In a new paper, a team of OpenAI researchers sets out to advance methods for training large-scale language models such as BERT on objectives that more closely capture human preferences — and does so by putting humans back into the loop. The work focuses on abstractive English text summarization — a subjective task that’s considered challenging because the notion of what makes a “good summary” is difficult to capture without human input.
“As our models become more powerful, we believe aligning them with our goals will be very important to ensure they are beneficial for humans,” the researchers explain in an OpenAI blog post. “In the short term, we wanted to test if human feedback techniques could help our models improve performance on useful tasks.”
By applying human feedback and reinforcement learning (RL) to the training of language models, the researchers were able to significantly improve the quality of their models’ summaries.
The team first trained an initial summarization model and collected a large, high-quality dataset of human comparisons between the summaries. They then trained a reward model to predict the human-preferred summary and used that model as a reward function to fine-tune the summarization model.
They found that this method significantly improves the quality of the summaries as evaluated by humans, even on datasets that were very different from the one(s) used for fine-tuning.
In evaluations conducted on the TL;DR dataset of Reddit posts, the OpenAI models significantly bettered both human reference summaries and those generated by much larger models fine-tuned with supervised learning alone. The models also produced summaries nearly as good as human references on the CNN/DailyMail news dataset without any news-specific fine-tuning, demonstrating that models informed by such human feedback mechanisms can generalize to new domains much better than traditional supervised models.
The researchers also examined the impact of model and data size and analyzed reward model performance using synthetic and human written perturbations of summaries. Their reward model outperformed metrics such as ROUGE at predicting human preferences, while optimizing the reward model directly resulted in better summaries than optimizing ROUGE.
However, outperforming human-written reference summaries on TL;DR still doesn’t mean these new models have reached human-level performance, as the reference summary baselines for TL;DR and CNN/DM are not the highest possible quality, the researchers explain.
In the future, in addition to tackling harder problems, the team also plans to explore different types of feedback beyond binary comparisons. For example, they could ask humans to provide demonstrations, edit model outputs to make them better, or give explanations as to why one model output is better than another.
Although the current work focuses on summarization, the long-term goal is much broader, namely to “figure out which kinds of feedback are most effective for training models that are aligned with human preferences.”
The paper Learning to Summarize from Human Feedback is on arXiv.
Reporter: Yuan Yuan | Editor: Michael Sarazen
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.