A new study proposes using human feedback and interaction logs to boost offline reinforcement learning (RL) in natural language processing (NLP). Although humans are often inclined to critique or complain, thus far the practical use of their feedback in ML has been mostly limited to error reporting. In the paper Learning from Human Feedback: Challenges for Real-World Reinforcement Learning in NLP, a team from Google Research, Heidelberg University and NEC Laboratories Europe explores the “gold mine” of information buried in feedback and interaction logs and its potential in user-interactive RL and NLP systems.
The researchers note that once NLP systems are deployed in the real world, large volumes of interaction logs that contain user ratings, clicks or revisions are typically collected but tend to be referenced only for evaluation purposes.
The researchers note that it is too “risky” to directly update models online, especially in business settings or with feedback that is inappropriate or even harmful for training. Offline RL can help by pretraining RL agents on existing interaction log data without any further interactions with the environment.
The researchers focused on sequence-to-sequence (Seq2Seq) learning for NLP applications such as machine translation, summarization, semantic parsing and dialogue generation for chatbots, as these can include rich interactions with users.
The team identified three main challenges for off-policy RL in NLP:
- Large output space
- Deterministic logging
- Reliability and learnability of feedback
For RL methods to work well it is crucial to explore the output space — which, in the case of Seq2Seq, is particularly large. Although an output sentence may contain only 100 words, those could come from a vocabulary set of 30,000. As the Heidelberg University Statistical Natural Language Processing Group tweeted, that produces 30.000^100 possible outputs.
Effective exploration is especially important in real-world commercial systems, as poor performance there will deliver inferior outputs to users. The researchers propose pretraining the RL policy on available supervised data to enable the model to concentrate on reasonable areas in the output space.
Another associated constraint for NLP applications is deterministic logging policies. Production NLP systems deliver the most likely output to users and avoid displaying inferior options, which leads to deterministic logging policies that lack explicit exploration and can bias the collected dataset towards the logging policy choices. Once an NLP system is operational it is difficult to correct such biases. The team proposes two approaches to tackle this challenge:
- Implicit exploration due to input or context variability
- Consider concrete cases of degenerate behaviour in estimation from logged data
The team stresses that even if systems can learn from human feedback, not all human feedback is of equal value or even beneficial. It is unreasonable for example to expect anything other than “bandit feedback” from user interactions with a chatbot, which only provide a reward signal for the one output presented and cannot rate a multitude of outputs for the same input. In this case, the feedback is very sparse in relation to the size of the output space. It is also important how feedback is collected, as this can affect reliability and the ability to learn reward estimators that approximate human rewards and can be integrated into an end-to-end RL task for NLP applications.
The team proposes that since the interfaces for feedback collection affect the reward function that RL agents learn from, researchers should study user interfaces and experiences in real-world settings, especially for downstream tasks where agents interact with natural language. They say that focusing on such systems can help NLP researchers examine offline RL in real-world production settings and encourage the development of innovative algorithms for tackling challenges in NLP applications.
The paper Learning from Human Feedback: Challenges for Real-World Reinforcement Learning in NLP is on arXiv.
Reporter: Fangyu Cai | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.