Recent research has demonstrated that prompting language models to generate reasoning steps can improve performance on various natural language reasoning tasks. But how to supervise such models? Is it better to employ outcome-based approaches that supervise the final answer, or process-based approaches that supervise the reasoning process itself?
In the new paper Solving Math Word Problems With Process- and Outcome-based Feedback, a DeepMind research team conducts the first comprehensive comparison between process- and outcome-based model supervision. The two approaches achieve comparable final-answer error rate improvements on the GSM8K dataset of math word problems, while the process-based method significantly reduces reasoning errors from 14.0 to 3.4 percent.

The team summarizes their key findings as follows:
- Outcome-based and process-based approaches lead to similar final-answer error rates.
- Both process- and outcome-supervised reward models learn to emulate process-based feedback.
- Low trace error requires either process-based feedback, or a reward model that emulates it.

To effectively compare process- and outcome-based approaches, the team generates a sequence of reasoning steps which leads to a final answer under both outcome-based and process-based settings.
For process-based approaches, the team uses supervision provided by offline human-generated reasoning traces from the GSM8K dataset of math word problems and online human correctness annotations. They compare these approaches with regard to different modelling and training components, e.g. few-shot prompting, supervised finetuning (STF) and reinforcement learning (RL).


The results show that supervising final-answer correctness alone is sufficient for lowering the final-answer error rate, with the team’s best model reducing the state-of-the-art final-answer error rate on GSM8K from 16.8 percent to 12.9 percent with less label supervision. The proposed process-based feedback approach demonstrates more significant improvements when applied to trace error rate (how often the model makes reasoning mistakes according to the human annotators), reducing the state-of-the-art rate of 14.0 percent to just 3.4 percent.
Overall, this study provides valuable insights on process- and outcome-based model supervision approaches, which the team hopes will motivate future work exploring how these findings might generalize to other domains.
The paper Solving Math Word Problems With Process- And Outcome-based Feedback is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Chess – https://aboutchess.net/