Achieving 8× Performance Gains with Reinforcement Learning on Synthetic Data in Large Language Models

Synced

2 years ago

Training on model-generated synthetic data is a promising approach for fine-tuning Large Language Models (LLMs). However, opinions among researchers are divided. Some highlight the benefits of synthetic data, while others caution that it can negatively impact model performance.

In a new paper RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold, a research team from Carnegie Mellon University, Google DeepMind and MultiOn provides insights into how synthetic data affects performance. Their findings suggest that a specific schema can achieve consistent gains over using only positive data, achieving performance equivalent to an eightfold increase in synthetic data volume.

The researchers aim to understand synthetic data’s impact on LLM capabilities via a study on math reasoning, a prevalent scenario where synthetic data is used. They derive scaling laws for positive and negative data on common reasoning benchmarks such as GSM8K and MATH.

The researchers focused on understanding the impact of synthetic data on LLM capabilities through a study on math reasoning, a common application for synthetic data. They derived scaling laws for both positive and negative data using reasoning benchmarks like GSM8K and MATH. Their key observations include:

Training on positive synthetic data from capable models results in significantly slower scaling rates compared to standard empirical risk minimization.

Using model-generated positive synthetic data can improve sample efficiency by 2× but also increases spurious correlations.
Constructing learner-specific negative data with a focus on critical steps leads to performance gains equivalent to an eightfold increase in positive data.
Training with negative data helps unlearn spurious correlations.
They present a conceptual model inspired by reinforcement learning (RL) to explain these observations and the generalization benefits of synthetic data.

Overall, this study provides valuable insights and conceptual models to understand the role of synthetic data in reasoning tasks. It validates that consistent gains can be achieved over using only positive data, and that training on per-step negatives can help unlearn spurious correlations, offering robustness benefits similar to those of reinforcement learning.

The paper RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold is on arXiv.

Author: Hecate He | Editor: Chain Zhang

Share this: