AI China Machine Learning & Data Science Nature Language Tech Popular Research

DeepSeek Signals Next-Gen R2 Model, Unveils Novel Approach to Scaling Inference with SPCT

DeepSeek AI, a prominent player in the large language model arena, has recently published a research paper detailing a new technique aimed at enhancing the scalability of general reward models (GRMs) during the inference phase.

DeepSeek AI, a prominent player in the large language model arena, has recently published a research paper detailing a new technique aimed at enhancing the scalability of general reward models (GRMs) during the inference phase. Simultaneously, the company has hinted at the imminent arrival of its next-generation model, R2, building anticipation within the AI community.

The paper, titled “Inference-Time Scaling for Generalist Reward Modeling” introduces a novel method that allows GRMs to optimize reward generation by dynamically producing principles and critiques. This is achieved through rejection fine-tuning and rule-based online reinforcement learning [1-1].

This development comes at a time when the paradigm for scaling LLMs is shifting from the pre-training stage to post-training, particularly the inference phase, following the emergence of models like OpenAI’s o1. This approach leverages increased reinforcement learning (computational effort during training) and more extensive “thinking time” (computational effort during testing) to continually improve model performance. Notably, o1 generates a lengthy internal chain of thought before responding to users, refining its reasoning process, exploring different strategies, and identifying its own errors.

DeepSeek’s own R1 series of models has further validated the potential of pure reinforcement learning training (without relying on supervised fine-tuning) to achieve significant leaps in LLM reasoning capabilities.

The fundamental “next token prediction” mechanism of LLMs, while providing vast knowledge, often lacks deep planning and the ability to predict long-term outcomes, making them susceptible to short-sighted decisions. Reinforcement learning serves as a crucial complement, providing LLMs with an “Internal World Model.” This enables them to simulate the potential outcomes of different reasoning paths, evaluate the quality of these paths, and select superior solutions, ultimately leading to more systematic long-term planning. The synergy between LLMs and RL is increasingly recognized as key to enhancing the ability to solve complex problems.

Wu Yi, an assistant professor at Tsinghua’s Institute for Interdisciplinary Information Sciences (IIIS), likened the relationship between LLMs and reinforcement learning to a “multiplicative relationship” in a recent podcast. While reinforcement learning excels in decision-making, it inherently lacks understanding. The construction of understanding relies on pre-trained models, upon which reinforcement learning can then further optimize decision-making capabilities. This “multiplicative relationship” suggests that only when a strong foundation of understanding, memory, and logical reasoning is built during pre-training can reinforcement learning fully unlock its potential to create a complete intelligent agent [1-2].

A comprehensive survey paper titled “Reinforcement Learning Enhanced LLMs: A Survey” outlines the typical three-step process of using RL to train LLMs:

  1. Reward Model Training: Before fine-tuning, a reward model (or reward function) is trained to approximate human preferences and evaluate different LLM outputs.
  2. Preference-Based Fine-Tuning: In each fine-tuning iteration, the large language model generates multiple responses to a given instruction, and each response is scored using the trained reward model.
  3. Policy Optimization: Reinforcement learning optimization techniques are used to update the model’s weights based on the preference scores, aiming to improve response generation.

Integrating reinforcement learning allows large language models to dynamically adjust based on varying preference scores, moving beyond the limitations of a single, pre-determined answer.

DeepSeek’s SPCT: Addressing the Scaling Challenges of RL for LLMs

Despite the success of reinforcement learning in post-training as a breakthrough for enhancing LLM performance, reinforcement learning algorithms themselves still have significant room for improvement, and the “Scaling Laws” of reinforcement learning are still in their nascent stages.

Unlike traditional scaling laws that focus on increasing data and compute to improve model performance, the scaling laws for reinforcement learning are influenced by more complex factors, including sample throughput, model parameter size, and the complexity of the training environment.

A major hurdle in the scaling of reinforcement learning is reward sparsity. The reward model is a critical component, and generating accurate reward signals is paramount. Achieving both generalization and continuity in reward models is a key focus.

DeepSeek and Tsinghua researchers addressed this challenge in their recent work by exploring the scalability and generalization of reward models at inference time. Their proposed Self-Principled Critique Tuning (SPCT) method aims to improve the scalability of general reward modeling during inference.

The SPCT approach involves two key stages:

  1. Rejection Fine-Tuning: This serves as a cold start, enabling the GRM to adapt to generating principles and critiques in the correct format and type.
  2. Rule-Based Online RL: This stage further optimizes the generation of principles and critiques.

To achieve effective inference-time scaling, the researchers employed parallel sampling to maximize computational utilization. By sampling multiple times, the DeepSeek-GRM can generate different sets of principles and critiques and select the final reward through voting. Furthermore, a meta-reward model (Meta RM) is trained to guide the voting process, further enhancing scaling performance. The Meta RM is a point-to-point scalar reward model designed to identify the correctness of the principles and critiques generated by the DeepSeek-GRM.

Experimental results demonstrated that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models on multiple comprehensive RM benchmarks without significant domain bias.

Looking Ahead: DeepSeek R2 on the Horizon

While the research paper focuses on advancements in reward modeling and inference-time scaling, the mention of DeepSeek’s R1 series and the implicit progression suggests that the company is actively developing its next-generation model, R2. Given DeepSeek’s emphasis on pure reinforcement learning for enhancing reasoning, it is highly anticipated that R2 will incorporate and build upon the insights gained from this latest research on scalable reward models.

The AI community will be keenly watching for further announcements regarding DeepSeek R2, eager to see how the company leverages its innovative approaches to reinforcement learning and inference optimization to push the boundaries of large language model capabilities. The focus on scalable reward models hints at a potential emphasis on even more sophisticated self-evaluation and improvement mechanisms within their next flagship model.

The paper Inference-Time Scaling for Generalist Reward Modeling is on arXiv.

About Synced

Machine Intelligence | Technology & Industry | Information & Analysis

15 comments on “DeepSeek Signals Next-Gen R2 Model, Unveils Novel Approach to Scaling Inference with SPCT

  1. This site rocks for retro fans! Hypackel Ga

  2. Pingback: DeepSeek predstavuje nový model R2 a prelomový prístup SPCT pre škálovanie inferencie | Blog o umelej inteligencii

  3. Such a fun way to play—great tips! Scratch Over It

  4. Billie Jollie

    ​DeepSeek AI has introduced a novel technique called Geometry Dash Lite Self-Principled Critique Tuning (SPCT), aimed at enhancing the scalability of general reward models (GRMs) during inference.

  5. Brenda Gray

    DeepSeek AI’s new approach to enhancing general reward models during the inference phase is truly fascinating! It’s exciting to see how the integration of reinforcement learning and the shift towards post-training optimization is advancing the capabilities of LLMs. As AI continues to evolve, the potential for models like R2 could really push the boundaries of intelligent decision-making. For those interested in further exploring online safety, here’s an insightful article on https://thelakewoodscoop.com/news/how-to-spot-a-safe-and-reliable-online-casino-in-hungary/

  6. DeepSeek AI is breaking new ground in the field of large language models by focusing on the inference stage rather than just pre-training, demonstrating the growing importance of integrating reinforcement learning to help models think more deeply, self-evaluate, and agar io make more effective decisions in complex environments.

  7. David Solano

    They don’t compare it to reasoning models. Is it even geometry dash lite better than those?

  8. The ultimate test of communication—because one misstep sends Fireboy into water or Watergirl into flames!

  9. Just like cutting-edge AI continues to evolve and improve, taking care of yourself is a constant journey of growth and adaptation. Prioritizing self-care helps us stay balanced and ready to face new challenges with clarity and strength.
    https://premiumbarbershop.com/

  10. jonehesa

    Wow, this sounds incredibly promising! I’ve always been fascinated by the challenges of scaling AI models. Thinking about resource constraints makes me appreciate the clever solutions people come up with. It reminds me of trying to optimize levels in games with limited processing power. Ironically, sometimes I just need to relax and unwind, and surprisingly, playing something simple like Geometry Dash helps clear my head after tackling complex problems like this. I hope this SPCT method really makes a difference!

  11. Pingback: DeepSeek Signals Next-Gen R2 Model, Unveils Novel Approach to Scaling Inference with SPCT - AiAgentives.com is for sale!

  12. Pingback: DeepSeek Signals Next-Gen R2 Model, Unveils Novel Approach to Scaling Inference with SPCT - ainewstoday.ai

  13. Coaching feels more immersive as Retro Bowl 26 in the middle of your football journey allows you to prove your skills and lead your team to Super Bowl glory.

  14. Wow, DeepSeek AI is really pushing the boundaries with this inference-time scaling for reward models. The idea of dynamically producing principles and critiques sounds super smart for improving how these models work. It’s interesting how the focus is shifting to the inference phase now, kind of like what OpenAI’s o1 is doing with all that extra thinking time. Makes sense that pure RL training can lead to big jumps in reasoning.

  15. EZ Pass NY is New York’s top electronic toll system, letting millions zip through NY toll booths without cash, using a EZ Pass NY transponder linked to an E-ZPass NY account. E-ZPassNY

Leave a Reply

Your email address will not be published. Required fields are marked *