In a new paper SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF, an NVIDIA research team introduces STEERLM, a novel supervised fine-tuning method that empowers end-users to control model responses during inference, surpassing even state-of-the-art baselines, including RLHF models like ChatGPT-3.5.
In the new paper Solving Math Word Problems With Process- and Outcome-based Feedback, a DeepMind research team conducts the first comprehensive comparison between process- and outcome-based model supervision. The two approaches achieve comparable final-answer error rate improvements on math word problems, while the process-based method significantly reduces reasoning errors from 14.0 to just 3.4 percent.
Facing the incomplete information environment, the asynchronous neural virtual self-play (ANFSP) method allows AI to learn to generate optimal decisions in multiple virtual environments. The approach has performed well in Texas Hold’em and multiplayer FPS video games.
Reinforcement learning (RL) has been making spectacular achievements, e.g., Atari games, AlphaGo, AlphaGo Zero, AlphaZero, DeepStack, Libratus, OpenAI Five, Dactyl, DeepMimic, Catch The Flag, learning to dress, data center cooling, chemical syntheses, drug design, etc. See more RL applications.