Large Language Models (LLMs) have revolutionized natural language understanding, delivering remarkable performance across a wide array of tasks. However, their effectiveness in responding to user instructions remains an open challenge. Conventional methods have relied on supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to align these models with human preferences.
Yet, RLHF has its limitations, arising from its intricate training setup and the risk of encoding implicit values that users cannot control at runtime. Moreover, RLHF commonly depends on single-dimensional feedback, rather than explicit, multifaceted signals that encompass attributes like helpfulness, humor, and toxicity.
To resolve this issue, in a new paper SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF, an NVIDIA research team introduces STEERLM, a novel supervised fine-tuning method that empowers end-users to control model responses during inference, surpassing even state-of-the-art baselines, including RLHF models like ChatGPT-3.5.

The team summarizes their main contributions as follows:
- Introducing STEERLM: STEERLM is presented as a straightforward alternative for language model alignment, exclusively leveraging the language modeling objective.
- Efficacy of STEERLM 43B: The research demonstrates the remarkable performance of STEERLM 43B on the Vicuna benchmark, outshining state-of-the-art baselines, including RLHF models like ChatGPT-3.5.
- Flexibility and Customizability: STEERLM 43B is praised for its adaptability and customizability. It enables users to fine-tune model attributes at inference time, facilitating a wide range of applications.

STEERLM stands out as a simple, innovative method to align language models with user instructions. It offers a computationally efficient alternative to RLHF, comprising four key steps:
- Attribute Prediction Model: The base language model is trained to evaluate response quality by predicting attribute values.
- Annotating Datasets using Attribute Prediction Model: The attribute prediction model annotates response quality across various datasets.
- Attribute Conditioned SFT: Given a prompt and desired attribute values, a new base model is fine-tuned to generate responses aligned with the specified attributes.
- Bootstrapping with High-Quality Samples: Multiple responses are sampled from the fine-tuned model in Step 3, emphasizing maximum quality. The sampled responses are evaluated by the trained attribute prediction model, leading to another round of fine-tuning
In their empirical study, the team compares STEERLM against various state-of-the-art instruction following models, including OpenAI ChatGPT 3.5, OpenAI text-davinci-003, and Guanaco 65B. Furthermore, to highlight the contrast between RLHF and SFT, they include OASST LLaMA 30B SFT, a model solely employing SFT for alignment purposes.

STEERLM 43B consistently outperforms all baseline models in both automatic and human evaluations. It generates responses that are preferred by human and automatic evaluators over many state-of-the-art baselines trained with RLHF, all while being significantly easier to train.
The research team’s vision is that their work will inspire further exploration and development of simple yet effective model alignment methods, ultimately empowering the creation of improved AI assistants for everyone.
Try STEERLM at huggingface. The paper SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF on arXiv.
Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
I have learned a lot through this article.