For artificial intelligence to thrive in a complex, constantly evolving world, it must overcome significant challenges: limited data quality and scale, and a lag in new, relevant information creation. An intriguing question emerges—can language models autonomously generate new tasks to practice on, enabling them to self-improve and better align with human preferences?
In a new paper Evolving Alignment via Asymmetric Self-Play, a research team from Google DeepMind and The University of Chicago presents a novel approach to Reinforcement Learning from Human Feedback (RLHF). Their method, called eva (Evolving Alignment via Asymmetric Self-Play), introduces a flexible, scalable framework that leverages any RLHF algorithm to drive more effective alignment with human values.

Traditional RLHF frameworks for aligning large language models (LLMs) often rely on a fixed prompt distribution, which limits adaptability and scalability. By contrast, the eva framework reconceives alignment as an asymmetric interaction between two roles: (1) a creator, which dynamically generates increasingly informative prompt distributions using feedback from a reward model, and (2) a solver, which learns to produce responses that align with human preferences based on these evolving prompts.

The eva framework advances alignment through a creator policy that refines prompt distributions via a straightforward estimate-sample-evolve process. It assesses each prompt’s informativeness by measuring the diversity in responses generated from it, guided by reward signals. From these insights, the creator evolves a new set of prompts, which the solver then uses to train and improve. The creator and solver can either share the same network or operate independently, depending on the implementation needs.

To enable this interaction, the researchers designed an efficient asymmetric self-play algorithm that alternates between optimizing the creator and solver policies. This modular design allows eva to be integrated into existing alignment pipelines seamlessly.

In empirical tests on standard alignment benchmarks, eva demonstrated significant performance gains across various preference optimization algorithms (e.g., DPO, SPPO, SimPO, ORPO). Notably, eva achieved these improvements without relying on additional human-generated data, enhancing alignment efficiency. In some cases, eva-prompt-trained models even matched or outperformed those trained on human prompts from UltraFeedback, offering a cost-effective alternative.
In summary, eva introduces a fresh perspective on alignment by framing it as an asymmetric game where the creator role actively generates novel, learnable prompts, and the solver refines its responses to these evolving challenges. By encouraging agents to create tasks rather than just solve them, eva taps into an essential trait of intelligence—posing new problems—a dynamic often overlooked in model training.
The paper Evolving Alignment via Asymmetric Self-Play is on arXiv.
Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Pingback: Self-Evolving Prompts: Redefining AI Alignment with DeepMind & Chicago U’s eva Framework - Welcome
Thanks to its unique feature that displays your progress during each exercise, speedstars is incredibly addictive.
I see the Eva method as being very useful because it allows the machine to generate its own tasks rather than always depending on steal brainrot game past data. Their use of creator–solver to drive the model to gain a deeper understanding is very sensible indeed.