Natural selection driven by interspecific and intraspecific competition is a fundamental evolutionary mechanism that has led to the wide diversity and complexity of species inhabiting Earth. The process is mirrored to a degree in contemporary AI research, where competitive multi-agent reinforcement learning (RL) environments have enabled machines to reach superhuman performance.
Designing multi-agent RL environments with conditions conducive to the development of interesting and useful agent skills can however be a time-consuming and laborious process. A common approach in single-agent settings is domain randomization, where the agent is trained on a wide distribution of randomized environments. Recent works have improved this process via automatic environment curricula techniques that adapt environment distribution during training to maximize the number of environments that produce better and more robust skills.
In the new paper AutoDIME: Automatic Design of Interesting Multi-Agent Environments, an OpenAI research team explores automatic environment design for multi-agent environments using an RL-trained teacher that samples environments to maximize student learning. The work demonstrates that intrinsic teacher rewards are a promising approach for automating both single and multi-agent environment design.
The team summarizes their main contributions as:
- We show that intrinsic teacher rewards that compare student reward or behaviour relative to some prediction can lead to faster skill emergence in multi-agent Hide and Seek and faster student learning in a single-agent random maze environment.
- We formulate an analogue of the noisy TV problem for automatic environment design and analyze the susceptibility of intrinsic teacher rewards to uncontrolled stochasticity in a single agent random-maze environment. We find that value prediction error and to a small extent policy disagreement is susceptible to stochasticity while value disagreement (teacher rewards that measure the disagreement of an ensemble of student value functions with different initializations) is much more robust.
The researchers employ the teacher-student curriculum learning (TSCL) training schema, where an RL-trained teacher samples environments in which the student agents are trained. The teacher is rewarded when it generates environments that enable its students to learn the most. In the proposed setup, the teacher first samples an environment at the beginning of a student episode in a single time-step, and the student policies are then rolled out and the teacher reward calculated.
During their exploration, the team identified two advantages of conditional sampling, which they adopted for their teacher sample strategy: 1) It is often easier to implement, as the teacher need not interact with every random sampling step of a procedurally generated environment; 2) Having the teacher specify fewer environment parameters leads to better performance than having it specify more environment parameters.
In their evaluations, the team compared the resulting trained teacher’s effect on student learning against baseline training with a uniform or stationary environment distribution. The experiments were conducted on environments simulated using the MuJoCo (Multi-Joint dynamics with Contact) physics engine, including a modified Hide and Seek quadrant environment and a single-agent random maze environment.
The team summarizes their findings as:
- Value prediction error and value disagreement lead to faster and more reliable skill discovery in multi-agent hide and seek than uniform sampling or policy disagreement.
- Value disagreement is a promising teacher reward for automatic environment design for both multi-agent environments and environments with stochasticity.
- Many previously proposed teacher reward schemes fall prey to adversarial situations where the teacher reward can be decoupled from genuine student learning progress.
- By sampling learnable environments, a well-designed teacher can speed up significantly the exploration of the student.
Overall, the results demonstrate that intrinsic teacher rewards — and value disagreement in particular — are a promising approach for automating both single and multi-agent environment design.
The paper AutoDIME: Automatic Design of Interesting Multi-Agent Environments is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.