Since the genesis of modern AI, researchers have regarded the challenges of real-world strategy games as a convenient testbed for model development. Improving performance in such games requires learning not a single strategy but rather a population of strategies, typically through iterative training. This approach however comes with two problematic issues: 1) Under a finite budget, approximate best-response operators often result in undertrained good-responses filling the population; 2) Repeated learning of basic skills at each iteration is wasteful and quickly becomes intractable when dealing with increasingly strong opponents.
In a new paper, a research team from DeepMind and University College London proposes Neural Population Learning (NeuPL), an efficient and general framework that learns and represents diverse policies in symmetric zero-sum games and enables transfer learning across policies within a single conditional network.
The researchers cite the popular game “rock-paper-scissors,” where a population with two available strategies (rock, paper) will beat a singleton population (scissors) if both populations are revealed. This is reflected in the unifying population learning framework Policy Space Response Oracle (PSRO, Lanctot et al., 2017), where a new policy is trained to best-respond to a mixture over previous policies at each iteration following a meta-strategy solver. A PSRO variation was used to master the game of StarCraft in 2019.
Such iterative and isolated approaches from classic game theory however are fundamentally different from how humans learn diverse strategies, where incremental strategic innovations can help us develop new strategies by revisiting and improving upon those we have already mastered. The proposed NeuPL framework aims to endow AI agents with similar capabilities by extending population learning to real-world games.
NeuPL was designed to satisfy two desiderata: 1) At convergence, the resulting population of policies should represent a sequence of iterative best-responses under reasonable conditions; 2) Transfer learning can occur across policies throughout training. This approach deviates from PSRO in several important ways:
- NeuPL suggests concurrent and continued training of all unique policies such that no good-response features in the population prematurely due to early truncation.
- NeuPL represents an entire population of policies via a shared conditional network with each policy conditioned on and optimized against a meta-game mixture strategy, enabling transfer learning across policies.
- NeuPL allows for cyclic interaction graphs, beyond the scope of PSRO.
The researchers also note that NeuPL offers convergence guarantees to a population of best-responses under mild assumptions, has generality, can improve model performance and efficiency across domains, and that under the NeuPL framework, novel strategies become more accessible, not less, as the neural population expands.
To evaluate NeuPL’s effectiveness, the team conducted experiments using Maximum A Posterior Optimization (MPO, Abdolmaleki et al., 2018) as the underlying reinforcement learning (RL) algorithm across several domains.
The experiments validate NeuPL’s generality from two aspects: it recovers the expected results of existing population learning algorithms on rock-paper-scissors; and it generalizes to the spatiotemporal, partially observed strategy game of running-with-scissors (Vezhnevets et al., 2020), where players must infer opponent behaviours through tactical interactions. The results also show that NeuPL induces skill transfer across policies, enabling the discovery of exploiters to strong opponents that would have been out-of-reach otherwise; and that it scales to the large-scale Game-of-Skills of MuJoCo Football (Liu et al., 2019), where a concise sequence of best-responses are learned, reflecting the prominent transitive skill dimension of the game.
The team regards their study as a step toward scalable policy space exploration, and suggests going beyond the symmetric zero-sum setting as a possible direction for future research in this area.
The paper NeuPL: Neural Population Learning is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.