Games have become a popular proving ground for AI techniques, particularly in the field of reinforcement learning (RL). Introduced in 2017, Policy Space Response Oracles (PSRO) is a multi-agent RL method for finding approximate Nash equilibria (NE) that has achieved state-of-the-art performance in large imperfect-information two-player zero-sum games such as StarCraft and Stratego. Although PSRO will, in theory, eventually reach convergence, in large games it is often terminated early, which can result in the ending exploitability being even higher than it was at the beginning.
To address this issue, a research team from the University of California Irvine and DeepMind has proposed Anytime Optimal PSRO (AO-PSRO), a new PSRO variant for two-player zero-sum games that is guaranteed to converge to a Nash equilibrium while decreasing exploitability from iteration to iteration.
The team summarizes their work’s main contributions as:
- We present a tabular no regret double oracle algorithm, Anytime Optimal DO (AODO), that updates a no-regret algorithm against a new best response every iteration in the inner loop. AODO is guaranteed to not increase exploitability from one iteration to the next.
- We present a method for finding the minimum exploitability meta-distribution in AODO via a no-regret procedure against best responses, called RM-BR DO.
- We present a reinforcement learning version of AODO, Anytime Optimal PSRO (AO-PSRO), that outperforms PSRO on certain games and tends to not increase exploitability, unlike PSRO.
The team first introduces Anytime Optimal DO (AODO), a tabular no-regret double-oracle algorithm guaranteed to converge to a Nash equilibrium (settling on an optimal game-strategy solution). Because there are a finite number of pure strategies in the original game, AODO will terminate with a meta-distribution that is a Nash equilibrium in the original game. Moreover, unlike traditional double oracle methods, the exploitability of the meta-distribution for AODO decreases with every iteration.
The team then proposes Regret Minimizing against a Best Response Double Oracle (RM-BR DO), a specific AODO form that finds optimal meta-strategies for the restricted player in each restricted game. In a given game, RM-BR DO uses three steps: a no-regret algorithm is initialized; at every iteration, a best response to the current meta-strategy is calculated; and the meta-strategy is updated with respect to this best response.
Although the RM-BR DO algorithm is guaranteed to not significantly increase exploitability from one iteration to the next, it requires extensive computations on countless different best responses in each inner loop iteration, which may become infeasible in a larger game. The team therefore further refines it to create Anytime Optimal PSRO (AO-PSRO), which can update the approximate best response in fewer iterations to get a full best response. The researchers leverage tabular Q-learning to update to the best response, thus boosting performance compared to PSRO while not increasing exploitability.
In experiments on Leduc poker and random normal-form games, the proposed AO-PSRO demonstrated far lower exploitability than DO and PSRO and never increased exploitability, validating its contribution to the advancement of PSRO research. The team suggests future studies might explore whether such performance improvements could be leveraged at a larger scale using deep RL with function approximation.
The paper Anytime Optimal PSRO for Two-Player Zero-Sum Games is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.