UC Irvine & DeepMind’s Anytime Optimal PSRO: Guaranteed Convergence to a Nash Equilibrium With Decreased Exploitability in Two-Player Zero-Sum Games

Games have become a popular proving ground for AI techniques, particularly in the field of reinforcement learning (RL). Introduced in 2017, Policy Space Response Oracles (PSRO) is a multi-agent RL method for finding approximate Nash equilibria (NE) that has achieved state-of-the-art performance in large imperfect-information two-player zero-sum games such as StarCraft and Stratego. Although PSRO will, in theory, eventually reach convergence, in large games it is often terminated early, which can result in the ending exploitability being even higher than it was at the beginning.

To address this issue, a research team from the University of California Irvine and DeepMind has proposed Anytime Optimal PSRO (AO-PSRO), a new PSRO variant for two-player zero-sum games that is guaranteed to converge to a Nash equilibrium while decreasing exploitability from iteration to iteration.

The team summarizes their work’s main contributions as:

We present a tabular no regret double oracle algorithm, Anytime Optimal DO (AODO), that updates a no-regret algorithm against a new best response every iteration in the inner loop. AODO is guaranteed to not increase exploitability from one iteration to the next.
We present a method for finding the minimum exploitability meta-distribution in AODO via a no-regret procedure against best responses, called RM-BR DO.
We present a reinforcement learning version of AODO, Anytime Optimal PSRO (AO-PSRO), that outperforms PSRO on certain games and tends to not increase exploitability, unlike PSRO.

The team first introduces Anytime Optimal DO (AODO), a tabular no-regret double-oracle algorithm guaranteed to converge to a Nash equilibrium (settling on an optimal game-strategy solution). Because there are a finite number of pure strategies in the original game, AODO will terminate with a meta-distribution that is a Nash equilibrium in the original game. Moreover, unlike traditional double oracle methods, the exploitability of the meta-distribution for AODO decreases with every iteration.

The team then proposes Regret Minimizing against a Best Response Double Oracle (RM-BR DO), a specific AODO form that finds optimal meta-strategies for the restricted player in each restricted game. In a given game, RM-BR DO uses three steps: a no-regret algorithm is initialized; at every iteration, a best response to the current meta-strategy is calculated; and the meta-strategy is updated with respect to this best response.

Although the RM-BR DO algorithm is guaranteed to not significantly increase exploitability from one iteration to the next, it requires extensive computations on countless different best responses in each inner loop iteration, which may become infeasible in a larger game. The team therefore further refines it to create Anytime Optimal PSRO (AO-PSRO), which can update the approximate best response in fewer iterations to get a full best response. The researchers leverage tabular Q-learning to update to the best response, thus boosting performance compared to PSRO while not increasing exploitability.

In experiments on Leduc poker and random normal-form games, the proposed AO-PSRO demonstrated far lower exploitability than DO and PSRO and never increased exploitability, validating its contribution to the advancement of PSRO research. The team suggests future studies might explore whether such performance improvements could be leveraged at a larger scale using deep RL with function approximation.

The paper Anytime Optimal PSRO for Two-Player Zero-Sum Games is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

6 comments on “UC Irvine & DeepMind’s Anytime Optimal PSRO: Guaranteed Convergence to a Nash Equilibrium With Decreased Exploitability in Two-Player Zero-Sum Games”

Abram Abram

2022-01-21

I didn’t have any expectations concerning that title, but the more I was astonished. The author did a great job. I spent a few minutes reading and checking the facts. Everything is very clear and understandable. I like posts that fill in your knowledge gaps. This one is of the sort.

Loading...

kamir bouchareb st

2022-03-03

gooooooooooooooood

Loading...

tommyrider2000

2026-01-19

The advancements in AI technology, particularly in reinforcement learning and game theory, are truly fascinating. The introduction of Anytime Optimal PSRO by the UC Irvine and DeepMind team marks a significant step forward in achieving Nash equilibria in complex games. As someone residing in Germany, it’s exciting to see how these innovations can potentially influence various sectors, including online gaming. For instance, platforms like Ivybet casino could benefit from such technologies to enhance user engagement and optimize game strategies. The implications for fairness and strategy in gaming are profound, and I’m eager to see how this will unfold in the future.

Loading...

tracylucanda

2026-03-10

The idea of optimizing reinforcement learning systems specifically for TPU Pods shows how important Free Web Games hardware–software co-design is for scaling modern AI.

Loading...

Anna Kena

2026-03-25

Thank you for sharing this post with slope game 2.

Loading...

Major

2026-04-20

Hey, a colleague from Pakistan suggested trying something with extra perks, so I gave it a shot. At first I had no luck and almost quit, but I continued through https://glory.org.pk/bonus and decided to take one more chance with a higher bet. That final attempt paid off with a decent win that made the session worthwhile. Now I return sometimes when I want to unwind a bit.

Loading...

UC Irvine & DeepMind’s Anytime Optimal PSRO: Guaranteed Convergence to a Nash Equilibrium With Decreased Exploitability in Two-Player Zero-Sum Games

Like this:

6 comments on “UC Irvine & DeepMind’s Anytime Optimal PSRO: Guaranteed Convergence to a Nash Equilibrium With Decreased Exploitability in Two-Player Zero-Sum Games”

Leave a Reply Cancel reply

Related

Share this:

Like this:

6 comments on “UC Irvine & DeepMind’s Anytime Optimal PSRO: Guaranteed Convergence to a Nash Equilibrium With Decreased Exploitability in Two-Player Zero-Sum Games”

Leave a Reply Cancel reply

Related