Introduced in 2020, MuZero (Schrittwieser et al.) is a model-based general reinforcement learning agent that combines a learned model of the environment dynamics with a Monte Carlo tree search planning algorithm. Although MuZero has achieved state-of-the-art results on a wide range of domains ranging from board games to visually rich environments, it is limited to deterministic models; and thus struggles in real-world environments that are inherently stochastic.
In the new paper Planning in Stochastic Environments with a Learned Model, a research team from DeepMind and University College London proposes Stochastic MuZero for stochastic model learning. The novel approach achieves performance comparable or superior to state-of-the-art methods in complex single- and multi-agent environments while maintaining the superhuman performance of the original MuZero in deterministic environments such as the game of Go.

Stochastic MuZero combines a learned stochastic transition model of the environment dynamics with a Monte Carlo tree search (MCTS) variant to model the dynamics of stochastic environments. Unlike MuZero, which only uses latent states to represent real environmental states, Stochastic MuZero leverages afterstates (Sutton & Barto, 2018) — the hypothetical state of an environment after an action is applied but before the environment has transitioned to a true state — to capture the stochastic dynamics.

The Stochastic MuZero algorithm comprises five functions: 1) a representation function that maps the current observation to a latent state; 2) an afterstate dynamics function that produces the next latent afterstate; 3) a dynamics function that produces the next latent state and a reward prediction; 4) a prediction function that generates value and policy predictions; and 5) an afterstate prediction function that generates a value prediction and a future chance outcomes distribution.

Stochastic MuZero also extends the MCTS algorithm by introducing chance nodes and chance values to the search such that each chance node corresponds to a latent afterstate and can be expanded by querying the stochastic model. This expanded value is then backpropagated up the tree. Finally, when the node is traversed during the selection phase, a code is drawn from the prior distribution. Stochastic MuZero search can thus be effectively applied to stochastic environments.



In their empirical studies, the team applied Stochastic MuZero to a wide variety of challenging stochastic and deterministic environments, including the classic 2048 puzzle game, Backgammon, and the deterministic game of Go. The results show that Stochastic MuZero significantly outperforms MuZero in stochastic environments, achieves similar or better performance than AlphaZero, and matches or exceeds previous methods that use a perfect stochastic simulator without requiring any prior knowledge of the environment.
Planning in Stochastic Environments with a Learned Model was published as a conference paper at ICLR 2022 and is available on OpenReview.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “DeepMind & UCL’s Stochastic MuZero Achieves SOTA Results in Complex Stochastic Environments”