Researchers from Zhejiang University have proposed a new method combining Monte Carlo tree search with Neural Fictitious Self-Play (NFSP) to improve performance on large-scale zero-sum imperfect-information games. Facing the incomplete information environment, the asynchronous neural virtual self-play (ANFSP) method allows AI to learn to generate optimal decisions in multiple virtual environments. The approach has performed well in Texas Hold’em and multiplayer FPS video games.
In May 2017 DeepMind’s AlphaGo Master dispatched the world’s top Go player Ke Jie in a three-game match — a feat many in the AI research community believed was still a decade away. Machines however have not yet achieved comparable progress in imperfect information games such as Starcraft or DOTA 2. The main motive behind games research is to evolve a practical real life AI that can solve complex problems with high computational dimensions, such as transaction and traffic control. Researchers believe that one of the main reasons machines struggle in this regard is a lack of theoretical and quantitative evaluation of training and results.
University College London’s Johannes Heinrich and DeepMind’s David Silver previously introduced Neural Fictitious Self-Play (NFSP), which combines FSP and neural network functions. A “player” consists of a Q-learning network and a supervised learning network. This algorithm calculates a best response through greedy deep Q-learning, and an average strategy by supervising the historical behavior of the agent. It solves coordination problems by introducing expected dynamics — the players act on their average strategy and optimal responses. This is the first end-to-end reinforcement learning method that can learn an approximate Nash equilibrium (mutually optimal fixed strategies in game theory) without any prior knowledge in an incomplete-information game. However, due to the complexity of the adversary strategy and the nature of the deep Q network learning in offline mode, NFSP performs poorly in games with large search spaces and depths. Another shortfall is that in NFSP, the optimal response depends on deep Q-learning calculations, which require a long time until convergence.
The researchers therefore introduced Monte Carlo Neural Fictitious Self Play (MC-NFSP), which combines NFSP and Monte Carlo Tree Search. They evaluated the method in the zero-sum board game Othello, concluding that MC-NFSP will converge to an approximate Nash equilibrium, while NFSP will not. Researchers also proposed Asynchronous Neural Fictitious Self-Play (ANFSP), which uses parallel actor learners to stabilize and accelerate training. This reduces the memory required for data storage compared to NFSP. They evaluated this method in a two-player zero-sum poker game and determined that ANFSP can approach the approximate Nash equilibrium more stably and quickly than NFSP.
To further demonstrate the advantages of MC-NFSP and ANFSP technologies in complex games, researchers evaluated the effectiveness of the algorithm in a multiplayer FPS battle game, in which the AI agent team competed with a human team. The new system provided a strategy and control which enabled the AI agents to win the game.
The paper Monte Carlo Neural Fictitious Self-Play: Achieve Approximate Nash equilibrium of Imperfect-Information Games is on arXiv.
Author: Reina Qi Wan | Editor: Michael Sarazen