DeepMind recently published a paper in Nature introducing the latest evolution of its AI-powered Go program. “AlphaGo Zero” learns in self-play games, with no human knowledge required. The program crushed previous “AlphaGo” versions (including the one that beat world-best human Ke Jie) with a record of 100 wins and zero losses, stimulating discussion in the Go and AI communities.
Facebook AI researcher Yuandong Tian — who built Facebook’s Go program Darkfmcts3 in 2015 — shared his views on AlphaGo Zero with Synced:
Deep Mind’s paper Mastering the Game of Go Without Human Knowledge is much better than its January 2016 predecessor, Mastering the Game of Go With Deep Neural Networks and Tree Search. The new paper’s method is clean and standardized, and it is surely destined to be a classic.
AlphaGo Zero combines its previous versions’ policy network and value network to share parameters, which is not novel. Most reinforcement learning algorithms are now doing the same thing, including my projects Doom AI Bot F1, which took first place in Track 1 at the Visual Doom AI Competition in 2016; and ELF (an Extensive, Lightweight and Flexible platform for fundamental reinforcement learning research).
I was however very surprised that AlphaGo Zero bettered its predecessors after playing only about five million self-play games, with each step having just 1600 Monto Carlo rollouts. Many of the early games were played almost randomly, but the agent learned very quickly, even though the total number of states covered in all five million games was 10 ^ 9, which is a small fraction of all legal Go board positions (10 ^ 170).
AlphaGo Zero’s success proves that Convolutional Neural Networks (CNN) can work well in solving Go, which is something like extrapolating the entire Encyclopedia Britannica simply by looking at the first letter of the first word.
From the perspective of machine learning, CNN’s inductive bias is extremely suitable for the rules of Go. As such, the result can be improved by inputting only a few samples. However, there are many moves in human Go play that don’t make sense to computers, which complicates CNN learning. For example it takes a long time on servers like KGS for Go bots to overfit unusual human moves.
Even if I’m correct about CNN’s suitability for Go, this does not mean we should be overly optimistic regarding CNN’s application in other fields. I assume that for research problems such as protein folding for example, neural networks might not be a good fit. CNN can only use rote methods, which weakens the capability of generalization. In this context self-play might not be effective.
In fact, this is why self-play in Go did not previously achieve significant progress. Before AlphaGo, Computer Go researchers used hand-tuned features with a linear classifier, which was not the right model. In a word, the key to AlpahGo’s performance was finding the right model for self-play.
The success of CNN algorithms however does not make AlphaGo Zero the “God of Go”. Using CNN and ResNet’s framework to learn how to play Go is similar to the human learning process, and although today’s top professionals’ moves are similar to AlphaGo Zero’s, the program operates exponentially faster than any human. Imagine if some sort of aliens used an RNN to learn Go with another inductive bias. They might discover another (and possibly stronger) way of playing the game. At this point, it is too early to predict the ultimate level that Computer Go can reach.
CNN’s success in Go proves the importance of studying the theory of deep learning algorithms. Machines can solve problems that humans struggle with by adopting a model with the same or similar inductive bias structure. But researchers usually don’t know how to improve their algorithms on the key features of problems except through trial and error. If researchers can understand quantificationally how deep learning can work on different data, I believe it will become easier to know what types of data and models should be adopted for different problems. I firmly believe that the proper structure of data is another key to unlocking the magical effect of deep learning.
Last question is: Why does DeepMind use the Monte Carlo Tree Search (MCTS) rather than reinforcement learning? I am not a DeepMind employee, so I can only speculate. MCTS is actually a type of online planning, using non-parametric methods to estimate the local Q function and then using the local Q function estimation to decide how to make the next rollout.
MCTS is a good option for Go because the game matches MCTS preconditions: complete information on the environment and a perfect forward model to determine the next step from any state.
However, if MCTS is applied to train a game bot in Atari for example, researchers must either adopt a forward model or embed an Atari simulator in the training algorithm. Compared to the actor-critic or policy gradient methods that can use the current state of the local path, MCTS causes more complexity.
Finally, the new Deep Mind paper reveals that AlphaGo Zero’s implementation is simpler and it requires much less computing power than its predecessors.
I believe that researchers will further explore AlphaGo Zero’s methodology and pull out even more insights. I look forward to it.
Author: Yuandong Tian | Localization: Tony Peng | Editor: Michael Sarazen