Humans are a species that can adapt to environmental challenges, and over eons this has enabled us to biologically evolve — an essential characteristic found in animals but absent in AI.
Although machine learning has made remarkable progress in complex games such as Go and Dota 2, the skills mastered in these arenas do not necessarily generalize to practical applications in real-world scenarios. The goal for a growing number of researchers is to build a machine intelligence that behaves, learns and evolves more like humans.
A new paper from San Francisco-based OpenAI proposes that training models in the children’s game of hide-and-seek and pitting them against each other in tens of millions of contests results in the models automatically developing humanlike behaviors that increase their intelligence and improve subsequent performance.
Why hide-and-seek? Hide-and-seek was selected as a fun starting point mostly due to its simple rules, says the paper’s first author, OpenAI Researcher Bowen Baker.
The game rules: All agents are simulated as spherical objects that can perform three types of actions: navigate; grab and move objects; and lock objects, which can only be unlocked by that agent’s teammates.
Researchers placed one to three “hiders” and one to three “seekers” in simulated, physically-grounded environments with rooms bordered by static walls. The environments also included movable boxes of different sizes, and movable ramps.
Researchers used reinforcement learning to train the agents, whose reward signal — the incentive mechanism to stimulate agents to achieve their goals — followed the simple rules of hide-and-seek: hiders get a reward when they remain hidden from seekers, and seekers are rewarded when they find the hiders.
How AI plays hide-and-seek: Over the course of the training process, the AI agents evolved through several stages of strategies. In the beginning the hiders (blue) only learned to run away from seekers (red). As gameplay progressed, the hiders began using tools to their advantage, for example, moving boxes to block the doors so seekers could not enter their room.
Seekers then developed corresponding counter strategies, for example using ramps to climb over the walls.
In response, hiders learned to move the ramp into their room so seekers could not use it to get over walls.
As the environments became more complex, hiders learned to build more robust “forts” using elongated boxes. Although OpenAI believed this would be the final strategy, seekers once again countered successfully — figuring out a way to jump onto a box and use momentum to “surf” atop it, over the wall and into the hiders’ fort.
In the final stage, hiders learned to defend against “box surfing” by locking all the boxes before building their fort.
Researchers refer to the evolution of these different strategies as “Emergent Skill Progression from Multi-agent Autocurricula.” The term “autocurricula” was coined by DeepMind this year and applies to multiple agents gradually creating new tasks to challenge one another in a given environment. OpenAI researchers believe this process has parallels in natural selection.
“Why we’re really excited about this is we kind of see similar dynamics that we’ve seen on Earth with evolution. So you have all of these kind of organisms on Earth that were competing and and co-evolving together. And eventually out of that you got humans which are kind of the AGI of the natural world in a sense,” says Baker.
Why this research matters: Given the relatively simple objective of hide-and-seek, multiple agents trained through competitive self-play learned to use tools and adopted human-relevant skills to win. OpenAI believes this presents a promising research direction for future intelligent agent development and deployment.
OpenAI is open-sourcing their code and the environments in order to encourage further research in this area. One of the paper’s authors, OpenAI Researcher Yi Wu, told Synced “the academic community really needs good and interesting environments and problems to study. This environment is a bit more complicated than the 2D particle world, and not as super complicated as StarCraft.”
Why OpenAI is interested: OpenAI’s ultimate goal is to build an Artificial General Intelligence (AGI) capable of performing a multitude of tasks within one general system. While there might be different paths towards that goal, OpenAI is doubling down on reinforcement learning research enabled by massive compute power. OpenAI recently signed a 10-year compute contract with Microsoft that is worth US$1 billion.
This hide-and-seek research also excites OpenAI because as the environment complexity increases, agents continuously self-adapt to the new challenges with new strategies. “If a process like this can scale up and be put into a much more complex environment, you might get agents that are complex enough that they could solve real tasks for us,” says Baker.
The core algorithm: The agents are composed of two networks: a policy network to produce an action distribution and a critic network to predict the corresponding future returns. OpenAI researchers used Proximal Policy Optimization (PPO), the technique they have used in training Dota2 computer programs, to optimize the policy. The architecture is shown below.
The AI agents were trained millions of times in parallel. Training toward the final stage (surf defense) in the most complicated environment took three to four days on 16 GPUs and 4,000 CPUs.
Experiment results: Compared to previous algorithms such as intrinsic motivation, the hide-and-seek policy is much more human interpretable. Researchers also evaluated the multi-agent hide-and-seek method in object counting, lock and return, sequential lock, blueprint construction, and shelter construction intelligence tasks. The agents performed better than baseline models in three of the five tasks.
Challenges: Baker told Synced the agents sometimes exhibited surprising behaviors. For example, hiders attempted to escape the game area altogether until researchers applied a penalty on that.
Other challenges could be attributed to bugs in the contact physics of the simulated environment. For example, hiders learned that if they pushed a ramp against walls at corners, the ramp would for some reason pass through the walls and then disappear. Such “cheats” illustrate how the safety of algorithms can play a critical role in machine learning. “Before it happens, you never know. These kind of systems always have flaws. What we did is basically observe, and visualized the policy so we can see this weird thing happening. Then we try to fix the physics,” says Wu.
Journalist: Tony Peng | Editor: Michael Sarazen