Although human infants are not tasked to do so, they naturally crawl around and interact with objects — a process of exploration that plays a huge role in developing their understanding of physics and the environment. This observation has inspired machine learning researchers to explore intrinsic motivation, which aims to identify and provide agents with mathematical objectives that don’t rely on a specific task and can be applied to any unknown environment.
To accelerate the development of intrinsic objectives for reinforcement learning (RL) agents, a team of researchers from Vector Institute, University of Toronto and Google Brain recently studied three common types of intrinsic motivation across seven agents, three Atari games, and the 3D game Minecraft. They found that all three of the intrinsic objectives correlate more strongly with a human behaviour similarity metric than with any task reward.
Basically, RL works by enabling independent agents to make decisions and solve complex tasks in a simulated environment. Agents are repeatedly punished or rewarded according to how well they perform on a task, and they ultimately learn a reward function that maximizes rewards and minimizes punishment to succeed in complex tasks.
“Unfortunately, designing informative reward functions is often expensive, time-consuming, and prone to human error,” the team notes, and this is a pain point of existing RL approaches. Many previous studies have looked at infants for inspiration, as these natural agents learn without externally provided tasks, but rather through intrinsic objectives.
The team examined three common types of intrinsic motivation in their work Evaluating Agents without Rewards
- Input entropy encourages encountering rare sensory inputs, measured by a learned density model
- Information gain rewards the agent for discovering the rules of its environment
- Empowerment rewards the agent for maximizing the influence it has over its sensory inputs or environment
The researchers evaluated the different intrinsic objectives by collecting a diverse dataset of different environments and behaviours and retrospectively computing agent objectives from it. By analyzing the correlations between intrinsic objectives and supervised objectives such as task reward and human similarity, the researchers were able to identify relationships between different intrinsic objectives without training a new agent for each objective, which sped up the iteration time.
For evaluation purposes, the team used 100 million frames from the three Atari game environments to train seven agents: random, no-op, PPO, and RND and ICM agent versions with and without a task reward. For the 3D game Minecraft environment, the evaluation used 12 million frames per agent since that simulation is slower than Atari. For the “human similarity” supervised objective, the team took human behaviour as the ground truth and computed the similarity between agents’ and humans’ behaviours in the same environment.
In tests across all environments, the three intrinsic objectives correlated more strongly with the human behaviour similarity metric than with the task rewards, suggesting that intrinsic objectives are more relevant than typical task rewards if the goal is to design agents that behave like humans.
The researchers note that the current human dataset is relatively small for identifying human similarity values. They propose that additional human data, as well as knowing what instructions the human agents received, would help further work in this area.
The paper Evaluating Agents without Rewards is on arXiv, and the source code for replicating the analyses and the collected dataset can be found on co-author Danijar Hafner’s website.
Reporter: Fangyu Cai | Editor: Michael Sarazen
Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Are RL Agents More Humanlike When Not Seeking Rewards? New Research from Vector Institute, University of Toronto & Google Brain”