Google, OpenAI & DeepMind: Shared Task Behaviour Priors Can Boost RL and Generalization

Synced

5 years ago

Researchers in recent years have deployed reinforcement learning (RL) agents to solve increasingly challenging problems. As the trend continues, so has the development of new methods that enable the injection of “priors” (prior knowledge) into agents to help them better understand the structure of the world and come up with more effective solution strategies.

In a new paper, researchers from Google, OpenAI, and DeepMind introduce “behaviour priors,” a framework designed to capture common movement and interaction patterns that are shared across a set of related tasks or contexts. The researchers discuss how such behaviour patterns can be captured using probabilistic trajectory models and how they can be integrated effectively into RL schemes, such as for facilitating multi-task and transfer learning.

Their method for learning behaviour priors can lead to significant speedups on complex tasks, the researchers say. Moreover, more restricted forms of the priors can encourage generalization and thus lead to faster learning.

Considered as a semi-supervised learning model, RL is a technique that allows an agent to take actions and interact with an environment in order to learn to maximize its total rewards. The agent is expected to reason about the long-term consequences of its actions even if the immediate associated rewards might be nil or negative. Since RL is well-suited to problems that include such long-term versus short-term reward trade-offs, it has been successfully applied to problems that include robot control, elevator scheduling, telecommunications, and the ancient board game Go.

The team explains that while recent advances in data efficiency, scalability, and stability of RL algorithms have led to successful applications in various domains, many RL problems remain challenging to solve or require large numbers of interactions with the environment. And these limitations are likely to get worse as researchers push the boundaries of RL to tackle increasingly challenging and diverse problems.

Leveraging methods that can inject prior knowledge into the learning process is one way to address this issue, as knowledge extracted from experts or from previously solved tasks can help inform solutions to new ones, according to the researchers. But which representations are best suited to capture and reuse prior knowledge? Is it better to directly use prior data to constrain the space of solutions, or could the answer be hierarchical policies that combine and sequence various skills and behaviours?

“In this work, we present a unifying perspective to introduce priors into the RL problem. The framework we develop presents an alternative view that allows us to understand some previous approaches in a new light,” the researchers say. Their paper views the problem of extracting reusable knowledge through the lens of probabilistic modelling. Building on the insight that policies combined with environmental dynamics define a distribution over trajectories, they designed their behaviour priors systematic framework to provide task solutions with different levels of detail and generality.

Whether hand-defined or learned from data, the behaviour priors can be integrated into RL schemes and deployed in different learning scenarios, such as to constrain the solution or to guide exploration. The framework allows modular or hierarchical models to selectively constrain or generalize certain aspects of the behaviour such as low-level skills or high-level goals.

In their experiments, the researchers analyzed the effect of behaviour priors on a number of simulated motor control domains using humanlike walkers from the DeepMind control suite. Their purpose was to understand how priors with various capacity and information constraints can learn to capture general task-agnostic behaviours at different levels of abstraction, including basic low-level motor skills as well as goal-directed and other temporally extended behaviours.

*DeepMind control suite’s walker: A controllable entity with common locomotion-related methods like projection of vectors into an egocentric frame*

The researchers found that since the prior does not have access to task-specific information on which target to visit next, it learns to encode the behaviour of visiting targets in general. This “target-centric” movement behaviour is useful for guiding the final policy during training and transfer.

The researchers also show that structured priors can be advantageous in modelling more complex distributions — a feature that can be important in some domains. While they only focused on a modelling perspective through the use of a hierarchical, latent variable prior, they believe the potential applications extend well beyond this.

Because the real world is often challenging to work with, impossible to perfectly simulate, and quite often constrained in the kinds of solutions that can be applied, the researchers say the ability to introduce prior knowledge about problems in order to constrain the solution space or to improve agent exploration is likely to be of increasing importance. They suggest the methods presented in this work can help with these issues, and expect further research efforts to emerge in this regard.

The paper Behavior Priors for Efficient Reinforcement Learning is on arXiv.

Reporter: Yuan Yuan | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weeklyto get weekly AI updates.

Thinking of contributing to Synced Review? Synced’s new column Share My Research welcomes scholars to share their own research breakthroughs with global AI enthusiasts.

Share this: