Although reinforcement learning (RL) has produced agents that can outperform human experts in complex games and image classification, these agents generally lack ability outside their narrowly defined target tasks. Moreover, learning these tasks from scratch requires a large amount of task-specific environment interaction. The learning efficiency of such agents therefore remains a challenge.
Enter Plan2Explore — a self-supervised RL agent designed to quickly generalize to unseen tasks in a zero or few-shot manner. Evaluated on 20 challenging control tasks without access to proprioceptive states or rewards, Plan2Explore achieves SOTA zero-shot and adaptation performance. The novel agent is the product of researchers from University of Pennsylvania, UC Berkeley, Google Brain, University of Toronto, Carnegie Mellon University, and Facebook AI.
Plan2Explore first sets out to efficiently learn a new environment, which will help it solve tasks therein. The agent leverages planning to explore, doing so in a self-supervised manner and without a task-specific reward function. During this exploration, the agent uses data it has collected to learn a global model which is in turn used to direct its exploration to collect additional data. This is achieved by training an exploration policy inside the global model to seek out novel states.
After the task-agnostic exploration phase, the agent receives reward functions to allow it to adapt to downstream tasks such as standing, walking, running, and flipping using either zero or few task-specific interactions.
The model-based RL agent is thus able to seek out expected future novelties during exploration and quickly adapt to solve multiple downstream tasks, while learning directly from high dimensional image inputs.
Experiments show that Plan2Explore achieves SOTA zero-shot task performance on the DeepMind Control Suite. The agent’s zero-shot performance is also competitive with Dreamer — the high-performance supervised RL agent Google introduced in March that uses a global model it learns from images to learn long-sighted behaviours. The researchers regard Plan2Explore as a step towards building scalable real-world reinforcement learning systems.
The paper Planning to Explore via Self-Supervised World Models is on arXiv.
Journalist: Yuan Yuan | Editor: Michael Sarazen