A crucial component of natural embodied intelligence is discovering useful behaviours from past experiences and transferring them to unseen tasks — a process that continues over the lifetime of humans and other animals and enables them to tackle new situations efficiently. Efforts to incorporate such abilities into AI models have generally only supported online, on-policy, bottleneck state discovery, limiting sample-efficiency, or discrete state-action domains, which limits their real-world applicability in robotics.
A DeepMind research team addresses these issues in the new paper MO2: Model-Based Offline Options, proposing an offline hindsight bottleneck options framework that supports sample-efficient option discovery over continuous state-action spaces for efficient skill transfer to new tasks.
The researchers’ goal is to use temporally abstract skills extracted from large, unstructured and multi-task datasets in the source domain to support efficient learning of new tasks in the transfer domain. They decompose these skill transfer tasks into two sub-problems: 1) the extraction of skills suited for planning and acting from offline data, and 2) the learning of transfer tasks over the temporally abstracted skill space.
The team trains their framework using options representing temporally abstract skills to support temporal abstraction discovering and reusing, employing a maximum-likelihood behavioural cloning objective to discover task-agnostic, shared skills across tasks and enable the reconstruction of offline behaviours. In the transfer stage, the team freezes these options, and the model then learns and acts over them to accelerate online reinforcement learning (RL) on new tasks.
The options framework is built upon Hindsight Off-Policy Options (HO2), an actor-critic options algorithm that enables higher sample efficiency as it trains all options in hindsight across all experiences with a Maximum A-Posteriori Policy Optimization (MPO) method that guarantees monotonic improvement.
The team extends HO2 with an additionally learned option-level transition model and a predictability objective that encourages option-level transitions across the episode to discover bottleneck options to build their advanced Model-Based Offline Options (MO2) framework.
In their empirical study, the team compared MO2 with state-of-the-art baselines on complex continuous control domains, where MO2 outperformed all baselines by a large margin on the challenging AntMaze domain. The results also demonstrate that MO2’s options are bottleneck-aligned and improve acting, and the value estimation over MO2’s more temporally compressed options yields a faster learning and less-biased critic.
The paper MO2: Model-Based Offline Options is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.