In recent years there has been growing interest in reinforcement learning (RL) algorithms that can learn entirely from fixed datasets without interaction (offline RL). A number of relatively unexplored challenges remain in this research field, such as how to get the most out of the collected data, how to work with growing datasets, and how to compose the most effective datasets.
In a new paper, a DeepMind research team proposes a clear conceptual separation of the RL process into data-collection and inference of knowledge to improve RL data efficiency. The team introduces a “Collect and Infer” (C&I) paradigm and provides insights on how to interpret RL algorithms from the C&I perspective; while also showing how it could guide future research into more data-efficient RL.
The key idea informing the C&I paradigm is the separation of RL into two distinct but interconnected processes: collecting data into a transition memory by interacting with the environment, and inferring knowledge about the environment by learning from the data of said memory.
To optimize each process, the team set two objectives: (1) Given a fixed data batch, what is the right learning setup to get to the maximally performing policy? (optimal inference); and (2) Given an inference process, what is the minimal set of data required to get to a maximally performing policy? (optimal collection).
The team describes their algorithm development desiderata as:
- Learning is done offline in a ’batch’ setting assuming fixed data as suggested by (1). Data may have been collected by a behaviour policy different from the one that is the learning target. This enables utilization of the same data to optimize for multiple objectives simultaneously, and coincides with interest in offline RL.
- Data-collection is a process that should be optimized in its own right. Naive exploration schemes that employ simple random perturbations of a task policy, such as epsilon greedy, are likely to be inadequate. The behaviour that is optimal for data collection in the sense of (2) may be quite different from the optimal behaviour for a task of interest.
- Treating data collection as a separate process offers novel ways to integrate known methods like skills, model-based approaches, or innovative exploration schemes into the learning process without biasing the final task solution.
- Data collection may happen concurrently with inference (in which case the two processes actively influence each other and we get close to online RL) or can be conducted separately.
- C&I suggests a different focus for evaluation: in contrast to usual regret-based frameworks for exploration, C&I does not aim to optimize task performance during collection. Instead, we distinguish between a learning phase, during which a certain amount of data is collected, and a deployment phase, during which the performance of the agent is assessed.
The C&I paradigm offers considerable opportunities and flexibility. Its interpolation between pure offline (batch) and more conventional online learning scenarios can enable rapid learning of new behaviours with only small amounts of online experience. By decoupling acting and learning, it can optimize data collection strategies and schemes for unsupervised RL and unsupervised skill discovery. By considering data as a vehicle for knowledge transfer, C&I can enable new algorithms for multi-task and transfer scenarios. It also provides a different emphasis when considering meta-learning or life-long learning scenarios.
Compared with traditional Bayesian methods which attempt to find an optimal trade-off between exploration and exploitation yet are usually intractable, the proposed C&I approach focuses its optimization emphasis on data collection.
The paper’s overall message is a re-thinking of data-efficient RL via a clear separation of data collection and exploitation, and an exploitation of the flexibility of off-policy RL in agent design.
The team believes their work can encourage further research into strategies for the acquisition of information; while providing a flexible framework that may facilitate a conceptual disentanglement of objectives, representations, and execution strategies.
The paper Collect & Infer — A Fresh Look at Data-Efficient Reinforcement Learning is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.