Recent research has demonstrated exciting progress in formulating offline reinforcement learning (RL) problems as context-conditioned sequence modelling problems, enabling the use of powerful transformer architectures to significantly improve upon purely model-free performance, especially in scenarios where online interactions are expensive. These works have focused on the offline RL setting, with the resulting agents subsequently requiring finetuning via online exploration and interactions.
In the new paper Online Decision Transformer, a team from Facebook AI Research, UC Berkeley and UCLA proposes Online Decision Transformers (ODT), an RL algorithm based on sequence modelling that incorporates both offline pretraining and online finetuning in a unified framework and achieves performance competitive with state-of-the-art models on the D4RL benchmark.
RL policies trained purely on offline datasets are typically sub-optimal as the offline trajectories might not have high return and cover only a limited part of the state space. Online interactions are thus essential for improving model performance. The learning formulation for a standard transformer however is insufficient for online learning and, the researchers note, can collapse when used naively for online data acquisition.
this formulation to account for exploration in the goal is to learn a stochastic policy that maximizes the likelihood of the dataset.
After shifting from deterministic to stochastic policies for defining exploration objectives during the online phase, the team develops a novel replay buffer that stores trajectories and is populated via online rollouts from the ODT. They then extend a notion of hindsight experience replay to ensure the ODT returns match the true returns observed during an online rollout.
In their empirical study, the team evaluated ODT against other state-of-the-art approaches for finetuning pretrained policies under a limited online budget and investigated how the individual ODT components influence its overall performance. They compared ODT’s offline performance with DT and implicit Q-learning (IQL) (Kostrikov et al., 2021a), a state-of-the-art algorithm for offline RL; and compared ODT’s online finetuning performance to an IQL finetuning variant. For a purely online baseline, the team also reported the results of the soft actor critic (SAC) algorithm (Haarnoja et al., 2018a). All experiments were conducted on the D4RL benchmark.
While IQL outperformed both ODT and DT on most tasks, it struggled to improve its performance after online finetuning, where ODT was able to catch up and achieve comparable performance. For online learning, ODT performed substantially better than the SAC under a sample budget of 200k online interactions.
Overall, the study shows the proposed ODT RL algorithm can significantly benefit practical regimes with offline data and limited budgets for online interactions and is competitive with state-of-the-art RL methods. The team suggests future work in this area could investigate whether supervised learning approaches can account for purely online RL and probe the limits of supervised learning algorithms for RL.
The paper Online Decision Transformer is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.