Recent studies have shown that transformers can model high-dimensional distributions of semantic concepts at scale, opening up the intriguing possibility of formalizing sequential decision-making problems as reinforcement learning (RL). New research from a UC Berkeley, Facebook AI Research and Google Brain team that includes esteemed Belgian professor Pieter Abbeel explores whether generative trajectory modelling — i.e. modelling the joint distribution of a sequence of states, actions, and rewards — could serve as a replacement for conventional RL algorithms.
In the paper Decision Transformer: Reinforcement Learning via Sequence Modeling, the researchers abstract RL as a sequence modelling problem. Their proposed Decision Transformer outputs optimal actions by leveraging a causally masked transformer and can generate future actions with desired returns. Moreover, despite Decision Transformer’s relative simplicity, the proposed framework matches or outperforms the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
Transformer architectures are able to efficiently model sequential data, and their self-attention mechanism allows the layer to assign “credit” by implicitly forming state-return associations via maximizing the dot product of the query and key vectors. Transformers can thus function effectively in the presence of sparse or distracting rewards. Previous studies have also shown that transformers can model a wide distribution of behaviours, enabling better generalization and transfer abilities.
Motivated by these advantages, the team modifies the transformer architecture with a causal self-attention mask to enable autoregressive generation. Their resulting Decision Transformer is a modified transformer that can model trajectories autoregressively.
Intuitively, the proposed approach can be considered as a typical RL task: finding the shortest path on a directed graph, where the reward is 0 when the agent is at the goal node and −1 otherwise. The team trained GPT-based transformers to predict the next token in a sequence of returns-to-go, states, and actions. The optimal trajectories at test time are obtained by adding a prior to generate the highest possible returns and subsequently generate a corresponding sequence of actions via conditioning. In this way, the proposed model can achieve policy improvements without the need for dynamic programming.
A key component of the proposed Decision Transformer is trajectory representation, which aims at learning meaningful patterns. Instead of directly feeding rewards, the model is fed with the returns-to-go (the sum of future rewards), which leads to trajectory representations amenable to autoregressive training and generation.
The token embeddings are obtained by projecting raw inputs to the embedding dimension to get linear layers for each modality. For environments with visual inputs, the state is fed into a convolutional encoder, and an embedding for each timestep is learned and added to each token. Finally, after processing the tokens with a GPT model, future action tokens are predicted via autoregressive modelling.
The team evaluated Decision Transformer’s performance against dedicated offline RL and imitation learning algorithms on both discrete (Atari) and continuous (OpenAI Gym) control tasks. Their comparisons focused on Conservative Q-Learning (Kumar et al., 2020), a model-free offline RL approach based on temporal difference learning (TD-learning), the dominant RL paradigm for sample efficiency and a sub-routine in many model-based RL algorithms. They also tested Decision Transformer’s performance on behaviour cloning (BC) and variants relative to CQL.
For their Atari experiments, the team compared Decision Transformer to four baselines (CQL, REM, and QR-DQN) on four Atari tasks (Breakout, Qbert, Pong, and Seaquest). The results show that the proposed method is competitive with CQL in three out of four games and outperforms or matches REM, QR-DQN, and BC on all four games.
In the OpenAI Gym experiments, the team compared Decision Transformer to CQL, BEAR, BRAC, and AWR, with Decision Transformer achieving the best performance in a majority of the tasks and remaining competitive with the state-of-the-art in the remaining tasks.
Overall, the study effectively bridges sequence modelling and transformers with RL, suggesting sequence modelling can serve as a strong algorithmic paradigm for RL.
The code has been open-sourced on the project GitHub. The paper Decision Transformer: Reinforcement Learning via Sequence Modeling is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.