A new DeepMind paper introduces two architectures designed for the efficient use of Tensor Processing Units (TPUs) in reinforcement learning (RL) research at scale.
Deep learning (DL) frameworks such as TensorFlow, PyTorch and JAX enable easy, rapid model prototyping while also optimizing execution speed for model training. Although such frameworks are popular across the general DL community, scalable research platforms for deep RL remain relatively underdeveloped.
The proposed DeepMind architectures, Anakin and Sebulba, address this deficiency, demonstrating how TPU-based RL platforms can deliver exceptional performance at low cost.
RL agents are designed to make a sequence of decisions in a given environment in order to maximize their cumulative reward. The DeepMind team proposes that Cloud TPUs can satisfy the compute requirements of large-scale RL systems, and that this is especially true with TPU Pods, a special Google data center configuration that features multiple TPU devices interconnected via extremely low latency communication channels. The proposed architectures, which the researchers call “Podracers,” are designed to support scalable RL research on TPU Pods. The Anakin architecture deals with online agent training, while Sebulba handles actor-learner agent decomposition.
For the Anakin framework, all environments must be written in JAX — a Google Research-developed Python library designed for high-performance numerical computing. While this restricts the range of supported environments, the team identifies positive trade-offs such as improved performance and “cleanliness of the research platform.”
Another Anakin benefit is that it is easy to scale up using JAX. The minimal computation unit is first v-mapped to vectorize the computation across a batch size large enough to ensure good utilisation of an entire TPU core. The vectorized function is then replicated and distributed across a TPU’s eight cores.
Such a design ensures the efficiency of this architecture, as the agent-environment interaction can be compiled into a single Accelerated Linear Algebra (XLA) program. It is also possible to replicate the basic setup to larger TPU slices, making the method scalable.
Sebulba meanwhile supports arbitrary environments, and relies on an actor-learner decomposition for performance. Like Anakin, Sebulba co-locates acting and learning on a single TPU machine, but it steps the environments on the host CPU and splits the eight TPU cores into two groups: using one group of cores for acting and the remaining cores for learning.
Sebulba’s computation also can be scaled up via replication. The learner cores in each replica only process trajectories generated on the corresponding host, but the parameter updates use JAX’s collective operations across all learner cores from all replicas.
To evaluate the performance of the Anakin and Sebulba architectures, the team looked at concrete use cases.
When training small neural networks on grid-world environments, Anakin performed five million steps per second. In a complex environment, Anakin achieved over 3 million steps per second on a 16-core TPU. Sebulba meanwhile completed the training of a 200-million-frame Atari game in just one hour while running on an 8-core TPU.
The study shows that Podracer architectures can effectively support various research use cases and are scalable and easy to use — validating the potential for RL platforms based on TPUs to deliver exceptional performance at low cost.
The paper Podracer Architectures for Scalable Reinforcement Learning is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.