AI Machine Learning & Data Science Research

DeepMind Introduces Gato: A Generalist, Multi-Modal, Multi-Task, Multi-Embodiment Agent

A DeepMind research team proposes Gato, a single general-purpose transformer sequence model that can engage in dialogue, caption images, stack blocks with a real robot arm, navigate in simulated 3D environments and even beat human players at Atari games.

The practicality of a single tool with multi-purpose capabilities is why the good old Swiss Army knife has remained popular for over a century. In the new paper A Generalist Agent, a DeepMind research team introduces an AI take on this concept, proposing Gato, a single general-purpose agent that can perform over 600 diverse tasks ranging from captioning images to stacking blocks with a real robot arm and navigating simulated 3D environments — all while using the same network with the same weights. A novel transformer sequence model, Gato even beats human players in Atari games.

The DeepMind researchers start with the hypothesis that training an agent that is generally capable on a large number of tasks is possible; and that this general agent can be adapted with little extra data to succeed at an even larger number of tasks. They note that a general-purpose agent provides significant advantages, such as reducing the need for hand-crafting policy models for each field, increasing the amount and diversity of training data, and achieving continuous improvements at the frontier of data, compute and model scale. A general-purpose agent can also be regarded as a step toward machine learning’s ultimate goal of artificial general intelligence (AGI).

Gato is designed to be trained on the widest possible variety of relevant data. By processing massive multi-modal data and serializing it into a flat sequence of tokens, Gato functions as a multi-modal, multi-task, multi-embodiment generalist policy model able to adapt to and succeed at tasks with varying modalities, observations and action specifications, and handle new tasks given minimal additional data.

Gato was trained on a large number of datasets comprising agent experience in both simulated and real-world environments. For vision and language, training was done on MassiveText, a multi-modal text dataset that includes web pages, books and news articles; and on code and vision-language datasets such as ALIGN (Jia et al., 2021) and COCO captions (Chen et al., 2015).

The team evaluated Gato on a variety of tasks, including simulated control, robotic stacking, and ALE Atari games. In the experiments, Gato crossed the 50 percent expert score threshold on 450 of its 604 tasks.

Overall, this work shows the potential of Gato-like transformer sequence models as multi-task multi-embodiment policies that can be applied to real-world text, vision and robotics tasks; as well as their potential in few-shot, out-of-distribution task learning capabilities. The researchers envision such models one day being used as default starting points for learning new behaviours rather than training from scratch.

While the proposed Gato has re-ignited the AGI debate in the Twitterverse, it is worth noting the term “AGI” does not appear in the DeepMind paper or associated blogpost, which use the less ambitious “general-purpose agent” descriptor.

The paper A Generalist Agent is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

1 comment on “DeepMind Introduces Gato: A Generalist, Multi-Modal, Multi-Task, Multi-Embodiment Agent

  1. Thanks for sharing! This website is very informative. I appreciate this website. The DeepMind researchers start with the hypothesis that training an agent that is generally capable on a large number of tasks is possible ?

Leave a Reply

Your email address will not be published.

%d bloggers like this: