Human beings perceive and understand the world by processing high-dimensional inputs from modalities as diverse as vision, audio, touch, proprioception, etc. Yet most machine learning models rely on modality-specific architectures, dealing with a stereotyped set of inputs and outputs associated with a single task. This forces researchers to redesign their architectures every time the inputs change.
The recently proposed Perceiver model (Andrew Jaegle, 2021) aims at supporting diverse inputs, and obtains impressive results on domains such as image, audio, point clouds, etc., while scaling linearly in compute and memory with input size. However, this model can only produce simple outputs such as class scores.
To broaden the Perceiver model’s capabilities, a DeepMind research team has proposed Perceiver IO, a single network that can easily integrate and transform arbitrary information for arbitrary tasks. Perceiver IO maintains Perceiver’s appealing properties — scaling linearly with both input and output sizes — and achieves outstanding results on tasks with highly structured output spaces, such as natural language and visual understanding.
The team describes their main contribution as Perceiver IO’s novel decoding procedure. The model is built upon Perceiver, which uses attention to map inputs in a wide range of modalities to a fixed-size latent space. This process decouples the bulk of the network’s processing from the size and the modality type details of the input, enabling it to efficiently scale to large and multimodal data.
Perceiver IO meanwhile uses a cross-attention mechanism to map from latents to arbitrarily sized and structured outputs via a querying system that can adapt to the specified semantics needed for outputs on a wide range of domains. This design incorporates the decoder with the original Perceiver, enabling Perceiver IO to serve as a drop-in replacement for many specialist networks while improving model efficiency compared to the original Perceiver network.
The Perceiver IO pipeline comprises three main steps: 1) Inputs are encoded to a latent space; 2) The latent representation is refined via many layers of processing; 3) The latent space is decoded to produce outputs. The approach leverages domain-agnostic primitives for nonlocal processing of inputs, enabling the network to decouple the size of elements from the input and output spaces while making minimal assumptions about the spatial or locality structures of the inputs and outputs. Perceiver IO can thus handle more complex output spaces with arbitrary shape and structure, while the latent features remain agnostic to the shape and structure of the outputs.
To evaluate the generality of Perceiver IO, the DeepMind team conducted experiments on various domains, including language understanding, visual understanding, symbolic representations for games (StarCraft II), and multimodal and multi-task settings.
In the evaluations, the proposed Perceiver IO achieved impressive results on tasks with highly structured output spaces, matching a transformer-based BERT baseline on the GLUE language benchmark without the need for input tokenization. It also achieved state-of-the-art results on Sintel optical flow estimation, indicating its promising future as a general-purpose neural network architecture.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.