DeepMind’s GATS: A Novel Module for Seamless Integration of Multimodal Foundation Models

Synced

2 years ago

In our inherently multimodal world, information is encoded in various modalities, such as text, images, and video. While deep learning has made significant strides in unimodal or bimodal tasks, the increasing adoption of large-scale Artificial Intelligence (AI) models necessitates the development of general and flexible tools for their integration.

In a new paper GATS: Gather-Attend-Scatter, a Google DeepMind research team introduces Gather-Attend-Scatter (GATS), a pioneering module designed to seamlessly combine pretrained foundation models—whether trainable or frozen—into larger multimodal networks.

The GATS module comprises multiple GATS layers, akin to vanilla transformer layers, each equipped with local attention. These layers interleave with given networks, serving as a vital bridge between them.

GATS operates by gathering activations from all component models, attending to the most relevant information, and scattering the combined representation back to all models by modifying their original activations. Its versatility lies in being applicable to any deep neural network, as it processes activations layer by layer. GATS is agnostic to the specific details of the neural networks it combines, making it a powerful and flexible tool.

The resulting GATS multimodal architectures only necessitate training the GATS module, eliminating the need to fine-tune the original pretrained models and preventing potential loss of knowledge. This renders GATS a highly versatile and general-purpose tool for building multimodal models from diverse pretrained sources.

The research team showcased the efficacy of GATS through agent experiments in diverse environments, including Atari Pong, Language-Table, and YCB. GATS demonstrated its ability to seamlessly integrate text and image models, showcasing its versatility in processing and generating various modalities. The results underscored the capability of GATS-based models to effectively leverage pretrained models and achieve state-of-the-art performance.

GATS-based architectures exhibit modularity and extendibility, paving the way for exciting possibilities in future research. This framework provides flexibility to incorporate additional modalities, leverage larger and more powerful foundation models, and explore novel applications requiring the coordination and processing of multimodal inputs.

Overall, GATS represents a significant stride in the realm of multimodal AI integration. Its ability to seamlessly combine pretrained models, eliminate the need for extensive finetuning, and exhibit modularity positions it as a versatile tool for researchers and practitioners. The demonstrated effectiveness across diverse experiments opens the door to innovative applications and inspires further exploration in the field of multimodal AI.

The paper GATS: Gather-Attend-ScatterarXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Share this: