The recent rapid rise of large language models (LLMs) has piqued research interest regarding the power and potential of representation models, which are designed to decode and understand data. While contemporary representation models have achieved outstanding performance on unimodal tasks, they typically remain underequipped for handling multimodal tasks.
In the new paper ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities, a research team from Alibaba Group’s DAMO Academy and the Huazhong University of Science and Technology releases ONE-PEACE, a highly extensible model that can align and integrate representations across vision, audio, and language modalities. The team sees their work as a scalable approach that can be applied toward the creation of a general representation model for unlimited modalities.
The researchers envision a general representation model that meets the following three conditions: the model architecture must be flexible enough to accommodate various modalities and support multimodal interaction, pretraining tasks should not only extract information within each modality but also ensure alignment across modalities, and pretraining tasks should be general and straightforward, allowing them to be applied to different modalities.
The ONE-PEACE architecture employs a vision adapter (V-Adapter), an audio adapter (A-Adapter), and a language adapter (L-Adapter) to convert the respective raw signals into unified features. The V-Adapter uses a hierarchical multilayer perceptron (hMLP) stem to patchify a given image. The image patches are then flattened into sequences to produce image embeddings and output representations. The A-Adapter normalizes raw audio and uses a convolutional feature extractor to process the normalized audio and produce audio embeddings and output representations. The L-Adapter leverages byte-pair encoding (BPE) to transform a given text into a subword sequence. After inserting special tokens into the resulting sequence, an embedding layer transfers the sequence to text embeddings, which inform the output text representations.
A transformer-based modality fusion encoder with three modality feed-forward networks (V-FFN, A-FFN, and L-FFN) and a novel shared self-attention layer enables interaction between the different modalities through its attention mechanism.
The team designs two modality-agnostic pretraining tasks — cross-modal aligning contrast and intra-modal denoising contrast — to pretrain ONE-PEACE. This pretraining strategy enables the model to align the semantic space of different modalities and obtain fine-grained details concurrently in all the modalities.
In their empirical study, the team evaluated ONE-PEACE on vision, audio, vision-language, and audio-language tasks. In the experiments, ONE-PEACE achieved state-of-the-art results on multimodal tasks that included audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g).
The team plans to continue to test ONE-PEACE on additional downstream tasks involving video, 3D point clouds, etc., and explore its interaction with large language models (LLMs) to build a more powerful general representation model and, by combining with LLMs, to create a more general multimodal language model.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.