AI models’ task-specific performance has improved dramatically in recent years, and many in the research community are now exploring ways to create more flexible “omnipotent models” that can handle multiple tasks and modalities in a humanlike manner. The progress of transformer architectures has demonstrated their potential as such universal models, but achieving this goal will require satisfying three conditions: the model must be task-agnostic (TA), modality-agnostic (MA) and demonstrate task comprehensiveness (TC), all while maintaining its superior performance.
In the new paper Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, a research team from Alibaba Group’s DAMO Academy proposes OFA (One For All), a unified multimodal pretrained model that unifies modalities and tasks in a simple Seq2Seq (sequence-to-sequence) learning framework and achieves new state-of-the-art results on a series of multimodal tasks.
Inspired by its success with multimodal pretraining, the researchers adopted the conventional transformer architecture as their OFA backbone, leveraging its encoder-decoder network as a unified architecture for pretraining, finetuning and zero-shot tasks. To stabilize training and accelerate convergence, they added head scaling to self-attention, a post-attention layer normalization (LN), and an LN after the first layer of the feed-forward network (FFN).
To unify tasks and modalities, the team introduced a simple unified Seq2Seq learning paradigm for pretraining, finetuning, and inference on all tasks with multiple modalities. This approach enables performing multitask pretraining on multimodal and uni-modal data to endow the model with more comprehensive capabilities.
The team designed five tasks for cross-modal representation learning: visual grounding (VG), grounded captioning (GC), image-text matching (ITM), image captioning (IC), and visual question answering (VQA). For uni-modal representation learning, they used two vision tasks (object detection and image infilling) and a masked language modelling task. After jointly pretraining on these tasks and multiple datasets, OFA was able to perform different tasks with various modalities and complex cross-modal scenarios.
The team conducted extensive evaluation experiments on OFA, and summarizes the results as:
- Experiments demonstrate that OFA achieves new SOTAs on multimodal benchmarks, including image captioning (COCO test CIDEr: 149.6), text-to-image generation (COCO test FID: 10.5), VQA (test-std acc.: 80.02), SNLI-VE (test acc.: 90.20), and referring expression comprehension (RefCOCO / RefCOCO+ / RefCOCOg test acc.: 92.93 / 90.10 / 85.20), and performs competitively with uni-modal pretrained models on language and vision tasks, while still most previous multimodal pretrained models far underperform the uni-modal.
- We verify that OFA achieves competitive performance in zero-shot learning. Also, it can transfer to unseen tasks with new task instructions and adapt to out-of-domain information without finetuning.
Overall, the study validates OFA as capable of multimodal and uni-modal understanding and generation and demonstrates its promising potential in zero-shot learning and domain and task transfer. The team plans to continue in this research direction, working toward a reasonable solution for building an “omni-model” that is essentially generalizable to the complex real world.
The associated code will be released on the project GitHub. The paper Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.