Building a general-purpose unified model that can solve diverse tasks in different modalities while maintaining high performance is a long-standing challenge in the machine learning research community. A conventional approach in this direction is building models with task-specialized heads on top of a shared architectural backbone — but such models require expert knowledge to design a specialized head for each task, and their lack of parameter-sharing for new tasks limits their transfer-learning capabilities.
In the new paper Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks, a research team from the Allen Institute for AI and the University of Washington introduces UNIFIED-IO, a neural model with no task- or modality-specific branches that achieves competitive performance across a wide variety of computer vision (CV), natural language processing (NLP), and multi-modal benchmark tasks without fine-tuning.
The researchers set out to build a unified neural architecture that ML practitioners with little or no knowledge of the underlying machinery could use to efficiently and effectively train their models for new NLP and CV tasks.
For models to support a variety of modalities (images, language, boxes, binary masks, segmentation, etc.), they must represent all modalities in a shared space. The proposed UNIFIED-IO is a pure transformer encoder-decoder model inspired by and built on a modified T5 Text-to-Text Transfer Transformer (Raffel et al., 2020). The modifications include embedding the model with linear projection and reshaping input images into a sequence of flattened 2D patches. The team also expands the model vocabulary to include the location and image tokens used in vector quantized generative adversarial networks (VQ-GANs), extends the 1D relative embeddings to 2D with a fixed number of learned embeddings, and adds absolute position embedding to the token embeddings to help with vision tasks.
UNIFIED-IO is jointly trained on a large variety of tasks. These include classical CV tasks such as pose estimation, object detection, depth estimation and image generation; vision-and-language tasks such as region captioning and referring expression comprehension; and NLP tasks such as question answering and paraphrasing.
In their empirical study, UNIFIED-IO achieved state-of-the-art results across the seven tasks in the General Robust Image Task (GRIT) benchmark and competitive performance on 16 additional NLP and CV benchmark tasks without any fine-tuning or task-specific heads or modifications.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.