Transformer architectures have shown great success across machine learning (ML) tasks in natural language processing and beyond, but have mostly been limited to tasks from a single domain or specific multimodal domains. For example, ViT is exclusively for vision-related tasks, BERT focus on language tasks, and VILBERT-MT works only on related vision-and-language tasks.
A question naturally arises: Could we build a single transformer capable of handling a wide range of applications in different domains over multiple modalities? Recently, a Facebook AI research team took up the challenge with a novel Unified Transformer (UniT) encoder-decoder model that jointly trains on multiple tasks across different modalities and achieves strong performance on these different tasks with a unified set of model parameters.

Transformers were first applied to the language domain for sequence-to-sequence models. They have since been extended to the visual domain and have even been applied to joint vision and language reasoning tasks. Although pretrained transformers can be fine-tuned for application in various downstream tasks and achieve good results, this model fine-tuning approach results in a different set of parameters being created for each downstream task.
The Facebook AI researchers propose that a single transformer may be all we really need. Their UniT is built on traditional transformer encoder-decoder architecture comprising separate encoders for each input modality type followed by a decoder with simple task-specific heads. Inputs are in two modalities: images and text. First, a convolutional neural network backbone extracts visual features, and BERT encodes the language inputs into hidden state sequences. Then, a transformer decoder is applied on an encoded single modality or the concatenated sequence of both encoded modalities (depending on whether the task is uni-modal or multimodal). Finally, the representations from the transformer decoder are passed to a task-specific head, which outputs the final predictions.

To evaluate UniT’s performance, the researchers conducted experiments that required jointly learning a number of popular tasks from different domains: object detection on COCO and Visual Genome datasets, language understanding tasks from the GLUE benchmark ( QNLI, QQP, MNLI-mismatched, and SST-2), and visual reasoning tasks on VQAv2 and SNLI-VE datasets.




The results demonstrate the proposed UniT model simultaneously handling seven tasks across eight datasets, achieving strong performance on each task with a unified set of model parameters. The strong performance suggests UniT’s potential as a domain-agnostic transformer architecture, a step toward the goal of more generalized intelligent agents.
The paper Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: [N] Facebook AI’s Multitask & Multimodal Unified Transformer: A Step Toward General-Purpose Intelligent Agents – ONEO AI
This is a great report. Thank you!