AI Machine Learning & Data Science Research

Facebook AI’s Multitask & Multimodal Unified Transformer: A Step Toward General-Purpose Intelligent Agents

A research team from Facebook AI has proposed a Unified Transformer (UniT) encoder-decoder model that jointly trains on multiple tasks across different modalities and achieves strong performance on seven tasks with a unified set of model parameters.

Transformer architectures have shown great success across machine learning (ML) tasks in natural language processing and beyond, but have mostly been limited to tasks from a single domain or specific multimodal domains. For example, ViT is exclusively for vision-related tasks, BERT focus on language tasks, and VILBERT-MT works only on related vision-and-language tasks.

A question naturally arises: Could we build a single transformer capable of handling a wide range of applications in different domains over multiple modalities? Recently, a Facebook AI research team took up the challenge with a novel Unified Transformer (UniT) encoder-decoder model that jointly trains on multiple tasks across different modalities and achieves strong performance on these different tasks with a unified set of model parameters.

image.png

Transformers were first applied to the language domain for sequence-to-sequence models. They have since been extended to the visual domain and have even been applied to joint vision and language reasoning tasks. Although pretrained transformers can be fine-tuned for application in various downstream tasks and achieve good results, this model fine-tuning approach results in a different set of parameters being created for each downstream task.

The Facebook AI researchers propose that a single transformer may be all we really need. Their UniT is built on traditional transformer encoder-decoder architecture comprising separate encoders for each input modality type followed by a decoder with simple task-specific heads. Inputs are in two modalities: images and text. First, a convolutional neural network backbone extracts visual features, and BERT encodes the language inputs into hidden state sequences. Then, a transformer decoder is applied on an encoded single modality or the concatenated sequence of both encoded modalities (depending on whether the task is uni-modal or multimodal). Finally, the representations from the transformer decoder are passed to a task-specific head, which outputs the final predictions.

image.png
UniT model overview

To evaluate UniT’s performance, the researchers conducted experiments that required jointly learning a number of popular tasks from different domains: object detection on COCO and Visual Genome datasets, language understanding tasks from the GLUE benchmark ( QNLI, QQP, MNLI-mismatched, and SST-2), and visual reasoning tasks on VQAv2 and SNLI-VE datasets.

image.png
UniT performance on multi-task training over object detection and VQA
image.png
Analyses on object detection and VQA with the UniT model
image.png
UniT model performance on seven tasks across eight datasets
image.png
UniT model predictions with a shared decoder

The results demonstrate the proposed UniT model simultaneously handling seven tasks across eight datasets, achieving strong performance on each task with a unified set of model parameters. The strong performance suggests UniT’s potential as a domain-agnostic transformer architecture, a step toward the goal of more generalized intelligent agents.

The paper Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

2 comments on “Facebook AI’s Multitask & Multimodal Unified Transformer: A Step Toward General-Purpose Intelligent Agents

  1. Pingback: [N] Facebook AI’s Multitask & Multimodal Unified Transformer: A Step Toward General-Purpose Intelligent Agents – ONEO AI

  2. This is a great report. Thank you!

Leave a Reply

Your email address will not be published.

%d bloggers like this: