AI Research

12-in-1: Facebook AI’s New Framework Tackles Multiple Vision-and-Language Tasks

The model reduces the number of parameters from some 3 billion to 270 million while improving task performance by an average of 2.05 points.

In recent years researchers in the busy deep learning, computer vision and natural language processing communities have all become increasingly interested in vision and language (V&L).

A compelling reason to study language and vision jointly is the promise of language as a universal and natural interface for visual reasoning problems — useful in both specifying a wide range of problems and communicating AI responses. However, previous research in visually-grounded language understanding have been mostly task-specific.

Researchers from the Facebook AI Research, Georgia Institute of Technology, and Oregon State University found that the skills required for different V&L tasks such as visual question answering and caption-based image retrieval overlap significantly, thanks mainly to the rise of V&L general architectures.

The wide variety of independent V&L tasks motivated these researchers explore ways to consolidate some of them — and the result of their efforts is an all-in-one model that learns from 12 supporting datasets of four broad categories of V&L tasks. The model reduces the number of parameters from some 3 billion to 270 million while improving task performance by an average of 2.05 points.

Based on the recently proposed ViLBERT (Vision-and-Language BERT) model for learning joint representations of image content and natural language, the new model focuses on four categories — visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification.

dn-1210.png

Among the 12 datasets are three for vocab-based VQA (VQAv2, GQA, and VGQA), two for image retrieval (COCO and Flickr30K), five for referring expressions (RefCOCO, RefCOCO+, RefCOCOG, Visual7W, and GuessWhat), and two for multi-modal verification (NLVR2 and SNLI-VE).

Previous V&L datasets were infamous for variations in size, quality, interface, and difficulty. The new research not only shows the possibility of using a single model to perform multiple tasks but also proves that even with the same architecture, training with multiple datasets can actually lead to improvements on task metrics compared to single-task training.

The paper further demonstrates that multi-task training can be an effective pretraining step for single-task models as it led to further gains and set a new state-of-the-art for 7 out of 12 dataset tasks.

The paper 12-in-1: Multi-Task Vision and Language Representation Learning is available on arXiv.


Journalist: Yuan Yuan | Editor: Michael Sarazen

0 comments on “12-in-1: Facebook AI’s New Framework Tackles Multiple Vision-and-Language Tasks

Leave a Reply

Your email address will not be published.

%d bloggers like this: