Current state-of-the-art vision and vision-and-language models are generally either cross-modal (contrastive) or multi-modal (with earlier fusion) and tend to target specific modalities or tasks. A promising direction envisioned by many in the machine learning research community is the development of next-generation “foundation models” or “universal” transformers capable of tackling vision, language, and vision-and-language problems simultaneously.
Taking a step toward the realization of such universal transformers, a Facebook AI Research (FAIR) team has introduced FLAVA, a foundational language and vision alignment model that explicitly targets language, vision, and their multimodal combination all at once and achieves impressive performance on 35 tasks across these domains.
The FAIR team set out to learn a foundational language and vision representation capable of unimodal vision and language understanding as well as multimodal reasoning within a single pretrained model.
The FLAVA architecture comprises an image encoder that extracts unimodal image representations, a text encoder that obtains unimodal text representations, and a multimodal encoder that fuses and aligns the image and text representations for multimodal reasoning.
The model’s image encoder is based on vision transformer (ViT) architectures. The processed input images are linearly embedded and fed into a transformer model, which outputs a list of image hidden state vectors and an additional vector for an image classification token. The text encoder meanwhile tokenizes and embeds text inputs into a list of word vectors. A transformer model then encodes the word vectors into a list of hidden state vectors, including the additional vector for the text classification token.
The team employs the same ViT architecture (with different parameters) for their visual and textual encoders, and uses a separate transformer, i.e. a multimodal encoder, to fuse the image and text hidden states. They apply two learned linear projections over each hidden state vector for both image and text, and this integrated list is then fed into the multimodal encoder transformer to obtain cross-attention on the two fused modalities.
To evaluate FLAVA performance, the team used a Public Multimodal Datasets (PMD) corpus constructed from publicly available data sources that contain text-image pairs, and applied their proposed model to 35 tasks: 22 common vision tasks, 8 tasks from the GLUE (General Language Understanding Evaluation) benchmark, and a variety of multimodal tasks such as image and text retrieval on the COCO (Common Objects in Context) dataset.
Full FLAVA pretraining achieved the best average scores on vision, natural language processing (NLP), and multimodal tasks compared to baseline masked image modelling (MIM) and masked language modelling (MLM) methods. Although its training corpus was several orders of magnitude smaller than similar recent models, FLAVA obtained better or competitive performance across the 35 evaluation tasks, even approaching the performance of Google’s powerful BERT large language model on several GLUE benchmark tasks.
Overall, this study validates the effectiveness of the proposed FLAVA as a single foundational model for vision tasks, language tasks, and cross- and multi-modal vision and language tasks. The team believes their work can point the way forward toward the creation of generalized but open models that perform well across a wide variety of multimodal tasks.
The paper FLAVA: A Foundational Language and Vision Alignment Model is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.