In recent years the machine learning research community has turned its attention to the convergence of language, vision, and multimodal pretraining, aiming to develop general-purpose foundation models that can handle multiple modalities and be easily adapted to diverse downstream tasks.
In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3 (BERT Pretraining of Image Transformers), a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.
BEiT-3 performs masked “language” modeling on images, texts, and image-text pairs in a unified manner. It is pretrained on massive monomodal and multimodal data via a novel shared Multiway Transformer network, which the team uses as their backbone for encoding different modalities. The Multiway Transformer blocks employ a shared self-attention module that learns to align different modalities and provides deep fusion for multimodal tasks, and a pool of feed-forward networks for representing different modalities.
In the BEiT-3 pretraining process, the team leverages a unified masked data modelling objective on monomodal and multimodal data. They mask text tokens or image patches and train the model to predict the masked tokens. For multimodal data, they use 15M images and 21M image-text pairs collected from various public datasets. The monomodal data comprises 14M images from ImageNet-21K and a 160GB text corpora.
In their empirical study, the team applied BEiT-3 to major public benchmarks such as Visual Question Answering (VQA), Visual Reasoning, Image Captioning, and Semantic Segmentation. In the evaluations, BEiT-3 achieved state-of-the-art performance on all vision and vision-language benchmarks — object detection on COCO, semantic segmentation on ADE20K and image classification on ImageNet, visual question answering on VQAv2 and image captioning on COCO.
Overall, the proposed BEiT-3 advances the building of multimodal foundation models and also opens a new and promising direction for efficiently scaling up such models.
The paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.