In recent years the machine learning research community has turned its attention to the convergence of language, vision, and multimodal pretraining, aiming to develop general-purpose foundation models that can handle multiple modalities and be easily adapted to diverse downstream tasks.
In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3 (BERT Pretraining of Image Transformers), a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.


BEiT-3 performs masked “language” modeling on images, texts, and image-text pairs in a unified manner. It is pretrained on massive monomodal and multimodal data via a novel shared Multiway Transformer network, which the team uses as their backbone for encoding different modalities. The Multiway Transformer blocks employ a shared self-attention module that learns to align different modalities and provides deep fusion for multimodal tasks, and a pool of feed-forward networks for representing different modalities.

In the BEiT-3 pretraining process, the team leverages a unified masked data modelling objective on monomodal and multimodal data. They mask text tokens or image patches and train the model to predict the masked tokens. For multimodal data, they use 15M images and 21M image-text pairs collected from various public datasets. The monomodal data comprises 14M images from ImageNet-21K and a 160GB text corpora.


In their empirical study, the team applied BEiT-3 to major public benchmarks such as Visual Question Answering (VQA), Visual Reasoning, Image Captioning, and Semantic Segmentation. In the evaluations, BEiT-3 achieved state-of-the-art performance on all vision and vision-language benchmarks — object detection on COCO, semantic segmentation on ADE20K and image classification on ImageNet, visual question answering on VQAv2 and image captioning on COCO.
Overall, the proposed BEiT-3 advances the building of multimodal foundation models and also opens a new and promising direction for efficiently scaling up such models.
The paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: Microsoft Research Unveils 'BEIT-3' General Purpose Multimodal Baseline Model, Achieving State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks - fashilo.com
Pingback: Microsoft Research Introduces a General-Purpose Multimodal Foundation Model ‘BEIT-3,’ that Achieves State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks
Pingback: Microsoft Research introduces a universal multimodal base model "BEIT-3" that achieves state-of-the-art transmission performance in both vision and vision language tasks - Instant News
Pingback: Výzkum společnosti Microsoft zavádí univerzální multimodální základní model „BEIT-3“, který dosahuje nejmodernější přenosové výkonnosti u úloh v oblasti vidění i jazyka - Hpntunes
Pingback: Microsoft Research Introduces a General-Purpose Multimodal Core Model "BEIT-3" That Achieves Peak Transfer Performance on Vision and Vision Language Tasks
Pingback: Microsoft Research Introduces a General-Purpose Multimodal Foundation Model 'BEIT-3,' that Achieves State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks - FynBok
Pingback: Microsoft Research Unveils 'BEIT-3' General-Purpose Multimodal Baseline Model, Achieving State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks - owurr.com
Pingback: Сижу в Минводах - Блог ШСМ
Thanks you for this wonderful piece. Dear readers, kindly Click Here to get awesome offer from FairMoney Loan
Bravo on your thought-provoking post! It’s a rare gem that not only educates but also sparks contemplation. .Could you kindly take a moment to read my post? Your insights would be greatly appreciated. CHCECE031 Assessment Answers.