AI Computer Vision & Graphics Machine Learning & Data Science Popular Research

Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks

In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3, a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.

In recent years the machine learning research community has turned its attention to the convergence of language, vision, and multimodal pretraining, aiming to develop general-purpose foundation models that can handle multiple modalities and be easily adapted to diverse downstream tasks.

In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3 (BERT Pretraining of Image Transformers), a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.

BEiT-3 performs masked “language” modeling on images, texts, and image-text pairs in a unified manner. It is pretrained on massive monomodal and multimodal data via a novel shared Multiway Transformer network, which the team uses as their backbone for encoding different modalities. The Multiway Transformer blocks employ a shared self-attention module that learns to align different modalities and provides deep fusion for multimodal tasks, and a pool of feed-forward networks for representing different modalities.

In the BEiT-3 pretraining process, the team leverages a unified masked data modelling objective on monomodal and multimodal data. They mask text tokens or image patches and train the model to predict the masked tokens. For multimodal data, they use 15M images and 21M image-text pairs collected from various public datasets. The monomodal data comprises 14M images from ImageNet-21K and a 160GB text corpora.

In their empirical study, the team applied BEiT-3 to major public benchmarks such as Visual Question Answering (VQA), Visual Reasoning, Image Captioning, and Semantic Segmentation. In the evaluations, BEiT-3 achieved state-of-the-art performance on all vision and vision-language benchmarks — object detection on COCO, semantic segmentation on ADE20K and image classification on ImageNet, visual question answering on VQAv2 and image captioning on COCO.

Overall, the proposed BEiT-3 advances the building of multimodal foundation models and also opens a new and promising direction for efficiently scaling up such models.

The paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

10 comments on “Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks

  1. Pingback: Microsoft Research Unveils 'BEIT-3' General Purpose Multimodal Baseline Model, Achieving State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks - fashilo.com

  2. Pingback: Microsoft Research Introduces a General-Purpose Multimodal Foundation Model ‘BEIT-3,’ that Achieves State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks

  3. Pingback: Microsoft Research introduces a universal multimodal base model "BEIT-3" that achieves state-of-the-art transmission performance in both vision and vision language tasks - Instant News

  4. Pingback: Výzkum společnosti Microsoft zavádí univerzální multimodální základní model „BEIT-3“, který dosahuje nejmodernější přenosové výkonnosti u úloh v oblasti vidění i jazyka - Hpntunes

  5. Pingback: Microsoft Research Introduces a General-Purpose Multimodal Core Model "BEIT-3" That Achieves Peak Transfer Performance on Vision and Vision Language Tasks

  6. Pingback: Microsoft Research Introduces a General-Purpose Multimodal Foundation Model 'BEIT-3,' that Achieves State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks - FynBok

  7. Pingback: Microsoft Research Unveils 'BEIT-3' General-Purpose Multimodal Baseline Model, Achieving State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks - owurr.com

  8. Pingback: Сижу в Минводах - Блог ШСМ

  9. Bravo on your thought-provoking post! It’s a rare gem that not only educates but also sparks contemplation. .Could you kindly take a moment to read my post? Your insights would be greatly appreciated. CHCECE031 Assessment Answers.

Leave a Reply to Loan Apps Cancel reply

Your email address will not be published. Required fields are marked *