Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks

In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3, a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.

by Synced

2022-08-30

Comments 15

In recent years the machine learning research community has turned its attention to the convergence of language, vision, and multimodal pretraining, aiming to develop general-purpose foundation models that can handle multiple modalities and be easily adapted to diverse downstream tasks.

In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3 (BERT Pretraining of Image Transformers), a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.

BEiT-3 performs masked “language” modeling on images, texts, and image-text pairs in a unified manner. It is pretrained on massive monomodal and multimodal data via a novel shared Multiway Transformer network, which the team uses as their backbone for encoding different modalities. The Multiway Transformer blocks employ a shared self-attention module that learns to align different modalities and provides deep fusion for multimodal tasks, and a pool of feed-forward networks for representing different modalities.

In the BEiT-3 pretraining process, the team leverages a unified masked data modelling objective on monomodal and multimodal data. They mask text tokens or image patches and train the model to predict the masked tokens. For multimodal data, they use 15M images and 21M image-text pairs collected from various public datasets. The monomodal data comprises 14M images from ImageNet-21K and a 160GB text corpora.

In their empirical study, the team applied BEiT-3 to major public benchmarks such as Visual Question Answering (VQA), Visual Reasoning, Image Captioning, and Semantic Segmentation. In the evaluations, BEiT-3 achieved state-of-the-art performance on all vision and vision-language benchmarks — object detection on COCO, semantic segmentation on ADE20K and image classification on ImageNet, visual question answering on VQAv2 and image captioning on COCO.

Overall, the proposed BEiT-3 advances the building of multimodal foundation models and also opens a new and promising direction for efficiently scaling up such models.

The paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

15 comments on “Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks”

Pingback: Microsoft Research Unveils 'BEIT-3' General Purpose Multimodal Baseline Model, Achieving State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks - fashilo.com
Pingback: Microsoft Research Introduces a General-Purpose Multimodal Foundation Model ‘BEIT-3,’ that Achieves State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks
Pingback: Microsoft Research introduces a universal multimodal base model "BEIT-3" that achieves state-of-the-art transmission performance in both vision and vision language tasks - Instant News
Pingback: Výzkum společnosti Microsoft zavádí univerzální multimodální základní model „BEIT-3“, který dosahuje nejmodernější přenosové výkonnosti u úloh v oblasti vidění i jazyka - Hpntunes
Pingback: Microsoft Research Introduces a General-Purpose Multimodal Core Model "BEIT-3" That Achieves Peak Transfer Performance on Vision and Vision Language Tasks
Pingback: Microsoft Research Introduces a General-Purpose Multimodal Foundation Model 'BEIT-3,' that Achieves State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks - FynBok
Pingback: Microsoft Research Unveils 'BEIT-3' General-Purpose Multimodal Baseline Model, Achieving State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks - owurr.com
Pingback: Сижу в Минводах - Блог ШСМ
Loan Apps

2022-09-14

Thanks you for this wonderful piece. Dear readers, kindly Click Here to get awesome offer from FairMoney Loan

Loading...

Reply
malia

2023-08-24

Bravo on your thought-provoking post! It’s a rare gem that not only educates but also sparks contemplation. .Could you kindly take a moment to read my post? Your insights would be greatly appreciated. CHCECE031 Assessment Answers.

Loading...

Reply
ownsley

2025-12-05

Artificial intelligence is rapidly transforming the way we communicate and interact with technology. It’s no longer just about asking questions or looking for information—AI can now hold conversations, offer advice, and even engage in creative tasks. AIと会話する , and you’ll discover a powerful tool that can assist with everything from problem-solving to brainstorming ideas. Whether you’re seeking help with work or simply having a casual chat, AI’s ability to understand and respond to human language is continually improving. Embracing this technology opens up endless possibilities for innovation and efficiency.

Loading...

Reply
Love Fortune

2025-12-31

This paper highlights an important step toward unified multimodal AI. By introducing BEiT-3, the Microsoft research team shows how vision and language can be learned through a shared pretraining framework. 恋みくじサイト
Treating images like a “foreign language” is a compelling idea that pushes foundation models closer to being flexible, general-purpose systems across diverse vision and vision-language tasks.

Loading...

Reply
enid

2026-01-14

Engage with AI tools without the hassle of signing up! Many AI platforms now offer seamless experiences without requiring registration. This freedom lets you experiment with AI’s potential without barriers. Dive into AI-driven solutions and unlock new possibilities without creating an account. It’s easy to harness AI’s power to boost productivity, creativity, and more. Parlez à l’IA maintenant without registration today and discover a world of innovative possibilities at your fingertips!

Loading...

Reply
enid

2026-01-14

Engage with AI tools without the hassle of signing up! Many AI platforms now offer seamless experiences without requiring registration. This freedom lets you experiment with AI’s potential without barriers. Dive into AI-driven solutions and unlock new possibilities without creating an account. It’s easy to harness AI’s power to boost productivity, creativity, and more. Parlez à l’IA maintenant without registration today and discover a world of innovative possibilities at your fingertips!

Loading...

Reply
palitonken

2026-01-16

A qr コード読み取り無料 , allows users to scan and decode QR codes without any cost, offering quick access to links, contact information, or promotional offers. These apps are typically easy to use, available for smartphones and PCs alike. With no subscription or hidden fees, they provide a convenient solution for everyday scanning needs.

Loading...

Reply

Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks

Like this:

15 comments on “Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks”

Leave a Reply Cancel reply

Related

Share this:

Like this:

15 comments on “Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks”

Leave a Reply Cancel reply

Related