AI Computer Vision & Graphics Machine Learning & Data Science Popular Research

Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks

In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3, a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.

In recent years the machine learning research community has turned its attention to the convergence of language, vision, and multimodal pretraining, aiming to develop general-purpose foundation models that can handle multiple modalities and be easily adapted to diverse downstream tasks.

In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3 (BERT Pretraining of Image Transformers), a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.

BEiT-3 performs masked “language” modeling on images, texts, and image-text pairs in a unified manner. It is pretrained on massive monomodal and multimodal data via a novel shared Multiway Transformer network, which the team uses as their backbone for encoding different modalities. The Multiway Transformer blocks employ a shared self-attention module that learns to align different modalities and provides deep fusion for multimodal tasks, and a pool of feed-forward networks for representing different modalities.

In the BEiT-3 pretraining process, the team leverages a unified masked data modelling objective on monomodal and multimodal data. They mask text tokens or image patches and train the model to predict the masked tokens. For multimodal data, they use 15M images and 21M image-text pairs collected from various public datasets. The monomodal data comprises 14M images from ImageNet-21K and a 160GB text corpora.

In their empirical study, the team applied BEiT-3 to major public benchmarks such as Visual Question Answering (VQA), Visual Reasoning, Image Captioning, and Semantic Segmentation. In the evaluations, BEiT-3 achieved state-of-the-art performance on all vision and vision-language benchmarks — object detection on COCO, semantic segmentation on ADE20K and image classification on ImageNet, visual question answering on VQAv2 and image captioning on COCO.

Overall, the proposed BEiT-3 advances the building of multimodal foundation models and also opens a new and promising direction for efficiently scaling up such models.

The paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

15 comments on “Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks

  1. Pingback: Microsoft Research Unveils 'BEIT-3' General Purpose Multimodal Baseline Model, Achieving State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks - fashilo.com

  2. Pingback: Microsoft Research Introduces a General-Purpose Multimodal Foundation Model ‘BEIT-3,’ that Achieves State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks

  3. Pingback: Microsoft Research introduces a universal multimodal base model "BEIT-3" that achieves state-of-the-art transmission performance in both vision and vision language tasks - Instant News

  4. Pingback: Výzkum společnosti Microsoft zavádí univerzální multimodální základní model „BEIT-3“, který dosahuje nejmodernější přenosové výkonnosti u úloh v oblasti vidění i jazyka - Hpntunes

  5. Pingback: Microsoft Research Introduces a General-Purpose Multimodal Core Model "BEIT-3" That Achieves Peak Transfer Performance on Vision and Vision Language Tasks

  6. Pingback: Microsoft Research Introduces a General-Purpose Multimodal Foundation Model 'BEIT-3,' that Achieves State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks - FynBok

  7. Pingback: Microsoft Research Unveils 'BEIT-3' General-Purpose Multimodal Baseline Model, Achieving State-of-the-Art Transfer Performance on Both Vision and Vision Language Tasks - owurr.com

  8. Pingback: Сижу в Минводах - Блог ШСМ

  9. Bravo on your thought-provoking post! It’s a rare gem that not only educates but also sparks contemplation. .Could you kindly take a moment to read my post? Your insights would be greatly appreciated. CHCECE031 Assessment Answers.

  10. ownsley

    Artificial intelligence is rapidly transforming the way we communicate and interact with technology. It’s no longer just about asking questions or looking for information—AI can now hold conversations, offer advice, and even engage in creative tasks. AIと会話する , and you’ll discover a powerful tool that can assist with everything from problem-solving to brainstorming ideas. Whether you’re seeking help with work or simply having a casual chat, AI’s ability to understand and respond to human language is continually improving. Embracing this technology opens up endless possibilities for innovation and efficiency.

  11. This paper highlights an important step toward unified multimodal AI. By introducing BEiT-3, the Microsoft research team shows how vision and language can be learned through a shared pretraining framework. 恋みくじサイト
    Treating images like a “foreign language” is a compelling idea that pushes foundation models closer to being flexible, general-purpose systems across diverse vision and vision-language tasks.

  12. Engage with AI tools without the hassle of signing up! Many AI platforms now offer seamless experiences without requiring registration. This freedom lets you experiment with AI’s potential without barriers. Dive into AI-driven solutions and unlock new possibilities without creating an account. It’s easy to harness AI’s power to boost productivity, creativity, and more. Parlez à l’IA maintenant without registration today and discover a world of innovative possibilities at your fingertips!

  13. Engage with AI tools without the hassle of signing up! Many AI platforms now offer seamless experiences without requiring registration. This freedom lets you experiment with AI’s potential without barriers. Dive into AI-driven solutions and unlock new possibilities without creating an account. It’s easy to harness AI’s power to boost productivity, creativity, and more. Parlez à l’IA maintenant without registration today and discover a world of innovative possibilities at your fingertips!

  14. palitonken

    A qr コード 読み取り 無料 , allows users to scan and decode QR codes without any cost, offering quick access to links, contact information, or promotional offers. These apps are typically easy to use, available for smartphones and PCs alike. With no subscription or hidden fees, they provide a convenient solution for everyday scanning needs.

Leave a Reply

Your email address will not be published. Required fields are marked *