In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3, a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.
A research team from Sun Yat-sen University and UBTECH proposes a unified approach for justifying, analyzing, and improving foundation models in the new paper Big Learning: A Universal Machine Learning Paradigm? The team’s big learning framework can model many-to-all joint/conditional/marginal data distributions and delivers extraordinary data and task flexibilities.
A Facebook AI Research team presents FLAVA, a foundational language and vision alignment model that explicitly targets language, vision, and their multimodal combination all at once, achieving impressive performance on 35 tasks across the vision, language, and multimodal domains.
In the paper A New Foundation Model for Computer Vision, a Microsoft research team proposes Florence, a novel foundation model for computer vision that significantly outperforms previous large-scale pretraining approaches and achieves new SOTA results across a wide range of visual and visual-linguistic benchmarks.