In the new paper I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Data, an Allen Institute for Artificial Intelligence team proposes Cross Modal Transfer On Semantic Embeddings (CLOSE), an approach that learns high-level skills from textual data, then uses these skills to complete vision tasks without additional visual training data.
In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3, a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.
In the new paper Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks, a research team from the Allen Institute for AI and the University of Washington introduces UNIFIED-IO, a neural model that achieves strong performance across a wide variety of vision, language, and multi-modal tasks without task- or modality-specific branches or fine-tuning.
In the new paper i-Code: An Integrative and Composable Multimodal Learning Framework, a Microsoft Azure Cognitive Services Research team presents i-Code, a self-supervised pretraining framework that enables the flexible integration of vision, speech, and language modalities and learns their vector representations in a unified manner.
In the new paper Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, Google researchers argue that the diversity of different foundation models is symbiotic and that it is possible to build a framework that uses structured Socratic dialogue between pre-existing foundation models to formulate new multimodal tasks as a guided exchange between the models without additional finetuning.
The Swiss Federal Institute of Technology Lausanne (EPFL) presents Multi-modal Multi-task Masked Autoencoders (MultiMAE), a simple and effective pretraining strategy that enables masked autoencoding to include multiple modalities and tasks and is applicable to any RGB dataset.
DeepMind researchers propose Hierarchical Perceiver (HiP), a model that retains the original Perceiver’s ability to process arbitrary modalities but is faster, can scale up to even more inputs/outputs, reduces the need for input engineering, and improves both efficiency and accuracy on classical computer vision benchmarks.
Baidu researchers propose ERNIE-ViLG, a 10-billion parameter scale pretraining framework for bidirectional text-image generation. Pretrained on 145 million (Chinese) image-text pairs, ERNIE-ViLG achieves state-of-the-art performance on both text-to-image and image-to-text generation tasks.
A Facebook AI Research team presents FLAVA, a foundational language and vision alignment model that explicitly targets language, vision, and their multimodal combination all at once, achieving impressive performance on 35 tasks across the vision, language, and multimodal domains.
A research team from Google Research, University of Cambridge and Alan Turing Institute proposes PolyViT, a single transformer model capable of processing multiple modalities and datasets. PolyViT is parameter-efficient and learns representations that generalize across multiple domains.