Tag: Multimodal Learning

AI Machine Learning & Data Science Nature Language Tech Popular Research

Toward AGI: Microsoft’s KOSMOS-1 MLLM Can Perceive General Modalities, Follow Instructions, and Perform In-Context Learning

In the new paper Language Is Not All You Need: Aligning Perception with Language Models, a Microsoft research team presents KOSMOS-1, a multimodal large language model (MLLM) that can perceive general modalities, learn in context, and follow instructions.

AI Machine Learning & Data Science Research

No Images Are Needed! Allen AI’s CLOSE Learns to Complete Visual Tasks From Text Inputs Alone

In the new paper I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Data, an Allen Institute for Artificial Intelligence team proposes Cross Modal Transfer On Semantic Embeddings (CLOSE), an approach that learns high-level skills from textual data, then uses these skills to complete vision tasks without additional visual training data.

AI Computer Vision & Graphics Machine Learning & Data Science Popular Research

Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks

In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3, a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.

AI Machine Learning & Data Science Research

Allen AI & UW Propose Unified-IO: A High-Performance, Task-Agnostic Model for CV, NLP, and Multi-Modal Tasks

In the new paper Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks, a research team from the Allen Institute for AI and the University of Washington introduces UNIFIED-IO, a neural model that achieves strong performance across a wide variety of vision, language, and multi-modal tasks without task- or modality-specific branches or fine-tuning.

AI Computer Vision & Graphics Machine Learning & Data Science Research

Microsoft Azure Introduces i-Code: A General Framework That Enables Flexible Multimodal Representation Learning

In the new paper i-Code: An Integrative and Composable Multimodal Learning Framework, a Microsoft Azure Cognitive Services Research team presents i-Code, a self-supervised pretraining framework that enables the flexible integration of vision, speech, and language modalities and learns their vector representations in a unified manner.

AI Machine Learning & Data Science Research

Google Builds Language Models with Socratic Dialogue to Improve Zero-Shot Multimodal Reasoning Capabilities

In the new paper Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, Google researchers argue that the diversity of different foundation models is symbiotic and that it is possible to build a framework that uses structured Socratic dialogue between pre-existing foundation models to formulate new multimodal tasks as a guided exchange between the models without additional finetuning.

AI Machine Learning & Data Science Research

EPFL’s Multi-modal Multi-task Masked Autoencoder: A Simple, Flexible and Effective ViT Pretraining Strategy Applicable to Any RGB Dataset

The Swiss Federal Institute of Technology Lausanne (EPFL) presents Multi-modal Multi-task Masked Autoencoders (MultiMAE), a simple and effective pretraining strategy that enables masked autoencoding to include multiple modalities and tasks and is applicable to any RGB dataset.

AI Computer Vision & Graphics Machine Learning & Data Science Research

DeepMind’s Upgraded Hierarchical Perceiver Is Faster, Scales to Larger Data Without Preprocessing, and Delivers Higher Resolution and Accuracy

DeepMind researchers propose Hierarchical Perceiver (HiP), a model that retains the original Perceiver’s ability to process arbitrary modalities but is faster, can scale up to even more inputs/outputs, reduces the need for input engineering, and improves both efficiency and accuracy on classical computer vision benchmarks.

AI Machine Learning & Data Science Research

Baidu’s 10-Billion Scale ERNIE-ViLG Unified Generative Pretraining Framework Achieves SOTA Performance on Bidirectional Vision-Language Generation Tasks

Baidu researchers propose ERNIE-ViLG, a 10-billion parameter scale pretraining framework for bidirectional text-image generation. Pretrained on 145 million (Chinese) image-text pairs, ERNIE-ViLG achieves state-of-the-art performance on both text-to-image and image-to-text generation tasks.

AI Machine Learning & Data Science Popular Research

Google, Cambridge U & Alan Turing Institute Propose PolyViT: A Universal Transformer for Image, Video, and Audio Classification

A research team from Google Research, University of Cambridge and Alan Turing Institute proposes PolyViT, a single transformer model capable of processing multiple modalities and datasets. PolyViT is parameter-efficient and learns representations that generalize across multiple domains.