Multimodal Learning

by Synced 2024-12-26 21

DeepMind’s JetFormer: Unified Multimodal Models Without Modelling Constraints

A DeepMind research team introduces JetFormer, a Transformer designed to directly model raw data. This model maximizes the likelihood of raw data without depending on any pre-trained components, and is capable of both understanding and generating text and images seamlessly.

by Synced 2024-08-06 9

AI Machine Learning & Data Science Research

Llama 3: Meta AI’s Multilingual and Multimodal Marvel

In a new paper The Llama 3 Herd of Models, a Meta AI research team presents Llama 3, a new set of foundation models for language, delivering competitive performance comparing to state-of-the-art language models such as GPT-4 on a plethora of tasks.

by Synced 2024-07-26 4

AI Machine Learning & Data Science Research

From Images to Insights: DeepMind’s Versatile Vision-Language Model PaliGemma Achieves SOTA Results

A DeepMind research team release PaliGemma, a robust and versatile vision language model with 3 billion parameters. PaliGemma excels in transfer learning across various vision and language tasks, achieving state-of-the-art performance in a multitude of open-world applications.

by Synced 2024-06-21 3

AI Machine Learning & Data Science Research

Contrastive Learning Advances Sleep Science: Superior Multi-Modal Model Enhances Disorder Detection

In a new paper SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals, a research team introduces SleepFM, the first attempt at developing a multi-modal contrastive learning (CL) approach for PSG analysis, outperforming baselines in tasks like demographic attribute prediction and sleep stage classification.

by Synced 2024-06-17 4

AI Machine Learning & Data Science Research

AI Pioneers Gather at BAAI 2024: Unveiling Innovations in Large-Scaled AI Models for Language, Multimodal, Embodied, Bio-Computing, and FlagOpen 2.0

“Global Vision, Ideas in Collision, Leading Cutting-Edge Innovations” – The 6th annual BAAI Conference successfully concluded on June 15. Over 200 AI scholars and industry leaders gathered to discuss the trajectories and applications of advanced AI technologies.

by Synced 2023-12-06 4

AI Machine Learning & Data Science Research

Tencent & Sydney U’s GPT4Video: A Unified Multimodal Large Language Significantly Elevates LMs’ Video Generative Capabilities

A collaborative effort between Tencent AI Lab and The University of Sydney introduces GPT4Video, which stands as a unified multi-model framework that endows Large Language Models (LLMs) with the unique ability for both video understanding and generation.

by Synced 2023-10-04 4

AI Machine Learning & Data Science Research

Microsoft Unveils the Potential of Large Multimodal Models with GPT-4V(ision)

A Microsoft research team conducts an in-depth analysis of the latest model, GPT-4V(ision). Their report delves into the emerging application scenarios and outlines future research directions for GPT-4V-based systems, with the goal of inspiring research on next-generation multimodal task formulation and the development of more robust LLMs.

by Synced 2023-07-11 11

AI Machine Learning & Data Science Research

Google & CMU’s Semantic Pyramid AutoEncoder Marks the First Successful Attempt for Multimodal Generation with Frozen LLMs

In a new paper SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs, a research team from Google Research and Carnegie Mellon University introduces Semantic Pyramid AutoEncoder (SPACE), the first successful method for enabling frozen LLMs to solve cross-modal tasks.

by Synced 2023-06-06 2

AI Machine Learning & Data Science Research

Microsoft’s LLaVA-Med Trains a Large Language-and-Vision Assistant for Biomedicine Within 15 Hours

In a new paper LLaVA-Med: Training a Large Language-and-Vision Assistant, a Microsoft research team proposes a Large Language and Vision Assistant for BioMedicine (LLaVA-Med), which can be trained in less than 15 hours and demonstrates strong multimodal conversational capability, aiding inquiries about biomedical image.

by Synced 2023-05-24 1

AI Machine Learning & Data Science Research

Alibaba & HUST’s ONE-PEACE: Toward a General Representation Model For Unlimited Modalities

In the new paper ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities, a research team from Alibaba Group’s DAMO Academy and the Huazhong University of Science and Technology releases ONE-PEACE, a highly extensible model that can align and integrate representations across vision, audio, and language modalities; opening a path toward the creation of a general representation model for unlimited modalities.

by Synced 2023-03-07 23

AI Machine Learning & Data Science Nature Language Tech Popular Research

Toward AGI: Microsoft’s KOSMOS-1 MLLM Can Perceive General Modalities, Follow Instructions, and Perform In-Context Learning

In the new paper Language Is Not All You Need: Aligning Perception with Language Models, a Microsoft research team presents KOSMOS-1, a multimodal large language model (MLLM) that can perceive general modalities, learn in context, and follow instructions.

by Synced 2022-12-22 1

AI Machine Learning & Data Science Research

Google’s Mu2SLAM: Toward a Single Model For All Speech and Text Understanding Tasks

In the new paper Mu2SLAM: Multitask, Multilingual Speech and Language Models, a Google Research team presents Mu2SLAM, a multilingual sequence-to-sequence pretraining method for speech and text models that covers arbitrary tasks in over 100 languages.

by Synced 2022-11-29 0

AI Machine Learning & Data Science Research

No Images Are Needed! Allen AI’s CLOSE Learns to Complete Visual Tasks From Text Inputs Alone

In the new paper I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Data, an Allen Institute for Artificial Intelligence team proposes Cross Modal Transfer On Semantic Embeddings (CLOSE), an approach that learns high-level skills from textual data, then uses these skills to complete vision tasks without additional visual training data.

by Synced 2022-08-30 16

AI Computer Vision & Graphics Machine Learning & Data Science Popular Research

Microsoft’s BEiT-3 Foundation Model: A ‘Big Convergence of Language, Vision, and Multimodal Pretraining’ That Achieves SOTA Results on Popular Benchmarks

In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3, a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.

by Synced 2022-06-24 12

AI Machine Learning & Data Science Research

Allen AI & UW Propose Unified-IO: A High-Performance, Task-Agnostic Model for CV, NLP, and Multi-Modal Tasks

In the new paper Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks, a research team from the Allen Institute for AI and the University of Washington introduces UNIFIED-IO, a neural model that achieves strong performance across a wide variety of vision, language, and multi-modal tasks without task- or modality-specific branches or fine-tuning.

by Synced 2022-05-18 2

AI Machine Learning & Data Science Research

DeepMind Introduces Gato: A Generalist, Multi-Modal, Multi-Task, Multi-Embodiment Agent

A DeepMind research team proposes Gato, a single general-purpose transformer sequence model that can engage in dialogue, caption images, stack blocks with a real robot arm, navigate in simulated 3D environments and even beat human players at Atari games.

by Synced 2022-05-11 1

AI Computer Vision & Graphics Machine Learning & Data Science Research

Microsoft Azure Introduces i-Code: A General Framework That Enables Flexible Multimodal Representation Learning

In the new paper i-Code: An Integrative and Composable Multimodal Learning Framework, a Microsoft Azure Cognitive Services Research team presents i-Code, a self-supervised pretraining framework that enables the flexible integration of vision, speech, and language modalities and learns their vector representations in a unified manner.

by Synced 2022-04-12 1

AI Machine Learning & Data Science Research

Google Builds Language Models with Socratic Dialogue to Improve Zero-Shot Multimodal Reasoning Capabilities

In the new paper Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, Google researchers argue that the diversity of different foundation models is symbiotic and that it is possible to build a framework that uses structured Socratic dialogue between pre-existing foundation models to formulate new multimodal tasks as a guided exchange between the models without additional finetuning.

by Synced 2022-04-08 1

AI Machine Learning & Data Science Research

EPFL’s Multi-modal Multi-task Masked Autoencoder: A Simple, Flexible and Effective ViT Pretraining Strategy Applicable to Any RGB Dataset

The Swiss Federal Institute of Technology Lausanne (EPFL) presents Multi-modal Multi-task Masked Autoencoders (MultiMAE), a simple and effective pretraining strategy that enables masked autoencoding to include multiple modalities and tasks and is applicable to any RGB dataset.

by Synced 2022-02-24 0

AI Computer Vision & Graphics Machine Learning & Data Science Research

DeepMind’s Upgraded Hierarchical Perceiver Is Faster, Scales to Larger Data Without Preprocessing, and Delivers Higher Resolution and Accuracy

DeepMind researchers propose Hierarchical Perceiver (HiP), a model that retains the original Perceiver’s ability to process arbitrary modalities but is faster, can scale up to even more inputs/outputs, reduces the need for input engineering, and improves both efficiency and accuracy on classical computer vision benchmarks.

by Synced 2022-02-09 1

AI Machine Learning & Data Science Research

DAMO Academy Proposes One For All, a Task- and Modality-Agnostic Framework for Multimodal and Uni-Modal Understanding and Generation

A research team from Alibaba Group’s DAMO Academy proposes OFA (One For All), a pretrained model that unifies modalities and tasks to a simple Seq2Seq learning framework and achieves SOTA results on a series of multimodal tasks.

by Synced 2022-01-24 2

AI Computer Vision & Graphics Machine Learning & Data Science Research

Meta AI’s OMNIVORE: A Modality-Agnostic Single Vision Model With Cross-Modal Generalization

A Meta AI research team presents OMNIVORE, a single vision model for various visual modalities that can perform cross-modal generalization and achieves performance at par or better than traditional modality-specific models of the same size.

by Synced 2022-01-07 0

AI Machine Learning & Data Science Research

Baidu’s 10-Billion Scale ERNIE-ViLG Unified Generative Pretraining Framework Achieves SOTA Performance on Bidirectional Vision-Language Generation Tasks

Baidu researchers propose ERNIE-ViLG, a 10-billion parameter scale pretraining framework for bidirectional text-image generation. Pretrained on 145 million (Chinese) image-text pairs, ERNIE-ViLG achieves state-of-the-art performance on both text-to-image and image-to-text generation tasks.

by Synced 2021-12-15 2

AI Machine Learning & Data Science Research

Facebook AI’s FLAVA Foundational Model Tackles Vision, Language, and Vision & Language Tasks All at Once

A Facebook AI Research team presents FLAVA, a foundational language and vision alignment model that explicitly targets language, vision, and their multimodal combination all at once, achieving impressive performance on 35 tasks across the vision, language, and multimodal domains.

by Synced 2021-11-30 0

AI Machine Learning & Data Science Popular Research

Google, Cambridge U & Alan Turing Institute Propose PolyViT: A Universal Transformer for Image, Video, and Audio Classification

A research team from Google Research, University of Cambridge and Alan Turing Institute proposes PolyViT, a single transformer model capable of processing multiple modalities and datasets. PolyViT is parameter-efficient and learns representations that generalize across multiple domains.

by Synced 2021-04-30 2

AI Computer Vision & Graphics Machine Learning & Data Science Research

Yann LeCun Team’s Novel End-to-End Modulated Detector Captures Visual Concepts in Free-Form Text

A research team from NYU and Facebook proposes MDETR, an end-to-end modulated detector that identifies objects in images conditioned on a raw text query and is able to capture a long tail of visual concepts expressed in free-form text.