multi modal model

by Synced 2024-12-07 38

The Future of Vision AI: How Apple’s AIMV2 Leverages Images and Text to Lead the Pack

An Apple research team introduces AIMV2, a family of vision encoders that is designed to predict both image patches and text tokens within a unified sequence. This combined objective enables the model to excel in a range of tasks, such as image recognition, visual grounding, and multimodal understanding.

by Synced 2024-11-17 5

AI Machine Learning & Data Science Research

NVIDIA’s OMCAT: A Breakthrough in Cross-Modal Temporal Understanding for Multimodal AI

An NVIDIA research team introduces OMCAT: Omni Context Aware Transformer in their new paper, presenting both OCTAV, a unique dataset aimed at capturing event transitions across audio and video, and OMCAT, a model that employs RoTE (Rotary Time Embeddings).

by Synced 2024-10-30 5

AI Machine Learning & Data Science Research

From OCR to Multi-Image Insight: Apple’s MM1.5 with Enhanced Text-Rich Image Understanding and Visual Reasoning

Building on MM1’s success, Apple’s new paper, MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning, introduces an improved model family aimed at enhancing capabilities in text-rich image understanding, visual grounding, and multi-image reasoning.

by Synced 2024-06-21 3

AI Machine Learning & Data Science Research

Contrastive Learning Advances Sleep Science: Superior Multi-Modal Model Enhances Disorder Detection

In a new paper SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals, a research team introduces SleepFM, the first attempt at developing a multi-modal contrastive learning (CL) approach for PSG analysis, outperforming baselines in tasks like demographic attribute prediction and sleep stage classification.

by Synced 2023-10-12 3

AI Machine Learning & Data Science Research

Microsoft’s DeepSpeed-VisualChat: Breaking Boundaries in Multi-Modal Language Models

In a new paper DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention, a research team from DeepSpeed of Microsoft presents the DeepSpeed-VisualChat framework, which is designed to optimize LLMs by incorporating multi-modal capabilities, demonstrating superior scalability, even up to a 70 billion parameter model size.