Site icon Synced

From OCR to Multi-Image Insight: Apple’s MM1.5 with Enhanced Text-Rich Image Understanding and Visual Reasoning

Multimodal Large Language Models (MLLMs) have rapidly become a focal point in AI research. Closed-source models like GPT-4o, GPT-4V, Gemini-1.5, and Claude-3.5 exemplify the impressive capabilities of advanced multimodal understanding.

This April, Apple introduced MM1, a suite of multimodal models up to 30 billion parameters, setting new benchmarks in multimodal performance with features like enhanced in-context learning and multi-image reasoning. These innovations support advanced few-shot chain-of-thought prompting.

Building on MM1’s success, Apple’s new paper, MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning, introduces an improved model family aimed at enhancing capabilities in text-rich image understanding, visual grounding, and multi-image reasoning.

MM1.5 leverages a data-centric training approach, examining the effects of diverse data combinations across its training lifecycle. Key architectural enhancements include:

Expanding on MM1’s pre-training and SFT stages, MM1.5 introduces a continual pre-training phase with high-quality OCR data and synthetic captions. The researchers categorize the SFT data into groups based on targeted model capabilities, adjusting ratios to maintain a balanced skill set.

To further refine the model, especially for text-rich images, they meticulously evaluate and adjust the dataset selections for continual pre-training. For knowledge-intensive benchmarks like MMMU, they retain MM1’s image-caption and interleaved image-text datasets, update text-only data, and fine-tune the data mix, yielding a highly optimized final composition.

Empirical findings indicate that MM1.5 significantly advances upon MM1, excelling across a broad spectrum of multimodal tasks, including general-domain and text-rich image interpretation, coarse- to fine-grained understanding, and single- to multi-image reasoning.

The paper MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning is on arXiv.


Author: Hecate He | Editor: Chain Zhang

Exit mobile version