From OCR to Multi-Image Insight: Apple’s MM1.5 with Enhanced Text-Rich Image Understanding and Visual Reasoning

Synced

1 year ago

Multimodal Large Language Models (MLLMs) have rapidly become a focal point in AI research. Closed-source models like GPT-4o, GPT-4V, Gemini-1.5, and Claude-3.5 exemplify the impressive capabilities of advanced multimodal understanding.

This April, Apple introduced MM1, a suite of multimodal models up to 30 billion parameters, setting new benchmarks in multimodal performance with features like enhanced in-context learning and multi-image reasoning. These innovations support advanced few-shot chain-of-thought prompting.

Building on MM1’s success, Apple’s new paper, MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning, introduces an improved model family aimed at enhancing capabilities in text-rich image understanding, visual grounding, and multi-image reasoning.

MM1.5 leverages a data-centric training approach, examining the effects of diverse data combinations across its training lifecycle. Key architectural enhancements include:

OCR Capabilities: In alignment with trends in high-resolution image comprehension, MM1.5 supports arbitrary image aspect ratios and resolutions up to 4 megapixels. By integrating optimized OCR data throughout its training stages, MM1.5 excels at understanding complex, text-rich images.
Visual Grounding and Referring: MM1.5 offers detailed image comprehension, interpreting not only text prompts but also visual prompts like points and bounding boxes. The model can generate responses that are visually grounded, correlating text outputs with specific image regions.
Multi-Image Reasoning and In-Context Learning: Benefiting from extensive interleaved pre-training, MM1.5 demonstrates robust multi-image reasoning and strong in-context learning abilities, further enhanced by supervised fine-tuning (SFT) on specialized multi-image data.

Expanding on MM1’s pre-training and SFT stages, MM1.5 introduces a continual pre-training phase with high-quality OCR data and synthetic captions. The researchers categorize the SFT data into groups based on targeted model capabilities, adjusting ratios to maintain a balanced skill set.

To further refine the model, especially for text-rich images, they meticulously evaluate and adjust the dataset selections for continual pre-training. For knowledge-intensive benchmarks like MMMU, they retain MM1’s image-caption and interleaved image-text datasets, update text-only data, and fine-tune the data mix, yielding a highly optimized final composition.

Empirical findings indicate that MM1.5 significantly advances upon MM1, excelling across a broad spectrum of multimodal tasks, including general-domain and text-rich image interpretation, coarse- to fine-grained understanding, and single- to multi-image reasoning.

The paper MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning is on arXiv.

Author: Hecate He | Editor: Chain Zhang

Share this: