Recent strides in large language models (LLMs) have showcased their remarkable versatility across various domains and tasks. The next frontier in this field is the development of large multimodal models (LMMs), aiming to enhance the capabilities of LLMs by incorporating multi-sensory skills to achieve even greater general intelligence. However, most existing LLMs are constrained by model and data scales, leaving a gap in our understanding of the current state and emergent multimodal abilities of LMMs built upon state-of-the-art LLMs.
In a new paper The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision), a Microsoft research team conducts an in-depth analysis of the latest model, GPT-4V(ision). Their report delves into the emerging application scenarios and outlines future research directions for GPT-4V-based systems, with the goal of inspiring research on next-generation multimodal task formulation and the development of more robust LLMs.
This study centers on the use of qualitative results to shed light on GPT-4V’s new capabilities and potential emerging use cases, even though these novel capabilities may not yet be entirely reliable.
The report is structured around four key questions guiding their exploration: 1) What are GPT-4V’s supported inputs and working modes? 2) What are the quality and genericity of GPT-4V’s capabilities on different domains and tasks? 3) What are effective ways to use and prompt GPT-4V? and 4) What are promising future directions?
The contributions of this paper can be summarized as follows:
- Supported Inputs and Working Modes:
- GPT-4V exhibits unparalleled proficiency in comprehending and processing a diverse mix of input types, including images, sub-images, text, scene text, and visual pointers.
- GPT-4V seamlessly supports test-time techniques observed in LLMs, such as instruction following, chain-of-thoughts, and in-context few-shot learning.
- Quality and Generality of Capabilities:
- GPT-4V demonstrates impressive human-level capabilities across a wide range of domains, including open-world visual understanding, visual description, multimodal knowledge, commonsense reasoning, scene text understanding, document reasoning, coding, temporal reasoning, abstract reasoning, and emotion understanding.
- Effective Prompting Techniques:
- Visual referring prompting can be seamlessly integrated with other image and text prompts in GPT-4V, creating a nuanced interface for instruction and example demonstrations. For example, visual referring prompts employ visual pointers and scene texts on input images to instruct GPT-4V effectively.
- Promising Future Directions:
- The researchers explore novel use cases enabled by GPT-4V and suggest powerful future systems that can be built upon its foundation. These include multimodal plugins, multimodal chains, self-reflection, self-consistency, and retrieval-augmented LMMs, among others.
In summary, this report offers a comprehensive analysis of GPT-4V, encompassing a broad spectrum of domains, tasks, working modes, and prompting techniques, all within a fixed capacity. It is our belief that this organized compilation of explorations will serve as a source of inspiration for future research endeavors, sparking innovations in emerging applications, next-generation multimodal task formulation, and the development of advanced LMM-based intelligent systems.
The paper The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.