Cutting-edge large language models (LLMs) have displayed remarkable capabilities in various text-based tasks, but there is a growing desire to extend their abilities beyond the confines of text-only processing. This expansion into multi-modal capabilities holds great promise.
Among these multi-modal endeavors, large vision and language models (LVLMs) have garnered significant attention due to their potential for enhancing visual-textual comprehension. However, existing models face limitations in handling interleaved image-and-text inputs in multi-image, multi-round dialogues, and their adaptability and scalability across diverse interaction realms are hampered by constraints related to training and data accessibility.
To address this issue, in a new paper DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention, a research team from DeepSpeed of Microsoft presents the DeepSpeed-VisualChat framework, which is designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, demonstrating superior scalability, even up to a 70 billion parameter language model size, when compared to existing frameworks.
The team summarizes their main contributions as follows:
- Fully Open-Sourced Multi-round Multi-image Framework: DeepSpeed-VisualChat, one of the pioneering fully open-sourced frameworks, enables multi-round and multi-image dialogues, accommodating interleaved text-and-image inputs.
- Multi-Modal Causal Attention (MMCA): We devise a novel MMCA for multi-modal models that independently computes attention weights across various modalities.
- Data Blending for Interleaved Inputs: To facilitate conversations with interleaved modalities, DeepSpeed-VisualChat employs assorted data blending techniques on existing datasets, overcoming the shortage of interleaved text-and-image inputs in most available open-sourced datasets.
- Unprecedented Scalability: We leverage the DeepSpeed framework to amplify our training with a 2B visual encoder from and a 70B language decoder from LLaMA-2, illustrating the remarkable scalability of our framework.
DeepSpeed-VisualChat is structured based on MiniGPT4, where a pre-trained vision encoder encodes an image, which is then aligned with the hidden dimension of the text embedding layer’s output through a linear layer. These diverse inputs are then passed to language models like LLaMA2, powered by the new Multi-Modal Causal Attention (MMCA) mechanism. Both the vision encoder and the language model are kept frozen.
In contrast to the conventional Cross Attention (CrA), which introduces new parameters and complexities, MMCA addresses these issues by having visual tokens attend to themselves and textual tokens attend to their previous tokens with separate attention weight matrices for text and image tokens.
Empirical results showcase that DeepSpeed-VisualChat outperforms contemporary models in scalability, enhancing adaptability in diverse interactive scenarios without incurring additional training costs or complexity. For instance, it achieves superior scalability up to a 70 billion parameter language model size, representing a significant milestone in the realm of multi-modal language models and laying a strong foundation for future developments in this field.
Code will be released soon as a part of https://github.com/microsoft/DeepSpeedExamples. The paper DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.