Microsoft’s DeepSpeed-VisualChat: Breaking Boundaries in Multi-Modal Language Models

Cutting-edge large language models (LLMs) have displayed remarkable capabilities in various text-based tasks, but there is a growing desire to extend their abilities beyond the confines of text-only processing. This expansion into multi-modal capabilities holds great promise.

Among these multi-modal endeavors, large vision and language models (LVLMs) have garnered significant attention due to their potential for enhancing visual-textual comprehension. However, existing models face limitations in handling interleaved image-and-text inputs in multi-image, multi-round dialogues, and their adaptability and scalability across diverse interaction realms are hampered by constraints related to training and data accessibility.

To address this issue, in a new paper DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention, a research team from DeepSpeed of Microsoft presents the DeepSpeed-VisualChat framework, which is designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, demonstrating superior scalability, even up to a 70 billion parameter language model size, when compared to existing frameworks.

The team summarizes their main contributions as follows:

Fully Open-Sourced Multi-round Multi-image Framework: DeepSpeed-VisualChat, one of the pioneering fully open-sourced frameworks, enables multi-round and multi-image dialogues, accommodating interleaved text-and-image inputs.
Multi-Modal Causal Attention (MMCA): We devise a novel MMCA for multi-modal models that independently computes attention weights across various modalities.
Data Blending for Interleaved Inputs: To facilitate conversations with interleaved modalities, DeepSpeed-VisualChat employs assorted data blending techniques on existing datasets, overcoming the shortage of interleaved text-and-image inputs in most available open-sourced datasets.
Unprecedented Scalability: We leverage the DeepSpeed framework to amplify our training with a 2B visual encoder from and a 70B language decoder from LLaMA-2, illustrating the remarkable scalability of our framework.

DeepSpeed-VisualChat is structured based on MiniGPT4, where a pre-trained vision encoder encodes an image, which is then aligned with the hidden dimension of the text embedding layer’s output through a linear layer. These diverse inputs are then passed to language models like LLaMA2, powered by the new Multi-Modal Causal Attention (MMCA) mechanism. Both the vision encoder and the language model are kept frozen.

In contrast to the conventional Cross Attention (CrA), which introduces new parameters and complexities, MMCA addresses these issues by having visual tokens attend to themselves and textual tokens attend to their previous tokens with separate attention weight matrices for text and image tokens.

Empirical results showcase that DeepSpeed-VisualChat outperforms contemporary models in scalability, enhancing adaptability in diverse interactive scenarios without incurring additional training costs or complexity. For instance, it achieves superior scalability up to a 70 billion parameter language model size, representing a significant milestone in the realm of multi-modal language models and laying a strong foundation for future developments in this field.

Code will be released soon as a part of https://github.com/microsoft/DeepSpeedExamples. The paper DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

3 comments on “Microsoft’s DeepSpeed-VisualChat: Breaking Boundaries in Multi-Modal Language Models”

Henry Larry

2023-12-19

The DeepSpeed VisualChat framework appears to overcome crucial limitations in handling multi image multi round dialogues offering a significant leap in adaptability and scalability across diverse interaction realms.
Professional Car Detailing Workshop in Mesa AZ

Loading...

mariakenneth

2024-07-01

The advancements in large language models and multi-modal capabilities described in this paper are remarkable. It’s exciting to imagine how these technologies could enhance interactive experiences, perhaps even in unexpected realms like omegle-style chat platforms. The team’s focus on scalability and adaptability is particularly promising for building inclusive, accessible, and engaging multi-modal dialogues of the future.

Loading...

Deborah

2026-02-04

Seeing stuff like DeepSpeed-VisualChat push multi-modal AI feels like watching the next stage of how we interact with tech — almost like teaching machines to understand the world a bit more like we do. It reminds me of the first time I tried combining voice commands with pictures on my phone: confusing at first, but then kind of magical when it worked. Of course, experimenting with new frameworks always comes with questions, so having support handy is clutch, I even had to look up the microsoft customer service phone number once just to untangle an authentication issue! Tech breakthroughs are exciting, especially when they make life feel more intuitive and fun.

Loading...

Share this:

Like this:

3 comments on “Microsoft’s DeepSpeed-VisualChat: Breaking Boundaries in Multi-Modal Language Models”

Leave a Reply Cancel reply

Related