AI Machine Learning & Data Science Research

Microsoft’s LLaVA-Med Trains a Large Language-and-Vision Assistant for Biomedicine Within 15 Hours

In a new paper LLaVA-Med: Training a Large Language-and-Vision Assistant, a Microsoft research team proposes a Large Language and Vision Assistant for BioMedicine (LLaVA-Med), which can be trained in less than 15 hours and demonstrates strong multimodal conversational capability, aiding inquiries about biomedical image.

Conversational generative large multimodal models (LMMs) have achieved impressive performance on a wide variety of vision-language tasks. Despite the success of these LMMs in general domain, they normally have worse performance on biomedical field with domain specific biomedical image-text pairs.

In an effort to bridge this gap, a new paper titled ‘LLaVA-Med: Training a Large Language-and-Vision Assistant’ by a Microsoft research team introduces a Large Language and Vision Assistant for BioMedicine (LLaVA-Med). This assistant can be trained in less than 15 hours and demonstrates a strong multimodal conversational capability, effectively assisting with inquiries about biomedical images.

The team summarizes their main contributions as follows:

  1. Biomedical multimodal instruction-following data. We present a novel data generation pipeline to create diverse (image, instruction, output) instances, by sampling biomedical image-text pairs from PMC-15M and using GPT-4 to create instructions from the text alone.
  2. LLaVA-Med. We propose a novel curriculum learning method for adapting LLaVA to the biomedical domain using our self-generated biomedical multi-modal instruction-following dataset.
  3. Open-source. To facilitate research in biomedical multimodal learning, we will release the following assets to the public: the biomedical multimodal instruction-following dataset and the codebase for data generation and model training.

To address the challenge that there are lack of multimodal biomedical datasets for training an instruction-following assistant, the team first proposes a novel data generation pipeline that samples 600K image-text pairs from PMC-15M, curates diverse instruction-following data through GPT-4 and aligns the created instructions to the model.

Next, the researchers present a novel curriculum learning approach to train LLaVA-Med. Specifically, they first train a multimodal conversation model LLaVA in general domains, then continuously train the model to adapt to the biomedical domain. The whole training procedure is consists of two stages: 1) Biomedical Concept Feature Alignment that aligns the image features of vast novel biomedical visual concepts to their corresponding textual word embeddings; 2) End-to-End Instruction-Tuning that fine-tunes model on the biomedical language-image instructions, as a results the LLaVA-Med is able to effectively interact with users and demonstrates strong zero-shot task transfer capability.

In their empirical study, the team compared the LLaVA-Med with supervised state-of-the-art methods, such as VL Encoder–Decoder, Q2ATransformer and BiomedCLIP. LLaVA-Med is able to be trained in less than 15 hours and surpasses most supervised state-of-the-art (SoTA) approaches.

Overall, this work validates the effective of the proposed LLaVA-Med, and the team believes their contribution paves the way for the development of general-purpose multimodal conversational assistants.

The code is available on project’s GitHub. The paper LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

0 comments on “Microsoft’s LLaVA-Med Trains a Large Language-and-Vision Assistant for Biomedicine Within 15 Hours

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: