Conversational generative large multimodal models (LMMs) have achieved impressive performance on a wide variety of vision-language tasks. Despite the success of these LMMs in general domain, they normally have worse performance on biomedical field with domain specific biomedical image-text pairs.
In an effort to bridge this gap, a new paper titled ‘LLaVA-Med: Training a Large Language-and-Vision Assistant’ by a Microsoft research team introduces a Large Language and Vision Assistant for BioMedicine (LLaVA-Med). This assistant can be trained in less than 15 hours and demonstrates a strong multimodal conversational capability, effectively assisting with inquiries about biomedical images.

The team summarizes their main contributions as follows:
- Biomedical multimodal instruction-following data. We present a novel data generation pipeline to create diverse (image, instruction, output) instances, by sampling biomedical image-text pairs from PMC-15M and using GPT-4 to create instructions from the text alone.
- LLaVA-Med. We propose a novel curriculum learning method for adapting LLaVA to the biomedical domain using our self-generated biomedical multi-modal instruction-following dataset.
- Open-source. To facilitate research in biomedical multimodal learning, we will release the following assets to the public: the biomedical multimodal instruction-following dataset and the codebase for data generation and model training.

To address the challenge that there are lack of multimodal biomedical datasets for training an instruction-following assistant, the team first proposes a novel data generation pipeline that samples 600K image-text pairs from PMC-15M, curates diverse instruction-following data through GPT-4 and aligns the created instructions to the model.

Next, the researchers present a novel curriculum learning approach to train LLaVA-Med. Specifically, they first train a multimodal conversation model LLaVA in general domains, then continuously train the model to adapt to the biomedical domain. The whole training procedure is consists of two stages: 1) Biomedical Concept Feature Alignment that aligns the image features of vast novel biomedical visual concepts to their corresponding textual word embeddings; 2) End-to-End Instruction-Tuning that fine-tunes model on the biomedical language-image instructions, as a results the LLaVA-Med is able to effectively interact with users and demonstrates strong zero-shot task transfer capability.


In their empirical study, the team compared the LLaVA-Med with supervised state-of-the-art methods, such as VL Encoder–Decoder, Q2ATransformer and BiomedCLIP. LLaVA-Med is able to be trained in less than 15 hours and surpasses most supervised state-of-the-art (SoTA) approaches.
Overall, this work validates the effective of the proposed LLaVA-Med, and the team believes their contribution paves the way for the development of general-purpose multimodal conversational assistants.
The code is available on project’s GitHub. The paper LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day on arXiv.
Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Microsoft’s LLaVA-Med Trains a Large Language-and-Vision Assistant for Biomedicine Within 15 Hours”