Microsoft’s LLaVA-Med Trains a Large Language-and-Vision Assistant for Biomedicine Within 15 Hours

Conversational generative large multimodal models (LMMs) have achieved impressive performance on a wide variety of vision-language tasks. Despite the success of these LMMs in general domain, they normally have worse performance on biomedical field with domain specific biomedical image-text pairs.

In an effort to bridge this gap, a new paper titled ‘LLaVA-Med: Training a Large Language-and-Vision Assistant’ by a Microsoft research team introduces a Large Language and Vision Assistant for BioMedicine (LLaVA-Med). This assistant can be trained in less than 15 hours and demonstrates a strong multimodal conversational capability, effectively assisting with inquiries about biomedical images.

The team summarizes their main contributions as follows:

Biomedical multimodal instruction-following data. We present a novel data generation pipeline to create diverse (image, instruction, output) instances, by sampling biomedical image-text pairs from PMC-15M and using GPT-4 to create instructions from the text alone.
LLaVA-Med. We propose a novel curriculum learning method for adapting LLaVA to the biomedical domain using our self-generated biomedical multi-modal instruction-following dataset.
Open-source. To facilitate research in biomedical multimodal learning, we will release the following assets to the public: the biomedical multimodal instruction-following dataset and the codebase for data generation and model training.

To address the challenge that there are lack of multimodal biomedical datasets for training an instruction-following assistant, the team first proposes a novel data generation pipeline that samples 600K image-text pairs from PMC-15M, curates diverse instruction-following data through GPT-4 and aligns the created instructions to the model.

Next, the researchers present a novel curriculum learning approach to train LLaVA-Med. Specifically, they first train a multimodal conversation model LLaVA in general domains, then continuously train the model to adapt to the biomedical domain. The whole training procedure is consists of two stages: 1) Biomedical Concept Feature Alignment that aligns the image features of vast novel biomedical visual concepts to their corresponding textual word embeddings; 2) End-to-End Instruction-Tuning that fine-tunes model on the biomedical language-image instructions, as a results the LLaVA-Med is able to effectively interact with users and demonstrates strong zero-shot task transfer capability.

In their empirical study, the team compared the LLaVA-Med with supervised state-of-the-art methods, such as VL Encoder–Decoder, Q2ATransformer and BiomedCLIP. LLaVA-Med is able to be trained in less than 15 hours and surpasses most supervised state-of-the-art (SoTA) approaches.

Overall, this work validates the effective of the proposed LLaVA-Med, and the team believes their contribution paves the way for the development of general-purpose multimodal conversational assistants.

The code is available on project’s GitHub. The paper LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

2 comments on “Microsoft’s LLaVA-Med Trains a Large Language-and-Vision Assistant for Biomedicine Within 15 Hours”

melindablack

2024-09-09

Clients can rest assured that every product used by our company is safe for surfaces, as well as the people and animals who inhabit the spaces. These products are not only eco-conscious but are also designed to protect the long-term integrity of household surfaces. Our company uses certified, green cleaning products that are free from harmful chemicals such as ammonia, phosphates, and chlorine. These products are chosen specifically for their ability to clean effectively while maintaining the health of indoor spaces. Visit us.

Loading...

LunwenHui

2024-11-06

Customer feedback not only affirms our work but also guides our continuous improvement. By actively listening to our customers’ voices, we can quickly adjust and improve our services to meet your evolving needs. We sincerely appreciate every suggestion and comment from you, as it helps us provide personalized, tailored essay writing services https://www.lunwenhui.com/ that meet your requirements.

Loading...

Microsoft’s LLaVA-Med Trains a Large Language-and-Vision Assistant for Biomedicine Within 15 Hours

Like this:

2 comments on “Microsoft’s LLaVA-Med Trains a Large Language-and-Vision Assistant for Biomedicine Within 15 Hours”

Leave a Reply Cancel reply

Related

Share this:

Like this:

2 comments on “Microsoft’s LLaVA-Med Trains a Large Language-and-Vision Assistant for Biomedicine Within 15 Hours”

Leave a Reply Cancel reply

Related