Recent developments in text-to-music editing, which utilize text prompts to modify music, have opened up unique challenges and opportunities for AI-driven music creation. Traditional approaches have struggled with the need to train specific editing models from scratch, a process that is both resource-intensive and inefficient. Alternatively, using large language models to predict edited music has led to issues with inaccurate audio reconstruction.
Addressing these limitations, a new paper titled Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning presents a novel solution developed by a research team from Queen Mary University of London, Sony AI, and MBZUAI. They introduce Instruct-MusicGen, an innovative method that fine-tunes a pretrained MusicGen model to efficiently follow editing instructions, delivering superior performance across various tasks compared to existing benchmarks.

Instruct-MusicGen adopts an instruction-following tuning strategy for the pretrained MusicGen model, enhancing its ability to adhere to editing directives without needing to fine-tune all its parameters.

Specifically, Instruct-MusicGen integrates an audio fusion module based on the LLaMA-adapter and a text fusion module based on LoRA into the original MusicGen framework. This dual integration enables it to simultaneously process precise audio conditions and text-based instructions, capabilities that the original MusicGen lacks. Consequently, Instruct-MusicGen can handle a variety of editing tasks, such as adding, separating, and removing stems, all within a single training process, significantly reducing computational resource demands compared to training specialized editing models from scratch.
For training, the team synthesizes an instructional dataset using the Slakh2100 dataset, introducing only an 8% increase in parameters compared to the original model. They fine-tune the model for just 5,000 steps, which is less than 1% of the effort required to train a music editing model from scratch.

The researchers also conducted a thorough comparison across various models. Instruct-MusicGen not only outperformed existing benchmarks but also matched the performance of models specifically trained for individual tasks, showcasing the effectiveness and versatility of their approach. This advancement not only improves the efficiency of text-to-music editing but also expands the practical applications of music language models in dynamic music production environments.
Demos are available at: https://bit.ly/instruct-musicgen. The paper Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning is on arXiv.
Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Battling formidable online adversaries in snake io is what keeps me coming back. It’s fantastic!
We deeply value your expertise and thoughtful thinking. We admire your bravery and determination.