AI Machine Learning & Data Science Research

AI Jam Session: Google & Sorbonne U’s MusicLM Achieves SOTA Performance on High-Fidelity Music Generation from Text

In the new paper MusicLM: Generating Music From Text, a Google Research and Sorbonne University team presents MusicLM, a model for generating high-fidelity music that can be conditioned on both text and melody. MusicLM surpasses baselines in both its audio quality and adherence to the text descriptions.

AI’s evolution over the last decade has been incredible. While researchers might point to the successes of AlexNet or AlphaGo as milestones, the “wow” moments for the general public have come from prompt-based image generation models such as Stable Diffusion and, more recently, the power of ChatGPT. Might the next frontier be prompt-based music generation?

Conditional neural audio generation already has applications in areas such as text-to-speech and audio synthesis. Conventional approaches are facilitated by a temporal alignment between the conditioning signal and the corresponding audio output, but recent studies have also revealed the potential for generating complex audio outputs from sequence-wide, high-level captions.

In the new paper MusicLM: Generating Music From Text, a Google Research and Sorbonne University team presents MusicLM, a model for generating high-fidelity music that can be conditioned on both text and melody. MusicLM surpasses baselines in both its audio quality and adherence to the text descriptions.

The team summarizes their main contributions as follows:

  1. We introduce MusicLM, a generative model that produces high-quality music at 24 kHz which is consistent over several minutes while being faithful to a text conditioning signal.
  2. We extend our method to other conditioning signals, such as a melody that is then synthesized according to the text prompt. Furthermore, we demonstrate long and coherent music generation of up to 5-minute long clips.
  3. We release the first evaluation dataset collected specifically for the task of text-to-music generation: MusicCaps is a hand-curated, high-quality dataset of 5.5k music-text pairs prepared by musicians.

The proposed MusicLM is built on AudioLM (Borsos et al., 2022), an audio generation framework that achieves high-fidelity and long-term coherence over dozens of seconds. MusicLM leverages AudioLM’s multi-stage autoregressive modelling and extends it to incorporate text conditioning.

To address their approach’s paired data scarcity issue, the researchers turn to MuLan (Huang et al., 2022), a joint music-text model trained to match music and its corresponding text description. MuLan doesn’t require high-quality training data; it can learn cross-modal correspondences even when the music-text pairs are only weakly associated.

In their empirical study, the team compared MusicLM with two recent baselines, Mubert (Mubert-Inc, 2022) and Riffusion (Forsgren & Martiros, 2022). The results show that MusicLM effectively captures fine-grained information from rich free-text captions, produces high-quality music at 24 kHz, remains consistent over several minutes, and is more faithful to the text conditioning signal.

This paper aptly demonstrates MusicLM’s power and potential in text-to-music generation and identifies possible future research avenues such as lyrics generation and the modelling of high-level song structures (e.g. intro, verse, chorus) to enable the generation of more complex compositions.

Samples are available on the project’s GitHub. The paper MusicLM: Generating Music From Text is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

2 comments on “AI Jam Session: Google & Sorbonne U’s MusicLM Achieves SOTA Performance on High-Fidelity Music Generation from Text

  1. We extend our method to other conditioning signals, such as a melody that is subsequently synthesized according to the text prompt.

  2. Maximilian Hohenzollern

    That’s pretty interesting, but I don’t think that such music can compare with songs made by people. Even when we’re talking about complex compositions, they can’t compete with live performances, and I listen to them a lot on YouTube. Tubidy helps me download all the live versions of songs that I like on YouTube, and for me, such tracks are full of emotions, and that’s something Artificial Intelligence can’t provide.

Leave a Reply

Your email address will not be published. Required fields are marked *