Music, an artistic universal expression of mankind carry deep culture significance and appeal to humans in our civilization world. There has been a significant progress of deep generative models for generating music. However, generating high-fidelity and realistic music conditioned on free-form textual descriptions, known as text-to-music, remains challenging.
In a new paper JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models, a Futureverse research team presents JEN-1, a universal framework that combines bidirectional and unidirectional modes to generate high-quality music conditioned on either text or music representations, achieving new state-of-the-art results in text-music alignment and music quality with increasing computational costs.
The team summarizes their key contributions as follows:
- We propose JEN-1 as a solution to the challenging text-to-music generation task. JEN1 employs in-context learning and is trained with multi-task objectives, enabling music generation, music continuation, and music inpainting within a single model.
- JEN-1 utilizes an extremely efficient approach by directly modeling waveforms, avoiding the conversion loss associated with spectrograms.
- Our JEN-1 model integrates both autoregressive diffusion mode and non-autoregressive mode to improve sequential dependency and enhance sequence generation concurrently.
- Our paper presents a significant advancement in the field of text-to-music generation, offering a powerful, efficient, and controllable framework for generating high-quality music aligned with textual prompts and melodic structures.
JEN-1 combines bidirectional and unidirectional modes to offer a unified approach for universal text-to-music generation. And unlike previous generation models that discrete tokens or involve multiple serial stages, JEN-1 uses a novel framework to enable continuous, high-fidelity music generation using a single model.
Moreover, JEN-1 utilizes both autoregressive to improve sequential dependency and non-autoregressive training to improve sequence generation concurrently. Specifically, JEN-1 leverages a temporal 1D efficient U-Net to effectively model the waveform and implement the desired blocks in the diffusion model. The researchers further a novel omnidirectional latent diffusion model to achieve multi-task training. JEN-1 also integrates the unidirectional diffusion mode to inherent sequential characteristic of music.
In their empirical study, the team compares JEN-1 with state-of-the-art methods, including Riffusion, Mousai, MusicLM, MusicGen and Noise2Music. JEN-1 surpasses all SOTA baselines in terms of subjective quality, diversity, and controllability.
Overall, this work moves steps forward of text-to-music generation progress and introduces a powerful text-to-music generator. The team hopes their work will encourage more research on developing generative models to create impactful and realistic art.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.