The ongoing progress in neural networks has enabled high-fidelity audio compression, a useful technique for mitigating the increasing demands that audio and video streaming have placed on the Internet. Current lossy neural compression models however are prone to problems such as overfitting to their training sets and low compression efficiency.
Researchers from Meta AI address these issues in the new paper High Fidelity Neural Audio Compression, presenting EnCodec, a state-of-the-art real-time neural audio compression model that can generate high-fidelity audio samples across a wide range of sample rates and bandwidths.
EnCodec is a streaming, convolutional-based encoder-decoder architecture with three principal components: 1) an encoder network that inputs an audio extract and outputs a latent representation; 2) a quantization layer that produces a compressed representation via vector quantization, and 3) a decoder network that reconstructs the time-domain signal from the compressed latent representation.
To simplify and speed up the training, the team adopts a single multiscale spectrogram adversary that reduces artifacts and produces high-quality samples. They deal with the overfitting issues that have hindered previous compression methods by employing a large and diverse training set and discriminator networks that serve as perceptual losses. The perceptual loss term is based on a multiscale STFT-based (MS-STFT) discriminator that effectively captures different structures in audio signals to significantly stabilize the training process.
To improve compression efficiency in compute time and size, the team limits their study to models that run in real-time on a single CPU core and uses residual vector quantization of the neural encoder floating-point output. They also leverage lightweight transformer models to further compress obtained representations by up to 40 percent while maintaining real-time or faster speeds.
In the team’s empirical study, human evaluators using the MUSHRA test methodology compared EnCodec with baselines Opus (Valin et al., 2012), Enhanced Voice Service (EVS, Dietz et al., 2015), Lyra 2 (SoundStream, Zeghidour et al., 2021), and MP3 compression at 64 kbps. In the evaluations, EnCodec achieved new state-of-the-art results for speech; monophonic music at 1.5, 3, 6, and 12 kbps at 24 kHz; and stereophonic music at 6, 12, and 24 kbps at 48 kHz.
Overall, this work shows that EnCodec can produce high-fidelity audio samples across a range of sample rates and bandwidths while stabilizing the training process. Moreover, employing a small transformer model can further reduce bandwidth by up to 40 percent without sacrificing compression quality.
The code and models are available on the project’s GitHub. The paper High Fidelity Neural Audio Compression is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.