In recent years, generative adversarial networks (GANs) have emerged as popular solutions to generate high-quality audio due the to feedforward parallel generator that leads to fast inference speeds. A key component that contributes the success of these models is a high quality neural compression model to compress high-dimensional natural signals into lower dimensional tokens.
These compression models however are often tailored to a specific type of audio signal, which fails to serve as universal audio compression models to general generic sounds, including speech, music, environmental sounds, etc.
To bridge this gap, in a new paper High-Fidelity Audio Compression with Improved RVQGAN, a Descript research team presents Improved RVQGAN, a high fidelity universal audio compression model that combines advances in high-fidelity audio generation and improved adversarial and reconstruction losses to achieve 90x compression of 44.1 KHz audio at only 8kbps bandwidth.
The team summarizes their main contributions as follows:
- We introduce Improved RVQGAN, a high fidelity universal audio compression model, that can compress 44.1 KHz audio into discrete codes at 8 kbps bitrate (~90x compression) with minimal loss in quality and fewer artifacts.
- We identify a critical issue in existing models which don’t utilize the full bandwidth due to codebook collapse (where a fraction of the codes are unused) and fix it using improved codebook learning techniques.
- We identify a side-effect of quantizer dropout – a technique designed to allow a single model to support variable bitrates, actually hurts the full-bandwidth audio quality and propose a solution to mitigate it.
- We make impactful design changes to existing neural audio codecs by adding periodic inductive biases, multi-scale STFT discriminator, multi-scale mel loss and provide thorough ablations and intuitions to motivate them.
- Our proposed method is a universal audio compression model, capable of handling speech, music, environmental sounds, different sampling rates and audio encoding formats.
The improved RVQGAN model is built upon VQ-GANs. Specifically, it uses the fully convolutional encoder-decoder network from SoundStream to perform temporal downscaling based on the chosen striding factor. And the researchers use Residual Vector Quantization (RVQ) to quantize the encodings and apply quantizer dropout in training stage to enable a single model to operate at several bitrates.
To improve audio fidelity, the team adopts the Snake activation function to add a periodic inductive bias to the generator. And to address the poor initialization that results in low codebook usage issue in conventional Vanilla VQ-VAEs, the researchers use factorized codes and L2-normalized codes in the Improved VQGAN image model to improve codebook usage.
The abovementioned two tricks along with the overall model architecture design significantly improve bitrate efficiency and reconstruction quality with minimal loss in quality and fewer artifacts.
In their empirical study, the team trained their model on a large dataset that consists of speech, music, and environmental sounds and compared the trained Improved RVQGAN with competitive baselines, including EnCodec, Lyra, and Opus. The proposed Improved RVQGAN achieves higher scores than EnCodec at all bitrates. Moreover, it significantly outperforms state-of-the-art methods even at lower bitrates, showing capability to compress 44.1 KHz audio into discrete tokens at only 8 kbps bitrate.
Overall, this work validates the proposed Improved RVQGAN as a high-fidelity universal neural audio compression, the team hopes their contributions can lay the foundation for the next generation of high-fidelity audio modeling.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.