AI Machine Learning & Data Science Research

A WaveNet Rival? Stanford U Study Models Raw Audio Waveforms Over Contexts of 500k Samples

In the new paper GoodBye WaveNet — A Language Model for Raw Audio with Context of 1/2 Million Samples, Stanford University researcher Prateek Verma presents a generative auto-regressive architecture that models audio waveforms over contexts greater than 500,000 samples and outperforms state-of-the-art WaveNet baselines.

The effective modelling of long-term dependencies enables conditioning new model outputs on previous inputs and is critical when dealing with longer text, audio or video contexts. However, when modelling the long-term dependencies of audio signals, even smaller time scales can yield hundreds of thousands of samples. While transformer architectures have helped handle this workload, their quadratic complexity over input data makes scaling them very computationally expensive.

In the new paper GoodBye WaveNet — A Language Model for Raw Audio With Context of 1/2 Million Samples, Stanford University researcher Prateek Verma presents a generative auto-regressive architecture that can model audio waveforms over contexts greater than 500,000 samples. The novel method outperforms state-of-the-art WaveNet baselines on the modelling of long-term structures.

The paper’s main contributions are summarized as follows:

  1. We produce state-of-the-art results in generative modelling for raw audio. We compare our work against WaveNet, Sample-RNN, and SaSHIMI on the same dataset, with a similar number of parameters, and with a fewer number of training steps.
  2. To the best of our knowledge, for raw audio, this is the first work that can do generative modelling for such large contexts. Given that we can model context over 100,000 training examples, this work can further be improved to show improvements over very long contexts, even up to a million samples of the past.

The proposed generative, auto-regressive architecture models the probability distribution for a chunk of a given waveform, then uses this probability distribution to predict the next waveform. The model pipeline comprises three modules, which are used for 1) latent representation learning, 2) learning dependencies over latent representations, and 3) prediction of the next sample.

For latent representation learning, the input audio is divided into chunks of 2,000 non-overlapping samples which are encoded by a convolutional encoder. The model then learns time-frequency representations and dependencies across these learned representations. The 2000 audio samples can easily be compressed to enable the model to handle larger sequences with a smaller attention map.

To learn dependencies over the latent representations, the learned embeddings are fed into a stacked layer of transformer modules, with sinusoidal positional encodings (Vaswani et al., 2017) added to the latent representations to enable the model to know the position to which each of the latent representations belongs. A dropout rate of 0.1 is applied on input sequences and intermediate embedding tokens across time to improve robustness.

A single classification token from the final transformer layer is used to predict the next 8-bit sample given the historical data. The sample is first passed through a linear classification head to go from a latent space to a classification output, and a linear layer then generates an output space that is the same as the number of possible states (256 states for 8-bit audio signals).

An empirical study compared the proposed model to recent state-of-the-art neural architectures such as DeepMind’s WaveNet (Van den Oord et al., 2016), Sample-RNN (Mehri et al., 2016) and SaSHMI (Goel et al., 2022) on the YouTubeMix piano dataset. The proposed model achieved similar negative-log likelihood (NLL) scores with similar parameters and outperformed all baseline models with similar contexts. The paper notes that the reported results were achieved without any tuning of the architecture topology or parameters such as regularization, latent dim, number of layers, feed-forward layers, dropout rates, etc., and that tuning these parameters would further improve the NLL scores.

Overall, the work demonstrates the strong potential of auto-regressive models for raw audio processing with extremely long context inputs.

The paper GoodBye WaveNet — A Language Model for Raw Audio With Context of 1/2 Million Samples is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

0 comments on “A WaveNet Rival? Stanford U Study Models Raw Audio Waveforms Over Contexts of 500k Samples

Leave a Reply

Your email address will not be published.

%d bloggers like this: