AI Machine Learning & Data Science Research

Google & TUAT’s WaveFit Neural Vocoder Achieves Inference Speeds 240x Faster Than WaveRNN

A neural vocoder is a neural network designed to generate speech waveforms given acoustic features — often used as a backbone module for speech recognition tasks such as text-to-speech (TTS), speech-to-speech translation (S2ST), etc. Current neural vocoders however can struggle to maintain high sound quality without incurring high computational costs.

In the new paper WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration, a team from Google Research and the Tokyo University of Agriculture and Technology presents WaveFit, a fast and high-quality neural vocoder that achieves natural human speech with inference speeds that are 240 times faster than WaveRNN.

The first breakthrough in neural vocoder development was the introduction of autoregressive (AR) models such as WaveNet (van den Oord et al., 2016), which revolutionized the quality of speech generation but proved inefficient as they required a huge number of sequential operations for signal generation.

Non-AR models were subsequently proposed to speed up inference speeds, with denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) among the most popular.

Generating human-comparable speech waveforms in a few iterations however remains challenging, and typically involves an undesirable trade-off between sound quality and computational cost.

The proposed WaveFit non-AR neural vocoder is inspired by the theory of fixed-point iteration and introduces a novel method for combining DDPMs and GANs to boost the performance of conventional non-AR models. WaveFit iteratively applies a DNN as a denoising mapping that eliminates noise components from an input signal. A GAN-based and a short-time Fourier transform (STFT)- based loss are combined to produce a loss function that is insensitive to imperceptible phase differences and to encourage the intermediate output signals to approach the target speech along with the iterations.

In their empirical study, the team evaluated WaveFit on subjective listening experiments and compared it with baselines that included WaveRNN, DDPM-based models and GAN-based models. The results show that WaveFit with five iterations can generate synthetic speech with audio quality comparable to that of WaveRNN and natural human speech while achieving inference speeds more than 240 times faster than WaveRNN.

Audio demos are available at google.github.io/df-conformer/wavefit/. The paper WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

6 comments on “Google & TUAT’s WaveFit Neural Vocoder Achieves Inference Speeds 240x Faster Than WaveRNN

  1. Pingback: Google & TUAT’s WaveFit Neural Vocoder Achieves Inference Speeds 240x Faster Than WaveRNN - UK News

  2. Pingback: Google & TUAT’s WaveFit Neural Vocoder Achieves Inference Speeds 240x Faster Than WaveRNN - CA News

  3. A very interesting topic that I have been looking at, I think this is one of the most important information for me. And I’m glad to read your post. Thanks for sharing!

  4. But current neural vocoders have trouble maintaining excellent sound quality without spending a lot of money on computing. And I enjoyed reading your post. I appreciate you sharing!

  5. Great article. I read this blog all the time and I’m impressed. I think this information is very important, especially the last part. I have been trying to find this information for a long time. Thank you and good luck

  6. Impressed by that stats! Really awesome to know that such things can happened in our life.

Leave a Reply

Your email address will not be published. Required fields are marked *