A neural vocoder is a neural network designed to generate speech waveforms given acoustic features — often used as a backbone module for speech recognition tasks such as text-to-speech (TTS), speech-to-speech translation (S2ST), etc. Current neural vocoders however can struggle to maintain high sound quality without incurring high computational costs.
In the new paper WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration, a team from Google Research and the Tokyo University of Agriculture and Technology presents WaveFit, a fast and high-quality neural vocoder that achieves natural human speech with inference speeds that are 240 times faster than WaveRNN.
The first breakthrough in neural vocoder development was the introduction of autoregressive (AR) models such as WaveNet (van den Oord et al., 2016), which revolutionized the quality of speech generation but proved inefficient as they required a huge number of sequential operations for signal generation.
Non-AR models were subsequently proposed to speed up inference speeds, with denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) among the most popular.
Generating human-comparable speech waveforms in a few iterations however remains challenging, and typically involves an undesirable trade-off between sound quality and computational cost.
The proposed WaveFit non-AR neural vocoder is inspired by the theory of fixed-point iteration and introduces a novel method for combining DDPMs and GANs to boost the performance of conventional non-AR models. WaveFit iteratively applies a DNN as a denoising mapping that eliminates noise components from an input signal. A GAN-based and a short-time Fourier transform (STFT)- based loss are combined to produce a loss function that is insensitive to imperceptible phase differences and to encourage the intermediate output signals to approach the target speech along with the iterations.
In their empirical study, the team evaluated WaveFit on subjective listening experiments and compared it with baselines that included WaveRNN, DDPM-based models and GAN-based models. The results show that WaveFit with five iterations can generate synthetic speech with audio quality comparable to that of WaveRNN and natural human speech while achieving inference speeds more than 240 times faster than WaveRNN.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.