Microsoft’s VALL-E 2: First Time Human Parity in Zero-Shot Text-to-Speech Achieved

Synced

2 years ago

Over the past decade, significant breakthroughs in speech synthesis have emerged, driven by the development of neural networks and end-to-end modeling. Last year, Microsoft introduced VALL-E, a neural codec language model capable of synthesizing high-quality personalized speech from just a 3-second recording of an unseen speaker. This model notably outperformed the state-of-the-art zero-shot text-to-speech (TTS) systems at the time.

Building on this progress, in a recent new paper VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers, a Microsoft research team presents VALL-E 2, the latest advancement in neural codec language models. This innovation marks a milestone in zero-shot TTS synthesis by achieving human parity for the first time.

VALL-E 2, an evolution of its predecessor, employs a neural codec language modeling method for speech synthesis and introduces two significant enhancements: repetition-aware sampling and grouped code modeling.

Repetition-aware sampling improves upon the random sampling used in VALL-E by adaptively choosing either random or nucleus sampling for each time step token prediction. This decision is based on the token repetition in the decoding history, enhancing the stability of the decoding process and preventing the infinite loop issue encountered in VALL-E.

Grouped code modeling divides the codec codes into groups, each modeled in a single frame during the autoregressive (AR) modeling process. This approach accelerates inference by reducing sequence length and improves performance by addressing the long context modeling problem.

Notably, VALL-E 2 requires only simple utterance-wise speech-transcription pair data for training, greatly simplifying the data collection and processing. This advancement facilitates potential scalability and streamlines the training process.

Experiments on the LibriSpeech and VCTK datasets demonstrate that VALL-E 2 surpasses previous systems in terms of speech robustness, naturalness, and speaker similarity. It is the first model to achieve human parity on these benchmarks. Furthermore, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are complex or contain repetitive phrases.

Demos of VALL-E 2 will be posted to https://aka.ms/valle2. The paper VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers is on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Share this: