Today’s text-to-speech (TTS) systems have made tremendous progress in synthesizing high-quality speech from raw acoustic data. Such systems however have poor generalization abilities, suffering dramatic performance drops when dealing with unseen (not in the training set) speakers under zero-shot settings.
A Microsoft research team addresses this issue in the new paper Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, presenting VALL-E, the first language model-based text-to-speech (TTS) system with strong in-context learning. VALL-E achieves state-of-the-art personalized speech synthesis quality via prompting in a zero-shot setting, significantly surpassing the state-of-the-art zero-shot TTS system on LibriSpeech and VCTK benchmarks.

The team summarizes their main contributions as follows:
- We propose VALL-E, the first TTS framework with as strong in-context learning capabilities as GPT-3. It has in-context learning capability and enables prompt-based approaches for zero-shot TTS, which does not require additional structure engineering, pre-designed acoustic features, and fine-tuning as in previous work.
- We build a generalized TTS system in the speaker dimension by leveraging a huge amount of semi-supervised data, suggesting that simple scaling up semi-supervised data has been underestimated for TTS.
- VALL-E is able to provide diverse outputs with the same input text and keep the acoustic environment and speaker’s emotion of the acoustic prompt.
- We verify that VALL-E synthesizes natural speech with high speaker similarity by prompting in the zero-shot scenario. Evaluation results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system on LibriSpeech and VCTK.

The researchers’ goal was to train a TTS system capable of generating speech for unseen speakers by treating zero-shot TTS learning as a conditional codec language modelling problem. They trained VALL-E on a corpus of 60,000 hours of unlabelled speech from English audiobooks— a training set hundreds of times larger than existing TTS systems.
VALL-E is designed to generate an acoustic code matrix conditioned on a phoneme sequence and an acoustic prompt matrix, which enables it to extract content and speaker information from the phoneme sequence and the acoustic prompt, respectively. In the inference stage, VALL-E synthesizes high-quality speech based on a given phoneme sequence, a procedure that requires only a 3-second sample recording from the unseen speaker.
In their empirical experiments, the team compared VALL-E with the SOTA zero-shot TTS model YourTTS (Casanova et al., 2022) on speaker similarity between prompt and synthesized speech, naturalness, and synthesis robustness.

In the evaluations, VALL-E outperformed the baseline YourTTS model on all metrics, demonstrating strong in-context learning capabilities in zero-shot scenarios. The team notes that VALL-E can also preserve a prompt’s emotion in a zero-shot setting and generate diverse outputs in different sampling-based decoding processes.
The paper Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Microsoft’s Neural Codec Language Models Synthesize High-Quality Personalized Speech From a 3-Second Sample”