Today’s text-to-speech (TTS) systems have made tremendous progress in synthesizing high-quality speech from raw acoustic data. Such systems however have poor generalization abilities, suffering dramatic performance drops when dealing with unseen (not in the training set) speakers under zero-shot settings.
A Microsoft research team addresses this issue in the new paper Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, presenting VALL-E, the first language model-based text-to-speech (TTS) system with strong in-context learning. VALL-E achieves state-of-the-art personalized speech synthesis quality via prompting in a zero-shot setting, significantly surpassing the state-of-the-art zero-shot TTS system on LibriSpeech and VCTK benchmarks.

The team summarizes their main contributions as follows:
- We propose VALL-E, the first TTS framework with as strong in-context learning capabilities as GPT-3. It has in-context learning capability and enables prompt-based approaches for zero-shot TTS, which does not require additional structure engineering, pre-designed acoustic features, and fine-tuning as in previous work.
- We build a generalized TTS system in the speaker dimension by leveraging a huge amount of semi-supervised data, suggesting that simple scaling up semi-supervised data has been underestimated for TTS.
- VALL-E is able to provide diverse outputs with the same input text and keep the acoustic environment and speaker’s emotion of the acoustic prompt.
- We verify that VALL-E synthesizes natural speech with high speaker similarity by prompting in the zero-shot scenario. Evaluation results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system on LibriSpeech and VCTK.

The researchers’ goal was to train a TTS system capable of generating speech for unseen speakers by treating zero-shot TTS learning as a conditional codec language modelling problem. They trained VALL-E on a corpus of 60,000 hours of unlabelled speech from English audiobooks— a training set hundreds of times larger than existing TTS systems.
VALL-E is designed to generate an acoustic code matrix conditioned on a phoneme sequence and an acoustic prompt matrix, which enables it to extract content and speaker information from the phoneme sequence and the acoustic prompt, respectively. In the inference stage, VALL-E synthesizes high-quality speech based on a given phoneme sequence, a procedure that requires only a 3-second sample recording from the unseen speaker.
In their empirical experiments, the team compared VALL-E with the SOTA zero-shot TTS model YourTTS (Casanova et al., 2022) on speaker similarity between prompt and synthesized speech, naturalness, and synthesis robustness.

In the evaluations, VALL-E outperformed the baseline YourTTS model on all metrics, demonstrating strong in-context learning capabilities in zero-shot scenarios. The team notes that VALL-E can also preserve a prompt’s emotion in a zero-shot setting and generate diverse outputs in different sampling-based decoding processes.
The paper Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

The idea of high-quality, zero-shot TTS from just a 3-second Slope sample shows how powerful in-context learning has become for speech models.