AI Machine Learning & Data Science Research

Microsoft’s VALL-E 2: First Time Human Parity in Zero-Shot Text-to-Speech Achieved

In a recent new paper VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers, a Microsoft research team presents VALL-E 2, the latest advancement in neural codec language models. This innovation marks a milestone in zero-shot TTS synthesis by achieving human parity for the first time.

Over the past decade, significant breakthroughs in speech synthesis have emerged, driven by the development of neural networks and end-to-end modeling. Last year, Microsoft introduced VALL-E, a neural codec language model capable of synthesizing high-quality personalized speech from just a 3-second recording of an unseen speaker. This model notably outperformed the state-of-the-art zero-shot text-to-speech (TTS) systems at the time.

Building on this progress, in a recent new paper VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers, a Microsoft research team presents VALL-E 2, the latest advancement in neural codec language models. This innovation marks a milestone in zero-shot TTS synthesis by achieving human parity for the first time.

VALL-E 2, an evolution of its predecessor, employs a neural codec language modeling method for speech synthesis and introduces two significant enhancements: repetition-aware sampling and grouped code modeling.

Repetition-aware sampling improves upon the random sampling used in VALL-E by adaptively choosing either random or nucleus sampling for each time step token prediction. This decision is based on the token repetition in the decoding history, enhancing the stability of the decoding process and preventing the infinite loop issue encountered in VALL-E.

Grouped code modeling divides the codec codes into groups, each modeled in a single frame during the autoregressive (AR) modeling process. This approach accelerates inference by reducing sequence length and improves performance by addressing the long context modeling problem.

Notably, VALL-E 2 requires only simple utterance-wise speech-transcription pair data for training, greatly simplifying the data collection and processing. This advancement facilitates potential scalability and streamlines the training process.

Experiments on the LibriSpeech and VCTK datasets demonstrate that VALL-E 2 surpasses previous systems in terms of speech robustness, naturalness, and speaker similarity. It is the first model to achieve human parity on these benchmarks. Furthermore, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are complex or contain repetitive phrases.

Demos of VALL-E 2 will be posted to https://aka.ms/valle2. The paper VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers is on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

58 comments on “Microsoft’s VALL-E 2: First Time Human Parity in Zero-Shot Text-to-Speech Achieved

  1. Wow, achieving human parity is a huge leap! It’s like the AI can now copy any voice perfectly from just a tiny sample. Makes me wonder how this will change things like audiobooks or helping people who lost their voice. Ginny And Georgia Test

  2. Wow, it can copy any voice from just 3 seconds? That’s like a superpower for making stories come to life! Hope they’re super careful about how it’s used tho. Play X Trench Run Online

  3. Wow, it’s like the AI can now copy a person’s voice from just a tiny sample and make it talk perfectly? That’s kinda mind-blowing and a little scary for the future of audio! Play Pokemon Overlord Online

  4. Wow, it’s like the computer can now copy anyone’s voice perfectly from just a tiny sample? That’s kinda mind-blowing and a little bit scary for making fake stuff. Hope they have good rules for using it. Play Finn’s Fishing Bonanza Online

  5. Wow, it can copy any voice from just a tiny clip? That’s like a superpower for storytime! Makes me think how this could help kids who have trouble speaking to still have their own voice. Play Chudways Online

  6. Wow, it can copy any voice from just a tiny sample? That’s like a superpower for storytime! Makes me think how it could help kids who lost their voice to still sound like themselves. Play A Fly In The Array

  7. Wow, it can copy a voice from just 3 seconds? That’s like a superpower for making computers talk! Sounds like it’s finally getting really, really good at not sounding like a robot. Play Hammy Home

  8. very impressive to see like this and the context and emotional is so good in this now i am gonna confirm this that i will follow your page and see the latest things but need some improvements i am sure you will

Leave a Reply

Your email address will not be published. Required fields are marked *