AI Machine Learning & Data Science Research

Microsoft’s VALL-E 2: First Time Human Parity in Zero-Shot Text-to-Speech Achieved

In a recent new paper VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers, a Microsoft research team presents VALL-E 2, the latest advancement in neural codec language models. This innovation marks a milestone in zero-shot TTS synthesis by achieving human parity for the first time.

Over the past decade, significant breakthroughs in speech synthesis have emerged, driven by the development of neural networks and end-to-end modeling. Last year, Microsoft introduced VALL-E, a neural codec language model capable of synthesizing high-quality personalized speech from just a 3-second recording of an unseen speaker. This model notably outperformed the state-of-the-art zero-shot text-to-speech (TTS) systems at the time.

Building on this progress, in a recent new paper VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers, a Microsoft research team presents VALL-E 2, the latest advancement in neural codec language models. This innovation marks a milestone in zero-shot TTS synthesis by achieving human parity for the first time.

VALL-E 2, an evolution of its predecessor, employs a neural codec language modeling method for speech synthesis and introduces two significant enhancements: repetition-aware sampling and grouped code modeling.

Repetition-aware sampling improves upon the random sampling used in VALL-E by adaptively choosing either random or nucleus sampling for each time step token prediction. This decision is based on the token repetition in the decoding history, enhancing the stability of the decoding process and preventing the infinite loop issue encountered in VALL-E.

Grouped code modeling divides the codec codes into groups, each modeled in a single frame during the autoregressive (AR) modeling process. This approach accelerates inference by reducing sequence length and improves performance by addressing the long context modeling problem.

Notably, VALL-E 2 requires only simple utterance-wise speech-transcription pair data for training, greatly simplifying the data collection and processing. This advancement facilitates potential scalability and streamlines the training process.

Experiments on the LibriSpeech and VCTK datasets demonstrate that VALL-E 2 surpasses previous systems in terms of speech robustness, naturalness, and speaker similarity. It is the first model to achieve human parity on these benchmarks. Furthermore, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are complex or contain repetitive phrases.

Demos of VALL-E 2 will be posted to https://aka.ms/valle2. The paper VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers is on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

34 comments on “Microsoft’s VALL-E 2: First Time Human Parity in Zero-Shot Text-to-Speech Achieved

  1. lisadominik

    Located in the vibrant heart of New York City, our midtown barbershop is more than just a place to get a haircut; it’s a sanctuary where tradition and innovation blend seamlessly to provide an unparalleled grooming experience. At our barbershop, we believe that every man deserves to look and feel his best, and our team of skilled barbers is here to make that happen.

  2. rablejunio

    The advancements in speech synthesis with VALL-E 2 are impressive! Achieving human parity in zero-shot TTS synthesis is a geometry dash lite game-changer.

  3. Gambling has been part of human culture for millennia, starting with ancient dice games and evolving into today’s digital platforms. For a modern twist, just check nonukgccasinos gambling sites, which offer diverse and innovative options. These sites continue the long tradition of gaming, adapting to new technologies and player preferences.

  4. This neural codec language model builds on its predecessor by introducing advanced techniques like repetition-aware sampling and grouped code modeling, enhancing stability, speed, and performance. By requiring minimal training data and demonstrating superior speech quality and naturalness, VALL-E 2 sets a new standard in speech synthesis technology.

  5. tayloer12

    I hope you understand how much I appreciate the wisdom and information you have so kindly shared with me. papa’s freezeria game

  6. jess chetl

    This is a major breakthrough in technology. slope

  7. Check out the fun twist on the classic 2048 game with cupcakes! 🎂 In 2048 cupcakes, you combine cupcakes to reach higher scores and enjoy a sweet challenge. Perfect for a quick break or some casual fun! 🍰

  8. JasonWilliams

    The history of gambling dates back thousands of years, evolving into various forms. Visit https://www.motorpunk.co.uk/uncategorized/future-moto-betting-exploring-new-markets-technologies for insights. Recent trends in betting on motorcycles show a growing interest in the sport, blending speed with the thrill of wagering.

  9. kendyl10

    freecine download

    In today’s world, movies are more accessible than ever before. With the rise of streaming platforms, people can watch their favorite films from the comfort of their homes. However, most popular streaming services require subscriptions, which can add up over time. But what if you could watch movies for free? Yes, free movie watching is possible, and in this article, we’ll explore the best ways to enjoy movies without spending a single cent. We’ll cover legal ways to access free movies, tips for finding the best platforms, and how to optimize your viewing experience

  10. Thank you for sharing such valuable insights through your blog. It has been incredibly helpful to me!

  11. Alice Rose

    Your tubidy blog is a gem! Thank you for putting so much time and effort into producing high-quality articles.

  12. I appreciate how much time and effort you put into every blog post. Thank you for keeping the content so engaging!

  13. With its extensive library, high-quality downloads, and no registration requirement, Tubidy is an excellent choice for anyone looking to enjoy multimedia content offline. By following the steps and best practices outlined above

  14. i think this is good ! really nice

  15. Suika Game is a simple yet highly addictive puzzle game originating from Japan. The objective is to combine fruits by sliding them together, ultimately creating larger and more valuable fruits. With its minimalist design and relaxing gameplay, Suika Game has captured the attention of casual gamers worldwide.

  16. booksrun

    I love nursing research topics and how this information helps make nurses’ jobs easier.

  17. Help the character escape in EscapeRoad. Draw paths, avoid obstacles, and find the best way out. Test your thinking skills in this fun game!

  18. jojoy is The app that is easy to use and perfect for beginners.

  19. Enjoy unlimited in-game resources for free.

  20. Tuak-88Provider terbaik server luar negeri anti boncos.

  21. Data_sdy_2025live draw sdy 6d tercepat hari ini.

  22. That’s fascinating! I’ve always been impressed by how quickly AI voice synthesis is improving. It’s almost scary how realistic it’s becoming. I remember back in college, messing around with early voice synthesizers, and the output was so robotic. It’s wild to think we’re now approaching human parity. Speaking of repetitive tasks and problem-solving, sometimes when I’m feeling overwhelmed, I find playing a level of Geometry Dash oddly therapeutic. It’s a completely different kind of challenge, but it helps me clear my head!

  23. AI-powered flowchart tools are transforming how teams visualize processes, making it easier than ever to turn ideas into clear, structured diagrams. For example, with just a simple text prompt, users can generate professional flowcharts in seconds, streamlining everything from project planning to documentation. These tools often support real-time editing, multiple export formats, and collaborative features—ideal for business analysts, project managers, and designers seeking efficiency and accuracy.

    If you’re looking to enhance your workflow with smart automation, check out Flowchart AI, a platform designed to instantly convert your text descriptions into visually appealing flowcharts. It’s a great way to simplify complex processes and communicate ideas more effectively.

  24. Microsoft’s VALL-E 2 marks a significant milestone in zero-shot text-to-speech (TTS) technology, achieving human parity for the first time. This advancement builds on neural codec language models and introduces repetition-aware sampling and grouped code modeling, which enhance speech synthesis stability and performance. VALL-E 2’s ability to generate high-quality speech from just a short recording demonstrates the rapid progress in AI-powered voice technologies. For those interested in exploring cutting-edge voice cloning solutions, voice clone offers more insights into the latest developments in this field.

  25. Impressive breakthrough in zero-shot TTS! VALL-E 2 achieving human parity with enhanced sampling and grouped codes is a game-changer. Can’t wait to try the demos.

  26. Impressive breakthrough! VALL-E 2’s human parity in zero-shot TTS is game-changing. The grouped code modeling and repetition-aware sampling sound like smart solutions to key challenges. Excited to try the demos!

  27. dickson

    Geometry Dash Lite – “Perfect for quick, fun sessions! The lite version is super smooth and addictive.”

  28. premiumdermalmart

    Really appreciate this content. It’s helpful and very well written.
    Order Botox Online

  29. Download Minecraft Game fully unlocked features. We share website link below. Visit website and click to download minecraft apk.

  30. Really Informative article

  31. Shubham Kumar

    Great post! I really enjoyed reading your insights. It’s interesting how small choices can make a big difference in group travel. For those seeking comfort and style, Luxury Tempo Traveller Hire in Delhi has become a top choice, offering spacious seating, modern interiors, and a smooth travel experience. Your post reminds readers that planning ahead can truly enhance any trip.
    If anyone wants to hire a luxury tempo traveller, you can explore here:https://www.tempotravellerrentindelhi.com/

  32. Fresh poppy pods are the seed pods that are harvested from the poppy flower. Poppies are known for their beautiful flowers, but it’s their seed pods that are of the most value. These pods contain the seeds for the next crop and, when dried, they are frequently used in floral arrangements and other decorative crafts. By using fresh poppy pods, you can take your art to the next level as it gives a natural and pleasant look to your creations.

  33. Green Flame Fuel: wood pellets is a comprehensive information resource promoting the UK as a global supplier of quality, environmentally-responsible forest products from sustainably-managed forests.

  34. Even though it’s gone now, the impact remains. Kissanime introduced countless viewers to beloved series and helped grow the global anime community. It played a role in making anime more mainstream and easier for people to talk about openly.

Leave a Reply

Your email address will not be published. Required fields are marked *