Microsoft’s SpeechX: A Leap in Versatile Generative Speech Synthesis

In a new paper SpeechX: Neural Codec Language Model as a Versatile Speech Transformer, a Microsoft research team presents SpeechX, a versatile, robust, and extensible speech generation model that is capable to address zero-shot TTS and various speech transformation tasks, handling both clean and noisy signals.

Generative speech models leveraging audio-text prompts have paved the way for exceptional advancements in zero-shot text-to-speech synthesis. Yet, these models still grapple with diverse challenges, particularly when tasked with transforming input speech across varied audio-text-based speech generation scenarios.

To address these challenges, in a new paper SpeechX: Neural Codec Language Model as a Versatile Speech Transformer, a Microsoft research team presents SpeechX, a versatile, robust, and extensible speech generation model that is capable to address zero-shot TTS and various speech transformation tasks, handling both clean and noisy signals.

The proposed SpeechX is built upon VALL-E, which leverages the Transformer-based neural codec language model – EnCodec to generate neural codes conditioned on textual and acoustic prompts. More specifically, SpeechX uses autoregressive (AR) to output the neural codes of the first quantization layer of EnCodec and non-auto-regressive (NAR) Transformer models to produce the neural codes of all the layers above the first layer. The combination of these two models provides a reasonable trade-off between generation flexibility and inference speed.

To enable SpeechX to handle multiple tasks, the researchers adopt task-based prompting, which incorporates additional tokens in the multi-task learning setup, where the tokens collectively control what task to be executed. As a result, SpeechX is able to acquire knowledge of diverse tasks, facilitating a versatile and highly extensible speech generation process.

In their empirical study, the team compared SpeechX to the baseline expert models on various tasks, e.g. noise suppression, target speaker extraction, zero-shot TTS, clean speech editing, speech removal and etc. SpeechX achieves comparable and even superior performance to baseline models across various tasks. The team believes their work is an important step toward unified generative speech models.

See https://aka.ms/speechx for demo samples. The paper SpeechX: Neural Codec Language Model as a Versatile Speech Transformer on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

11 comments on “Microsoft’s SpeechX: A Leap in Versatile Generative Speech Synthesis”

Pingback: Microsoft’s SpeechX: A Leap in Versatile Generative Speech Synthesis – Ai Headlines
AdamRock

2023-10-28

In fact, it’s quite interesting! And there are many ways to find the right tools, I decided to do a little research and eventually found Microsoft office for Mac. Thanks to this, I got acquainted with Microsoft products and now I know exactly how to access productivity and development tools. I think it would be very useful for many other people as well!

Loading...

Reply
avanorman

2023-12-26

For trustworthy exterior deep cleaning services in Canberra and the surrounding regions, turn to Canberra Pressure Cleaning at https://www.canberrapressurecleaning.com.au/ Our use of state-of-the-art equipment and superior pressure cleaning methods makes us stand out, ensuring exceptional results for diverse surfaces, regardless of the size of the external cleaning endeavor.

Loading...

Reply
Yunkods

2025-07-10

You’re seeking a quick exhibition match or a challenging season to climb the ranks.
baseball 9

Loading...

Reply
- Jenny
  
  2025-07-22
  
  With each clever obstacle I overcame in Moto X3M, I felt more relaxed and content, like I was in sync with the game’s rhythm and my own growing skill.
  
  Loading...
  
  Reply
Jeny

2025-07-22

With each clever obstacle I overcame in Moto X3M, I felt more relaxed and content, like I was in sync with the game’s rhythm and my own growing skill.

Loading...

Reply
image describer

2025-09-25

Impressive tech! Microsoft’s SpeechX marks a major leap in generative speech synthesis—versatile, expressive, and eerily human-like. The future of voice AI is here!
free kawaii coloring pages

Loading...

Reply
andanwale

2025-11-03

Incredible technology! When it comes to expressive, adaptable, and uncannily human-like generative voice synthesis, Microsoft’s SpeechX represents a giant leap forward. Welcome to the future of artificial voice! rocket fortress

Loading...

Reply
MotoX3m

2026-02-24

I love you writing such great stuff for us. I am confident that a large number of people will profit from approach Moto X3m.

Loading...

Reply
Ragdoll Hit

2026-05-27

Multi-task learning on zero-shot TTS is wild.Ragdoll Hit

Loading...

Reply
Crutch Manufacturer

2026-06-07

We actually tested similar neural codec setups in a noisy facility next to a heavy-duty Crutch Manufacturer, and zero-shot synthesis usually fails terribly in that ambient environment.

Loading...

Reply