AI Machine Learning & Data Science Research

Microsoft’s SpeechX: A Leap in Versatile Generative Speech Synthesis

In a new paper SpeechX: Neural Codec Language Model as a Versatile Speech Transformer, a Microsoft research team presents SpeechX, a versatile, robust, and extensible speech generation model that is capable to address zero-shot TTS and various speech transformation tasks, handling both clean and noisy signals.

Generative speech models leveraging audio-text prompts have paved the way for exceptional advancements in zero-shot text-to-speech synthesis. Yet, these models still grapple with diverse challenges, particularly when tasked with transforming input speech across varied audio-text-based speech generation scenarios.

To address these challenges, in a new paper SpeechX: Neural Codec Language Model as a Versatile Speech Transformer, a Microsoft research team presents SpeechX, a versatile, robust, and extensible speech generation model that is capable to address zero-shot TTS and various speech transformation tasks, handling both clean and noisy signals.

The proposed SpeechX is built upon VALL-E, which leverages the Transformer-based neural codec language model – EnCodec to generate neural codes conditioned on textual and acoustic prompts. More specifically, SpeechX uses autoregressive (AR) to output the neural codes of the first quantization layer of EnCodec and non-auto-regressive (NAR) Transformer models to produce the neural codes of all the layers above the first layer. The combination of these two models provides a reasonable trade-off between generation flexibility and inference speed.

To enable SpeechX to handle multiple tasks, the researchers adopt task-based prompting, which incorporates additional tokens in the multi-task learning setup, where the tokens collectively control what task to be executed. As a result, SpeechX is able to acquire knowledge of diverse tasks, facilitating a versatile and highly extensible speech generation process.

In their empirical study, the team compared SpeechX to the baseline expert models on various tasks, e.g. noise suppression, target speaker extraction, zero-shot TTS, clean speech editing, speech removal and etc. SpeechX achieves comparable and even superior performance to baseline models across various tasks. The team believes their work is an important step toward unified generative speech models.

See https://aka.ms/speechx for demo samples. The paper SpeechX: Neural Codec Language Model as a Versatile Speech Transformer on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

1 comment on “Microsoft’s SpeechX: A Leap in Versatile Generative Speech Synthesis

  1. Pingback: Microsoft’s SpeechX: A Leap in Versatile Generative Speech Synthesis – Ai Headlines

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: