Not so long ago, text-to-speech (TTS) outputs were disappointingly deadpan and robotic. The leveraging of deep neural networks in recent years has dramatically transformed TTS, enabling conditioning on factors such as stress and intonation to achieve higher quality and much more humanlike results. Contemporary TTS models however still perform best when dealing with a specific speaker in a specific language. Cross-lingual speech synthesis, which aims to transfer the characteristics of a user’s voice from one language to another, has remained relatively underexplored. That just changed.
In the new paper Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling, a Microsoft research team presents VALL-E X, a simple yet effective cross-lingual neural codec language model that inherits strong in-context learning capabilities from the VALL-E TTS model and demonstrates high-quality zero-shot cross-lingual speech synthesis performance.
The team summarizes their main contributions as follows:
- We develop a cross-lingual neural codec language model VALL-E X with large multilingual multi-speaker multi-domain unclean speech data.
- The multilingual in-context learning framework enables VALL-E X to generate cross-lingual speech maintaining the unseen speaker’s voice, emotion, and speech background, prompted by only one sentence in the source language.
- Based on the learned cross-lingual speech modelling ability with the introduced language ID, VALL-E X can generate speech in a native tongue for any speaker and can significantly reduce the foreign accent problem, which is a well-known problem in cross-lingual speech synthesis tasks.
- We apply VALL-E X to zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experiments show that the proposed VALL-E X can beat strong baselines in terms of speaker similarity, speech quality, translation quality, speech naturalness, and human evaluation.
The proposed VALL-E X is built upon VALL-E, a neural codec language model Microsoft introduced in January that demonstrates strong in-context learning capabilities and achieves state-of-the-art TTS synthesis performance. This study extends VALL-E to enable zero-shot cross-lingual and cross-lingual TTS or speech-to-speech translation (S2ST) capabilities.
The team first extracts multilingual speech-transcription data from ASR (automatic speech recognition) data or pseudo-labelled speech data. They then employ a rule-based converter (a grapheme-to-phoneme conversion / G2P tool) to convert the transcriptions to phoneme sequences; and an offline neural codec encoder to convert the speech data to acoustic tokens. Finally, they train a multilingual conditional language model using the paired phoneme and acoustic token sequences of each language.
The trained VALL-E X is thus able — prompted by a single sentence spoken in the source language — to generate high-quality cross-lingual speech that maintains the source speaker’s voice characteristics, such as emotion and other background information.
In their empirical study, the team evaluated VALL-E X on zero-shot cross-lingual TTS and zero-shot S2ST. In the evaluations, VALL-E X surpassed the strong baselines with higher speaker similarity scores, lower word error rates, higher BLEU scores and better speech naturalness.
This paper introduces a promising model with strong potential for cross-lingual speech synthesis. The debut version of VALL-E X was trained on large-scale multi-speaker speech-transcription data in Chinese and English, but the researchers plan to expand their approach with additional data and languages in the future.
Audio samples are here, and the code is available on the project’s GitHub. The paper Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.