An Apple research team has published a study showing that data from speakers of languages other than the target language can be used to improve the voice quality of text-to-speech (TTS) systems.
The quality of synthetic speech has improved dramatically with the development of neural networks, but this progress has come with the cost of high compute and huge amounts of training data. The Apple researchers set out to improve voice quality when data in the target language is limited, and to enable more efficient cross-lingual synthesis.
In their paper Combining Speakers of Multiple Languages to Improve Quality of Neural Voices, the researchers explore multiple architectures and training procedures to develop a multi-speaker and multi-lingual neural TTS system. Their large-scale study constructs a novel neural TTS model by combining speech from 30 speakers from 15 locales in 8 different languages. Test results show that for the vast majority of voices, the proposed multi-lingual and multi-speaker model yields better overall quality than single-speaker models.
The study addresses three questions: a) How effective is it to combine speakers from different languages compared with training only on the data of the target speaker; b) What type of model architectures and training protocols yield the best quality when using multi-lingual data; and c) To which extent can voices created in this way speak other languages included in the training data?
The researchers conducted a number of large-scale experiments to answer these questions. The basic architecture of their proposed model is Tacotron2, and the input is a sequence of phones (basic speech segments used in phonetic speech analysis, generally a vowel or consonant), and punctuation marks. The model’s output is a sequence of 80-dimensional mel-spectrogram features. The encoder converts the sequence of phone-IDs into a sequence of 512-dimensional vectors, three 1D-CNNs and one bi-LSTM layer. The attention mechanism is stepwise monotonic attention, and the decoder comprises two LSTMs followed by one FF layer, used to decode the mel-spectrograms and generate the output.
The team built two variants based on this architecture: the first uses a 16-dimensional residual variational auto-encoder (resVAE) to normalize differences between the utterances that cannot be described from the input; and the second adds speaker and language embeddings. Notably, unlike one-hot encoding, speaker embeddings can assign different values to each utterance.
The team ran two subjective evaluations: “inlingual” synthesis (where the spoken language is the same as the target voices) and cross-lingual synthesis (voices in a language other than that of the target voices).
The model was trained on 8 languages and 30 crowdsourced voices from 15 locales (Australia, India, Ireland, South Africa, UK and US for English, Germany, Italy, etc.) The total number of listeners per voice was around 120 for the inlingual experiments and 140 for the cross-lingual experiments, and the evaluation metric was a five-point mean opinion score (MOS).
Compared to a single-speaker model, the proposed multi-speaker model produced significantly better quality in most cases while requiring less than 40 percent of the speaker data. In cross-lingual synthesis, the MOS on the proposed models averaged around 80 percent of the MOS obtained by inlingual single-speaker native voices.
Overall, the results confirm that fine-tuning a multi-lingual and multi-speaker model can produce equal or better quality than single-speaker models. The team believes the insights provided by this study will be useful for researchers and practitioners who are developing synthetic voices.
The paper Combining Speakers of Multiple Languages to Improve Quality of Neural Voices is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.