Site icon Synced

Google’s Zero-Shot Cross-Lingual Voice Transfer for Dysarthric Speakers

In recent years, Voice Transfer (VT) technology has made notable strides, particularly in applications such as Text-to-Speech (TTS), Voice Conversion (VC), and Speech-to-Speech Translation. However, achieving high-quality zero-shot or one-shot voice transfer, especially for unseen speakers, remains a significant challenge.

In a new paper Zero-shot Cross-lingual Voice Transfer for TTS, a Google research team presents a new VT module that seamlessly integrates into a multilingual TTS system, enabling voice transfer across languages.

The team summarizes their main contributions as follows:

The model itself is a joint speech-text framework, featuring both feature-to-text (F2T) and text-to-feature (T2F) components, which are jointly optimized using data from both Automatic Speech Recognition (ASR) and TTS systems. The input speech is processed using UTF-8 byte-based hidden vectors to create an embedding tensor. This tensor captures key acoustic, phonetic, and prosodic features of the reference speech.

The embedding tensor is passed through a bottleneck layer, which constrains the embedding space, ensuring it remains continuous and well-formed. The choice of this bottleneck has been shown to have a substantial effect on the Mean Opinion Score (MOS) and the model’s ability to preserve the speaker’s voice. A residual adapter is introduced between two consecutive layers in the duration and feature predictor blocks.

The residual adapter operates by combining the output of the bottleneck layer with that of the preceding layer. This design provides two key benefits: it makes the model modular, allowing the VT module to be independently enabled or disabled without affecting the core TTS system, and it permits dynamic loading of parameters during runtime.

The model’s performance has been validated through empirical results. The SegmentGST approach achieved the highest MOS, averaging 3.9, with a speaker similarity score of 73%, based on typical speech references across nine different languages. Furthermore, in cases where only atypical speech samples were available, both SharedGST and MultiGST models excelled, achieving 80% speaker similarity and a word error rate of just 2.7% across the evaluated conditions.

These findings suggest that the proposed method is not only effective in typical voice transfer scenarios but also shows promise in restoring the voices of individuals with dysarthria or other speech challenges, using atypical speech samples.

The paper Zero-shot Cross-lingual Voice Transfer for TTS is on arXiv.


Author: Hecate He | Editor: Chain Zhang

Exit mobile version