In recent years, Voice Transfer (VT) technology has made notable strides, particularly in applications such as Text-to-Speech (TTS), Voice Conversion (VC), and Speech-to-Speech Translation. However, achieving high-quality zero-shot or one-shot voice transfer, especially for unseen speakers, remains a significant challenge.
In a new paper Zero-shot Cross-lingual Voice Transfer for TTS, a Google research team presents a new VT module that seamlessly integrates into a multilingual TTS system, enabling voice transfer across languages.

The team summarizes their main contributions as follows:
- The team presents a zero-shot VT module that can easily be incorporated into advanced TTS systems. This module enables voice transfer from a previously unseen speaker using just a short reference speech sample, while maintaining high quality and fidelity.
- The VT module allows voice transfer even when the language of the input speech sample differs from the target language, showcasing its cross-lingual capabilities.
- Novel bottleneck layers are proposed, which significantly enhance the zero-shot TTS quality and speaker similarity, and their impact is thoroughly analyzed.
- The model demonstrates its ability to generate high-quality, authentic-sounding speech across languages, even from atypical input references. This can be especially beneficial for users who have not pre-recorded their voices or those who experience speech atypicalities. Audio and video samples accompanying the study highlight these capabilities.

The model itself is a joint speech-text framework, featuring both feature-to-text (F2T) and text-to-feature (T2F) components, which are jointly optimized using data from both Automatic Speech Recognition (ASR) and TTS systems. The input speech is processed using UTF-8 byte-based hidden vectors to create an embedding tensor. This tensor captures key acoustic, phonetic, and prosodic features of the reference speech.
The embedding tensor is passed through a bottleneck layer, which constrains the embedding space, ensuring it remains continuous and well-formed. The choice of this bottleneck has been shown to have a substantial effect on the Mean Opinion Score (MOS) and the model’s ability to preserve the speaker’s voice. A residual adapter is introduced between two consecutive layers in the duration and feature predictor blocks.
The residual adapter operates by combining the output of the bottleneck layer with that of the preceding layer. This design provides two key benefits: it makes the model modular, allowing the VT module to be independently enabled or disabled without affecting the core TTS system, and it permits dynamic loading of parameters during runtime.

The model’s performance has been validated through empirical results. The SegmentGST approach achieved the highest MOS, averaging 3.9, with a speaker similarity score of 73%, based on typical speech references across nine different languages. Furthermore, in cases where only atypical speech samples were available, both SharedGST and MultiGST models excelled, achieving 80% speaker similarity and a word error rate of just 2.7% across the evaluated conditions.
These findings suggest that the proposed method is not only effective in typical voice transfer scenarios but also shows promise in restoring the voices of individuals with dysarthria or other speech challenges, using atypical speech samples.
The paper Zero-shot Cross-lingual Voice Transfer for TTS is on arXiv.
Author: Hecate He | Editor: Chain Zhang

Pingback: Google’s Zero-Shot Cross-Lingual Voice Transfer for Dysarthric Speakers – Welcome
This innovation in Voice Transfer is exciting! Zero-shot VT modules in multilingual TTS systems enable remarkable cross-language communication. The device can transfer voices from unseen speakers with a brief sample in excellent quality. Novel bottleneck layers to improve TTS and speaker similarity seem game-changing. Looking forward to improving application accessibility and user experience! Maintain your excellence, Team Google research!
Voice Transfer technology is considered a spectacular breakthrough capybara clicker
Your perspective on this topic is both unique and enlightening. word vs word is an exciting single-player with bot word puzzle game where you battle in real-time matches.
This is inspiring news from Google! The development of a zero-shot, cross-lingual voice transfer module for Text-to-Speech (TTS) systems, especially Crazy Cattle 3D with its demonstrated benefit for dysarthric speakers, is a significant breakthrough with immense potential for accessibility and inclusion.
Thanks for sharing this cool article about Google’s new voice transfer tech! It’s awesome they’re tackling challenges like helping dysarthric speakers with zero-shot translation across languages. Excited to see how this helps communication!
Amazing work! This cross-lingual voice transfer has huge potential for dysarthric speakers. Thank you for sharing!