Baidu AI Can Clone Your Voice in Seconds

Baidu’s research arm announced yesterday that its 2017 text-to-speech (TTS) system Deep Voice has learned how to imitate a person’s voice using a mere three seconds of voice sample data.

The technique, known as voice cloning, could be used to personalize virtual assistants such as Apple’s Siri, Google Assistant, Amazon Alexa; and Baidu’s Mandarin virtual assistant platform DuerOS, which supports 50 million devices in China with human-machine conversational interfaces.

In healthcare, voice cloning has helped patients who lost their voices by building a duplicate. Voice cloning may even find traction in the entertainment industry and in social media as a tool for satirists.

Baidu researchers implemented two approaches: speaker adaption and speaker encoding. Both deliver good performance with minimal audio input data, and can be integrated into a multi-speaker generative model in the Deep Voice system with speaker embeddings without degrading quality.

Speaker adaption is a backpropagation-based approach grounded in a multi-speaker generative model or adapted to only low-dimensional speaker embeddings. Speaker encoding meanwhile combines the multi-speaker generative model with a separate model that generates a new speaker embedding from cloned audio. This approach shortens cloning time to just a few seconds and requires a low number of parameters to represent each speaker, making it favorable for low-resource deployment.

Speaker adaptation and speaker encoding approaches for training, cloning and audio generation. Courtesy of Baidu Research.

Baidu has released multiple three-second cloned audio clips which track the process from original voices to synthesized voices that are strikingly similar.

Baidu is upbeat about the possibilities in the field of voice cloning research. For example, advances in meta-learning, a systematic approach of learning-to-learn, could significantly boost voice cloning quality.

Baidu is not the only institute working on imitating human voices with AI. Google’s DeepMind, which produced the epoch-making Go computer AlphaGo, introduced its TTS project WaveNet in 2016. The system models audio waveforms from real human voices and produces convincingly natural simulations. Adobe also unveiled a prototype software called Project VoCo that can learn to mimic a voice in 20 minutes. Last year, Montreal-based startup Lyrebird pushed voice cloning technology to the next level with a TTS system that required only a 60-second audio sample input to deliver “a digital voice that sounds like you.”

The recent breakthroughs in synthesizing human voices have also raised concerns. AI could potentially downgrade voice identity in real life or with security systems. For example voice technology could be used maliciously against a public figure by creating false statements in their voice. A BBC reporter’s test with his twin brother also demonstrated the capacity for voice mimicking to fool voiceprint security systems.

Baidu’s Deep Voice has reduced training time and advanced the development of voice cloning, opening possibilities for improvements in virtual assistants, advances in healthcare solutions and applications in many other sectors.

Journalist: Tony Peng| Editor: Michael Sarazen