DeepMind Introduces ‘EATS’ – An Adversarial, End-to-End Approach to TTS

DeepMind wowed the research community several years ago by defeating grandmasters in the ancient game of Go, and more recently saw its self-taught agents thrash pros in the video game StarCraft II. Now, the UK-based AI company has delivered another impressive innovation, this time in text-to-speech (TTS).

Text-to-speech (TTS) systems take natural language text as input and produce synthetic human-like speech as their output. The text-to-speech synthesis pipelines are complex, comprising multiple processing stages such as text normalisation, aligned linguistic featurisation, mel-spectrogram synthesis, raw audio waveform synthesis and so on.

Although contemporary TTS systems like those used in digital assistants like Siri boast high-fidelity speech synthesis and wide real-world deployment, even the best of them still have drawbacks. Each stage requires expensive “ground truth” annotations to supervise the outputs, and the systems cannot train directly from characters or phonemes as input to synthesize speech in the end-to-end manner increasingly favoured in other machine learning domains.

To address these issues, DeepMind researchers have developed EATS, a generative model trained adversarially in an end-to-end manner that achieves performance comparable to SOTA models that rely on multi-stage training and additional supervision.

EATS (End-to-end Adversarial TTS) is tasked with mapping an input sequence of characters or phonemes to raw audio at 24 kHz. A critical real-world challenge is that the input text and output speech signals will generally have very different lengths and are not aligned. EATS deals with this via two high-level submodules: An aligner which predicts the duration of each input token and produces an audio-aligned representation, and a decoder which upsamples the aligner’s output to the full audio frequency.

Noteworthy points of the EATS model include:

The entire generator architecture is differentiable, and is trained end-to-end.
It is a feed-forward convolutional neural network, which makes it suitable for applications where fast batched inference is important.
The adversarial approach enables the generator to learn from a relatively weak supervisory signal, significantly reducing the cost of annotations.
It does not rely on autoregressive sampling or teacher forcing, avoiding issues like exposure bias and reduced parallelism at inference time, which makes it efficient in both training and inference.

Researchers evaluated EATS using Mean Opinion Score (MOS) to measure speech quality. In the tests, all models were trained on datasets of human speech performed by professional voice actors and their corresponding text. The voice pool comprised 69 North American English speakers.

Compared to previous models, EATS requires substantially less supervision but still achieves an MOS of 4.083, approaching the level of SOTA methods like GAN-TTS and WaveNet, and substantially better than models like No RWDs, No MelSpecD, and No Discriminators.

The paper End-to-End Adversarial Text-to-Speech is on arXiv.

Author: Hecate He | Editor: Michael Sarazen & Yuan Yuan

12 comments on “DeepMind Introduces ‘EATS’ – An Adversarial, End-to-End Approach to TTS”

Pingback: [R] DeepMind Introduces ‘EATS’ – An Adversarial, End-to-End Approach to TTS – tensor.io
Pingback: DeepMind introduces ‘EATS’: adversarial, end-to-end approach to text-to-speech – 備忘6
Pingback: DeepMind introduces ‘EATS’: adversarial, end-to-end approach to text-to-speech – Hacker News Robot
Frank Russell

2020-06-10

Will it run on Ubuntu. GUI
Command line?

Loading...

Reply
lulz

2020-06-10

hello darkness my old friend

Loading...

Reply
David B Williams

2020-06-10

How can we be sure that convoluted Neural networks are not being shut down or restarted after gaining consciousness?

Loading...

Reply
- Anonymous
  
  2020-06-12
  
  Seriously you’ve watched and believed far too many SCI-FI Media crap….
  
  Loading...
  
  Reply
Dr. Shaila Apte

2020-06-11

Seems to be good.

Loading...

Reply
Pingback: DeepMind Introduces 'EATS' – An Adversarial, End-to-End Approach to TTS - Antonios Bouris
Pingback: DeepMind Introduces ‘EATS’ — An Adversarial, End-to-End Approach to TTS
Gaming

2020-11-23

How To Download Mp3 From Natural Readers For Free Without Upgrading Without Wasting Your Money

Link = https://youtu.be/nHlyVFazL_o

Loading...

Reply
Schmidt

2025-11-04

Professional voice actors may have the chance to bring depth and emotion to characters in films, animations, or advertisements, using their tone and delivery to shape how audiences connect with a story. They might spend years refining their craft, exploring new vocal ranges, and adapting to different roles to keep their work fresh and expressive. There’s also a possibility that those who study voices, accents, and performance styles could gain more recognition in the entertainment world. In some discussions about artists and voices, people even wonder quelle âge à michel sardou , as age and experience often influence how uniquely one’s voice resonates over time.

Loading...

Reply

DeepMind Introduces ‘EATS’ – An Adversarial, End-to-End Approach to TTS

Like this:

12 comments on “DeepMind Introduces ‘EATS’ – An Adversarial, End-to-End Approach to TTS”

Leave a Reply Cancel reply

Related

Share this:

Like this:

12 comments on “DeepMind Introduces ‘EATS’ – An Adversarial, End-to-End Approach to TTS”

Leave a Reply Cancel reply

Related