DeepMind wowed the research community several years ago by defeating grandmasters in the ancient game of Go, and more recently saw its self-taught agents thrash pros in the video game StarCraft II. Now, the UK-based AI company has delivered another impressive innovation, this time in text-to-speech (TTS).
Text-to-speech (TTS) systems take natural language text as input and produce synthetic human-like speech as their output. The text-to-speech synthesis pipelines are complex, comprising multiple processing stages such as text normalisation, aligned linguistic featurisation, mel-spectrogram synthesis, raw audio waveform synthesis and so on.
Although contemporary TTS systems like those used in digital assistants like Siri boast high-fidelity speech synthesis and wide real-world deployment, even the best of them still have drawbacks. Each stage requires expensive “ground truth” annotations to supervise the outputs, and the systems cannot train directly from characters or phonemes as input to synthesize speech in the end-to-end manner increasingly favoured in other machine learning domains.
To address these issues, DeepMind researchers have developed EATS, a generative model trained adversarially in an end-to-end manner that achieves performance comparable to SOTA models that rely on multi-stage training and additional supervision.
EATS (End-to-end Adversarial TTS) is tasked with mapping an input sequence of characters or phonemes to raw audio at 24 kHz. A critical real-world challenge is that the input text and output speech signals will generally have very different lengths and are not aligned. EATS deals with this via two high-level submodules: An aligner which predicts the duration of each input token and produces an audio-aligned representation, and a decoder which upsamples the aligner’s output to the full audio frequency.
Noteworthy points of the EATS model include:
- The entire generator architecture is differentiable, and is trained end-to-end.
- It is a feed-forward convolutional neural network, which makes it suitable for applications where fast batched inference is important.
- The adversarial approach enables the generator to learn from a relatively weak supervisory signal, significantly reducing the cost of annotations.
- It does not rely on autoregressive sampling or teacher forcing, avoiding issues like exposure bias and reduced parallelism at inference time, which makes it efficient in both training and inference.
Researchers evaluated EATS using Mean Opinion Score (MOS) to measure speech quality. In the tests, all models were trained on datasets of human speech performed by professional voice actors and their corresponding text. The voice pool comprised 69 North American English speakers.
Compared to previous models, EATS requires substantially less supervision but still achieves an MOS of 4.083, approaching the level of SOTA methods like GAN-TTS and WaveNet, and substantially better than models like No RWDs, No MelSpecD, and No Discriminators.
The paper End-to-End Adversarial Text-to-Speech is on arXiv.
Author: Hecate He | Editor: Michael Sarazen & Yuan Yuan