Diffusion models have garnered significant recognition for their outstanding performance in a wide range of image and audio generation tasks. Text-to-speech (TTS) systems employing diffusion models have proven their mettle by delivering high-fidelity speech that stands on par with state-of-the-art systems. Nonetheless, many existing TTS systems face a litany of issues, such as heavy reliance on intermediate features’ quality and complex deployment, training, and setup procedures.
In a new paper E3 TTS: Easy End-to-End Diffusion-based Text to Speech, a Google research team proposes Easy End-to-End Diffusion-based Text to Speech. This streamlined and efficient text-to-speech model hinges solely on diffusion to preserve temporal structure, allowing it to accept plain text as input and generate audio waveforms directly.
The E3 TTS model takes text as input and operates in a non-autoregressive manner, producing waveform outputs without delay. The architecture consists of two primary modules:
- A pretrained BERT model extracts relevant information from the input text.
- A diffusion UNet model processes the BERT output, iteratively refining the initial noisy waveform to predict the final raw waveform.
In a tangible sense, the E3 TTS leverages recent advancements in large language models. It relies on text representations provided by a pretrained BERT model. Unlike some prior approaches, which require representations like phonemes or graphemes, the E3 TTS simplifies the process by depending solely on a pretrained text language model. This model can be trained on multiple languages using only text data, which streamlines the system’s versatility.
The U-Net structure encompasses a sequence of downsampling and upsampling blocks linked by residuals. To enhance information extraction from the BERT output, the team incorporates crossattention in the top downsampling/upsampling blocks. In the lower blocks, an adaptive softmax Convolutional Neural Network (CNN) kernel is employed, with its kernel size determined by the timestep and speaker. In other layers, speaker and timestep embeddings are combined through Feature-wise Linear Modulation (FiLM), which includes a composite layer for channel-wise scaling and bias prediction.
The downsampler plays a crucial role in refining the noisy information, converting it from 24kHz to a sequence of similar length to the encoded BERT output, which significantly improves the overall quality. On the flip side, the upsampler predicts noise with the same length as the input waveform.
Empirical evidence demonstrates that E3 TTS can generate high-fidelity audio, approaching the performance of state-of-the-art neural TTS systems. Furthermore, it enables various zero-shot tasks, such as speech editing and prompt-based generation.
In summary, this work underscores the remarkable capabilities of E3 TTS in generating high-quality audio directly from BERT features. It simplifies the design of end-to-end TTS systems and has proven to deliver impressive results in experiments.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.