Microsoft today announced a new neural text-to-speech synthesis system that makes computer voices nearly indistinguishable from human recordings. Neural TTS currently available for preview through Azure Cognitive Services Speech Services.
The synthesis system uses deep neural networks to overcome the limits of traditional text-to-speech systems for matching stress and intonation patterns in spoken language. Humanlike prosody and articulation can significantly reduce listening fatigue when people interact with AI systems.
Traditional text-to-speech systems separate prosody into linguistic analysis and acoustic prediction with independent controls. That can lead to a muffled or buzzy voice when speech units are synthesized into a computer voice. In contrast, Neural TTS processes prosody prediction and voice synthesis at the same time, which results in a more fluid and humanlike voice.
Microsoft’s milestone in text-to-speech synthesis will enable more natural and engaging interactions from chatbots and virtual assistants, with applications for example in enhancing in-car navigation systems or converting digital texts to audiobooks.
Author: Herin Zhao | Editor: Michael Sarazen