The Collins Dictionary named “fake news” as its Word of the Year for 2017, and AI has just made it a lot more believable. The Montreal Institute for Learning Algorithms (MILA) recently launched ObamaNet — a photo-realistic lip-sync neural network that can make anyone appear to be saying anything.
ObamaNet is composed of three trainable neural modules: a text-to-speech network based on Char2Wav; a time-delayed LSTM to generate mouth-keypoints synced to the audio; and to generate the video frames conditioned on the key points, a network based on Pix2Pix — which is emerging as a good general-purpose solution for image-to-image translation problems.
Using generative networks for images and videos is nothing new, while decades of work have been put into speech synthesis. ObamaNet combines the attributes of both. According to MILA researchers, their system “can be trained on any set of close shot videos of a person speaking, along with the corresponding transcript. The result is a system that generates speech from an arbitrary text and modifies according to the mouth area of one existing video so that it looks natural and realistic.”
MILA researchers explain that they chose the former US President because “his videos are commonly used to benchmark lip-sync methods.” There is also more online video data available for public figures such as Obama. For the project, MILA extracted 17 hours of footage from Obama’s 300 weekly presidential addresses.
The concept of Obama as a benchmark is rooted in the paper Synthesizing Obama: Learning Lip Sync from Audio, published in July 2017 by Supasorn Suwajanakorn from the University of Washington’s Graphics and Imaging Laboratory (GRAIL). When Suwajanakorn uploaded his results to YouTube, the videos quickly attracted over 750,000 views
GRAIL’s synthesized Obama video stirred up quite a discussion online. The model worked so well that many YouTube viewers were frightened, one warning “I see the first episode of Black Mirror has begun.” Others welcomed the videos: “It’s better that the public is aware of such technology than being oblivious. If we were ignorant we would believe things without thinking it was tampered.”
The Guardian described the project as “the future of fake news. We’ve long been told not to believe everything we read, but soon we’ll have to question everything we see and hear as well.”
MILA says their model differs from the GRAIL project because instead of a traditional computer vision model, it uses a neural network topped with a text-to-speech synthesizer. The lab created ObamaNet in conjunction with Lyrebird.ai, whose beta voice synthesis product allows users to generate text-to-speech files in their own “digital voice” after providing just one minute of sample speech.
While applications are still limited, projects like Synthesizing Obama and ObamaNet provide a glimpse of future possibilities for the tech, including the chilling prospect of just how just realistic fake news may become.
Journalist: Meghan Han | Editor: Michael Sarazen