AI Technology

Deep Learning Model Morphs VTube Talking Heads With a Few Mouse Clicks

A Google researcher has released a deep neural network model that makes animating a VTube persona a little easier.

Every day is Halloween for Virtual YouTubers or “VTubers” — the new generation of wildly popular online entertainers whose voices and actions are represented in real time by colourful and expressive anime characters. Now, a Google researcher has released a deep neural network model that makes animating a VTube persona a little easier.

Using motion capture systems to transfer human movements to cartoon characters in real-time is a process that can be traced back to the 90s. The approach however was not popularized, and the term “Virtual YouTuber” did not enter our vocabulary until the virtual character “Kizuna AI“ debuted in 2016.

Kizuna is a cute young girl with wide eyes and a pink butterfly bow perched atop her long flowing hair — any otaku’s dream. She livestreams and records videos in various virtual environments for her 2.68 million YouTube subscribers. Although Kizuna communicates exclusively in Japanese, huge unofficial communities have emerged around the avid fans who contribute translations for her videos.

Recently released Kizuna AI music video

Anyone who wants to become a VTuber must first build a controllable character model with basic movement ability. Existing approaches require designing and creating a full 3D model of the character with a cost estimate of around US$5,000, or creating a 2D model for one-tenth that price. Even in 2D however the modeler has to manually segment the character into multiple movable parts and then re-assemble these using specialized software such as Live2D.

Google Japan software engineer Pramook Khungurn — himself a VTuber fan — set out to find a way to add variety and possibility to the presentation of VTube anime characters at a lower cost with the help of AI. He developed a neural network system that can morph a single image of an anime character’s face into a number of new basic poses.

Khungurn built a dataset specifically for the project, selecting some 8,000 characters created on the 3D animation software MikuMikuDance, and rendering their faces under random poses for training.

The resulting network enables users to adjust facial tilt and rotation and to what degree the character’s eyes and mouth are open, etc. Khungurn developed separate face morpher and face rotator

networks, where the rotator takes the morpher’s output as input to generate the basic poses and movements associated with a talking head.

Finally, Khungurn connected the system to a face tracker, enabling the anime character to mimic facial movements in real time from either existing videos or in the real world livestream scenarios that have made VTubers the megastars they are in Japan and around the world.

Transferring Barack Obama’s movements to anime characters

Khungurn believes that his system has additional application potential in the field of video game production.

The comprehensive Talking Head Anime from a Single Image project page is here.


Journalist: Yuan Yuan | Editor: Michael Sarazen

1 comment on “Deep Learning Model Morphs VTube Talking Heads With a Few Mouse Clicks

  1. Thanks for your efforts.

Leave a Reply

Your email address will not be published.

%d bloggers like this: