Electrifying an entire dance club is easy if you have killer moves like John Travolta in Saturday Night Fever. But for the rest of us, not so much. We may shake our butts and swing our arms, but let’s face it: some people just can’t dance. But now there’s hope, thanks to AI.
Researchers at the University of California at Berkeley have proposed a simple motion transfer method that can make a klutz dance like a Travolta. A source dance video is first input, followed by the target individual’s attempts to perform such dance movements. After a few minutes, the model synthesises a smooth dance video featuring the target subject. “Everybody Dance Now” indeed!
To transfer motion between human subjects in different videos the researchers used an end to end pixel-based pipeline. The process requires a mapping between images of the two individuals. Since the source and target subjects likely have different body sizes and shapes, the method focuses on observing keypoint-based poses which can comprehensively interpret body position.
Researchers designed the intermediate representation between the source and target as pose stick figures. The learning can thus be conducted in a supervised way by utilizing the pose stick figures. The pose stick figures are then transferred into the trained model to receive images of the target in the same pose as the source. Using this method, source motions can be completely transferred to the target.
Researchers divided the pipeline into three phases – pose detections, global pose normalization, and mapping from normalized pose stick figures within the source video frames.
The method’s pre-trained pose detector accurately estimates joint coordinates so that representation of the resulting pose stick figure can be correctly aligned. The global pose normalization stage illustrates differences between the source and target figures and their locations in the frames. Lastly, the system maps from the normalized results to generate synthesized images using modified adversarial training.
A trained GAN can make the target’s facial expressions realistic. Researchers also took the temporal smoothness of their generated video into account to ensure each frame conditioned the prediction on that of a previous step.
The Everybody Dance Now paper is available on arVix.
Journalist: Fangyu Cai | Editor: Michael Sarazen