Dancing may come naturally to many people, but there is a lot more to it than we might imagine. A key challenge in training AI models to perform humanlike dancing is the high spatial-temporal complexity in human motion dynamics modelling. In recent years many researchers have worked to synthesize dance movements from music, but these attempts tend to involve short-term dance generation of under 30 seconds.
Now, researchers from Fudan University and Microsoft have proposed a novel seq2seq architecture that generates dance sequences for music clips running a minute or longer.
The new model comprises a transformer based music encoder and a recurrent structure based dance decoder. The encoder first transforms low-level acoustic features of an input music clip into high-level representations. The decoder then uses a recurrent structure to predict frame-by-frame dance movements conditioned on corresponding musical elements.
The team used a local self-attention mechanism in the encoder to reduce the memory requirement for long sequence modelling. The mechanism enables the encoder to not only process long musical sequences efficiently but also model local musical characteristics such as chord progressions and rhythm patterns.
The researchers also propose a dynamic auto-condition training strategy as a new curriculum learning method to alleviate the error accumulation problem in human motion prediction and thus facilitate the generation of longer dance sequences.
The proposed method was evaluated on automatic metrics with LSTM, Aud-MoCoGAN, and Dancing2Music — the state-of-art method for music-to-dance generation tasks. The researchers also enlisted humans to evaluate the method’s motion realism, dance smoothness, and even the style consistency of generated dances with their corresponding music clips.
The human evaluators were impressed — ranking the new approach above the baselines on motion realism, style consistency and smoothness. Even in comparisons with motion captures of real human dancers, 57.9 percent of the annotators preferred the new method’s smoothness, 41.2 percent its motion realism, and 30.3 percent its style consistency.
The researchers say they will soon release a high-quality dataset with music and dance pairs along with their source code. In future work, they plan to consider the explicit modelling of style information in dance generation and incorporate additional dance styles.
The paper Dance Revolution: Long Sequence Dance Generation with Music via Curriculum Learning is on arXiv.
Journalist: Yuan Yuan | Editor: Michael Sarazen