Humans build AI systems to deal with environments largely populated by our own kind, and so it is not surprising that one of the more popular computer vision research areas is in human pose estimation. In the recent paper Contact and Human Dynamics from Monocular Video, a research team from Stanford University and Adobe Research proposes a new approach that combines learned pose estimation with physical reasoning through trajectory optimization to extract dynamically valid full-body motions from monocular video. The researchers say the approach produces motions that are visually and physically much more plausible than state-of-the-art methods.
Existing methods for human pose estimation from monocular video can estimate 2D and 3D kinematic poses. These methods however often still contain visible errors that defy physical constraints, such as the feet from recovered motions for example floating slightly above or penetrating into the ground. These errors can then distort or prevent subsequent uses of the motion information.
The researchers use the results of kinematic pose estimation techniques as input, focusing on single-person dynamic motions from dance, walking and sports. These inputs can produce accurate overall poses but struggle with contacts and dynamics. A physics-based trajectory optimization therefore enforces dynamics on the input motion, and the researchers leverage a reduced-dimensional body model with centroidal dynamics and contact constraints to produce physically-valid motions that closely match the inputs.
As in previous work, to recover full-body motion the researchers assume there is no camera motion and the full body is visible. This enables the approach to achieve highly dynamic motions without sacrificing physical accuracy.
The researchers conducted extensive qualitative and quantitative evaluations of the contact estimation and motion optimization methods. It was shown that the proposed method significantly enhances the realism of inferred motions over state-of-the-art methods, and also estimates various physical properties that might be useful for future inference of scene properties and action recognition.
This team also identifies some research limitations. For example, video optimization is very expensive, and the physical optimization process can take from 30 minutes to one hour for just a two-second (69 frames) video clip. Researchers hope to find a more efficient implementation method to speed up execution in this regard.
The paper Contact and Human Dynamics from Monocular Video is on arXiv. Click here to visit the project page.
Analyst: Yuqing Li | Editor: Michael Sarazen; Yuan Yuan
This report offers a look at how the Chinese government and business owners have leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle.
Click here to find more reports from us.
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.