Famed British mathematician Jacob Bronowski once said “the hand is the cutting edge of the mind.” From offering a friendly handshake to putting up fists, the whole gamut of human emotions and behaviours can be conveyed through hand gestures. The high expressive power of human hands is naturally of interest to machine learning researchers in the fields of human-computer interaction, social artificial intelligence, and robotics.
Existing monocular motion capture methods however tend to focus on overall body motion, while current hand motion capture approaches largely overlook body motion. Now, researchers from The Chinese University of Hong Kong, Facebook Reality Labs, and Facebook AI Research have unveiled a state-of-the-art monocular 3D hand motion capture method, FrankMocap, which can estimate both 3D hand and body motions from in-the-wild monocular inputs with faster speed and better accuracy than previous approaches.
The method comprises two regression modules that predict 3D poses of the hands and body individually from a single RGB image input, followed by an integration module that produces the whole body pose from the outputs from the body and hand modules.
The 3D hand motion capture outputs were designed to efficiently integrate with monocular body motion capture outputs to produce whole body motion results in a unified parametric model structure. “A main idea of our approach is to make the outputs from body module and hand module as compatible as possible, enabling us to efficiently integrate the outputs for whole body motion capture,” the researchers explain in the paper FrankMocap: Fast Monocular 3D Hand and Body Motion Capture by Regression and Integration.
The researchers demonstrated the fast and accurate performance of FrankMocap on various real-world monocular videos, including a real-time demo, and compared their method with previous hand motion capture methods and whole body motion capture methods thorough ablation studies.
They tested their hand module against previous SOTA hand approaches on three public hand benchmarks — Stereo Hand Pose Tracking Benchmark (STB), Rendered Hand Dataset (rhd), and the MPII+NZSL dataset. They calculated the percentage of correct key points under different thresholds and the corresponding Area Under Curve (AUC) for the key points.
The proposed method outperforms all benchmarks on RHD and MPII+NZSL and shows comparable performance on STB. Notably, it shows significantly better 2D localization accuracy on the challenging in-the-wild dataset MPII+NZSL, demonstrating its generalization ability in such unstructured scenarios.
The researchers also note a few limitations of their method — it still requires bounding boxes to infer 3D body and hands and struggles to estimate hand pose when the hands are too close together. The researchers believe addressing these issues in future studies will enable the model to handle cases of multiple people interacting with each other, and eventually enable machines to better understand what we mean when our hands do the talking.
The paper FrankMocap: Fast Monocular 3D Hand and Body Motion Capture by Regression and Integration is on arXiv.
Reporter: Yuan Yuan | Editor: Michael Sarazen
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.