Meta’s Sapiens: Revolutionizing Human Pose, Segmentation, and Depth Estimation with Vision Transformers

Synced

2 years ago

In recent years, substantial progress has been made in generating photorealistic human representations in both 2D and 3D, thanks to advancements in the precise estimation of various visual assets. Despite these improvements, achieving accurate and robust estimations remains a challenge, especially given the difficulties in scaling up ground-truth annotations for in-the-wild scenarios.

In a new paper Sapiens: Foundation for Human Vision Models, a Meta research team introduces Sapiens, a suite of models designed to address four core human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction.

The team summarizes their main contribution as follows:

Sapiens is a family of vision transformers pretrained on an extensive dataset of human images.
The study demonstrates that by combining straightforward data curation with large-scale pretraining, significant performance gains can be achieved without increasing computational costs.
The models, fine-tuned using both high-quality and synthetic labels, show strong generalization capabilities in real-world settings.
Sapiens introduces the first model capable of 1K resolution, natively supporting high-fidelity inference for human-centric tasks and setting new benchmarks in 2D pose estimation, body-part segmentation, depth, and normal estimation.

The researchers leverage a vast proprietary dataset of around one billion in-the-wild images for pretraining. They employ a person bounding-box detector to filter out lower-quality images, keeping only those with a detection score above 0.9 and bounding boxes larger than 300 pixels. They follow a masked autoencoder (MAE) strategy during pretraining, where the model learns to reconstruct the original human image from partially visible segments. The encoder captures latent representations from the visible portions of the image, while the decoder reconstructs the full image from this latent data.

For 2D pose estimation, Sapiens adopts a top-down approach to identify keypoints from an input image. Unlike previous models, which use up to 68 facial keypoints, Sapiens incorporates 243 facial keypoints, capturing intricate details around the eyes, nose, lips, and ears to better represent facial expressions.

In the body-part segmentation task, the model employs an encoder-decoder structure and introduces a more detailed classification vocabulary compared to prior datasets. This includes finer distinctions such as upper and lower limbs, as well as specific parts like the upper and lower lips, teeth, and tongue.

For depth estimation, the researchers use a similar architecture to that of segmentation, modifying the output channel to support regression. The depth estimation model is trained on 500,000 synthetic images generated from 600 high-resolution photogrammetry human scans, ensuring high accuracy for monocular depth estimation.

Empirical results show that Sapiens significantly outperforms previous state-of-the-art methods. The model improves performance on the Humans-5K benchmark for pose estimation by 7.6 mAP, the Humans-2K benchmark for body-part segmentation by 17.1 mIoU, the Hi4D benchmark for depth estimation by 22.4% in relative RMSE, and the THuman2 benchmark for normal estimation by 53.5% in relative angular error.

In summary, Sapiens marks a considerable advancement in human-centric vision models, positioning itself as a foundational framework for future applications. The researchers believe that their models can become a vital component in numerous downstream tasks and will provide high-quality vision backbones to a broader community.

The paper Sapiens: Foundation for Human Vision Models is on arXiv.

Author: Hecate He | Editor: Chain Zhang

Share this: