AI Machine Learning & Data Science Research

Meta’s Sapiens: Revolutionizing Human Pose, Segmentation, and Depth Estimation with Vision Transformers

In a new paper Sapiens: Foundation for Human Vision Models, a Meta research team introduces Sapiens, a suite of models designed to address four core human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction.

In recent years, substantial progress has been made in generating photorealistic human representations in both 2D and 3D, thanks to advancements in the precise estimation of various visual assets. Despite these improvements, achieving accurate and robust estimations remains a challenge, especially given the difficulties in scaling up ground-truth annotations for in-the-wild scenarios.

In a new paper Sapiens: Foundation for Human Vision Models, a Meta research team introduces Sapiens, a suite of models designed to address four core human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction.

The team summarizes their main contribution as follows:

  • Sapiens is a family of vision transformers pretrained on an extensive dataset of human images.
  • The study demonstrates that by combining straightforward data curation with large-scale pretraining, significant performance gains can be achieved without increasing computational costs.
  • The models, fine-tuned using both high-quality and synthetic labels, show strong generalization capabilities in real-world settings.
  • Sapiens introduces the first model capable of 1K resolution, natively supporting high-fidelity inference for human-centric tasks and setting new benchmarks in 2D pose estimation, body-part segmentation, depth, and normal estimation.

The researchers leverage a vast proprietary dataset of around one billion in-the-wild images for pretraining. They employ a person bounding-box detector to filter out lower-quality images, keeping only those with a detection score above 0.9 and bounding boxes larger than 300 pixels. They follow a masked autoencoder (MAE) strategy during pretraining, where the model learns to reconstruct the original human image from partially visible segments. The encoder captures latent representations from the visible portions of the image, while the decoder reconstructs the full image from this latent data.

For 2D pose estimation, Sapiens adopts a top-down approach to identify keypoints from an input image. Unlike previous models, which use up to 68 facial keypoints, Sapiens incorporates 243 facial keypoints, capturing intricate details around the eyes, nose, lips, and ears to better represent facial expressions.

In the body-part segmentation task, the model employs an encoder-decoder structure and introduces a more detailed classification vocabulary compared to prior datasets. This includes finer distinctions such as upper and lower limbs, as well as specific parts like the upper and lower lips, teeth, and tongue.

For depth estimation, the researchers use a similar architecture to that of segmentation, modifying the output channel to support regression. The depth estimation model is trained on 500,000 synthetic images generated from 600 high-resolution photogrammetry human scans, ensuring high accuracy for monocular depth estimation.

Empirical results show that Sapiens significantly outperforms previous state-of-the-art methods. The model improves performance on the Humans-5K benchmark for pose estimation by 7.6 mAP, the Humans-2K benchmark for body-part segmentation by 17.1 mIoU, the Hi4D benchmark for depth estimation by 22.4% in relative RMSE, and the THuman2 benchmark for normal estimation by 53.5% in relative angular error.

In summary, Sapiens marks a considerable advancement in human-centric vision models, positioning itself as a foundational framework for future applications. The researchers believe that their models can become a vital component in numerous downstream tasks and will provide high-quality vision backbones to a broader community.

The paper Sapiens: Foundation for Human Vision Models is on arXiv.


Author: Hecate He | Editor: Chain Zhang

3 comments on “Meta’s Sapiens: Revolutionizing Human Pose, Segmentation, and Depth Estimation with Vision Transformers

  1. Pingback: AI Progress Daily Report-08/28 – GoodAI

  2. dsgsg323hi

    I recently had the pleasure of visiting a place that’s quickly become one of my favorite spots in the city. From the moment I walked in, I could tell this was somewhere special. The atmosphere is a perfect blend of historic charm and modern comfort, creating a space that feels both timeless and current. The decor is stunning, with every detail carefully thought out to create a warm and inviting environment. The staff are some of the friendliest I’ve encountered, always ready with a smile and eager to make your visit as enjoyable as possible. The menu is impressive, offering a variety of dishes that cater to every taste. Whether you’re in the mood for something hearty or something light, you’ll find it here. And the drinks? Simply put, they’re amazing. The selection is extensive, with something to satisfy every palate. If you’re planning to visit, or if you’re curious to learn more, I’d highly recommend you contact White Horse Tavern. This is one of those places that you’ll want to keep coming back to, and I can’t wait to return.

  3. Adams Vinson

    A practical resource for professionals, the service provides concise, structured writing. The post titled 200 Action Verbs for Your Resume (CV) in 2025 stands out for clarity and usefulness, presenting action verbs for resume building in a straightforward, time-saving format suited to modern career needs.

Leave a Reply

Your email address will not be published. Required fields are marked *