Countless studies have been conducted on non-verbal human communication through pose, appearance, facial expression, hand gestures, etc., aka body language. Non-verbal signals can carry rich information with regard to human expression, and effectively capturing, interpreting and generating such signals can significantly improve the realism of digital humans (avatars) in telepresence, augmented reality (AR) and virtual reality (VR) environments.
While current state-of-the-art avatar models like those in the SMPL family can accurately represent various human body shapes in natural poses, they rely on mesh-based representations and are bound to fixed topologies and the 3D mesh’s resolution. Moreover, such models tend to focus on minimally clothed bodies and do not model garments or hair, which limits their outputs’ realism.
In the new paper X-Avatar: Expressive Human Avatars, a research team from ETH Zurich and Microsoft presents X-Avatar, an expressive implicit human avatar model designed to capture high fidelity human body and hand poses, facial expressions and other appearance characteristics in a holistic fashion.

The team summarizes their main contributions as follows:
- X-Avatar, the first expressive implicit human avatar model that captures body pose, hand pose, facial expressions and appearance.
- Part-aware initialization and sampling strategies, which together improve the quality of the results and keep training efficient.
- X-Humans, a new dataset comprising 233 sequences of high-quality textured scans showing 20 participants with varied body and hand movements and facial expressions, totalling 35,500 frames.

X-Avatar supports processing on two inputs: 3D posed scans and RGB-D images. Its architecture includes a shape network for modelling geometry in canonical space and a deformation network that employs learned linear blend skinning (LBS) to establish correspondences between the canonical and deformed spaces.
To create their expressive and controllable human avatars, the researchers start with the parameter space of SMPL-X, an SMPL extension that captures the shape, appearance and deformations of full-body humans with attention to hand poses and facial expressions. A human model defined by articulated neural implicit surfaces is used to capture the varying topology of clothed humans, while a novel part-aware initialization strategy significantly improves the fidelity of the final result by increasing the sampling rate for smaller body parts.


In their empirical study, the team compared X-Avatar with SCANimate and SNARF on animation tasks with minimally clothed humans and evaluated its ability to learn from 3D scans and (synthesized) RGB-D data. X-Avatar’s animation quality surpassed all baselines in the experiments.
This work demonstrates X-Avatar’s ability to capture human body pose, hand pose, facial expressions and appearance to generate more personalized and realistic avatars. The team hopes their approach will encourage further research on improving expressivity in digital humans.
Code and additional information are available on the project’s GitHub. The paper X-Avatar: Expressive Human Avatars is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
It’s so creative for the identity icon development