Total global retail e-commerce sales have more than tripled over the last six years and are projected to top US$7 trillion by 2025. With fashion claiming an increasing share of this market, suppliers are increasingly deploying AI-powered virtual try-on systems. Such systems are not only changing buyers’ shopping habits and boosting the e-commerce industry, they also have applications in short video and other popular domains. While the quality of image-based virtual try-on methods has dramatically improved, video-based virtual try-on remains relatively underdeveloped, as it is difficult and computationally costly to generate visually pleasing and temporally coherent video results.
In the new paper ClothFormer: Taming Video Virtual Try-on in All Module, a research team from BIGO Technology and iQIYI Inc. presents ClothFormer, a novel video virtual try-on framework that preserves clothes’ and humans’ features and details to generate realistic and temporally smooth try-on videos that surpass the outputs of current state-of-the-art virtual try-on systems by a large margin.
The team summarizes their main contributions as:
- A novel warp module that combines the advantages of TPS-based methods and appearance-flow-based methods is designed to address the problem of inaccurate warp due to occlusions appearing in the clothing region.
- A tracking module based on ridge regression and optical flow correction is proposed to deform a temporally smooth warped clothing sequence, which provides a prerequisite for the try-on module to generate coherent videos.
- The MPDT generator is designed carefully in the try-on module, which can extract and fuse clothing textures, person features and environment information to generate realistic try-on videos. To the best of our knowledge, this is the first time that transformer has been applied to video virtual try-on.
The main limitations of existing video virtual try-on methods are their poor performance with regard to frame consistency and spatio-temporal smoothness. The researchers trace these problems to two factors: 1) Existing models focus too much on the try-on module while neglecting the spatio-temporal dimensions, which leads to blurring and temporal artifacts in the generated videos; and 2) Most models were trained on simple datasets with clean backgrounds, and thus struggle in more complex real-life environments.
The proposed ClothFormer aims at solving the abovementioned issues. The team first designs a clothing-agnostic person representation that eliminates any clothing information and preserves backgrounds and occlusions. They then employ a frame-level TPS-based warp method to predict and mask the clothes’ occlusion regions, and feed these predicted results to an appearance flow-based method to obtain accurate and anti-occlusion dense flow pairs between the body and clothing regions. They also use an appearance flow tracking module to obtain warped clothing sequences with improved spatio-temporal consistency. Finally, they introduce a novel Multi-scale Patch-based Dual-stream Transformer (MPDT) generator, which extracts and fuses clothing textures, person features such as pose, and environment information to synthesize the final output video sequence.
To validate the effectiveness of the proposed ClothFormer, the team compared its outputs with existing state-of-the-art methods (FW-GAN, MV-TON, CP-VTON, ACGPN and PB-AFN) on both image-based and video-based evaluation metrics on the VVT video virtual try-on dataset. In the experiments, ClothFormer achieved significant quantitative and qualitative improvements in high-quality and spatio-temporally consistent try-on video generation, surpassing current systems by a large margin.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.