Consistency models, a burgeoning category of generative models, have been making waves in the AI research community. Unlike their counterparts, they produce top-tier samples in a single step, without the intricate adversarial training process. Their secret sauce lies in distilling knowledge from pre-trained diffusion models and leveraging learned metrics like Learned Perceptual Image Patch Similarity (LPIPS). But there’s a caveat: distillation ties the quality of consistency models to that of their parent diffusion models, and LPIPS introduces undesirable evaluation bias.
To confront these challenges head-on, in a new paper Techniques for Training Consistency Models, an OpenAI research team introduces innovative methods that enable consistency models to learn directly from data, surpassing the performance of consistency distillation (CD) in producing high-quality samples, all while breaking free from the clutches of LPIPS.
Consistency models have traditionally been trained using either consistency distillation (CD) or consistency training (CT). Prior research has consistently shown that CD outperforms CT, but it comes at the cost of increased computational complexity, demanding the training of a separate diffusion model and consequently limiting the sample quality of the consistency model.
In response, this research focuses on enhancing CT to match or even surpass the performance of CD. The proposed CT improvements stem from a blend of theoretical insights and exhaustive experimentation on the CIFAR-10 dataset. The researchers delve into a thorough examination of the practical implications of weighting functions, noise embeddings, and dropout in CT. In a notable discovery, they unveil an overlooked flaw in previous theoretical analyses and propose a straightforward remedy by eliminating the Exponential Moving Average (EMA) from the teacher network.
In a move to overcome the evaluation bias introduced by LPIPS, the team adopts Pseudo-Huber losses from the realm of robust statistics. Additionally, they explore how sample quality is enhanced as the number of discretization steps increases, and leverage these insights to introduce a straightforward yet potent curriculum for total discretization steps. A novel schedule for sampling noise levels in the CT objective, based on lognormal distributions, is also proposed.
The culmination of these innovations allows CT to achieve remarkable FID scores of 2.51 and 3.25 for CIFAR-10 and ImageNet 64×64, respectively, all within a single sampling step. These scores not only outperform CD but also represent impressive improvements of 3.5x and 4x over previous CT methods. What’s more, they achieve these results without the need for distillation, even surpassing the best few-step diffusion distillation techniques for diffusion models.
In summary, the enhanced techniques for CT have effectively overcome its past limitations, putting forth results that stand shoulder to shoulder with top-tier diffusion models and GANs. This achievement underscores the immense potential of consistency models as an independent and promising family of generative models.
The paper Improved Techniques for Training Consistency Models on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.