Large-scale text-to-image generative models have achieved remarkable success in recent years in synthesizing diverse and realistic images with complex objects and scenes. However, directly applying such models to the editing of real images remains challenging, as it requires carefully tailored text prompts, and the results can often include additional undesirable modifications.
In the new paper Zero-Shot Image-to-Image Translation, a team from Carnegie Mellon University and Adobe Research introduces pix2pix-zero, a diffusion-based image-to-image translation method that requires only simple edit-direction text instructions (e.g., cat → dog) to perform structure-preserving photorealistic image editing without additional prompting or training.
The team summarizes their main contributions as follows:
- An efficient, automatic editing direction discovery mechanism without input text prompting.
- Content preservation via cross-attention guidance.
The team first identifies generic edit directions that work for a broad range of input images. Given an input/original word (e.g. cat) and the target/edited word (e.g. dog), two separate groups of sentences are generated containing the original and the edited words; and the CLIP embedding direction is computed between the two groups. This process takes about five seconds and can be pre-computed. Because these editing directions are based on multiple sentences, the pix2pix-zero approach is more robust than conventional methods that only chart the direction between original and edited words.
To preserve an input image’s original content structure post-editing, the researchers use cross-attention guidance to enforce coherence in the cross-attention maps. They also apply various techniques to improve image quality and inference speed, namely: 1) Autocorrelation regularization to ensure noise is close to Gaussian during inversion, and 2) Conditional GAN distillation to enable interactive editing and real-time inference.
In the pix2pix-zero pipeline, an input image is first reconstructed using input text only — without the edit direction — to obtain reference cross-attention maps for each timestep. These reference maps correspond to the original image’s structure. The edit direction is then applied to generate a cross-attention map, and a gradient step process is employed to reduce the cross-attention loss. This discourages the cross-attention map from deviating from the reference cross-attention maps and enables the model to retain the original image structure in its output image.
In their empirical study, the team compared pix2pix-zero with SDEdit + word swap, Prompt-to-prompt, and DDIM + word swap real-image baselines on image-to-image translation tasks. The results show that pix2pix-zero surpasses the baselines on most tasks in terms of photorealism and content preservation and without complex text prompting or additional training.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.