Generative image models conditioned on text prompts have made astounding progress in recent years thanks to novel deep learning architectures, advanced training paradigms such as masked modelling, increasing availability of massive image-text paired training data, and new families of diffusion and masking-based models.
In the new paper Muse: Text-To-Image Generation via Masked Generative Transformers, a Google Research team introduces Muse, a transformer-based text-to-image synthesis model that leverages a masked image modelling approach to achieve state-of-the-art performance — a 7.88 FID score on zero-shot COCO evaluation and a 0.32 CLIP score — while being significantly faster than diffusion or traditional autoregressive models.
The team summarizes their main contributions as follows:
- We present a state-of-the-art model for text-to-image generation which achieves excellent FID and CLIP scores (quantitative measures of image generation quality, diversity and alignment with text prompts).
- Our model is significantly faster than comparable models due to the use of quantized image tokens and parallel decoding.
- Our architecture enables out-of-the-box, zero-shot editing capabilities including inpainting, outpainting, and mask-free editing.
Muse is built on Google’s T5, a large language model trained on a wide range of text-to-text tasks that generates high-quality images via a masked transformer architecture. Muse inherits rich information with regard to objects, actions, visual properties, spatial relationships and so on from T5 embeddings; and learns to match these rich concepts to the generated images.
The paper details eight of Muse’s core components, such as its semantic tokenization, which uses a VQGAN model’s encoder and decoder to encode images from different resolutions and output discrete tokens that capture higher-level semantics of the image without being affected by low-level noise.
A super-resolution model learns to translate the lower-resolution latent map to a higher-resolution latent map that is decoded through the higher-resolution VQGAN to generate the final high-resolution image. The researchers also add extra residual layers and channels to the VQGAN decoder while keeping the encoder’s capacity fixed, then fine-tune these new layers while keeping the VQGAN encoder’s weights, etc., fixed. Because the visual token “language” remains the same, it is possible to improve the generated images’ details and visual quality without retraining any other model components.
To improve Muse’s text-image alignment, the team uses a classifier-free guidance (CFG) approach that linearly increases the guidance scale so that early tokens are sampled with low or no guidance, and the influence of the conditioning prompt is increased for the later tokens. They also employ parallel decoding to reduce inference time.
In their empirical study, the team compared Muse with popular benchmark models on various text-to-image generation tasks. The Muse 900M parameter model achieved a new SOTA on the CC3M dataset with a 6.06 FID score (lower is better), while the Muse 3B parameter model recorded a 7.88 FID score on zero-shot COCO and a CLIP score of 0.32.
Muse also demonstrated impressive out-of-the-box, zero-shot editing capabilities, further confirming the potential of frozen large pretrained language models as powerful and efficient text encoders for text-to-image generation.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.