AI Machine Learning & Data Science Research

Google’s Masked Generative Transformers Achieve SOTA Text-To-Image Performance With Improved Efficiency

In the new paper Muse: Text-To-Image Generation via Masked Generative Transformers, a Google Research team introduces Muse, a transformer-based text-to-image synthesis model that leverages masked image modelling to achieve state-of-the-art performance while being significantly faster than diffusion or autoregressive models.

Generative image models conditioned on text prompts have made astounding progress in recent years thanks to novel deep learning architectures, advanced training paradigms such as masked modelling, increasing availability of massive image-text paired training data, and new families of diffusion and masking-based models.

In the new paper Muse: Text-To-Image Generation via Masked Generative Transformers, a Google Research team introduces Muse, a transformer-based text-to-image synthesis model that leverages a masked image modelling approach to achieve state-of-the-art performance — a 7.88 FID score on zero-shot COCO evaluation and a 0.32 CLIP score — while being significantly faster than diffusion or traditional autoregressive models.

The team summarizes their main contributions as follows:

  1. We present a state-of-the-art model for text-to-image generation which achieves excellent FID and CLIP scores (quantitative measures of image generation quality, diversity and alignment with text prompts).
  2. Our model is significantly faster than comparable models due to the use of quantized image tokens and parallel decoding.
  3. Our architecture enables out-of-the-box, zero-shot editing capabilities including inpainting, outpainting, and mask-free editing.

Muse is built on Google’s T5, a large language model trained on a wide range of text-to-text tasks that generates high-quality images via a masked transformer architecture. Muse inherits rich information with regard to objects, actions, visual properties, spatial relationships and so on from T5 embeddings; and learns to match these rich concepts to the generated images.

The paper details eight of Muse’s core components, such as its semantic tokenization, which uses a VQGAN model’s encoder and decoder to encode images from different resolutions and output discrete tokens that capture higher-level semantics of the image without being affected by low-level noise.

A super-resolution model learns to translate the lower-resolution latent map to a higher-resolution latent map that is decoded through the higher-resolution VQGAN to generate the final high-resolution image. The researchers also add extra residual layers and channels to the VQGAN decoder while keeping the encoder’s capacity fixed, then fine-tune these new layers while keeping the VQGAN encoder’s weights, etc., fixed. Because the visual token “language” remains the same, it is possible to improve the generated images’ details and visual quality without retraining any other model components.

To improve Muse’s text-image alignment, the team uses a classifier-free guidance (CFG) approach that linearly increases the guidance scale so that early tokens are sampled with low or no guidance, and the influence of the conditioning prompt is increased for the later tokens. They also employ parallel decoding to reduce inference time.

In their empirical study, the team compared Muse with popular benchmark models on various text-to-image generation tasks. The Muse 900M parameter model achieved a new SOTA on the CC3M dataset with a 6.06 FID score (lower is better), while the Muse 3B parameter model recorded a 7.88 FID score on zero-shot COCO and a CLIP score of 0.32.

Muse also demonstrated impressive out-of-the-box, zero-shot editing capabilities, further confirming the potential of frozen large pretrained language models as powerful and efficient text encoders for text-to-image generation.

More information and results are available on Google’s project hub: http://muse-model.github.io. The paper Muse: Text-To-Image Generation via Masked Generative Transformers is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

2 comments on “Google’s Masked Generative Transformers Achieve SOTA Text-To-Image Performance With Improved Efficiency

  1. Jedno popoludnie v Nitre som čakal na návštevu a mal čas nazvyš. Chcel som stránku tohto typu, ktorá nebude pôsobiť zložito. Našiel som spingranny a zostal som na nej dlhšie. Bonusové ponuky boli rozdelené prehľadne, free spiny mali normálne vysvetlenie a celé rozhranie bolo čisté. Pôsobilo to, akoby stránka bola robená pre bežných ľudí zo Slovenska, nie pre skúsených expertov.

  2. This is the kind of AI progress that feels exciting because it makes creativity faster, not just flashier. I remember when image generation tools first started popping up and even simple results felt like magic, so seeing models become more efficient is genuinely impressive. What stands out to me is how speed changes everyday use, because people are more likely to experiment when things feel smooth and immediate. It reminds me of how often we rely on convenience in daily life, whether that means searching the google maps phone number or finding the fastest route somewhere. Better tools usually win when they save both time and energy for everyone involved.

Leave a Reply

Your email address will not be published. Required fields are marked *