AI Computer Vision & Graphics Machine Learning & Data Science Research

OpenAI Releases GLIDE: A Scaled-Down Text-to-Image Model That Rivals DALL-E Performance

An OpenAI research team proposes GLIDE (Guided Language-to-Image Diffusion for Generation and Editing) for high-quality synthetic image generation. Human evaluators prefer GLIDE samples over DALL-E’s, and the model size is much smaller (3.5 billion vs. 12 billion parameters).

Text-to-image generation has been one of the most active and exciting AI fields of 2021. In January, OpenAI introduced DALL-E, a 12-billion parameter version of the company’s GPT-3 transformer language model designed to generate photorealistic images using text captions as prompts. An instant hit in the AI community, DALL-E’s stunning performance also attracted widespread mainstream media coverage. Last month, tech giant NVIDIA released the GAN-based GauGAN2 — the name taking inspiration from French Post-Impressionist painter Paul Gauguin as DALL-E had from Surrealist artist Salvador Dali.

Not to be outdone, OpenAI researchers this week presented GLIDE (Guided Language-to-Image Diffusion for Generation and Editing), a diffusion model that achieves performance competitive with DALL-E while using less than one-third of the parameters.

While most images can be relatively easily described in words, generating images from text inputs requires specialized skills and many hours of labour. Enabling an AI agent to automatically generate photorealistic images from natural language prompts not only empowers humans with the ability to create rich and diverse visual content with unprecedented ease, it also enables easier iterative refinement and fine-grained control of the generated images.

Recent studies have shown that likelihood-based diffusion models also have the ability to generate high-quality synthetic images, especially when paired with a guidance technique designed to trade-off diversity for fidelity. In May, OpenAI introduced a guided diffusion model that enables diffusion models to condition on a classifier’s labels.

GLIDE builds on this progress, applying guided diffusion to the challenge of text-conditional image synthesis. After training a 3.5 billion parameter GLIDE diffusion model that uses a text encoder to condition on natural language descriptions, the researchers compared two different guidance strategies: CLIP guidance and classifier-free guidance.

CLIP (Radford et al., 2021) is a scalable approach for learning joint representations between text and images that provides a score reflecting how close an image is to a caption. The team applied this method to their diffusion models by replacing the classifier with a CLIP model that “guides” the models.

Classifier-free guidance meanwhile is a technique for guiding diffusion models that does not require training a separate classifier. It has two appealing properties: 1) Enabling a single model to leverage its own knowledge during guidance, rather than relying on the knowledge of a separate classification model; 2) Simplifying guidance when conditioning on information that is difficult to predict with a classifier.

The researchers observed that classifier-free guidance image outputs were preferred by human evaluators for both photorealism and caption similarity.

In tests, GLIDE produced high-quality images with realistic shadows, reflections, and textures. The model can also combine multiple concepts (for example, corgis, bow ties, and birthday hats) while binding attributes such as colours to these objects.

In addition to generating images from text, GLIDE can also be used to edit existing images — inserting new objects, adding shadows and reflections, performing image inpainting, etc. — via natural language text prompts.

GLIDE can also transform simple line sketches into photorealistic images, and its zero-sample generation and repair capability for complex scenarios is strong.

In comparisons with DALL-E, GLIDE’s output images were preferred by human evaluators, even though it is a much smaller model (3.5 billion vs. 12 billion parameters), requires less sampling latency, and does not require CLIP reordering.

The team is aware their model could make it easier for malicious players to produce convincing disinformation or deepfakes. To safeguard against such use cases, they have only released a smaller diffusion model and a noised CLIP model trained on filtered datasets. The code and weights for these models are available on the project’s GitHub.

The paper GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

7 comments on “OpenAI Releases GLIDE: A Scaled-Down Text-to-Image Model That Rivals DALL-E Performance

  1. Pingback: r/artificial - [R] OpenAI Releases GLIDE: A Scaled-Down Text-to-Image Model That Rivals DALL-E Performance - Cyber Bharat

  2. Pingback: OpenAI Releases GLIDE: A Scaled-Down Text-to-Image Model That Rivals DALL-E Performance | Synced – Emsi’s feed

  3. Pingback: OpenAI Releases GLIDE: A Scaled-Down Text-to-Image Model That Rivals DALL-E Performance – Synced – xxp5 grooves


  5. I installed all the softwares. The ‘corgi in oil’ example looked like a corgi dog painted in oils, but when I did “cheesecakes across the ground” it just showed a bright wood panel floor. When I did “starship in the driveway” I just got a picture of a toaster sized object in a driveway, a large turquoise ball in a driveway, and part of a regular car in a driveway. I did one that was “giant cheesecake” and both of the results looked like regular cheesecakes. Google image search gives closer results. Also this goes very slow. Like 20 mins for 1 pic.

    • Micah,

      I have also installed their filtered models and am seeing similar results (now batching 10+ prompts overnight). Some of them are pretty good drawings, but definitely not integrating concepts reliably (like corgi with red bowtie and purple party hat; all 6 generated were corgis, 1 had a bowtie, and 1 had a hat, but the hat was not purple). This is likely the smaller size of the model after filtering for all things even vaguely related to humans, violence, and other things that could be controversial — about 75% of the original model.

      Still pretty neat for how easy it was to get going.

  6. Pingback: GLIDE: The new "creative" Artificial Intelligence which can compete with human graphic designers and artists | Creativity Overdrive

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: