The long and rich history of human storytelling is based in large part on how our imaginations enable us to picture a scene based on its description in text or the spoken word. We also find it easy and natural to describe a given scene in natural language once we’ve constructed or received a mental image. These abilities suggest that at some level, humans have deeply coupled representations for textual and visual structures, and that these play a key role in how we understand our everyday world.
In February of this year, OpenAI open-sourced CLIP (Contrastive Language-Image Pretraining), a dual language-image encoder that takes a step toward unifying textual and visual information. In a new paper, a research team from Cross Compass Ltd, Massachusetts Institute of Technology, Tokyo Institute of Technology and University of Tokyo introduces the CLIP-based CLIPDraw, an algorithm that synthesizes drawings based on natural language input. CLIPDraw does not require any training, as its drawings are synthesized through iterative optimization via evaluation-time gradient descent.
A CLIP model comprises an image encoder and a textual encoder, which map image and textual inputs into a shared encoding space. CLIPDraw is encoded via CLIP, and aims to synthesize a drawing that matches the CLIP encoding of a given description prompt. Based on the different description prompts provided, CLIPDraw will adjust not only the content of its synthesized drawings, but also the styles. The researchers regard CLIPDraw as a testbed for exploring language-image relationships, and as a vehicle for synthesizing and studying AI-assisted artworks.
Drawings in CLIPDraw are represented by a set of differentiable RGBA Bezier curves, with each curve parametrized by control points along with thickness and an RGBA colour vector. The curves are randomly distributed throughout the image at the initiation stage. During optimization, the number of curves and control points is fixed, but the positions of the points along with the thickness and colour vectors are optimized using gradient descent.
The team compared CLIPDraw with various synthesis-through-optimization methods, including Pixel Optimization, BigGan Optimization, and CLIPDraw (No Augment) — a method identical to CLIPDraw except that no image augmentation is applied to the synthesized drawings. They also explored various nuances in their approach: What kinds of visual techniques does CLIPDraw use to satisfy the textual description? Can CLIPDraw reliably produce drawings in different styles? How does the stroke count affect what drawings CLIPDraw produces? What happens if abstract words are given as a description prompt? Can synthesized drawings be fine-tuned via additional negative prompts?
From their empirical studies, the team identified a number of interesting CLIPDraw behaviours:
- By adjusting descriptive adjectives, such as “watercolour” or “3D rendering,” CLIPDraw produces drawings of vastly different styles.
- CLIPDraw often matches the description prompt in creative ways, such as writing words from the prompt inside the image itself, or interpreting ambiguous nouns in multiple ways.
- By giving CLIPDraw abstract prompts such as “happiness” or “self,” it is possible to examine what visual concepts the CLIP model associates with them.
- CLIPDraw behaviour can be further controlled through the use of negative prompts, such as “a messy drawing,” to encourage the opposite behaviour.
Overall, the study shows that CLIPDraw biases towards simple drawings of human-recognizable concepts, does not require learning a new model, and can generally synthesize images within about a minute on a typical GPU.
The ClipDraw code is available in this Colab notebook. The paper CLIPDraw: Exploring Text-to-Drawing Synthesis Through Language-Image Encoders is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.