In the latest demonstration of popular large language model GPT-3’s power and potential, OpenAI researchers today unveiled DALL·E, a neural network trained to create images from text captions across a wide range of concepts expressible in natural language.
OpenAI’s GPT-3, released last June, showed that natural language inputs could be used to instruct a large neural network to perform a variety of text generation tasks. The same month, the company’s ImageGPT research showed that similar neural networks could generate high-fidelity images.
To start the new year, OpenAI’s DALL-E builds on this, “to show that manipulating visual concepts through language is now within reach.”
Deriving its name from a portmanteau of artist Salvador Dalí and Pixar’s WALL·E, DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions using a dataset of text–image pairs. DALL·E boasts a diverse set of capabilities, such as creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text and applying transformations to existing images.
A transformer-based language model, DALL·E’s vocabulary has tokens for both text and image concepts. It receives both text and images as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to sequentially generate tokens to generate images from scratch. It can also regenerate regions of existing images in a manner consistent with the text prompt.
OpenAI today also introduced CLIP (Contrastive Language–Image Pretraining), a neural network that efficiently learns visual concepts from natural language supervision. The researchers say CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, which is similar to the “zero-shot” capabilities of GPT-2 and -3.
Trained on a wide variety of images with a wide variety of natural language supervision abundantly available on the Internet, the network can be instructed in natural language to perform a variety of classification benchmarks without directly optimizing for each benchmark’s performance.
CLIP is able to learn from unfiltered, highly varied, and highly noisy data, and CLIP models are significantly more flexible and general than existing ImageNet models, the researchers say. The results from their tests with CLIP show that agnostic pretraining on Internet-scale natural language — which has powered recent breakthroughs in NLP — can also be leveraged to improve the performance of deep learning in fields such as computer vision.
Reporter: Yuan Yuan | Editor: Michael Sarazen
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.