Machine Learning & Data Science Nature Language Tech Research

From Texts to Kitties: OpenAI’s GPT Language Model Tackles Image Generation

Large transformer-based language models trained on pixel sequences can generate coherent images without the use of labels.

It’s been just three weeks since OpenAI wowed the world with its gigantic 175-billion-parameter GPT-3 language model. Now, the San Francisco-based AI company has triggered a new stir on social media — proposing that large transformer-based language models trained on pixel sequences can generate coherent images without the use of labels. The new paper comes from an OpenAI research team that includes Founder and Chief Scientist Ilya Sutskever.

The success of unsupervised learning methods and transformer models in natural language processing (NLP) inspired OpenAI researchers to explore this new direction. Can similar models also learn useful representations for images?

Explains OpenAI in a blog post: “Just as a large transformer model trained on language can generate coherent text by establishing a correlation between sample quality and image classification accuracy, we show that our best generative model also contains features competitive with top convolutional nets in the unsupervised setting.” Unsupervised learning generally refers to model training that does not require manual data labelling.


AI pioneer and Turing Award honouree Geoffrey Hinton has tweeted that “unsupervised learning of representations is beginning to work quite well without requiring reconstruction.” In Hinton’s paper A Simple Framework for Contrastive Learning of Visual Representations, a linear classifier trained on self-supervised representations learned by a simple framework SimCLR achieves a significant performance leap in image recognition.


One of the vital insights the OpenAI researchers learned from transformer models like BERT and GPT-2 is that they are domain agnostic, meaning that they can be directly applied to 1D sequences of any form. The team decided to unroll raw images to a low resolution and reshape them into text-like long sequences of pixels, as otherwise the unrolled sequences would be too large to handle. “If we naively trained a transformer on a sequence of length 2242 × 3, our attention logits would be tens of thousands of times larger than those used in language models, and even a single layer would not fit on a GPU.

The team trained a model that uses the same transformer architecture as GPT-2 in language, dubbed iGPT, which learned strong image representations as measured by linear probing, fine-tuning, and low-data classification. The approach consists of a pretraining stage completed without labels, followed by a fine-tuning step. The team leveraged one of two pretraining objectives to achieve the pixel prediction: autoregressive, which is also the GPT-2 pretraining approach, and BERT. Once the objectives learned the representations, the team evaluated them with linear probes or fine-tuning.

In experiments on CIFAR-10 the iGPT-L model achieved 96.3 percent accuracy with a linear probe, outperforming a supervised Wide ResNet. It also reached 99.0 percent accuracy with full fine-tuning, matching the top supervised pretrained models. On ImageNet, the larger model iGPT -XL trained on a mixture of ImageNet and web images was comparable with self-supervised benchmarks, achieving an accuracy of 72.0 percent.

Even without the guidance of any human-labelled data, iGPT managed to generate a wide range of coherent images. But the researchers note that this performance came with a hefty price: “iGPT-L was trained for roughly 2500 V100-days while a similarly performing MoCo model can be trained in roughly 70 V100-days.


OpenAI stated sees the work as a proof-of-concept demonstrating the enormous potential of large transformer language models to learn unsupervised representations in new domains despite. The drawback is the jaw-dropping compute cost to train the models — which may be a deal-breaker for researchers who don’t have access to a supercomputer.

The paper Generative Pretraining from Pixels is available on the OpenAI project page, and the code can be found on GitHub.

Journalist: Fangyu Cai | Editor: Michael Sarazen

We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

5 comments on “From Texts to Kitties: OpenAI’s GPT Language Model Tackles Image Generation

%d bloggers like this: