Contrastive Language Image Pretraining (CLIP) is one of the most popular pretraining strategies for high-quality vision backbones, as it demonstrates impressive zero-shot transfer capabilities and its performance rivals that of the best label-supervised approaches. Meanwhile, image captioning, despite its simplicity, has attracted less attention than CLIP due to its inferior zero-shot learning capabilities.
In a new paper Image Captioners Are Scalable Vision Learners Too, a DeepMind research team presents CapPa, a image captioning based pretraining strategy that and can compete CLIP and exhibit favorable model and data scaling properties, verifying that a plain image captioning can be a competitive pretraining strategy for vision backbones.

The goal of this work is to develop a plain image captioning that is comparable to the popular CLIP in terms of simplicity, scalability, and efficiency.

To this end, the team first conducts a comprehensive comparison between the image captioning (which they simply referred to as Captioner (Cap)) and the CLIP strategy, and carefully match pretraining compute, model capacity and training data.
They observed that Cap vision backbone surpasses CLIP models in few-shot classification, captioning, OCR, and VQA tasks, as well as achieves comparable performance when transferring to classification tasks with large labeled training data, indicating that Cap vision backbone may be superior in multimodal downstream tasks.
The researchers further introduce the CapPa pretraining procedure, a mixed training strategy that combines standard autoregressive prediction (Cap) and parallel prediction (Pa).
In particular, in terms of model architecture, they adpot Vision Transformer (ViT) as vision encoders and applies a standard Transformer decoder architecture for predicting image captions. They also use cross-attention to feed the ViT-encoded sequence to the decoder.
In training stage, instead of training the model only for autoregressively, they train the model to predict all tokens in parallel. In this setting, the decoder can only rely on the image information to predict the token as the model predicts all the caption tokens independently in parallel, therefore the decoder can benefit a lot from image information to improve prediction accuracy.


In their empirical study, the team compared CapPa with conventional Cap and popular state-of-the-art CLIP approach on a wide variety of downstream tasks, such as image classification, captioning, OCR and visual question answering. CapPa outperformes Cap on almost all tasks, and surpasses or achieves comparable performance CLIP∗ trained with the same batch size. And CapPa also exhibits strong zero-shot capabilities and promising scaling properties.
Overall, this work established a competitive plain image captioning for vision backbones. The team hopes their contribution can inspire more researches of captioning as a pretraining task for vision encoders.
The paper Image Captioners Are Scalable Vision Learners Too on arXiv.
Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Spend some time playing. I’m interested in finding out more because I have strong views about it. Would you please provide more details to your blog post? We will all actually gain from it.