Visual language tasks are a hot topic these days in the fields of natural language processing (NLP) and computer vision. Most existing methods are based on pretrained models that use late fusion approaches to fuse multi-modal inputs for downstream tasks. Such approaches however usually require specific data annotations during training, and it remains very difficult and expensive to fulfill this requirement for many multi-modal tasks. A recent paper published by Microsoft researchers proposes a new vision-language pretrained model for image-text joint embedding, ImageBERT, which achieves SOTA performance on both the MSCOCO (image retrieval task) and Flickr30k (text retrieval) datasets.
Like Google’s BERT (Bidirectional Encoder Representations from Transformers) language model, ImageBERT is Transformer-based. It takes different modalities (both textual and visual tokens) as inputs, which are encoded into different embeddings through an embedding layer. Those embeddings are then fed in a multi-layer bidirectional self-attention transformer, which trains a cross-modality transformer to model the relationships between images and text.
The quantity and quality of data are critical for the cross-model pretraining of vision-language tasks, and so researchers developed a weak-supervised method for collecting large-scale image-text data from the Internet in a bid to boost pretraining performance. Their Large-scale weAk-supervised Image-Text (LAIT) dataset includes 10 million vision-language pairs (image + description), and was used to pretrain the ImageBERT model.
After LAIT, researchers pretrained the model on public dataset Conceptual Captions (most widely used data for image-text pre-training) and SBU Captions (SBU Captioned Photo Dataset) in the second stage. The model was pretrained simultaneously on four tasks designed by researchers to model text and visual content and their interrelationships:
- Task 1: Masked Language Modeling (MLM) – This is the same task as the MLM in BERT training. It proposes a new pre-training objective and enables the training of the deep bidirectional embedding.
- Task 2: Masked Object Classification (MOC) – An expansion of the MLM task.
- Task 3: Masked Region Feature Regression (MRFR) – Similar to MOC, this task also models the visual content with more precise work on object feature prediction.
- Task 4: Image Text Matching (ITM) – Task to learn image-text alignment.
The experiment results show that the multi-stage pretraining approach achieves better results than single-stage pretraining. Researchers also performed fine-tuning and compared the pretrained ImageBERT model with SOTA methods on image retrieval and text retrieval tasks, where ImageBERT obtained top results on both the MSCOCO and Flickr30k datasets.
The researchers hope that their new model and dataset can further advance the study and development of cross-modal pretraining.
The paper ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data is on arXiv.
Author: Herin Zhao |Editor: Michael Sarazen