Microsoft ImageBERT | Cross-modal Pretraining with Large-scale Image-Text Data

A recent paper published by Microsoft researchers proposes a new vision-language pretrained model for image-text joint embedding, ImageBERT, which which achieves SOTA performance on both the MSCOCO and Flickr30k datasets.