The emergence in recent years of powerful vision-language pretraining models has significantly boosted performance on a range of image-to-text generation tasks. The development of large-scale pretraining models for text-to-image synthesis however remains relatively underexplored in the machine learning research community.
In the new paper ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation, a Baidu research team proposes ERNIE-ViLG, a 10-billion parameter scale pretraining framework for bidirectional text-image generation. Pretrained on 145 million (Chinese) image-text pairs, ERNIE-ViLG achieves state-of-the-art performance on both text-to-image and image-to-text generation tasks.

The team summarizes their contributions as:
- We propose ERNIE-ViLG, a unified generative pretraining method for bidirectional image-text generation tasks, where both the image and text generation are formulated as autoregressive generative tasks. And we propose the first end-to-end training method for text-to-image synthesis based on image discrete representation, which enhances both the generator and reconstructor and outperforms the traditional two-stage approach.
- We train a 10-billion parameter ERNIE-ViLG model and obtain superior performance for both text-to-image and image-to-text generation tasks, setting new SOTA results for text-to-image synthesis on the MS-COCO benchmark and obtaining SOTA results for image captioning on two popular Chinese datasets.
- Superior performance on the generative visual question answering (VQA) task shows that our bidirectional generative model captures the complex semantic alignments between the vision and the language modalities.

ERNIE-ViLG adopts a unified framework for its bidirectional image-text generation. Images are represented as a sequence of discrete representations by a vector quantization variational autoencoder (VQVAE). For image-to-text generation, this image discrete sequence is used as the input to generate a corresponding textual sequence. For text-to-image synthesis, the parameter-sharing transformer model uses text inputs to generate a corresponding visual discrete sequence which is then used to reconstruct the image. The bidirectional text-to-image synthesis model is thus trained in an end-to-end manner, which the researchers explain provides two advantages:
- More contextual features for the reconstructor. Compared with context-independent embedding in codebooks, hidden embedding is encoded by a deep model and contains more image semantics. It also has perception of the textual information through the attention interaction.
- Enhancing the generator with the reconstruction task. The hidden embedding receives both abstract and original supervised signals from generation and reconstruction, helping the generator learn better image representations.
The 10-billion parameter ERNIE-ViLG was implemented based on the PaddlePaddle platform and pretrained on a large-scale image-text dataset comprising over 145 million high-quality Chinese image-text pairs. For their empirical study, the team applied ERNIE-ViLG on two bidirectional image-text tasks: text-to-image synthesis and image captioning. To evaluate the model’s cross-modal understanding ability, the team also applied ERNIE-ViLG to the challenging generative VQA task.


For text-to-image synthesis, ERNIE-ViLG surpassed DALL-E with a significant FID improvement of 12.8 in the zero-shot setting, and achieved performance comparable with fully-supervised models. Moreover, the results show that ERNIE-ViLG can not only draw entities based on a given text description, it can also combine these drawings with backgrounds in a visually coherent way.



In the image captioning experiments, ERNIE-ViLG achieved the best results on both the COCO-CN and AIC-ICC datasets. On VQA tasks, ERNIE-ViLG achieved a Turing Test passing rate of 78.5 percent, significantly improving on the mQA model performance and indicating its ability to better capture semantic alignments across vision and language modalities.
Overall, the proposed ERNIE-ViLG advances unified pretraining performance for image-to-text and text-to-image cross-modal generation tasks to a new state-of-the-art. It represents a powerful addition to Baidu’s “Wenxin” large-scale model panorama, which aims to boost the development of new artificial intelligence systems in China.
The paper ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Baidu’s 10-Billion Scale ERNIE-ViLG Unified Generative Pretraining Framework Achieves SOTA Performance on Bidirectional Vision-Language Generation Tasks”