Building versatile vision-language models that not only work on a single language but can generalize across all the world’s approximately 7,000 languages is difficult — and the task becomes even more challenging if the model is transferred without any additional annotated training data.
To tackle this issue, a research team from Carnegie Mellon, Oxford and Facebook AI has proposed a transformer-based model, Multilingual Multimodal Pretraining (MMP), that can learn contextualized multilingual multimodal embeddings under a zero-shot setting.

Recent research in cross-lingual transfer learning has demonstrated that models using only English annotation can nonetheless generalize to a non-English language. This success is attributed to the shared underlying vocabulary or structure amongst many languages. For example, many English and German words stem from the same origin, and many languages have the same recursive structures.
The researchers note that all humans have similar vision systems, and so many common visual concepts can be understood universally. This enables the leveraging of language and visual properties to associate sentences in different languages with visual concepts, so as to enable the cross-lingual transfer of vision-language models under a zero-shot (no labels) setting.
The team summarizes their main contributions as:
- Propose a transformer-based videotext model that learns contextualized multilingual multimodal representations.
- Empirically demonstrate that vision-language models, unlike NLP models, have limited zero-shot cross-lingual transferability.
- Introduce a multilingual multimodal pretraining strategy and construct a new Multi-HowTo100M dataset for pretraining to improve the zero-shot cross-lingual capability of vision-language models.
- Demonstrate the effectiveness of the proposed approach, by achieving state-of-the-art multilingual text to video search performance in both the zero-shot and fully supervised setup.

The text encoder of the proposed MMP consists of a multilingual transformer and a text transformer pooling head, while the video encoder consists of a 3D-CNN and a video transformer pooling head. The multilingual transformer generates contextualized multilingual representations to encode a sentence x; the transformer pooling head serves as a pooling function to selectively encode variable-length sentences and align them with the corresponding visual content; the 3D-CNNs encode spatial-temporal context in a video; and the video transformer pooling head encodes videos with different lengths into fixed-length representations.
In the next step, the researchers need to align the encoded text and video to form multimodal representations by minimizing a contrastive objective to map the associated (video, text) embeddings to be close to each other in a shared embedding space. Inspired by Translation Language Modeling (TLM), they propose a TLM-like contrastive objective to promote the alignment quality.
The team also built a “Multilingual How To 100M dataset” (Multi-HowTo100M) for multilingual multimodal learning to exploit the weak supervision from large-scale multilingual text-video data. The Multi-HowTo100M dataset contains 1.1 million videos, each with subtitles in seven to nine languages.
To validate the effectiveness of their proposed multilingual text-video approach, the researchers conducted experiments using multilingual BERT (mBERT) as the model, XLMRoberta-large (XLM-R) as the text backbone, and Multi-HowTo100M for multilingual multimodal pretraining (MMP). They fine-tuned the proposed model on the VTT, VATEX, and Multi30K datasets to evaluate it on text to video search tasks. In the zero-shot cross-lingual transfer experiments, they used only English-language video data and tested the model with non-English queries.

For XLM-R, improvements in R@1 asymptotically converge when pretraining with more multilingual text-video pairs. For zero-shot German to video search, pretraining with more languages improves the search performance.

With the in-domain translated pseudo-multilingual annotations, both mBERT and XLM-R showed better performance across non-English languages. Notably, there is still a performance gap between zero-shot and translate-train settings for mBERT, but the gap is much smaller for XLM-R.

The proposed MMP improved all recall metrics even with the modality gap, with its average R@1 improvement reaching an impressive 3.2. Moreover, without using any Czech-language annotations, the zero-shot model with MMP achieved comparable Czech-to-image search performance to SMALR (Scalable Multilingual Aligned Language Representation), which uses ten languages as annotations.
The results demonstrate MMP’s effectiveness for zero-shot cross-lingual transfer of vision-language models.
The paper Multilingual Multimodal pretraining for Zero-Shot Cross-Lingual Transfer of Vision-Language Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
Pingback: [N] CMU, Oxford & Facebook Cross-Lingual Vision-Language Model Achieves New SOTA in Zero-Shot Setting – ONEO AI