Research

Facebook Boosts Cross-Lingual Language Model Pretraining Performance

Facebook researchers have introduced two new methods for pretraining cross-lingual language models (XLMs). The unsupervised method uses monolingual data, while the supervised version leverages parallel data with a new cross-lingual language model.

Facebook researchers have introduced two new methods for pretraining cross-lingual language models (XLMs). The unsupervised method uses monolingual data, while the supervised version leverages parallel data with a new cross-lingual language model. The research aims at building an efficient cross-lingual encoder for sentences in different languages within the same embedded space — a shared-coding-space approach that provides advantages for tasks such as machine translation.

Research results show advanced efficiency in various cross-language comprehension tasks and state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation.

image.png

The Facebook XLM project contains code for:

  • Language model pretraining:
    • Causal Language Model (CLM) – monolingual
    • Masked Language Model (MLM) – monolingual
    • Translation Language Model (TLM) – cross-lingual
  • Supervised / Unsupervised MT training:
    • Denoising auto-encoder
    • Parallel data training
    • Online back-translation
  • XNLI fine-tuning
  • GLUE fine-tuning

XLM also supports multi-GPU and multi-node training.

Generating cross-lingual sentence representations

The project provides sample code that can quickly obtain cross-language sentence representations from pretrained models. These cross-lingual sentence representations are useful for machine translation, calculating sentence similarities, or implementing cross-lingual language classifiers. The examples provided by the project are mainly written in Python 3, and require support from the Numpy, PyTorch, fastBPE, and Moses libraries.

To generate cross-language sentence representations, the first step is to import code files and libraries and load the pre-training model:

image (1).png

Next, build a dictionary, update parameters, and build a model:

image.png

The following is a list of cases in BPE format (based on the fastBPE library), where researchers extracted sentence representations based on the pretraining model:

image.png

The last step is creating a batch and completing forward propagation to produce the final sentence embedding vector: