In 2018 Google released BERT (bidirectional encoder representations from transformers), a pretrained language model that scored SOTA results on a range on natural language processing (NLP) tasks and revolutionized the research field. Similar transformer-based models such as Open AI’s GPT-2 and Baidu’s ERNIE followed. In October 2019 Facebook AI came up with BART, a new pretrained model for text generation and comprehension that uses both bidirectional and auto-regressive methods.
Now, Facebook AI researchers have further developed the BART model with the introduction of mBART, which they say is the first method for pretraining a complete sequence-to-sequence model by denoising full texts in multiple languages for machine translation purposes.
Machine translation can be briefly described as automatically converting text in one language to another. With most current machine translation methods, only some model components can be pretrained — for instance the encoder and the decoder. Functionality is also limited, as most models can only reconstruct parts of the text or only focus on English corpora. The new method presented by the Facebook AI research group shows significant performance gain on translation across multi-languages thanks to the addition of a pretrained autoregressive model.
For any pretrained model the quality of the “pre-training” process is critical. The Facebook researchers used a dataset extracted from the common crawl corpus of 25 languages (CC25) as a subset and balanced with up/down-sampling based on the percentage for each language in CC25. The text corpus was then tokenized with a sentence piece model (SPM), which implements subword units and a unigram language model with the extension of direct training from row sentences.
A BART model with 12 encoder layers and 12 decoder layers was pretrained on different sets of languages. The final models were named mBARTNum, in which “Num” represents the number of languages used for training; and Random, which is a baseline model randomly initialed without pretraining.
These pretrained models were then separately fine-tuned on 24 pairs of publicly available parallel corpora by feeding the source language into the encoder and decoding the target language. The models’ machine translation quality was evaluated on a fine-tuned BLEU score (bilingual evaluation understudy) calculated by comparing sentence-level machine translation results with a set of human reference translations.
The results are surely promising, as the mBART25 model significantly outperformed the random model. An interesting observation was that when the dataset used for fine-tuning exceeded 25M parallel sentences this “hurt” the model’s performance. The researchers suspect that the supervised training possibly “washes out” the benefits of pretraining.
Besides direct BLEU tests, the researchers also assessed the model’s translation ability by calculating a BLEU score from back-translation (translation of the target language back to the source language). These results are also informative, as the BLEU score shows an improvement in sentence translation quality throughout the back-translation process.
The new mBART model obtains many advantages over existing models. In the pretraining step, mBART is trained on all possible languages, which provides a set of parameters that can be fine-tuned for any future language pairs in any training form including supervised and unsupervised. The pretraining step also reduces the future training and fine-tune step cost — although the pretraining step itself is expensive.
In their future work, the researchers plan to expand the language pool and conduct large-scale pre-train by incorporating more languages’ trained datasets.
The paper Multilingual Denoising Pre-Training for Neural Machine Translation is on arXiv.
Author: Linyang Yu | Editor: Michael Sarazen