Contemporary pretrained multilingual language models (LMs) aim to represent more than 100 languages in a single model. However, despite their state-of-the-art results in cross-lingual transfer, such multilingual models are often incapable of equitably representing their diverse set of languages due to limited capacity, skewed pretraining data and suboptimal vocabularies.
Although language-specific models trained on large custom vocabularies can avoid these issues, they lack the strong cross-lingual transfer abilities of multilingual LMs.
In a bid to encompass the “best of both worlds,” a team from Google Research has proposed MergeDistill, a framework for merging pretrained teacher LMs from multiple monolingual and multilingual LMs into a single multilingual task-agnostic student LM. The approach is designed to leverage the capabilities of powerful language-specific LMs while still being multilingual and enabling positive language transfer performance.
To achieve their goal, the team uses knowledge distillation (KD), a technique proposed by Hinton et al. in 2015. In most LM applications, KD is used for compression of large teacher models into smaller, single-task student models. But KD also can be employed in task-agnostic scenarios with pretraining objectives like masked language modelling (MLM) to obtain a task-agnostic student model.
In the paper MergeDistill: Merging Pre-trained Language Models using Distillation, the researchers focus on merging multiple pretrained LMs into a single multilingual student LM in the task-agnostic setting. The team says this is the first study of its kind, and summarizes their contributions as:
- MergeDistill is a task-agnostic distillation approach to merge multiple teacher LMs at the pretraining stage, to train a strong multilingual student LM that can then be finetuned for any task on all languages in the student LM. The approach is more maintainable (fewer models), compute efficient and teacher architecture agnostic (since we obtain offline predictions).
- MergeDistill is used to i) combine monolingual teacher LMs into a single multilingual student LM that is competitive with or outperforms individual teachers, ii) combine multilingual teacher LMs, such that the overlapping languages can learn from multiple teachers.
- Through extensive experiments and analysis, we study the importance of typological similarity in building multilingual models, and the impact of strong teacher LM vocabularies and predictions in our framework.
The inputs of the proposed MergeDistill are a set of pretrained teacher LMs and pretraining transfer corpora for all the languages that will be used for training the student LM. The set of pretrained teacher LMs in this work consists of four LM models. The three monolingual LM are trained on English, Spanish, and Korean respectively, while the multilingual LM is trained on English and Hindi.
The first step in training the student LM from multiple teacher LMs is to tokenize the pretraining transfer corpora and masks for each language using their respective teacher LM’s tokenizer. The method then obtains predictions and logits for each masked, tokenized example in each language by evaluating their respective teacher LMs. The next step is vocabulary mapping, in which the input indices, prediction indices and the gold label indices obtained after evaluation from each teacher LM are processed using a teacher-to-student vocabulary map. Finally, with the processed input indices, prediction indices and gold label indices on hand, the researchers train their multilingual student LM with the masked language modelling (MLM) objective, using teacher predictions as soft labels and minimizing the cross-entropy between student and teacher distributions.
The team conducted intensive experiments on Wikipedia text data to evaluate the effectiveness of their proposed MergeDistill approach. They reported F1 scores for structured prediction tasks (NER, POS), accuracy (Acc), scores for sentence classification tasks (XNLI, PAWS-X), and F1/Exact Match (F1/EM) scores for question answering tasks (XQuAD, MLQA, TyDiQA).
In their monolingual teacher LMs experiment, the team used preexisting monolingual teacher LMs to train the student LM. The results show that in each language, the resulting student LM was either competitive or outperformed its respective teacher LM, validating MergeDistill’s ability to train a multilingual student LM effectively from monolingual teacher LMs.
In the case of multilingual teacher LM training, the team used the multilingual models mBERT and MuRIL (Multilingual Representations for Indian Languages, 2020) for student LM training on the XTREME benchmark. The results show that on Non-MuRIL languages, the student LM beat the teacher (mBERT) by an average relative score of 3.8 percent. On the MuRIL languages, the student LM beat the mBERT teacher by 8.8 percent but underperformed the MuRIL teacher by 3.8 percent.
Overall, the study demonstrates the effectiveness and the potential of the proposed MergeDistill approach in bridging the gap between the ever-expanding universe of strong language-specific models and the proven cross-lingual performance of massively multilingual LMs.
The paper MergeDistill: Merging Pre-trained Language Models using Distillation is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.