Introduced by a Google team led by Jacob Devin in 2018, the powerful Bidirectional Encoder Representations from Transformers (BERT) language model has enabled many breakthroughs in the field of natural language processing (NLP). Google, which built its brand on industry-leading Search performance, says BERT has even dramatically improved the understanding of search queries.
Google has also released a multilingual language model, mBERT, which is trained on a corpus of 104 languages and can be leveraged as a universal language model. While the NLP research community has seen impressive performance from BERT models trained on a particular language, there hasn’t been a clear comparison between mBERT and these language-specific BERT models. To evaluate the advantage of each model from the perspectives of architecture, tasks, and domain, a team of researchers from Bocconi University has prepared an online overview of the commonalities and differences between language-specific BERT models and mBERT.
Currently, approximately GitHub 5000 repositories mention “BERT”. For researchers, deciding which language-specific model best suits their needs is a choice that can affect the entire research project. Models trained on a particular language and tested on specific data domains and tasks commonly draw their training data from sources such as Wikipedia, news, legislative and administrative tests, translated movie subtitles, etc. Common NLP tasks include Named Entity Recognition, Natural Language Inference, Paraphrase Identification, etc. To make sense of the different models and tasks and their interrelationships the Bocconi University research team launched the “BertLang” website.
The team says identifying which mBERT or language-specific models perform best at specific tasks is important for NLP progress and can also impact the use of computational resources. They tested 30 pretrained language-specific BERT models on 29 tasks in 18 languages with 177 different performance results.
Language-specific BERT models scored higher than mBERT in all 29 tasks. Cross-checking the average performance of different language-specific BERT models on various tasks provided additional insights. For example, researchers observed that specialized models for low-resources languages such as Mongolian showed the highest improvement compared to mBERT. The paper suggests this is because the developers of language-specific BERT models are likely to be experts on sourcing and using appropriate development resources beyond Wikipedia, etc.
In the future, the team is planning to add independent verification of reported results and direct comparisons of language-specific BERT models on domains and tasks.
The paper What the [MASK]? Making Sense of Language-Specific BERT Models is on arXiv, and the BERTLang website is here.
Journalist: Fangyu Cai | Editor: Michael Sarazen