The first language of around 24 million people and a second language for nearly 5 million, Dutch is the third most widely spoken Germanic language, after English and German. A group of researchers from Belgium’s Katholieke Universiteit Leuven and the Technische Universität Berlin recently introduced a Dutch RoBERTa-based language model, RobBERT.
First introduced in 2019, Google’s BERT (Bidirectional Encoder Representations from Transformers) is a powerful and popular language representation model designed to pre-train deep bidirectional representations from unlabeled text. Studies show that BERT models trained on a single language notably outperform the multilingual version.
Unlike previous approaches that have used earlier implementations of BERT to train a Dutch-language BERT, the new research uses RoBERTa, the improved version of BERT introduced last summer by researchers from Facebook AI and University of Washington, Seattle. RobBERT was pre-trained on 6.6 billion words totaling 39 GB of text from the Dutch section of the OSCAR corpus.
Researchers evaluated RobBERT in different settings on multiple downstream tasks, comparing its performance on sentiment analysis using the Dutch Book Reviews Dataset (DBRD), and on a task specific to the Dutch language, distinguishing “die” from “dat(that)” on the Europarl utterances corpus. The results show that RobBERT outperforms existing Dutch BERT-based models such as BERTje in sentiment analysis and achieves state of the art results on the “Die/Dat” disambiguation task.
The paper identifies possible improvements and future directions for this research, such as in training similar models, changing training data format and pre-training tasks such as sentence order prediction, and applying RobBERT in additional Dutch language tasks.
The pretrained RobBERT models can be used with Hugging Face’s transformers and Facebook’s Fairseq toolkit. The RobBERT logo, incidentally, derives from the fact that the word “rob” also means “seal” in Dutch.
The paper RobBERT: a Dutch RoBERTa-based Language Model is on arXiv. The model and code are available on GitHub.
Author: Yuqing Li | Editor: Michael Sarazen