The new Google AI paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding is receiving accolades from across the machine learning community. Google researchers present a deep bidirectional Transformer model that redefines the state of the art for 11 natural language processing tasks, even surpassing human performance in the challenging area of question answering. Some highlights from the paper:
- NLP researchers are exploiting today’s large amount of available language data and maturing transfer learning techniques to develop novel pre-training approaches. They first train a model architecture on one language modeling objective, and then fine-tune it for a supervised downstream task. Aylien Research Scientist Sebastian Ruder suggests in his blog that pre-trained models may have “the same wide-ranging impact on NLP as pretrained ImageNet models had on computer vision.”
- The BERT model’s architecture is a bidirectional Transformer encoder. The use of a Transformer comes as no surprise — this is a recent trend due Transformers’ training efficiency and superior performance in capturing long-distance dependencies compared to a recurrent neural network architecture. The bidirectional encoder meanwhile is a standout feature that differentiates BERT from OpenAI GPT (a left-to-right Transformer) and ELMo (a concatenation of independently trained left-to-right and right- to-left LSTM).
- BERT is a huge model, with 24 Transformer blocks, 1024 hidden layers, and 340M parameters.
- The model is pre-trained on 40 epochs over a 3.3 billion word corpus, including BooksCorpus (800 million words) and English Wikipedia (2.5 billion words).
- The model runs on 16 TPU pods for training.
- In the pre-training process, researchers took an approach which involved randomly masking a percentage of the input tokens (15 percent) to train a deep bidirectional representation. They refer to this method as a Masked Language Model (MLM).
- A pre-trained language model cannot understand relationships between sentences, which is vital to language tasks such as question answering and natural language inferencing. Researchers therefore pre-trained a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.
- The fine-tuned model for different datasets improves the GLUE benchmark to 80.4 percent (7.6 percent absolute improvement), MultiNLI accuracy to 86.7 percent (5.6 percent absolute improvement), the SQuAD v1.1 question answering Test F1 to 93.2 (1.5 absolute improvement), and so on over a total of 11 language tasks.
The paper’s first author is Jacob Devlin, a Google senior research scientist with a primary research interest in developing deep learning models for natural language tasks. He previously led Microsoft Translate’s transition from phrase-based translation to neural machine translation (NMT) as a Principle Research Scientist at Microsoft Research from 2014 to 2017.
Google Brain Research Scientist Thang Luong enthusiastically tweeted “a new era of NLP has just begun a few days ago: large pre-training models (Transformer 24 layers, 1024 dim, 16 heads) + massive compute is all you need.”
Baoxun Wang, Chief Scientist of Chinese AI startup Tricorn, also praised the Google paper as “a milestone” in his keynote address at this week’s Artificial Intelligence Industry Alliance conference in Suzhou, China. The paper leverages massive amounts of data and compute and well-polished engineering works, representing what Wang calls “Google’s tradition of violent aesthetics.”
The pre-trained model and code will be released in the next two weeks. The paper is on arXiv.
Journalist: Tony Peng | Editor: Michael Sarazen