Researchers from the Huazhong University of Science and Technology and Huawei Noah’s Ark Lab have introduced TinyBERT, a smaller and faster version of Google’s popular large-scale pre-trained language processing model BERT (Bidirectional Encoder Representations from Transformers). The new research targets the sore points of traditional NLP models and the limitations of large-scale pretraining.
Scientists working in natural language processing (NLP) are very familiar with pretrained language models, which have significantly improved the performance of many NLP tasks. Compared with initializing a new language model and training it with a limited dataset, language models pretrained on large-scale data usually have higher accuracy since they have already learned linguistic patterns which can be transferred onto other tasks.
Although these pretrained models achieve higher accuracy, the downside is that they usually have complex layered structures and a huge number of parameters, which makes them computationally expensive and nearly impossible to transfer to resource-restricted devices. The most popular pretrained NLP model, Google’s BERT (Bidirectional Encoder Representations from Transformers), also suffers from these limitations.
To compress model size while maintaining accuracy, NLP researchers have developed a “knowledge distillation (KD)” and “teacher-student” framework which transfers linguistic features learned from a large-scale teacher network to a smaller-scale student network trained to mimic the behaviour of the teacher. KD has been widely explored in general NLP transformer-based models, but very few studies have applied KD in BERT. This is because BERT’s pretraining-then-fine-tuning paradigm, wherein the model is trained on large scale unsupervised text corpus before being fine-tuned on a task-specific dataset, greatly increases the difficulty of distillation.
To further develop the KD concept, the paper’s authors propose a novel distillation method specifically for BERT models. TinyBERT was designed based on two conceptual breakthroughs: Transformer distillation, a new method developed by the researchers; and two-stage learning framework which includes the general distillation stage and task-specific distillation stage.
Transformer distillation is designed to efficiently distill linguistic patterns embedded in the teacher BERT. For each BERT layer, a unique loss function is specifically designed to fit the layer’s own representation. This unique loss function design, attention weights, can accelerate the transfer efficiency between the teacher and student.
A key innovation of the two-stage learning framework is the addition of a general TinyBERT between the larger-scale text corpus and the downstream fine-tuned TinyBERT. A general transformer distillation learning process is performed between the large scale text corpus and the general TinyBERT, which can be further fine-tuned for downstream tasks. By incorporating more task-specific data, knowledge within the general TinyBERT can be further transferred to a more task-related, fine-tuned TinyBERT.
By combining the transformer distillation and two-stage learning framework, researchers achieved accuracy similar to the general BERT model, but with a model seven times smaller and nine times faster. TinyBERT also shows higher accuracy with less computational expense than the Patient Knowledge Distillation for BERT (BERT-PKD) model, which is considered the gold standard and baseline method for state-of-the-art NLP models.
The paper TinyBERT: Distilling BERT for Natural Language Understanding is on arXiv.
Author: Linyang Yu | Editor: Michael Sarazen