Powerful large-scale pretrained language models such as Google’s BERT have been a game-changer in the arena of natural language processing (NLP) and beyond. The impressive achievements however have come with huge computational and memory demands, which has made it difficult to deploy such models on resource-restricted devices.
Previous studies have proposed task-agnostic BERT distillation to tackle this issue — an approach that aims to obtain a general small BERT model that can be fine-tuned directly like a teacher model (such as BERT-Base). But even task-agnostic BERT distillation is computationally expensive, due to the large-scale corpuses involved and the need to perform both a forward process for the teacher model and a forward-backward process for the student model.
In the paper Extract then Distill: Efficient and Effective Task-Agnostic BERT Distillation, a research team from Huawei Noah’s Ark Lab and Tsinghua University proposes Extract Then Distill (ETD), a generic and flexible strategy that reuses teacher model parameters for efficient and effective task-agnostic distillation that can be applied to student models of any size.
The researchers summarize their contributions as:
- Propose an effective method ETD, which improves the efficiency of task-agnostic BERT distillation by reusing the teacher parameters to initialize the student model.
- The proposed ETD method is flexible and applicable to student models of any size.
- Demonstrate the effectiveness of ETD on the GLUE benchmark and SQuAD.
- Demonstrate that ETD is general and can be applied to different existing state-of-the-art distillation methods, such as TinyBERT and MiniLM, to further boost their performance.
- Validate that the extraction process of ETD is efficient and brings almost no additional calculations.
The proposed ETD strategy comprises three steps: width-wise extraction, uniform layer selection and transformer distillation.
Width-wise extraction extracts parameters from the teacher to a thin teacher model. The process includes the extraction of FFN (feed-forward network) neurons, head neurons and hidden neurons. The extraction of hidden neurons follows the hidden consistency principle that ensures the extracted hidden neurons of different modules have the same position indexes. The researchers propose two approaches for extracting the teacher’s parameters: ETD-Rand performs a width-wise extraction of the teacher’s parameters randomly, while ETD-Impt does so depending on the importance scores. This step results in a thin teacher model with the same width as the student.
After width-wise extraction, the researchers are able to adopt the strategy of uniform layer selection for depth-wise extraction. Specifically, given a thin teacher that has N transformer layers and a student with M transformer layers, they apply a uniform strategy to choose M layers from the thin teacher model to initialize the student model.
Finally, they initialize the student with the extracted parameters and adopt the last-layer distillation strategy in ETD.
The team used English Wikipedia and the Toronto Book Corpus as their distillation datasets. They employed two baseline models — BERT-Base (a 12-layer transformer with 768 hidden layer size) and DistilBERT (a 6-layer model whose width remains the same as the teacher model) — as teachers to test ETD’s distillation performance. They also applied ETD-Impt to popular distillation methods such as TinyBERT and MiniLM) to evaluate ETD’s generality.
In the experiments, ETD-Rand and ETD-Impt strategies achieved results comparable to Rand-Init while using only 43 percent and 28 percent of the computation costs, respectively. Compared to TinyBERT and MiniLM, ETD-Impt achieved similar performance with a compute cost even lower than 28 percent, validating both the efficiency and genericness of the proposed ETD approach.
The paper Extract then Distill: Efficient and Effective Task-Agnostic BERT Distillation is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: [R] Huawei & Tsinghua U Method Boosts Task-Agnostic BERT Distillation Efficiency by Reusing Teacher Model Parameters – ONEO AI
Pingback: r/artificial - [R] Huawei & Tsinghua U Method Boosts Task-Agnostic BERT Distillation Efficiency by Reusing Teacher Model Parameters - Cyber Bharat