Pretrained transformer models have grown dramatically in recent years and now reach hundreds of billions of parameters. Although these behemoths are achieving unprecedented performance on natural language processing (NLP) tasks, their ever-expanding size has limited their real-world deployment on resource-constrained edge or embedded devices.
In the new paper Extreme Compression for Pre-trained Transformers Made Simple and Efficient, a Microsoft research team proposes XTC, a simple yet effective extreme compression pipeline for pre-trained transformers. XTC can skip the compute-heavy pretraining knowledge distillation (KD) process to obtain a 5-layer BERT model with better performance than previous state-of-the-art distillation methods, and its extreme quantization and layer reduction can cut model sizes by 50x.
The team summarizes their main contributions as:
- We present a systematic study of extreme quantization methods by fine-tuning ≥1000 pre-trained transformer models, which includes a careful evaluation of the effects of hyperparameters and several methods introduced in extreme quantization.
- We find that previous extreme quantization studies overlooked certain design choices, which lead to under-trained binarized networks and unnecessarily complex optimizations. Instead, we derive a celebrating recipe for extreme quantization, which is not only simpler but also allows us to achieve an even larger compression ratio and higher accuracy than existing methods.
- We find that extreme quantization can be effectively combined with lightweight layer reduction, which allows us to achieve greater compression rates for pretrained transformers with better accuracy than prior methods while enjoying the additional benefits of flexibly adjusting the size of the student model for each use-case individually, without the expensive pretraining distillation.
The proposed XTC pipeline comprises two steps: 1) Lightweight layer reduction. Instead of adopting computationally expensive pretraining distillation, the researchers first employ a subset of the fine-tuned teacher weights as a lightweight layer reduction method to initialize a layer-reduced model. When combined with the team’s other training strategies, this lightweight approach reduces computational cost and achieves a much higher compression ratio than other existing methods. 2) 1-bit quantization by applying 1S-KD with DA and long training. The team applies quantize-aware 1S-KD (one-step knowledge distillation), using an ultra-low bit (1-bit/2-bit) quantizer to compress the layer-reduced model weights obtained in step 1 for a forward pass, then uses a straight-through estimator (STE) in a backward pass for passing gradients. The team minimizes the single-stage deep KD objective with data augmentation (DA) and longer training, such that the training loss is close to zero.
In their empirical study, the team applied their novel compression approach to the BERT large language model, using the standard General Language Understanding Evaluation (GLUE) benchmark.
The experimental results show that the proposed XTC can compress BERTbase to a 5-layer BERTbase while outperforming previous state-of-the-art distillation methods such as the 6-layer TinyBERT without incurring the computationally expensive pretraining distillation. The method’s robust extreme quantization can also reduce model size by 50x with better accuracy than prior extreme quantization methods; and achieve state-of-the-art results on GLUE tasks.
Overall, this work introduces a simple yet effective compression pipeline for extreme compression in pretrained transformers, providing a possible solution for deploying such models on resource-constrained devices.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.