Large transformer language models (LM) that scale up to billions of parameters have demonstrated state-of-the-art performance across a wide variety of natural language processing (NLP) tasks. The real-world deployment of such models however remains limited due to their slow speed and heavy compute demands.
Researchers from Intel Corporation and Intel Labs address this issue in the new paper Fast DistilBERT on CPUs, proposing a pipeline and hardware-aware extreme compression technique for creating and running fast transformer models on CPUs. The approach achieves impressive speed ups and SOTA performance in production environments.
The team summarizes their main contributions as follows:
- Propose a hardware-aware extreme compression technique for fast transformer models on CPUs.
- Create an efficient transformer inference runtime for sparse & quantized transformer models.
- Demonstrate new SOTA performance under typical constraints in common production environments.
To apply the proposed model compression technique to transformer-based LMs, the researchers first use specialized sparse GEMM operators to accelerate sparse transformer models and extend Zafrir et al.‘s model compression infrastructure to create sparse pretrained LMs with block-wise structured sparsity. They then fine-tune the models with knowledge distillation while fine-tuning the block-wise sparse pre-trained LM to downstream tasks to bridge the accuracy gap caused by compression methods. Finally, they apply post-training quantization with automatic accuracy-aware tuning to optimize the model.
For software acceleration, the team develops a transformer inference engine on a CPU with advanced runtime, graph optimization and sparse GEMM operators.
In their empirical study, the team trained a block-wise sparse pretrained DistilBERT model and applied the proposed approach while fine-tuning it on the question-answering SQuADv1.1 benchmark. Their resulting Fast DistilBERT model surpasses the runtime performance of Neural Magic’s state-of-the-art DeepSparse model by up to 50 percent with only a minimal loss in accuracy and achieves 4.1x better performance than ONNX Runtime.
The novel combination of block-wise structured sparsity, knowledge distillation and quantization that define this hardware-aware model compression technique enables transformer-based LMs to run efficiently on CPUs. The researchers plan to apply their method to other common transformer models to further test its inference efficiency.
The paper Fast DistilBERT on CPUs has been accepted by the 36th Conference on Neural Information Processing Systems (NeurIPS 2022) and is available on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.