Running Fast Transformers on CPUs: Intel Approach Achieves Significant Speed Ups and SOTA Performance

Large transformer language models (LM) that scale up to billions of parameters have demonstrated state-of-the-art performance across a wide variety of natural language processing (NLP) tasks. The real-world deployment of such models however remains limited due to their slow speed and heavy compute demands.

Researchers from Intel Corporation and Intel Labs address this issue in the new paper Fast DistilBERT on CPUs, proposing a pipeline and hardware-aware extreme compression technique for creating and running fast transformer models on CPUs. The approach achieves impressive speed ups and SOTA performance in production environments.

The team summarizes their main contributions as follows:

Propose a hardware-aware extreme compression technique for fast transformer models on CPUs.
Create an efficient transformer inference runtime for sparse & quantized transformer models.
Demonstrate new SOTA performance under typical constraints in common production environments.

To apply the proposed model compression technique to transformer-based LMs, the researchers first use specialized sparse GEMM operators to accelerate sparse transformer models and extend Zafrir et al.‘s model compression infrastructure to create sparse pretrained LMs with block-wise structured sparsity. They then fine-tune the models with knowledge distillation while fine-tuning the block-wise sparse pre-trained LM to downstream tasks to bridge the accuracy gap caused by compression methods. Finally, they apply post-training quantization with automatic accuracy-aware tuning to optimize the model.

For software acceleration, the team develops a transformer inference engine on a CPU with advanced runtime, graph optimization and sparse GEMM operators.

In their empirical study, the team trained a block-wise sparse pretrained DistilBERT model and applied the proposed approach while fine-tuning it on the question-answering SQuADv1.1 benchmark. Their resulting Fast DistilBERT model surpasses the runtime performance of Neural Magic’s state-of-the-art DeepSparse model by up to 50 percent with only a minimal loss in accuracy and achieves 4.1x better performance than ONNX Runtime.

The novel combination of block-wise structured sparsity, knowledge distillation and quantization that define this hardware-aware model compression technique enables transformer-based LMs to run efficiently on CPUs. The researchers plan to apply their method to other common transformer models to further test its inference efficiency.

The paper Fast DistilBERT on CPUs has been accepted by the 36th Conference on Neural Information Processing Systems (NeurIPS 2022) and is available on arXiv.

Author: Hecate He | Editor: Michael Sarazen, Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

4 comments on “Running Fast Transformers on CPUs: Intel Approach Achieves Significant Speed Ups and SOTA Performance”

cricut.com/setup

2022-11-22

Cricut is a machine that beautifully prints and cuts crafts for you. If you need a perfect machine that works as per your instructions and provides designs according to your requirement for several school projects or commercial work, then Cricut is the best solution for you. Cricut.com/setup comes with multiple tools that help you to make a large design in a few minutes. Still, if you have not bought the Cricut machine yet and need one, then purchase it right now from the online store or an offline market.

Loading...

Pingback: Running Fast Transformers on CPUs - My Blog
Roblox Doors

2023-01-02

Cricut is a machine that prints and cuts creative projects for you. Cricut is the finest choice for you if you need a great machine that operates according to your instructions and generates designs based on your specifications for a variety of school assignments or commercial work.

Loading...

cricut.com/setup

2023-01-06

You have quality content in your blog, I appreciate it, and I’ll always come around for new updates. Visit at- https://cricutsetup-windows.com

Loading...

Running Fast Transformers on CPUs: Intel Approach Achieves Significant Speed Ups and SOTA Performance

Like this:

4 comments on “Running Fast Transformers on CPUs: Intel Approach Achieves Significant Speed Ups and SOTA Performance”

Leave a Reply Cancel reply

Related

Share this:

Like this:

4 comments on “Running Fast Transformers on CPUs: Intel Approach Achieves Significant Speed Ups and SOTA Performance”

Leave a Reply Cancel reply

Related