Intel’s Prune Once for All Compression Method Achieves SOTA Compression-to-Accuracy Results on BERT

Although transformer-based language models have become the de facto standard for a wide range of natural language processing (NLP) tasks, the ever-increasing scale of these models makes them inefficient and difficult to deploy in production environments and on edge devices. To address this issue, some recent studies have introduced compression algorithms designed to increase the implementation efficiency of large transformer-based models.

Following this path, an Intel Labs research team has proposed Prune Once for All (Prune OFA), a training method that leverages weight pruning and model distillation to produce pretrained transformer-based language models with high sparsity ratios. Applied to large language model BERT, the approach achieves state-of-the-art results in terms of compression-to-accuracy ratio.

The team summarizes their work’s main contributions as:

We introduce a new architecture-agnostic method of training sparse pretrained language models.
We demonstrate how to fine-tune these sparse models on downstream tasks to create sparse and quantized models, removing the burden of pruning and tuning for a specific language task.
We publish our compression research library with example scripts to reproduce our work for other architectures, along with our sparse pretrained models presented in this paper.

Weight pruning is a method that forces some of the neural network’s weights to zero, resulting in sparse neural networks that reduce the computation and the memory footprint of the trained model. The team leverages weight pruning along with model distillation to create Prune OFA, a novel training method that generates sparse pretrained language models that can later be fine-tuned to downstream tasks with minimal accuracy loss at high sparsity ratios. The term “Prune Once for All” derives from the method’s ability to fine-tune sparse pretrained models for various language tasks while pruning the pretrained model only once.

Prune OFA takes pretrained language models as inputs and outputs a sparse language model of the same architecture. It comprises two steps: teacher preparation and student pruning. The resulting sparse pretrained model is then used for transfer learning, which it does while maintaining its sparsity pattern.

For their evaluations, the team applied Prune OFA to BERT-Base, BERT-Large and DistilBERT architectures and fine-tuned the pretrained models on the massive English Wikipedia dataset. They then executed the student pruning step to obtain sparse pre-trained models, pruning BERT-Base and DistilBERT to 85 and 90 percent sparsity ratios, respectively, and BERT-Large to a 90 percent sparsity ratio.

The proposed Prune OFA achieved SOTA compression-to-accuracy ratios for BERT-Base, BERT-Large and DistilBERT. The team hopes their work can help researchers develop more efficient models and proposes future studies in this area could investigate whether large and sparse pretrained models are better at capturing and transferring natural language knowledge than smaller dense models of the same architecture with similar non-zero parameter counts.

The paper Prune Once for All: Sparse Pre-Trained Language Models has been accepted for a poster session at NeurIPS 2021 (December 6-14), and is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Intel’s Prune Once for All Compression Method Achieves SOTA Compression-to-Accuracy Results on BERT

Like this:

1 comment on “Intel’s Prune Once for All Compression Method Achieves SOTA Compression-to-Accuracy Results on BERT”

Leave a Reply Cancel reply

Related

Share this:

Like this:

1 comment on “Intel’s Prune Once for All Compression Method Achieves SOTA Compression-to-Accuracy Results on BERT”

Leave a Reply Cancel reply

Related