AI research and development in recent years has shown that deep neural networks can achieve extremely impressive performance, but at the cost of often enormous computation burdens. For instance, training Open AI’s GPT-3, which has 175 billion parameters, requires access to huge server clusters with strong graphics cards, entailing costs that can soar to the millions of dollars.
Two popular approaches designed to alleviate this issue are neural network pruning and distillation. The former aims to reduce the produced neural network size while maintaining similar accuracy, and the latter is used to speed up inference. Existing network pruning techniques however yield only limited efficiency benefits, as it is necessary to reconstruct the original dense shape before running a model on standard hardware. Distillation methods also have their limitations, as they require careful engineering and size selection to ensure that the distilled models won’t be larger than a pruned model.
To close these gaps, a research team from New York-based natural language processing (NLP) company Hugging Face has introduced a block pruning approach targeting both small and fast models, which can learn to eliminate full components of the original model — effectively dropping a large number of attention heads.
In their paper Block Pruning For Faster Transformers, the Hugging Face researchers focus on three recent varieties of large-scale pretrained language model compression methods: distillation, pruning, and structured pruning. Their goal is to produce a set of parameters for transformer models that are both fine-tuned for a specific end-task and smaller — in such a way that inference can be efficiently computed on parallel hardware.
Knowledge distillation is a popular compression technique used to obtain significantly smaller BERT models with competitive performance, and unstructured pruning is used to prune model weights. On transformers, this entails selecting the weights to prune based on their magnitude, or by computing an importance score using a first-order method. In contrast, structured pruning removes coherent groups of weights, an approach based on recent findings that most heads provide redundant information.
In this work, the team extends movement pruning — a score-based pruning approach that encourages the model to optimize these score parameters — to work on blocks of local parameters. The researchers partition each matrix in the transformer into fixed-sized blocks, with the goal of encouraging the data locality to be closer to what would be needed for efficiency.
Similar to past work, the proposed approach is trained with distillation to match the performance of a teacher model. But unlike distillation approaches which require fully specifying the new model structure, this method only requires the size and shapes of the blocks for each parameter matrix in a model.
The team conducted evaluation experiments on five pretrained language model tasks: question answering (SQuAD v1.1 and SQuAD v2), natural language inference (MNLI), sentence similarity (QQP), sentiment classification (SST2), and abstractive summarization (CNN/DailyMail). They used BERT for sentence classification and question answering, and BART for summarization. They also compared their results against state-of-the-art approaches developed for fast inference of transformer-based language models.
The results show a 2.4x speedup on SQuAD v1.1 with a 1 percent drop of F1, a 2.3x speedup on QQP with a 1 percent loss of F1, and a 1.39x speedup for an average of 2 points drop on all ROUGE metrics on CNN/DailyMail, demonstrating that the proposed method can extract small pruned models that are an equivalent or better than distilled networks; and can reduce compute burdens and accelerate deep neural network training while maintaining most of the original model accuracy.
The researchers also believe the proposed method could help alleviate privacy concerns around NLP systems, as migrating large server-side models to smaller versions running on user devices allows more information to stay private.
The paper Block Pruning For Faster Transformers is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.