Neural network pruning techniques can effectively reduce the parameter counts of original large trained networks by over 90 percent without compromising accuracy. We can thus deduce that if a network can be reduced in size, it is possible train this smaller architecture instead, which makes the training process more efficient.
Following this insight, a 2018 MIT CSAIL paperuncovered subnetworks that train from the start and learn at least as fast as their larger counterparts while reaching similar test accuracy. They termed these subnetworks “lottery tickets” or “winning tickets,” and the existence of such subnetworks is now widely referred to as “The Lottery Ticket Hypothesis” (LTH).
A research team from Georgia Tech, Microsoft Research and Microsoft Azure AI recently revisited the concept of lottery tickets in extremely over-parametrized models and found that at certain compression ratios, the generalization performance of the winning tickets can not only match but also exceed that of their full counterpart models. What’s more, the team revealed a phase transition phenomenon: as the compression ratio increases, the winning tickets’ generalization performance can reach an optimal point, enabling them to become “super tickets.”
Previous studies have shown that winning tickets can transfer across tasks and datasets and can be identified when finetuning the pretrained models on downstream tasks. However, these studies tended to focus on searching for a highly compressed subnetwork with performance comparable to the full model, and neglected the behaviour of the winning tickets in lightly compressed subnetworks.
In the new paper, the researchers study the behaviour of the winning tickets (especially on lightly compressed subnetworks) in pretrained language models. They first demonstrate how to identify winning tickets in Google’s large language model BERT through structured pruning of attention heads and feed-forward layers. They adopt importance score — the expected sensitivity of the model outputs with respect to the mask variables — as their gauge for pruning. The importance score can thus be interpreted as a strong indicator of expressive power — where for instance a low importance score indicates the corresponding structure only has a small contribution towards the output. After pruning the heads and feed-forward layers with the lowest importance scores, the team obtained winning tickets at different compression ratios. The super tickets were identified as the winning tickets with the best rewinding validation performance.
The team used multitask learning to evaluate the model generation ability of these super tickets, demonstrating that the shared models will usually be highly over-parameterized. To mitigate the redundancy of these shared models, the researchers sought to identify task-specific super tickets, and proposed a novel ticket-sharing algorithm that updates the parameters of the multitask model.
The team breaks down the idea behind their ticket-sharing algorithm: If a certain network structure (e.g., an attention head) is identified as a super ticket by multiple tasks, its weights are jointly updated by these tasks. If it is only selected by one specific task, then its weights are updated by that task only. Otherwise, its weights are completely pruned.
The researchers conducted intensive experiments on the General Language Understanding Evaluation (GLUE) benchmark. They first finetuned pretrained BERT models with task-specific data, including ST-DNNBASE/LARGE (initialize with BERTbase/large) and SuperTBASE/LARGE (initialize with the chosen set of super tickets in BERT-base/large). They then
conducted five trials with pruning and rewinding experiments to evaluate the generation performance of the super tickets.
The team summarizes the results as:
- In all the tasks, SuperT consistently achieves better generalization than ST-DNN. The task averaged improvement is around 0.9 over STDNNBASE and 1.0 over ST-DNNLARGE.
- The super tickets’ performance gains are more significant on small tasks.
- Performance of the super tickets is related to model size.
The team also summarizes their observations with regard to phase transitions:
- The winning tickets are indeed the “winners.”
- Phase transition is pronounced over different tasks and models. Accuracy of the winning tickets increases until a certain compression ratio. Passing the threshold, the accuracy decreases, until its value intersects with that of the random tickets.
- Phase transition is more pronounced in large models and small tasks.
Overall, this work validates that model generalization can be improved through structured pruning, and super tickets can be used to help improve model generalization.
The paper Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.