Content provided by Wangchunshu Zhou, co-first author of the paper BERT-of-Theseus: Compressing BERT by Progressive Module Replacing.
What’s New: In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Compared to the previous knowledge distillation approaches for BERT compression, our approach leverages only one loss function and one hyper-parameter, liberating human effort from hyper-parameter tuning. Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression.
How It Works: It works by progressively substitutes modules of BERT with modules of fewer parameters. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction between the original and compact models, and smooths the training process.
Key Insights: Jointly training with module replacement may be a promising approach for compressing large neural network models.
Anything else: Applying this model compression approach for ResNet-like models is interesting.
The paper BERT-of-Theseus: Compressing BERT by Progressive Module Replacing is on arXiv.
Meet the authors Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei and Ming Zhou from Wuhan University, Beihang University and Microsoft Research Asia.
Share Your Research With Synced Review
Share My Research is Synced’s new column that welcomes scholars to share their own research breakthroughs with over 1.5M global AI enthusiasts. Beyond technological advances, Share My Research also calls for interesting stories behind the research and exciting research ideas. Share your research with us by clicking here.