As AI-powered language models continue increasing in size, reducing serving cost has become an important research area. Knowledge distillation has emerged as a promising and effective method for model compression, but existing distillation methods can struggle with model-serving in today’s massive datacenters, where they face challenges such as handling fast-evolving models, considering serving performance, and optimizing for multiple objectives.
To deal with these issues, a research team from the University of Illinois Urbana-Champaign and Google has introduced AutoDistill, an end-to-end fully automated model distillation framework that integrates model architecture exploration and multi-objective optimization for building hardware-efficient pretrained natural language processing (NLP) models.
The team summarizes their main contributions as:
- We propose an end-to-end framework for fully automated model distillation, which satisfies user-defined metrics and constraints by delivering optimized pretrained models distilled from large NLP models. It can be easily extended to new search spaces and objectives, thereby eliminating the need for distillation experts. It helps solve the most critical problem of productionizing large-scale model distillation in datacenters.
- We use Bayesian Optimization (BO) (Golovin et al., 2017; Snoek et al., 2012) to conduct multi-objective NAS for student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target serving hardware. It is the first time that Bayesian Optimization (BO) is adopted by the NAS and distillation framework to deliver hardware-efficient largescale NLP pretrained models.
- Enabled by AutoDistill, the experiments on TPUv4i identify seven model architectures with up to 3.2% higher pretrained accuracy and up to 1.44× speedup on latency compared to MobileBERT (Sun et al., 2020). Four of them have higher GLUE average scores (up to 81.69) than BERTBASE (Devlin et al., 2018), DistillBERT (Sanh et al., 2019), TinyBERT (Jiao et al., 2020), and MobileBERT. Two models are smaller and have higher SQuAD accuracy than DistillBERT, TinyBERT, and NAS-BERT (Xu et al., 2021).
AutoDistill is an end-to-end solution designed to generate optimized task-agnostic pretrained language models for target hardware configurations. AutoDistill takes user requirements, objectives and constraints as inputs representing key components for consideration, such as pretraining tasks, model design spaces, target hardware, evaluation metrics, etc.
The overall flow for AutoDistill includes three major stages: model exploration, flash distillation, and evaluation. Model exploration is used to search for better compressed models by considering the design space, evaluation metrics, and user-specified constraints. Flash distillation is then adopted to grow the most promising candidate model as a student model that learns from both pretraining datasets and the teacher model. This stage is also responsible for regular distillation with the same teacher model but different training setups. The flash-distilled student model is then evaluated on the target tasks and hardware for prediction accuracy, next sentence prediction accuracy and hardware performance. After all desired metrics are collected, the information is passed back to the model exploration stage, where the search engine selects the optimal model for the next iteration.
Notably, AutoDistill formulates student model architecture search as a black-box optimization problem, integrating the Bayesian Optimization (BO) algorithm and the Vizier (Golovin et al., 2017) cloud-based black-box optimization service into the search engine for student architecture search. The researchers can capture valid and precise hardware feedback by measuring the student model on the target hardware and datacenter software environment in the fully automated and integrated evaluation stage.
AutoDistill has several advantages over previous differentiable neural architecture search (DNAS) methods: 1) It does not need to spend enormous effort to train a large supernet beforehand on NLP pretraining tasks, 2) It can better scale to handle a much larger design space, and 3) It can be easily extended to new objectives and new models with different architecture configurations.
The team conducted extensive experiments to evaluate AutoDistill. On the General Language Understanding Evaluation (GLUE) benchmark with nine downstream natural language understanding tasks, AutoDistill achieved higher average scores than BERTBASE, DistilBERT, TinyBERT6 and MobileBERT with significantly smaller model sizes. In experiments on Google’s TPUv4i hardware, AutoDistill-generated models achieved up to 3.2 percent higher pretrained accuracy and up to 1.44x speedups on latency compared to MobileBERT.
Overall, AutoDistill improves both prediction accuracy and serving latency on target hardware, indicating its promise and potential for building next-generation hardware-efficient pretrained NLP models.
The paper AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.