The continuing success of deep learning models has been largely due to their scalability, which allows them to deal with large-scale data and billions of model parameters. The deployment of such huge models on devices with limited resources however remains a challenge in the research community.
A variety of model compression and acceleration techniques have been developed to address this issue, and one of the most popular is knowledge distillation, which effectively learns a small student model from a large teacher model. Knowledge distillation seems a practical and effective solution, but just how well does it really work?
In the paper Does Knowledge Distillation Really Work?, a research team from New York University and Google Research shows that there are often surprisingly large discrepancies between the predictive distributions of teacher and student models and that this can lead to very different predictions even when student models have the capacity to perfectly match their teachers.
Previous research on knowledge distillation has shown that despite the smaller student networks’ lack of inductive biases for learning representations from training data, they can still effectively represent the solutions to problems. This has led to the assumption that the knowledge acquired by a large teacher model can be effectively transferred to a single smaller student model.
The NYU & Google researchers however suggest that knowledge distillation techniques often fail to transfer sufficient knowledge from teacher to student. While previous studies have focused on improving the students’ generative ability (their predictive power on unseen, in-distribution data), these have not distinguished between fidelity (the ability of a student to match its teacher’s predictions) and generalization. The team demonstrates that in many cases it is surprisingly difficult to obtain good student fidelity.
The researchers first explore failures in large network distillation by conducting experiments on the distillation of a 3 ResNet-56 teacher ensemble to expose the significant fidelity gap between teacher and student. They explain that in general, knowledge distillation is a means to transfer representations discovered by large black-box models into simpler more interpretable models, and determining whether the knowledge has been effectively transferred between teachers and student is based on fidelity and generation — qualities that are also foundational for understanding how knowledge distillation works and how it can be leveraged across a variety of applications.
The team then investigates whether applying data augmentation strategies during distillation can improve student model fidelity. Although extensive augmentations have proven useful in this regard, the improvements they bring are so small that the researchers believe it very unlikely that insufficient teacher labels represent the main obstacle to obtaining higher fidelity. They also find that modifying the distillation data can be beneficial, but again conclude that these small improvements alone cannot significantly affect overall model fidelity.
Finally, the researchers turn their attention to student behaviour on the distillation data itself, and investigate whether the students match the teachers on the data they are trained to match them on. They conduct a simplified distillation experiment on a ResNet-20 on CIFAR-100 with a baseline data augmentation strategy. The results show that the ResNet-20 student is unable to match the teacher even with basic augmentations. They also explore whether modifications of the problem might produce a high-fidelity student, but still find knowledge distillation unable to converge to optimal student parameters even when the answer is known and they are able to direct to an optimum via a good initialization.
Overall, the team summarizes their key findings as:
- Good student accuracy does not imply good distillation fidelity: even outside of self distillation, the models with the best generalization do not always achieve the best fidelity.
- Student fidelity is correlated with calibration when distilling ensembles: although the highest-fidelity student is not always the most accurate, it is always the best calibrated.
- Optimization is challenging in knowledge distillation: even in cases when the student has sufficient capacity to match the teacher on the distillation data, it is unable to do so.
- There is a trade-off between optimization complexity and distillation data quality: enlarging the distillation dataset beyond the teacher training data makes it easier for the student to identify the correct solution, but also makes an already difficult optimization problem harder.
The paper Does Knowledge Distillation Really Work? is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.