AI Machine Learning & Data Science Research

Microsoft & Xiamen U’s Progressive Distillation Method Sets New SOTA for Dense Retrieval

In the new paper Progressive Distillation for Dense Retrieval, a research team from Xiamen U and Microsoft Research presents PROD, a progressive distillation method for dense retrieval that achieves state-of-the-art performance on five widely used benchmarks.

Knowledge distillation is a classic approach for transferring knowledge from a powerful teacher model to a smaller student model. While it might be assumed that a stronger teacher model would naturally result in a stronger student model, this is not always the case — especially when the teacher-student gap is large. As Xiamen University and Microsoft researchers point out, “the university professor may not be more suitable than a kindergarten teacher to teach a kindergarten student.”

The Xiamen U and Microsoft Research team explores this issue in their new paper Progressive Distillation for Dense Retrieval, proposing PROD, a progressive distillation method for dense retrieval (matching queries to documents) that achieves state-of-the-art performance on five widely-used benchmarks.

PROD aims to gradually minimize the gap between a trained teacher model and the target student model via two sequential mechanisms: Teacher Progressive Distillation (TPD), which gradually improves teacher capability to enable students to learn progressively; and Data Progressive Distillation (DPD), wherein students are initially given all available data and the process then turns its focus to samples the student has performed poorly on. DPD is designed to identify appropriate (neither too easy nor too hard) knowledge for the student to learn — something like a tutor’s approach. Regularization loss is also introduced at each progressive step to avoid catastrophic forgetting of previous knowledge.

The PROD framework employs three different teacher models with varying levels of ability — a 12-layer DE (dual encoder), a 12-layer CE (cross-encoder) and a 24-layer CE — to boost a 6-layer DE student model gradually.

In their empirical study, the team conducted experiments on five widely used benchmark datasets — MS MARCO Passage, TREC Passage 19, TREC Document 19, MS MARCO Document and Natural Questions — with PROD achieving state-of-the-art results for dense retrieval across all datasets.

Overall, this paper validates the proposed PROD novel distillation method as a promising approach for dense retrieval, and the researchers hope their work will inspire additional research in this area.

The paper Progressive Distillation for Dense Retrieval is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

0 comments on “Microsoft & Xiamen U’s Progressive Distillation Method Sets New SOTA for Dense Retrieval

Leave a Reply

Your email address will not be published. Required fields are marked *