Large Language Model (LLM) providers often build entire families of models from scratch, each varying in size. However, training multiple multi-billion-parameter models from the ground up is highly demanding in terms of time, data, and computational resources. To address this challenge, recent research has shown that integrating weight pruning with knowledge distillation can dramatically reduce the costs of training LLM model families.
Following this path, in a new paper LLM Pruning and Distillation in Practice: The Minitron Approach, an NVIDIA research team presents the Minitron compression strategy, which effectively produces a robust 4B model from Llama 3.1 8B and a cutting-edge Mistral-NeMo-Minitron-8B model derived from Mistral NeMo 12B.


At a high level, the process begins with light fine-tuning of the teacher model on the target dataset, a step the researchers call teacher correction. The model is then pruned to reduce its size, followed by knowledge distillation to restore any lost accuracy.
To initiate pruning, the team calculates the importance of each layer, neuron, head, and embedding dimension. They rank these elements based on importance, using a purely activation-based estimation method. This method captures sensitivity data for all axes—depth, neuron, head, and embedding channels—through a small calibration dataset and simple forward passes.

In the model trimming phase, the researchers rank the elements of each axis by importance and prune the corresponding weight matrices. For neuron and head pruning, they trim weights in the MLP and MHA layers, respectively. When pruning embedding channels, they adjust the embedding dimensions of the weight matrices in the MLP, MHA, and LayerNorm layers.

The retraining phase follows with two strategies: conventional retraining using ground truth labels, and knowledge distillation, where the pruned model (student) learns from the unpruned model (teacher). During distillation, the researchers use forward KL Divergence loss, focusing solely on the logits of the teacher and student models.

Empirical results show that the Minitron compression strategy delivers a state-of-the-art 8B model (MN-Minitron-8B), which surpasses similarly sized models across common language modeling benchmarks. The Llama-3.1-Minitron-4B model also demonstrates impressive accuracy, closely matching the performance of its teacher, the Llama 3.1 8B, and outperforming the previous-generation Minitron-4B. Additionally, the MN-Minitron-8B achieves an average speedup of 1.2× compared to the Mistral NeMo 12B teacher, while the Llama-3.1-Minitron-4B models provide speedups of 2.7× and 1.8× for their depth- and width-pruned variants, respectively, compared to the Llama 3.1 8B teacher.
Overall, the Minitron approach exemplifies a practical and efficient method for compressing LLMs while preserving or enhancing their performance across key benchmarks.
The paper LLM Pruning and Distillation in Practice: The Minitron Approach is on arXiv.
Author: Hecate He | Editor: Chain Zhang

If you’re looking for a top-notch haircut, check out Premium Barbershop on 52nd Street. This barbershop salon is everything you need for a quality grooming experience. The barbers are friendly, professional, and meticulous. My haircut was perfect, and I’ll be recommending this spot to all my friends.
Fun, cute, and challenging, chiikawa puzzle game turns puzzle-solving into a memory adventure where every disappearing piece makes the game more exciting.