The pretraining of BERT-type large language models — which can scale up to billions of parameters — is crucial for obtaining state-of-the-art performance on many natural language processing (NLP) tasks. This pretraining process however is expensive, and has become a bottleneck hindering the industrial application of such large language models.
In the new paper Token Dropping for Efficient BERT Pretraining, a research team from Google, New York University, and the University of Maryland proposes a simple but effective “token dropping” technique that significantly reduces the pretraining cost of transformer models such as BERT, without degrading performance on downstream fine-tuning tasks.

The team summarizes their main contributions as:
- We show that BERT models can be pretrained with only a subset of the layers focusing on important tokens. Even though the model is trained on sub-sequences of important tokens only, it generalizes well to full sequences during fine-tuning on downstream tasks.
- We identify important tokens through the pretraining process by exploring the training dynamics, with minimal computational overhead and without modifying the model architecture.
- We show that our token dropping strategy can save 25% of pretraining time while achieving similar performance on downstream tasks.
Current transformer models generally allocate the same amount of computation to each token in a given sequence — an approach that results in much of the training cost being wasted on less informative tokens. The researchers’ proposed token-dropping strategy for BERT pretraining addresses this issue by removing tokens that are redundant or less informative to training; boosting efficiency by training models on only the most informative tokens and input sequences.


While previous related studies have allocated less compute to easy-to-predict tokens or performed pooling on the embeddings of nearby tokens, directly dropping tokens is a relatively new approach. In the 2021 paper Faster Depth-Adaptive Transformers, Liu et al. identify important tokens through mutual information-based estimation between tokens and predefined labels; or via a separate BERT model that computes the masked language model (MLM) loss for each token.
The proposed method aims to accelerate the task-agnostic pretraining phase without requiring any labels or computation by a separate language model. It classifies important tokens as those that are hard to predict by the model itself through its loss during training, an approach that is adaptive to the training process and produces practically no computational overhead. The method also identifies tokens in each sequence with the smallest historical masked language model (MLM) loss as unimportant and removes them from the intermediate layers of the BERT model during training to dramatically reduce compute and memory costs.
Further, the proposed token dropping strategy can be adapted without modifying the original BERT architecture or training settings, as it only requires training intermediate layers on a few important tokens. The researchers show that their simple approach also generalizes well on diverse downstream tasks with full sequences.

In their empirical evaluations, the team compared their approach with baseline BERT pretraining methods. The results show that the proposed token dropping method can reduce BERT pretraining cost by 25 percent while maintaining similar overall fine-tuning performance on standard downstream tasks.
In future work, the team plans to extend token-dropping to the pretraining of transformer models that can process much longer contexts; and apply the algorithm to additional transformer-based tasks such as translation and text generation.
The code is available on the project’s GitHub. The paper Token Dropping for Efficient BERT Pretraining is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: Token-Dropping : 25% reduction in BERT pretraining time with no performance loss – GZ AI INFO