Microsoft’s DeBERTa (Decoding-enhanced BERT with disentangled attention) is regarded as the next generation of the BERT-style self-attention transformer models that have surpassed human performance on natural language processing (NLP) tasks and topped the SuperGLUE leaderboard. This week, Microsoft released DeBERTaV3, an updated version that leverages ELECTRA-style pretraining with gradient-disentangled embedding sharing to achieve better pretraining efficiency and a significant jump in model performance.
The Microsoft Azure AI and Microsoft Research team introduces two methods for improving DeBERTa. They combine DeBERTa with ELECTRA-style training, which significantly boosts model performance; and they employ a gradient-disentangled embedding sharing approach as a DeBERTaV3 building block to avoid “tug-of-war” issues and achieve better pretraining efficiency.
Following the ELECTRA-style training paradigm, the team replaces DeBERTa’s mask language modelling (MLM) with a more sample-efficient pretraining task, replaced token detection (RTD), where the model is trained as a discriminator to predict whether a token in the corrupted input is either original or has been replaced by a generator.
In ELECTRA, the discriminator and the generator share the same token embeddings. This mechanism can however hurt training efficiency, as the training losses of the discriminator and the generator tend to pull token embeddings in different directions. While the MLM tries to pull semantically similar tokens closer to each other, the discriminator’s RTD works to discriminate semantically similar tokens, pulling their embeddings as far as possible to optimize binary classification accuracy. This results in inefficient “tug-of-war” dynamics.
While it is natural to imagine this problem could be solved by using separated embeddings for the generator and the discriminator, such an approach will result in a significant performance degradation when transferred to downstream tasks. The researchers thus propose a trade-off, employing a novel gradient-disentangled embedding sharing (GDES) method wherein the generator shares its embeddings with the discriminator but stops the gradients in the discriminator from backpropagating to the generator embeddings. This effectively avoids the tug-of-war dynamics.
The team pretrained three DeBERTaV3 model variants — DeBERTaV3large, DeBERTaV3base and DeBERTaV3small — and evaluated them on various representative natural language understanding (NLU) benchmarks.
The DeBERTaV3 Large model achieved a 91.37 percent average score on eight tasks on the GLUE benchmark, topping DeBERTa by 1.37 percent and ELECTRA by 1.91 percent. The team also pretrained a multilingual mDeBERTa Base model, which achieved 79.8 percent zero-shot cross-lingual accuracy on the XNLI dataset and a 3.6 percent improvement over XLM-R Base, setting a new SOTA. Overall, the results demonstrate that the improved DeBERTaV3 can significantly boost pretraining efficiency and model performance across a range of NLU benchmarks.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.