SuperGLUE met its match this week when, for the first time, a new model surpassed human baseline performance on the challenging natural language understanding (NLU) benchmark.
Dubbed DeBERTa (Decoding-enhanced BERT with disentangled attention), the breakthrough Transformer-based neural language model was initially introduced by a team of researchers from Microsoft Dynamics 365 AI and Microsoft Research in June of last year. Recently scaled up to 1.5 billion parameters, DeBERTa “substantially” outperformed the previous SuperGLUE leader — Google’s 11 billion parameter T5 — and surpassed the human baseline with a score of 89.9 (vs. 89.8).
In the paper DeBERTa: Decoding-enhanced BERT with Disentangled Attention, researchers detail the new DeBERTa, which improves on BERT and RoBERTa models using two novel techniques. Introduced by Google AI in 2018, BERT is a bi-directional Transformer model for pretraining deep bi-directional representations from unlabelled text that has redefined the SOTA across NLP tasks. Facebook AI’s BERT-based RoBERTa meanwhile employs an improved training methodology to boost downstream task performance.
The first of the new techniques is a proposed disentangled self-attention mechanism. Much of the success of Transformer-based deep learning language models such as BERT has been attributed to their self-attention mechanisms, which enable each token in an input sequence to attend independently to all other tokens in the sequence. Each word in an input is represented using a vector that is the sum of its word (content) embedding and position embedding. The researchers however point out that a standard self-attention mechanism lacks a natural way to encode word position information. DeBERTa addresses this by using two vectors, which encode content and position, respectively.
The second novel technique is designed to deal with the limitation of relative positions shown in the standard BERT model. The Enhanced Mask Decoder (EMD) approach incorporates absolute positions in the decoding layer to predict the masked tokens in model pretraining. For example, if the words store and mall are masked for prediction in the sentence “A new store opened near the new mall,” the standard BERT will rely only on a relative positions mechanism to predict these masked tokens. The EMD enables DeBERTa to obtain more accurate predictions, as the syntactic roles of the words also depend heavily on their absolute positions in a sentence.
In experiments on the NLU benchmark SuperGLUE, a DeBERTa model scaled up to 1.5 billion parameters outperformed Google’s 11 billion parameter T5 language model by 0.6 percent, and was the first model to surpass the human baseline. Moreover, compared to the robust RoBERTa and XLNet models, DeBERTa demonstrated better performance on NLU and NLG (natural language generation) tasks with better pretraining efficiency.
Reporter: Fangyu Cai | Editor: Michael Sarazen
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.