NLP researchers already know that Google’s hugely popular BERT (Bidirectional Encoder Representation from Transformers) language model, trained on large amounts of data, performs very well on syntactic grammatical judgment tasks even with no knowledge of hierarchical syntactic structures. But can it do even better? That’s what DeepMind and University of California, Berkeley researchers set out to discover in a new study that adds syntactic biases to determine whether and where they can help BERT achieve better understanding.
The approach was inspired by Knowledge Distillation (KD) procedures which improve the syntactic competence of scalable language models (LMs) with recurrent neural network grammars (RNNGs). Since RNNGs are hierarchical syntactic LMs that predict words from left to right, inserting them to BERT, which predicts words in a bidirectional context, is challenging. Researchers thus created a new pretraining setting which distills the RNNG’s marginal distribution over words in context but is still completely compatible, and kept the rest of BERT unchanged to maintain its scalability.
The proposed structure-distilled BERT models have four variants:
- Only distill the left-to-right RNNG (“L2RKD”)
- Only distill the right-to-left RNNG (“R2L-KD”)
- Distill the RNNG’s approximated marginal under the bidirectional context, with uniform distribution (“UF-KD”)
- Distill the RNNG’s approximated marginal under the bidirectional context, with unigram distribution (“UG-KD”)
The researchers evaluated their structure-distilled BERTs on six diverse structured prediction tasks covering syntactic, semantic, and co-reference resolution, as well as on the popular GLUE (General Language Understanding Evaluation) benchmark.
The test results show that all four structure-distilled BERT models consistently outperform the standard BERT baseline while reducing the relative error rate by 2-21 percent.
The findings suggest that syntactic inductive biases can be beneficial for a diverse range of structured prediction tasks, including those that are non-syntactic, and that these biases can also improve fine-tuning sample efficiency on downstream tasks.
The researchers propose potential future research could look into designing easily scalable models that integrate stronger notions of structural biases.
The paper Syntactic Structure Distillation Pretraining For Bidirectional Encoders is on arXiv.
Author: Reina Qi Wan | Editor: Michael Sarazen