Google’s new “ALBERT” language model has achieved state-of-the-art results on three popular benchmark tests for natural language understanding (NLU): GLUE, RACE, and SQuAD 2.0. ALBERT is a “lite” version of Google’s 2018 NLU pretraining method BERT. Researchers introduced two parameter-reduction techniques in ALBERT to lower memory consumption and increase training speed.
The associated paper ALBERT: A Lite BERT For Self-Supervised Learning of Language Representations is now under double-blind review for leading AI conference ICLR 2020, which will be held next April in the Ethiopian capital Addis Ababa.
Why this research matters? The creation and development of Transformer architecture and BERT have demonstrated the efficacy of large-scale pretrained models for tackling NLP tasks such as machine translation and question answering. Researchers usually train a full network in the pretraining stage and then tailor that down to smaller task-specific models for downstream applications.
Current SOTA language models however bag hundreds of millions or even billions of parameters. Attempts to scale such models will be restricted by the memory limitations of compute hardware like GPUs or TPUs. In addition, researchers have discovered that increasing the number of hidden layers in the BERT-large model can lead to even worse performance.
These obstacles motivated Google to take a deep dive into parameter reduction techniques that could reduce the size of models while not affecting their performance.
What is BERT? Bidirectional Encoder Representations from Transformers (BERT) is a Transformer-based language network architecture that has revolutionized pretraining methods for natural language understanding. Google researchers used a random selection of input tokens to train a deep bidirectional representation, also referred to as Masked Language Model (MLM). In the last 12 months a majority of NLU research studies have been built on top of BERT.
Core innovations: Google researchers introduced three standout innovations with ALBERT.
- Factorized embedding parameterization: Researchers isolated the size of the hidden layers from the size of vocabulary embeddings by projecting one-hot vectors into a lower dimensional embedding space and then to the hidden space, which made it easier to increase the hidden layer size without significantly increasing the parameter size of the vocabulary embeddings.
- Cross-layer parameter sharing: Researchers chose to share all parameters across layers to prevent the parameters from growing along with the depth of the network. As a result, the large ALBERT model has about 18x fewer parameters compared to BERT-large.
- Inter-sentence coherence loss: In the BERT paper, Google proposed a next-sentence prediction technique to improve the model’s performance in downstream tasks, but subsequent studies found this to be unreliable. Researchers used a sentence-order prediction (SOP) loss to model inter-sentence coherence in ALBERT, which enabled the new model to perform more robustly in multi-sentence encoding tasks.
Dataset: For pretraining baseline models, researchers used the BOOKCORPUS and English Wikipedia, which together contain around 16GB of uncompressed text.
Experiment results: The ALBERT model significantly outperformed BERT on the language benchmark tests SQuAD1.1, SQuAD2.0, MNLI SST-2, and RACE.
Also, both the ALBERT single-model and ensemble-model improved on previous state-of-the-art results on three benchmarks, producing a GLUE score of 89.4, a SQuAD 2.0 test F1 score of 92.2, and a RACE test accuracy of 89.4.
The paper ALBERT: A Lite BERT For Self-Supervised Learning of Language Representations is on openreview.
Journalist: Tony Peng | Editor: Michael Sarazen