The transformer-based architectures BERT has in recent years demonstrated the efficacy of large-scale pretrained models for tackling natural language processing (NLP) tasks such as machine translation and question answering. BERT’s large size and complex pretraining process however raise usability concerns for many researchers.
In a new paper, a pair of Amazon Alexa researchers extract an optimal subset of architectural parameters for the BERT architecture by applying recent breakthroughs in algorithms for neural architecture search. The proposed optimal subset, “Bort,” is just 5.5 percent the effective size of the original BERT-large architecture (not counting the embedding layer), and 16 percent of its net size.
Many attempts have been made to extract a simpler sub-architecture of BERT that maintains similar performance to its predecessor while simplifying the pretraining process and shortening inference time. Yet the performance of such sub-architectures is still being surpassed by the original implementation in terms of accuracy, the researchers say, and the choice of the set of architectural parameters in these works often appears to be arbitrary.
“We consider the problem of extracting the set of architectural parameters for BERT that is optimal over three metrics: inference latency, parameter size, and error rate,” the researchers explain. “We were able to extract the optimal subarchitecture set for the family of BERT-like architectures, as parametrized by their depth, number of attention heads, and sizes of the hidden and intermediate layer.”
The researchers note the pretraining time of Bort improves remarkably compared to its original counterparts: 288 GPU hours versus 1,153 for BERT-large and 24,576 for RoBERTa-large on the same hardware and with larger or equal size datasets.
The researchers also evaluated Bort on public Natural Language Understanding benchmarks including GLUE, SuperGLUE, and RACE, where Bort obtained improvements across almost all tasks. For example, when compared to BERT-large on GLUE (General Language Understanding Evaluation), Bort obtains improvements of between 0.3 percent and 31 percent on all GLUE tasks with the exception of QQP and QNLI.
The researchers point out that the success of Bort in terms of faster pretraining and efficient fine-tuning would not have been possible without the existence of a highly optimized BERT like the RoBERTa architecture. But given the cost associated with training such models, the researchers believe it’s worthwhile investigating whether it is possible to avoid large, highly optimized models and focus instead on smaller data representations through more rigorous algorithmic techniques.
The paper Optimal Subarchitecture Extraction For BERT is on arXiv, and the code is on GitHub.
Reporter: Yuan Yuan | Editor: Michael Sarazen
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.