In the current state of deep learning, methods that can be used to improve model accuracy basically come down to increasing model size, dataset size, or number of training steps. These methods however require large and very expensive compute resources. Optimizing computing efficiency has become a key goal for researchers when computing resources are limited. How to achieve higher accuracy with limited hardware support and training time?
To address this issue, researchers from the Berkeley Artificial Intelligence Research (BAIR) Lab at UC Berkeley explored the effect of Transformer model size on training and inference efficiency. Their new paper shows that with limited resources, training and inference efficiency can be improved by significantly increasing the size of the Transformer models and heavily compressing them.
The researchers conducted several experiments and found that in a given time, the deeper RoBERTa model (RoBERTa is an optimized BERT pretraining approach) with more layers had lower perplexity than the model with fewer layers. The wider RoBERTa model also had lower perplexity.
Researchers also evaluated the validation BLEU score of models in different sizes when training an English-French transformer machine translation model. BLEU score is an automatic evaluation metric for machine translation (the higher, the better). In the same training time, deeper and wider models outperformed the smaller models. Researchers also found that increasing model width or depth resulted in faster training for RoBERTa pretraining, and that the wider model works better in machine translation tasks.
Although training a larger model can deliver higher efficiency, this also raises the computation and memory cost of inference, and the total cost of inference is much higher than the training cost in most practical applications. The “Train Large, Then Compress” approach can solve this problem. Researchers used compression techniques such as quantization and pruning, both of which can reduce inference latency and memory requirements.
In the case of RoBERTa, the researchers first pretrained different size RoBERTa models with the same given time, then fine-tuned these models on a downstream text classification task and applied pruning or quantization methods for compression. It was found that in a given test time, increasing model size and then applying heavy compression worked best.
Researchers conducted a preliminary investigation of their findings limited to the field of natural language processing, and say their conclusions could be further explored in the other fields in the future.
The paper Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers is on arXiv.
Author: Herin Zhao | Editor: Michael Sarazen