In the current state of deep learning, methods that can be used to improve model accuracy basically come down to increasing model size, dataset size, or number of training steps. These methods however require large and very expensive compute resources. Optimizing computing efficiency has become a key goal for researchers when computing resources are limited. How to achieve higher accuracy with limited hardware support and training time?
To address this issue, researchers from the Berkeley Artificial Intelligence Research (BAIR) Lab at UC Berkeley explored the effect of Transformer model size on training and inference efficiency. Their new paper shows that with limited resources, training and inference efficiency can be improved by significantly increasing the size of the Transformer models and heavily compressing them.

The researchers conducted several experiments and found that in a given time, the deeper RoBERTa model (RoBERTa is an optimized BERT pretraining approach) with more layers had lower perplexity than the model with fewer layers. The wider RoBERTa model also had lower perplexity.
Researchers also evaluated the validation BLEU score of models in different sizes when training an English-French transformer machine translation model. BLEU score is an automatic evaluation metric for machine translation (the higher, the better). In the same training time, deeper and wider models outperformed the smaller models. Researchers also found that increasing model width or depth resulted in faster training for RoBERTa pretraining, and that the wider model works better in machine translation tasks.

Although training a larger model can deliver higher efficiency, this also raises the computation and memory cost of inference, and the total cost of inference is much higher than the training cost in most practical applications. The “Train Large, Then Compress” approach can solve this problem. Researchers used compression techniques such as quantization and pruning, both of which can reduce inference latency and memory requirements.
In the case of RoBERTa, the researchers first pretrained different size RoBERTa models with the same given time, then fine-tuned these models on a downstream text classification task and applied pruning or quantization methods for compression. It was found that in a given test time, increasing model size and then applying heavy compression worked best.
Researchers conducted a preliminary investigation of their findings limited to the field of natural language processing, and say their conclusions could be further explored in the other fields in the future.
The paper Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers is on arXiv.
Author: Herin Zhao | Editor: Michael Sarazen
Thanks for you.
Nice topic.
Thanks for you.
Best wishesI enjoyed reading the topic and thank you for sharing it with us, Best Regards
Thank you very much
This topic good
This article good
I like to thanks
Good like
I love the article
I love this topic
Yes ,i will to
Goood and nice
Good
Wi can you and Thanks
I uesed to
Thank you very good……….
Good good good
Thank you very much
Good good good good topic
Thank you this topic good
Very nice ….
Mérci pour article
Very very very very niiiiice
Good article and thank you
Good good article…
Very very niiiiiice
Good
Good good
Very very nice..
Very very niiiiiiiiiiiiiiiiiiice
Bien bien bien article
Mèrci pour article
Very very very nice
Mèrci mèrci..
Topic is good.
Good good very good
Bien bien article
Good good topic.
Veeeery niiiiiiiiiiiice
Good
Thank you very much.
Very very niiiiiiice
And good
Article
Nice nice and gooooooooood
Bien article mèrci
Good article very nice
…………
Bien bien article good very nice…………….
Niiiiiice niiiiiice good