Google Brain researchers have proposed LAMB (Layer-wise Adaptive Moments optimizer for Batch training), a new optimizer which reduces training time for its NLP training model BERT (Bidirectional Encoder Representations from Transformers) from three days to just 76 minutes.
BERT is a popular, large-scale pre-training language model that Google AI released last November. BERT has an incredible ability to extract textual information and apply to a variety of language tasks, but training it requires significant compute and time.
Training the BERT baseline model is typically done with AdamW, a variant of the Adam optimizer with weight decay as the optimizer. Another self-adaptive optimizer that has proven successful in large-batch convolutional neural network training is LARS (Layer-wise Adaptive Rate Scaling). These inspired Google researchers to develop LAMB, which can extend batch size to 64k without compromising accuracy. LAMB is a universal optimizer for both small and large batches, with no adjustments other than learning rate. It supports adaptive element-wise updating and accurate layer-wise correction.
In their experiments, researchers used TPUv3 Tensor Processing Units, Google’s homegrown AI computing hardware for training and inferencing. Each TPUv3 pod has 1024 chips and can provide mixed-precision calculations on over 100 peta-flops (see results in the table below). The baseline F1 score was derived from the corresponding score of the BERT-Large pre-training model. Researchers used the same datasets as the open source BERT model for pre-training, specifically the Wikipedia dataset containing 2.5B words and the BooksCorpus dataset containing 800M words.
BERT pre-training consists of two phases. The first 90 percent training epoch uses 128 sequence lengths, while the last 10 percent training epoch uses 512 sequence lengths.
In regular training, researchers performed 15,625 iterations for a batch size of 32k, resulting in an F1 score of 91.460 (14,063 iterations for sequence length 128 in phase 1, and 1,562 iterations for sequence length 512 in phase 2). The experiment achieved 76.7 percent on weak scaling efficiency (49.1 times speedup by 64 times computational resources).
In mixed-batch training, researchers were able to complete BERT training with 8599 iterations for a mixed-batch size of 64k (phase 1) and 32k (phase 2). As a result, training time was reduced to only 76 minutes, achieving a weak scaling efficiency of 101.8 percent (65.2 times speedup by 64 times computational resources).
Researchers believe large batch techniques hold the key to accelerating deep neural network training, and are now working on a theoretical analysis of the LAMB optimizer. The first author of this paper is Google Brain intern Yang You, who is also a PhD student in the Department of Computer Science at UC Berkeley.
The paper Reducing BERT Pre-Training Time from 3 Days to 76 Minutes is on arXiv.
Author: Reina Qi Wan | Editor: Michael Sarazen