In a paper currently under review for ICLR 2022, a Facebook AI Research team introduces NormFormer, a modification to the Pre-LN transformer architecture designed to improve pretraining perplexity and downstream task performance for both causal and masked language models with negligible extra compute cost.
Since their introduction in 2017, transformers have become a leading deep learning architecture. The original transformer uses layer normalization to reduce the variance of the inputs to the sublayer during pretraining. This Pre-LayerNorm transformer setup however suffers from a gradient magnitude mismatch, as the gradients received at early layers are much larger than those at later layers.
The proposed NormFormer alleviates this issue by applying three modifications to the Pre-LN transformer: a Layer Norm after self-attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. These modifications add a small number of learnable parameters that provide a cost-effective way for each layer to change the magnitude of its features, significantly improving pretraining perplexity and downstream task performance while adding negligible compute cost. The study shows that NormFormer can boost GPT3-Large (1.3B) zero-shot performance as well as fine-tuned performance on GLUE (General Language Understanding Evaluation) tasks.
In their first experiment, the Facebook researchers pretrained Causal Language Models (CLMs) with five different parameters: Small (125M parameters), Medium (355M), Large (1.3B) and XL (2.7B). They also trained three large-scale models with 2.7B parameters: GPT-3-2.7B with GELU activations, and two variants of GPT3-2.7B with Relu2 activations.
A second experiment adopted the RoBERTa-base, Pre-LN architecture, and fine-tuned both the baseline Masked Language Models (MLM) and NormFormer, reporting the best performance on the validation set for tasks on the GLUE benchmark.
In the experiments with CLM models, NormFormer outperformed GPT-3 at all sizes in zero-shot accuracy, achieving GPT3-Large (1.3B) zero-shot performance 60 percent faster. The NormFormer MLM models meanwhile bettered their Pre-LN counterparts on every task, improving fine-tuned GLUE performance by 1.9 percent.
The researchers conclude that adding small numbers of learnable parameters in the right places in architectures can alleviate certain issues and boost performance in current state-of-the-art networks; and suggest future studies could explore whether there are other similarly efficient modifications that might also deliver such improvements.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.