A Microsoft Research team has introduced a “simple yet effective” method that dramatically improves stability in transformer models with just a few lines of code change.
Large-scale transformers have achieved state-of-the-art performance on a wide range of natural language processing (NLP) tasks, and in recent years have also demonstrated their impressive few-shot and zero-shot learning capabilities, making them a popular architectural choice for machine learning researchers. However, despite soaring parameter counts that now reach billions and even trillions, the layer depth of transformers remains restricted by problems with training instability.
In their new paper DeepNet: Scaling Transformers to 1,000 Layers, the Microsoft team proposes DeepNorm, a novel normalization function that improves the stability of transformers to enable scaling that is an order of magnitude deeper (more than 1,000 layers) than previous deep transformers.
The paper first explores the root causes of instability in deep transformers, concluding that the exploding model update problem is largely responsible. Motivated by their own observations and previous work suggesting better initialization methods can be effective in stabilizing transformer training, the team conducted an in-depth analysis of the Post-LN (layer normalization) training process with or without proper initialization. They narrow the scale of lower layers to separate the effect of the gradient scale from the model update, resulting in a Post-LN-init model that relieves the gradient vanishing issue, making optimization more stable.
The researchers introduce a DeepNet family of extremely deep transformers that are based on the vanilla transformer architecture with the addition of their novel DeepNorm normalization function, which they show has the theoretical justification to stabilize optimization with a constant upper bound for model updates.
In their empirical study, the team compared DeepNet with state-of-the-art deep transformer models such as DLCL (Wang et al., 2019), NormFormer (Shleifer et al., 2021), ReZero (Bachlechner et al., 2020) and Admin (Liu et al., 2020). They conducted their experiments using popular machine translation benchmarks such as the IWSLT-14 German-English (De-En) dataset and the WMT-17 English-German (En-De) dataset.
DeepNet achieved impressive results on all tasks, with the proposed 3.2B parameter 200-layer model bettering the state-of-the-art by 5 BLEU (bilingual evaluation understudy) points on a subset of the Flores massively multilingual machine translation benchmark with 7,482 translation directions.
Overall, the results validate the effectiveness of the proposed DeepNet across various benchmarks. The team plans to extend DeepNet to support more diverse tasks, including language model pretraining, protein structure prediction, and BEiT (Bao et al., 2022; Wang et al., 2021) vision pretraining.
The paper DeepNet: Scaling Transformers to 1,000 Layers is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.