Microsoft Improves Transformer Stability to Successfully Scale Extremely Deep Models to 1000 Layers

A Microsoft Research team has introduced a “simple yet effective” method that dramatically improves stability in transformer models with just a few lines of code change.

Large-scale transformers have achieved state-of-the-art performance on a wide range of natural language processing (NLP) tasks, and in recent years have also demonstrated their impressive few-shot and zero-shot learning capabilities, making them a popular architectural choice for machine learning researchers. However, despite soaring parameter counts that now reach billions and even trillions, the layer depth of transformers remains restricted by problems with training instability.

In their new paper DeepNet: Scaling Transformers to 1,000 Layers, the Microsoft team proposes DeepNorm, a novel normalization function that improves the stability of transformers to enable scaling that is an order of magnitude deeper (more than 1,000 layers) than previous deep transformers.

The paper first explores the root causes of instability in deep transformers, concluding that the exploding model update problem is largely responsible. Motivated by their own observations and previous work suggesting better initialization methods can be effective in stabilizing transformer training, the team conducted an in-depth analysis of the Post-LN (layer normalization) training process with or without proper initialization. They narrow the scale of lower layers to separate the effect of the gradient scale from the model update, resulting in a Post-LN-init model that relieves the gradient vanishing issue, making optimization more stable.

The researchers introduce a DeepNet family of extremely deep transformers that are based on the vanilla transformer architecture with the addition of their novel DeepNorm normalization function, which they show has the theoretical justification to stabilize optimization with a constant upper bound for model updates.

In their empirical study, the team compared DeepNet with state-of-the-art deep transformer models such as DLCL (Wang et al., 2019), NormFormer (Shleifer et al., 2021), ReZero (Bachlechner et al., 2020) and Admin (Liu et al., 2020). They conducted their experiments using popular machine translation benchmarks such as the IWSLT-14 German-English (De-En) dataset and the WMT-17 English-German (En-De) dataset.

DeepNet achieved impressive results on all tasks, with the proposed 3.2B parameter 200-layer model bettering the state-of-the-art by 5 BLEU (bilingual evaluation understudy) points on a subset of the Flores massively multilingual machine translation benchmark with 7,482 translation directions.

Overall, the results validate the effectiveness of the proposed DeepNet across various benchmarks. The team plans to extend DeepNet to support more diverse tasks, including language model pretraining, protein structure prediction, and BEiT (Bao et al., 2022; Wang et al., 2021) vision pretraining.

The paper DeepNet: Scaling Transformers to 1,000 Layers is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

10 comments on “Microsoft Improves Transformer Stability to Successfully Scale Extremely Deep Models to 1000 Layers”

Pingback: Microsoft Improves Transformer Stability to Successfully Scale Extremely Deep Models to 1000 Layers – Synced - AI Caosuo
Pingback: Microsoft Improves Transformer Stability to Successfully Scale Extremely Deep Models to 1000 Layers – Synced – xxp5 grooves
elytrairritated

2022-11-03

I got a lot of new things here.

Loading...

Reply
doramasflix

2024-08-17

This is a great informative post. Keep sharing more.

Loading...

Reply
Doramasflix

2025-03-12

DeepMind’s innovation is truly remarkable! 🤖🔥 AI-driven advancements like this are shaping the future. Speaking of digital experiences, if you love high-quality entertainment, check out https://doramasflix.bar/ for the best Asian dramas and series! 📺✨

Loading...

Reply
Loklok

2025-03-17

Loklok App is a popular streaming platform offering a vast collection of movies, TV shows, and anime. It provides high-definition content with multilingual subtitles, offline download options, and a user-friendly interface, ensuring a seamless and enjoyable viewing experience for users worldwide.

Loading...

Reply
Anonymous

2025-08-13

Thank You

Loading...

Reply
Samantha

2025-08-18

It’s really encouraging to see the city of Nice stepping up to support local merchants as they recover from the Covid-19 impact. Initiatives like this help keep our communities vibrant—and I loved spotting http://knowledgesip.com/ in the mix for shining a light on efforts like these.

Loading...

Reply
desiciniemas

2026-01-03

Really interesting read on how advanced reinforcement learning frameworks like DeepMind’s Podracer architectures improve performance with TPU scalability — the future of AI research is exciting! On a different note, if anyone here also enjoys streaming great movies and shows, you should check out Desi Cinemast. The DesiCinemas App and its Desi Cinemas Apk make discovering and watching Desi Cinema super simple. DesiCinema has loads of content, and Desi Cinemas To explore even more, the DesiCinemas App includes options like Playdesi that I’ve been enjoying. Definitely worth a look for entertainment lovers! 🎬

Loading...

Reply
subway surfers

2026-03-04

Subway surfers created in 2012 by Kiloo and SYBO Games, is a timeless infinite runner. It has since gone global, becoming a hit with gamers of all ages and amassing billions of downloads.

Loading...

Reply