Optimizing Transformers: Microsoft & RUC’s ResiDual Solves Gradient Vanishing and Representation Collapse Issues

As evidenced by the rapid rise of powerful large language models like GPT, transformer-based neural networks are revolutionizing the field of natural language processing. One of the promising research avenues for further improvement of such models involves residual connections, which enable the direct propagation of data from earlier to later network layers.

There are two popular variants of residual connections in transformers: post-layer normalization (Post-LN) and pre-layer normalization (Pre-LN). While each approach has advantages, neither is ideal, as the former can suffer from gradient vanishing issues and the latter from representation collapse.

In the new paper ResiDual: Transformer With Dual Residual Connections, a team from Microsoft Research, Microsoft Azure Translation, and Renmin University of China proposes ResiDual, a novel transformer architecture that fuses the connections in Post-LN and Pre-LN to exploit the benefits of both while also addressing their respective limitations.

The team summarizes their study’s main contributions as follows:

We present ResiDual, a simple yet potent variation of the Transformer architecture, which tackles both the gradient vanishing problem in Post-LN and the representation collapse issue in Pre-LN transformer models.
Our theoretical analysis demonstrates that this new design can leverage the strengths of both variants while avoiding their weaknesses.
Our experimental results provide further evidence of the effectiveness of our approach, as it achieves superior performance compared to both the Post-LN and Pre-LN transformer models across multiple datasets.

The key difference between Post-LN and Pre-LN is how their layer normalization (LN) processes normalize each residual block’s outputs. In Post-LN, the outputs of lower blocks (those closer to the input) are normalized multiple times, causing the gradient norm to decay exponentially with depth and eventually disappear in lower layers. This vanishing problem is mitigated in Pre-LN, as the gradient can flow directly to each higher block. The Pre-LN approach however can lead to representation collapse issues, as the hidden representation in higher blocks (those closer to the output) will be similar and thus contribute little to model capacity.

To reap the benefits of both Post-LN and Pre-LN methods while also addressing their imitations, the researchers propose ResiDual, a novel transformer architecture that fuses the connections in Post-LN and Pre-LN. Their resulting Pre-Post-LN (PPLN) approach leverages a Post-LN-like residual to sustain representation diversity and a Pre-LN-like residual to prevent gradient vanishing.

The proposed dual residual mechanism thus avoids the vanishing problem even if the gradients coming from the Post-LN-like residual vanish, as the model still can obtain the Pre-LN-like residuals. And because the Pre-LN-like residuals only impact the model output, the input to each block remains intact, preserving the representation capacity. The researchers note that as the final residual is the sum of two residual connections, the output representation will not collapse either.

The team’s empirical study compared their approach with Post-LN and Pre-LN transformers. In the experiments, ResiDual surpassed both on all small-scale (IWLST), mid-scale (WMT), and large-scale (OPUS) datasets.

Overall, this work demonstrates the proposed ResiDual transformer’s ability to leverage the benefits of both Post-LN and Pre-LN variants to improve performance. The team hopes ResiDual can serve as a foundational architecture for other AI models and that their findings will inspire future research and progress in this field.

The code is available on the project’s GitHub. The paper ResiDual: Transformer With Dual Residual Connections is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

4 comments on “Optimizing Transformers: Microsoft & RUC’s ResiDual Solves Gradient Vanishing and Representation Collapse Issues”

refivel9

2023-06-16

I have had a fantastic experience playing at Slots City. The site https://slotscity.ua/ru offers a vast collection of games, including an impressive selection of slots, gambling games, and roulette. The gameplay is smooth, and the graphics are visually stunning. The site’s intuitive interface makes it easy to navigate and find your favorite games. I have also been impressed with the customer support, which has been responsive and helpful whenever I’ve had any questions. If you’re looking for a top-quality online casino, look no further than Slots City.

Loading...

Judeth

2025-08-18

This is such an interesting step forward in transformer research. The idea of combining Post-LN and Pre-LN into a dual residual connection feels like a smart way to address long-standing issues like gradient vanishing and representation collapse. I also came across Reviews IT recently where similar AI advancements are discussed, and it’s exciting to see how these innovations might shape the next generation of models.

Loading...

kimson

2025-09-25

The best research papers don’t just propose a new idea; they provide a rigorous analysis to back it up. The fact that the researchers conducted both theoretical proofs (e.g., proving a lower bound on the gradient) and empirical experiments on real-world machine translation benchmarks adds significant credibility to their claims. The results showing that ResiDual outperforms both Post-LN and Pre-LN myWisely across different network depths and data sizes are a strong endorsement of the architecture’s effectiveness.

Loading...

tommyrider2000

2025-11-26

Reading about the ResiDual transformer improvements reminded me how much I enjoy exploring new online platforms that mix strategy and quick decision-making. I recently tried https://highflybet.org and was impressed by how smooth the gameplay is. Even short sessions feel engaging, and the interface is intuitive. It’s a simple way to unwind while testing some quick thinking skills.

Loading...

Optimizing Transformers: Microsoft & RUC’s ResiDual Solves Gradient Vanishing and Representation Collapse Issues

Like this:

4 comments on “Optimizing Transformers: Microsoft & RUC’s ResiDual Solves Gradient Vanishing and Representation Collapse Issues”

Leave a Reply Cancel reply

Related

Share this:

Like this:

4 comments on “Optimizing Transformers: Microsoft & RUC’s ResiDual Solves Gradient Vanishing and Representation Collapse Issues”

Leave a Reply Cancel reply

Related