AI Machine Learning & Data Science Research

Optimizing Transformers: Microsoft & RUC’s ResiDual Solves Gradient Vanishing and Representation Collapse Issues

In the new paper ResiDual: Transformer With Dual Residual Connections, a team from Microsoft Research, Microsoft Azure Translation, and Renmin University of China proposes ResiDual, a novel transformer architecture that fuses the connections in post-layer normalization and pre-layer normalization to exploit the benefits of both while also addressing their limitations.

As evidenced by the rapid rise of powerful large language models like GPT, transformer-based neural networks are revolutionizing the field of natural language processing. One of the promising research avenues for further improvement of such models involves residual connections, which enable the direct propagation of data from earlier to later network layers.

There are two popular variants of residual connections in transformers: post-layer normalization (Post-LN) and pre-layer normalization (Pre-LN). While each approach has advantages, neither is ideal, as the former can suffer from gradient vanishing issues and the latter from representation collapse.

In the new paper ResiDual: Transformer With Dual Residual Connections, a team from Microsoft Research, Microsoft Azure Translation, and Renmin University of China proposes ResiDual, a novel transformer architecture that fuses the connections in Post-LN and Pre-LN to exploit the benefits of both while also addressing their respective limitations.

The team summarizes their study’s main contributions as follows:

  1. We present ResiDual, a simple yet potent variation of the Transformer architecture, which tackles both the gradient vanishing problem in Post-LN and the representation collapse issue in Pre-LN transformer models.
  2. Our theoretical analysis demonstrates that this new design can leverage the strengths of both variants while avoiding their weaknesses.
  3. Our experimental results provide further evidence of the effectiveness of our approach, as it achieves superior performance compared to both the Post-LN and Pre-LN transformer models across multiple datasets.

The key difference between Post-LN and Pre-LN is how their layer normalization (LN) processes normalize each residual block’s outputs. In Post-LN, the outputs of lower blocks (those closer to the input) are normalized multiple times, causing the gradient norm to decay exponentially with depth and eventually disappear in lower layers. This vanishing problem is mitigated in Pre-LN, as the gradient can flow directly to each higher block. The Pre-LN approach however can lead to representation collapse issues, as the hidden representation in higher blocks (those closer to the output) will be similar and thus contribute little to model capacity.

To reap the benefits of both Post-LN and Pre-LN methods while also addressing their imitations, the researchers propose ResiDual, a novel transformer architecture that fuses the connections in Post-LN and Pre-LN. Their resulting Pre-Post-LN (PPLN) approach leverages a Post-LN-like residual to sustain representation diversity and a Pre-LN-like residual to prevent gradient vanishing.

The proposed dual residual mechanism thus avoids the vanishing problem even if the gradients coming from the Post-LN-like residual vanish, as the model still can obtain the Pre-LN-like residuals. And because the Pre-LN-like residuals only impact the model output, the input to each block remains intact, preserving the representation capacity. The researchers note that as the final residual is the sum of two residual connections, the output representation will not collapse either.

The team’s empirical study compared their approach with Post-LN and Pre-LN transformers. In the experiments, ResiDual surpassed both on all small-scale (IWLST), mid-scale (WMT), and large-scale (OPUS) datasets.

Overall, this work demonstrates the proposed ResiDual transformer’s ability to leverage the benefits of both Post-LN and Pre-LN variants to improve performance. The team hopes ResiDual can serve as a foundational architecture for other AI models and that their findings will inspire future research and progress in this field.

The code is available on the project’s GitHub. The paper ResiDual: Transformer With Dual Residual Connections is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

0 comments on “Optimizing Transformers: Microsoft & RUC’s ResiDual Solves Gradient Vanishing and Representation Collapse Issues

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: