Optimizing Transformers: Microsoft & RUC’s ResiDual Solves Gradient Vanishing and Representation Collapse Issues
In the new paper ResiDual: Transformer With Dual Residual Connections, a team from Microsoft Research, Microsoft Azure Translation, and Renmin University of China proposes ResiDual, a novel transformer architecture that fuses the connections in post-layer normalization and pre-layer normalization to exploit the benefits of both while also addressing their limitations.