Since their introduction three years ago, transformer architectures have become the de-facto standard for natural language processing (NLP) tasks and are now also seeing application in areas such as computer vision. Although many transformer architecture modifications have been proposed, these have not proven as easily transferable across implementations and applications as hoped, and that has limited their wider adoption.
In a bid to understand why most widely-used transformer applications shun these modifications, a team from Google Research comprehensively evaluated them in a shared experimental setting, where they were surprised to discover that most architecture modifications they looked at do not meaningfully improve performance on downstream NLP tasks.
The researchers began by reimplementing and evaluating a variety of transformer variants on the tasks where they are most commonly applied. As a baseline, they used the original transformer model with two modifications: applying layer normalization before the self-attention and feedforward blocks instead of after, and using relative attention with shared biases instead of sinusoidal positional embeddings. They evaluated this setup against transformer architectures with modifications that included the following:
- Transparent Attention, which creates weighted residual connections along the encoder depth to facilitate gradient flow
- Evolved Transformer, designed via evolution-based architecture search where the initial population is seeded with the original transformer
- Synthesizer variants, where self-attention is replaced with “synthetic attention” patterns
- Funnel Transformer, which progressively reduces sequence length to efficiently encode input sequences
- Sparse Expert Transformers that replace the feedforward network with sparsely activated experts layers
- Universal Transformer, which repeatedly applies the same transformer “block” to input sequences
The researchers employed two experimental settings to evaluate each modification’s performance: transfer learning based on T5, and supervised machine translation on the WMT’14 English-German translation task.
The results indicate that modifications that led to significant performance improvements tended to fall into one of three buckets: relatively minor changes; those that increased parameter count or were slower; and those that were originally invented in the Mesh TensorFlow codebase. Few of the architectural modifications produced improvements, a finding that largely contradicted the experiment results presented in the research papers that originally proposed the modifications.
The researchers further investigated possible explanations for the underwhelming performance of these transformer modifications, concluding that they typically fail to effectively transfer across implementations and applications.
Finally, the team offered suggestions for improving the robustness of future architectural modifications. They suggest researchers test proposed modifications on multiple completely disparate codebases; apply the modifications to a wide variety of downstream applications; keep the hyperparameters fixed as much as possible when evaluating performance; and ensure best-practice reporting of results to include mean and standard deviation across multiple trials.
The paper Do Transformer Modifications Transfer Across Implementations and Applications? is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.