Google Study Shows Transformer Modifications Fail To Transfer Across Implementations and Applications

Since their introduction three years ago, transformer architectures have become the de-facto standard for natural language processing (NLP) tasks and are now also seeing application in areas such as computer vision. Although many transformer architecture modifications have been proposed, these have not proven as easily transferable across implementations and applications as hoped, and that has limited their wider adoption.

In a bid to understand why most widely-used transformer applications shun these modifications, a team from Google Research comprehensively evaluated them in a shared experimental setting, where they were surprised to discover that most architecture modifications they looked at do not meaningfully improve performance on downstream NLP tasks.

The researchers began by reimplementing and evaluating a variety of transformer variants on the tasks where they are most commonly applied. As a baseline, they used the original transformer model with two modifications: applying layer normalization before the self-attention and feedforward blocks instead of after, and using relative attention with shared biases instead of sinusoidal positional embeddings. They evaluated this setup against transformer architectures with modifications that included the following:

Transparent Attention, which creates weighted residual connections along the encoder depth to facilitate gradient flow
Evolved Transformer, designed via evolution-based architecture search where the initial population is seeded with the original transformer
Synthesizer variants, where self-attention is replaced with “synthetic attention” patterns
Funnel Transformer, which progressively reduces sequence length to efficiently encode input sequences
Sparse Expert Transformers that replace the feedforward network with sparsely activated experts layers
Universal Transformer, which repeatedly applies the same transformer “block” to input sequences

The researchers employed two experimental settings to evaluate each modification’s performance: transfer learning based on T5, and supervised machine translation on the WMT’14 English-German translation task.

Results for all architecture variants. The baseline model is the vanilla Transformer with relative attention. SGLUE refers to SuperGLUE and WebQ refers to WebQuestions dataset. The table reports average, ROUGE-2, accuracy, and BLEU score for SuperGLUE, XSum, WebQuestions, and WMT EnDe, respectively, on the validation sets.

The results indicate that modifications that led to significant performance improvements tended to fall into one of three buckets: relatively minor changes; those that increased parameter count or were slower; and those that were originally invented in the Mesh TensorFlow codebase. Few of the architectural modifications produced improvements, a finding that largely contradicted the experiment results presented in the research papers that originally proposed the modifications.

The researchers further investigated possible explanations for the underwhelming performance of these transformer modifications, concluding that they typically fail to effectively transfer across implementations and applications.

Finally, the team offered suggestions for improving the robustness of future architectural modifications. They suggest researchers test proposed modifications on multiple completely disparate codebases; apply the modifications to a wide variety of downstream applications; keep the hyperparameters fixed as much as possible when evaluating performance; and ensure best-practice reporting of results to include mean and standard deviation across multiple trials.

The paper Do Transformer Modifications Transfer Across Implementations and Applications? is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

5 comments on “Google Study Shows Transformer Modifications Fail To Transfer Across Implementations and Applications”

Pingback: [N] Google Study Shows Transformer Modifications Fail To Transfer Across Implementations and Applications – ONEO AI
Pingback: [N] Google Study Shows Transformer Modifications Fail To Transfer Across Implementations and Applications - Tổng hợp tin tức
Zianezoz

2021-03-28

Merci pour tout

Loading...

kamir bouchareb st

2021-05-04

very good thanks

Loading...

nial

2025-08-09

Love playing online, but worry about security? God of Wins Casino is an Australian platform that takes safety seriously: SSL encryption, RNGs audited by eCOGRA, and solid licensing guarantee fair play. The modern interface makes it easy to set self-limits or reach support quickly. If you enjoy pokies, table games, or live dealers, you’ll appreciate this secure yet fun gaming spot.

Loading...

Google Study Shows Transformer Modifications Fail To Transfer Across Implementations and Applications

Like this:

5 comments on “Google Study Shows Transformer Modifications Fail To Transfer Across Implementations and Applications”

Leave a Reply Cancel reply

Related

Share this:

Like this:

5 comments on “Google Study Shows Transformer Modifications Fail To Transfer Across Implementations and Applications”

Leave a Reply Cancel reply

Related