Transformers have become a preferred architecture in the machine learning community, and building deeper models is the common approach for improving their performance. But is deeper necessarily better?
In the new paper Wide Attention Is The Way Forward For Transformers, a research team from the University of Cambridge, Imperial College London, and the University of Oxford challenges the commonly held belief that deeper is better for transformer architectures, demonstrating that wider layers result in superior performance on natural language processing (NLP) tasks.
The team summarizes their main contributions as follows:
- We demonstrate that wider and shallower models can equal or beat the accuracy of deeper models when there is no pretraining of weights or embeddings. Across all 4 tasks, the average accuracy for the vanilla Transformer increases by 0.4% between normal deep models and our single-layer wide models.
- We show that our results are consistent across a variety of different attention mechanisms and input sequence lengths, and thus there is a general design equivalence in increasing the depth of a Transformer model vs increasing the width. Averaged across all non-vanilla attention types and tasks, accuracy increases by 0.3% from deepest to widest.
- We show that widening the models by fixing the attention computation size results in less overall parameters and faster inference. We show that wider models are on average 1.4× smaller and have 3.1× faster inference latency on a CPU and 1.9× on a GPU, compared to deep models.
- We demonstrate how single-layer networks can have more interpretable predictions by inspecting the attention weights of each head in a single layer.
The paper first evaluates the impact of model aspect ratio — the ratio of layers to heads — on model accuracy, runtime performance, model size, and interpretability. Unlike conventional approaches, which focus on finding more efficient attention styles or using network architecture search (NAS) to obtain optimal combination operators, the team considers a more coarse-grained design space by changing the model aspect ratio. This enables them to evaluate novel architectures, such as a single-layer model with many parallel heads.
The researchers performed experiments on four text classification tasks: sentiment analysis on the IMDb dataset at both the token and byte level, Listops 10-way classification, and byte-level document matching. They also investigated how widening the attention layer would affect ten different types of transformer attention mechanisms.
The researchers summarize the empirical results as follows:
- Wide transformer networks typically offer equal or greater accuracy on a range of classification tasks with different sequence lengths.
- Some tasks significantly benefit from going wide. Listops had an average accuracy increase of 1.5% when going wide, whereas all the others had changes of < 0.5%.
- The attention mechanism has some effect on whether wider or deeper is better, with Longformer and Sinkhorn being more sensitive to the model aspect ratio.
Overall, this work shows that the proposed wide transformer networks can achieve performance comparable to or better than deep transformers. The researchers conclude that wider and shallower models are thus a “viable and desirable alternative” for transformers when there is no pretraining of weights or embeddings.
The paper Wide Attention Is The Way Forward For Transformers is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.