While large-scale transformer architectures have significantly advanced the state-of-the-art on most natural language processing (NLP) tasks, in many real-life applications these models can be prohibitively expensive to train and their decoding speeds undesirably sluggish.
In the new paper Sparse is Enough in Scaling Transformers, a research team from the University of Warsaw, Google Research and OpenAI proposes Scaling Transformers, a novel family of transformers that leverage sparse layers to scale efficiently and perform unbatched decoding much faster than original transformers, enabling fast inference on long sequences even with limited memory.
To avoid having a transformer model’s non-sparse parts dominate decoding time and become a bottleneck, the team first explores how to completely sparsify a transformer by introducing sparse equivalents for the feedforward blocks, the dense QKV (query, key, value) and output layers in attention, and the final dense layer before the softmax and loss.
The team proposes a dynamic sparsity strategy to sparsify the feedforward layer. Unlike previous techniques that prune weights or blocks from weight matrices (static sparsity), this method selects only a fraction of the model’s parameters, enabling it to train a full weight matrix but only activate specific parts of it for each input token during decoding.
To sparsify the dense QKV layers in attention, the proposed approach subdivides the layers’ dimensionality into several modules then processes these modules with a convolutional layer with fewer weights and faster computation. The team also develops a multiplicative layer that can represent an arbitrary permutation and has fewer parameters and lower computation time than a dense layer.
To sparsify the loss layer, the researchers replace the dense layer with a multiplicative layer, similar to the abovementioned approach, resulting in faster decoding time with comparable perplexity.
Model scaling is however not the only factor contributing to high computation burdens, as long-sequence processing requires high complexity in attention operations and can also dominate decoding time. To solve this problem, the team leverages an LSH (Locality-Sensitive Hashing) attention approach from previous studies, integrating this sparse attention mechanism as well as recurrent blocks into a Scaling Transformer to yield their final model, which they call Terraformer.
Compared to the original transformer’s decoding speed of 0.061s on long-sequence-processing tasks, Terraformer achieves a decoding speed of 0.086s with similar performance in terms of perplexity. Terraformer also achieves accuracy similar to the original transformer model on several downstream tasks on the GLUE dataset. Last but not least, when the model is scaled up to 17B parameters, Terraformer achieves a 37x decoding speedup.
The results show that the proposed sparse models can match the performance of their dense counterparts while being many times faster at inference, with the sparsity benefits increasing as models scale up. The team believes this can make large transformer models more useful and sustainable, and hope machine learning researchers will take inspiration from these Scaling Transformers and tune them for their own needs.
The paper Sparse is Enough in Scaling Transformers is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.