ETH Zurich’s UltraFastBERT Realizes 78x Speedup for Language Models

The ever-expanding scale of language models, now boasting tens of billions of parameters, has undeniably enhanced performance across diverse tasks. However, the consequential surge in computation costs poses a significant hurdle for real-world applications. In a bid to overcome this challenge, researchers are diligently working to enhance the efficiency of Large Language Models (LLMs).

Recent studies have spotlighted a crucial observation: the majority of parameters in these expansive language models reside within their feedforward layers. Intriguingly, not every neuron in these layers needs to be active during inference, presenting an opportunity to optimize their computation efficiency.

In a new paper Exponentially Faster Language Modelling, an ETH Zurich research team introduces UltraFastBERT, a variant of the BERT architecture. UltraFastBERT takes a revolutionary approach by replacing feedforward layers with fast feedforward networks, resulting in an impressive 78x speedup over the optimized baseline feedforward implementation.

The main contributions of the research team can be summarized as follows:

UltraFastBERT Architecture: A BERT-like model with 4095 neurons, strategically utilizing only 12 (0.03%) during inference.
Performance Parity: After fine-tuning UltraFastBERT for standard downstream tasks, it performs on par with its BERT counterparts.
Efficient Implementation: The researchers provide a naive implementation of conditional matrix multiplication underlying fast feedforward network inference, yielding a remarkable 78x speedup over natively optimized dense matrix multiplication.
Demonstrating Potential: Through UltraFastBERT and existing speedups by simple Fast FeedForward (FFF) implementations, the research showcases the substantial potential of conditional neural execution in language modeling.

UltraFastBERT’s architecture draws inspiration from crammedBERT but distinguishes itself by replacing intermediate layer feedforward networks with fast feedforward networks. The simplifying changes applied to these networks include uniform activation functions, output weights for all nodes, removing output biases, fixing leaf size to 1, and allowing multiple FFF trees in parallel.

During the training phase, the team follows the final training procedure of crammedBERT, while in the evaluation phase, they fine-tune UltraFastBERT models for various tasks in the GLUE benchmark. Remarkably, the results demonstrate that UltraFastBERT variants trained for just one day on a single A6000 GPU retain at least 96.0% of the GLUE downstream predictive performance of the original BERTbase model. UltraFastBERT-1×11-long even performs on par with the original BERT-base model while utilizing a mere 0.3% of its neurons.

In addition to these achievements, the team provides high-level CPU code showcasing a 78x speedup over the optimized baseline feedforward implementation, along with a PyTorch implementation delivering a 40x speedup over the equivalent batched feedforward inference.

In conclusion, this work not only verifies the impressive efficiency of UltraFastBERT but also aims to inspire the integration of primitives for conditional neural execution into device programming interfaces. The hope is that these efforts will pave the way for more streamlined and efficient language models in the future.

The paper Exponentially Faster Language Modelling on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

ETH Zurich’s UltraFastBERT Realizes 78x Speedup for Language Models

Like this:

1 comment on “ETH Zurich’s UltraFastBERT Realizes 78x Speedup for Language Models”

Leave a Reply Cancel reply

Related

Share this:

Like this:

1 comment on “ETH Zurich’s UltraFastBERT Realizes 78x Speedup for Language Models”

Leave a Reply Cancel reply

Related