There was a nineteenth-century saying that mocked the use of “a sledgehammer to crack a peanut.” Google AI researcher Tal Schuster echoes this concept in introducing the new paper Confident Adaptive Language Modeling. While acknowledging the tremendous power of transformer-based large language models (LLMs), Schuster notes that many of the predictions they work on “require only minimal effort.” It could be said that using the entire LLM in such cases amounts to a sledgehammer-like overkill.
LLMs’ ever-increasing computation costs and associated inference slowdowns are the main bottlenecks impeding their practical application. Developed by a Google and MIT team, the proposed Confident Adaptive Language Modeling (CALM) framework addresses these issues by dynamically allocating different compute amounts to each input and generation timestep. CALM achieves up to 3x speedups on natural language processing (NLP) tasks while maintaining high model performance.
The team summarizes their main contributions as:
- A framework (CALM) for reliably accelerating transformer-based LLM generations.
- A systematic analysis of the token-wise early exit mechanism that motivates a simple-but-effective class of confidence measures and threshold functions that are used as part of the CALM framework.
- An empirical demonstration of CALM’s efficiency gains on three diverse generation datasets.
The proposed framework is based on a saturation theory: that the top-ranked prediction in LLMs remains unchanged after some layer and is propagated upward. The number of layers used by the model can thus be dynamically decided with regard to each input.
Following this idea, the team develops an adaptive compute approach to dynamically allocate computational resources per input to reduce model complexity while maintaining good performance. This method is also referred to as “early-exiting.”
Building on their analysis of the early-exiting paradigm, the team developed CALM as a principled method for increasing model efficiency. CALM leverages a distribution-free risk control technique for calibrating local, per-token exit decisions, such that model performance is provably maintained with arbitrarily high probability. CALM can dynamically allocate different amounts of compute per generated token, following explicitly defined tolerance levels based on the full generation output.
In their empirical study, the team implemented CALM on top of the T5 encoder-decoder model and evaluated text-generation task performance on three datasets — CNN/DM, WMT EN-FR, and SQUAD. The results show that CALM can reduce model compute burdens and gain speedups of up to 3x while maintaining high performance.
The paper Confident Adaptive Language Modeling is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.