Transformer-based large language models (LLMs) are rapidly expanding in both their applications and size. OpenAI’s GPT, for example, has ballooned from 117 million to 175 billion parameters since its 2020 release. LLMs encode input sequences via self-attention and decode the representations via feed-forward neural networks to produce outputs. Although scaling architectures to extremely large sizes has proven an effective way to boost performance, the decoding process for such large-scale transformers is costly and inefficient.
In the new paper Accelerating Large Language Model Decoding with Speculative Sampling, a DeepMind research team presents SpS (Speculative Sampling), an algorithm that achieves 2–2.5x decoding speedups on a 70 billion parameter Chinchilla LLM. The novel approach maintains sample quality and does not require any modifications to model parameters or architecture.
The team summarizes the SpS pipeline as follows:
- Generating a short draft of length 𝐾. This can be attained with either a parallel model (Stern et al., 2018) or by calling a faster, auto-regressive model 𝐾 times. We shall refer to this model as the draft model, and focus on the case where it is auto-regressive.
- Scoring the draft using the larger, more powerful model from we wish to sample from. We shall refer to this model as the target model.
- Using a modified rejection sampling scheme, accept a subset of the 𝐾 draft tokens from left to right, recovering the distribution of the target model in the process.
In conventional transformer training, samples are drawn using autoregressive sampling (ArS), a memory bandwidth-bound method that only generates a single token for every sequence in the batch and cannot make effective use of modern hardware accelerators like GPUs and TPUs.
The team’s approach for improving sample-drawing efficiency involves generating multiple tokens every time the target model is called, which is possible only if there is strong agreement between the draft and target model’s distributions on a given token or sub-sequence of tokens.
The proposed SpS is based on the team’s observation that computing the logits of a short continuation of 𝐾 tokens in parallel (generated by a faster but less powerful draft model) has similar latency to sampling a single token from a larger target model. They also introduce a novel modified rejection sampling scheme which recovers the distribution of the target model from the draft model samples within hardware numerics.
In their empirical study, the team used a 70B Chinchilla LLM to compare the proposed SpS with ArS on the XSum (Narayan et al., 2018) and 100-shot HumanEval (Chen et al., 2021) benchmarks. In the evaluations, SpS achieved 2–2.5x decoding speedups on both benchmarks while maintaining sample quality and without any parameter or architecture modifications.
The paper Accelerating Large Language Model Decoding with Speculative Sampling is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “DeepMind’s Speculative Sampling Achieves 2–2.5x Decoding Speedups in Large Language Models”