While today’s powerful large language models (LLMs) have found applications in many real-world scenarios, the high compute cost of their autoregressive decoding process remains a bottleneck to deployment at scale. This has prompted machine learning researchers and developers to explore various approaches for improving LLM inference efficiency.
In the new paper Inference with Reference: Lossless Acceleration of Large Language Models, a Microsoft research team proposes LLMA, a novel inference-with-reference decoding mechanism that achieves up to 2x lossless speed-ups in LLMs with identical generation results by exploiting the overlaps between their outputs and references, e.g., retrieved documents.
This team first notes that an LLM’s output tokens often come from its context — which includes relevant retrieved documents from external reference sources — and its outputs thus tend to contain text spans that “overlap” with those present in the retrieved documents.
Motivated by this observation, the proposed LLMA aims to exploit these overlaps between LLM outputs and their reference documents. LLMA first selects a text span from the reference cache, transfers copies of its tokens to the LLM decoder, and evaluates their acceptability based on the output token’s probabilities. This process is conducted in parallel to boost efficiency — enabling accelerated decoding while ensuring the generated results are identical to those of a vanilla greedy decoding method.
In their empirical study, the team applied their approach to open-source LLaMA language models in both retrieval-augmented and cache-assisted scenarios. LLMA achieved better than 2x speed-ups in both experiments, with generation results identical to greedy decoding methods.
Overall, this work demonstrates the effectiveness of the proposed LLMA mechanism in significantly accelerating LLM inference times without sacrificing the quality of the generated results.
The paper Inference with Reference: Lossless Acceleration of Large Language Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: Microsoft’s LLMA Accelerates LLM Generations via an ‘Inference-With-Reference’ Decoding Approach – Rapid AI News