AI Machine Learning & Data Science Nature Language Tech Research

Microsoft’s LLMA Accelerates LLM Generations via an ‘Inference-With-Reference’ Decoding Approach

In the new paper Inference with Reference: Lossless Acceleration of Large Language Models, a Microsoft research team proposes LLMA, an inference-with-reference decoding mechanism that achieves up to 2x lossless speed-ups with identical generation results by exploiting the overlaps between LLM outputs and references.

While today’s powerful large language models (LLMs) have found applications in many real-world scenarios, the high compute cost of their autoregressive decoding process remains a bottleneck to deployment at scale. This has prompted machine learning researchers and developers to explore various approaches for improving LLM inference efficiency.

In the new paper Inference with Reference: Lossless Acceleration of Large Language Models, a Microsoft research team proposes LLMA, a novel inference-with-reference decoding mechanism that achieves up to 2x lossless speed-ups in LLMs with identical generation results by exploiting the overlaps between their outputs and references, e.g., retrieved documents.

This team first notes that an LLM’s output tokens often come from its context — which includes relevant retrieved documents from external reference sources — and its outputs thus tend to contain text spans that “overlap” with those present in the retrieved documents.

Motivated by this observation, the proposed LLMA aims to exploit these overlaps between LLM outputs and their reference documents. LLMA first selects a text span from the reference cache, transfers copies of its tokens to the LLM decoder, and evaluates their acceptability based on the output token’s probabilities. This process is conducted in parallel to boost efficiency — enabling accelerated decoding while ensuring the generated results are identical to those of a vanilla greedy decoding method.

In their empirical study, the team applied their approach to open-source LLaMA language models in both retrieval-augmented and cache-assisted scenarios. LLMA achieved better than 2x speed-ups in both experiments, with generation results identical to greedy decoding methods.

Overall, this work demonstrates the effectiveness of the proposed LLMA mechanism in significantly accelerating LLM inference times without sacrificing the quality of the generated results.

The paper Inference with Reference: Lossless Acceleration of Large Language Models is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

3 comments on “Microsoft’s LLMA Accelerates LLM Generations via an ‘Inference-With-Reference’ Decoding Approach

  1. Pingback: Microsoft’s LLMA Accelerates LLM Generations via an ‘Inference-With-Reference’ Decoding Approach – Rapid AI News

  2. In each of the tests, LMA produced speedups more than two times, with generating outcomes that were equivalent to those obtained using greedy decoding approaches.

  3. Luis Evans

    With generating results that were comparable to those obtained using greedy decoding algorithms, LMA produced speedups of more than two times in each of the tests.

    https://wikiparagon.com/

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: