The computational hurdles associated with Large Language Model (LLM) inference continue to impede their widespread deployment, particularly as prompt lengths increase. Current methods for speeding up pre-filling often struggle to maintain acceptable levels of accuracy or efficiency when applied to long-context LLMs.
To address the abovementioned issue, in a new paper MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, a research team from Microsoft Corporation and University of Surrey introduces MInference (Milliontokens Inference), which employs a sparse calculation approach designed to expedite the pre-filling of long-sequence processing. It can reduce inference latency by up to 10 times on an A100 GPU while preserving accuracy.

Previous studies have indicated that attention matrices in LLMs are highly sparse, but attention distributions vary significantly across different inputs. This variability has hindered the direct application of prior sparse methods to long-context LLMs without costly training or fine-tuning. However, if dynamic sparse attention patterns could be efficiently predicted online, the pre-filling latency of long-context LLMs could be significantly reduced by computing only the most critical attention weights.

Building on this idea, MInference reduces 95% of the FLOPs (floating-point operations per second) in the attention computation, significantly accelerating the pre-filling stage of long-context LLM inference via dynamic sparse attention. Unlike existing dynamic sparse attention methods, which introduce substantial computational overhead to estimate attention patterns with low-rank hidden dimensions, MInference is specifically designed for long-context scenarios with minimal estimation overhead.
The researchers conducted extensive analyses and identified three general patterns of sparse attention in long-context LLMs: the A-shape pattern, the Vertical-Slash pattern, and the Block-Sparse pattern. Based on these findings, they developed a kernel-aware search method to assign the optimal attention pattern for each head.
Notably, rather than using fixed attention masks as in previous studies, the researchers implemented an efficient online approximation to construct a dynamic sparse mask for each head according to its assigned pattern and specific inputs. After obtaining the dynamic sparse mask, three optimized GPU kernels—based on the dynamic sparse compilers PIT, Triton, and FlashAttention—enable extremely efficient computation of dynamic sparse attention.


In their empirical study, the team conducted experiments on various long-context LLMs across benchmarks with context lengths exceeding 1 million tokens, such as InfiniteBench and RULER. The results demonstrated that MInference speeds up the pre-filling stage by up to 10 times for 1 million-token contexts with LLaMA-3-8B on a single A100 GPU, reducing latency from 30 minutes to 3 minutes per prompt, while maintaining or improving accuracy.
The paper MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention is on arXiv.
Author: Hecate He | Editor: Chain Zhang

I visited Sapphire Las Vegas last weekend, and it was an amazing experience. As the top strip club las vegas, it certainly lives up to the hype. The dancers are gorgeous and talented, and the overall vibe of the place is lively and welcoming. I can’t wait to go back next time I’m in town.
Osh University stands tall among international universities , offering a global perspective to students. Immerse yourself in a culturally rich environment where academic excellence and diversity converge, shaping well-rounded individuals.