Overcoming Computational Challenges in Large Language Model Inference with MInference 1.0
A research team from Microsoft and University of Surrey introduces MInference (Milliontokens Inference), which employs a sparse calculation approach designed to expedite the pre-filling of long-sequence processing. It can reduce inference latency by up to 10 times on an A100 GPU while preserving accuracy.
