Tag: attention mechanism

by Synced 2024-07-16 2

Overcoming Computational Challenges in Large Language Model Inference with MInference 1.0

A research team from Microsoft and University of Surrey introduces MInference (Milliontokens Inference), which employs a sparse calculation approach designed to expedite the pre-filling of long-sequence processing. It can reduce inference latency by up to 10 times on an A100 GPU while preserving accuracy.