Large language models (LLMs) have significantly transformed language processing, achieving remarkable outcomes across various applications. However, implementing LLMs on edge devices like mobile phones presents several challenges, particularly concerning memory, energy consumption, and computational demands. These limitations hinder the widespread adoption of LLMs in such devices.
One promising approach to overcoming these challenges is reducing the bit-width of weights and activations, making 8-bit activations an attractive option for on-device deployment. This reduction allows LLMs to take full advantage of hardware designed for mobile devices.
Building on this concept, in a new paper MobileQuant: Mobile-friendly Quantization for On-device Language Models, a research team from Samsung AI Center makes a first attempt to facilitate LLM deployment on edge devices using integer-only quantization. The proposed solution, MobileQuant, is a straightforward post-training quantization technique that reduces both inference latency and energy consumption while preserving accuracy levels comparable to those achieved with 16-bit activations.
MobileQuant effectively addresses the traditional challenges of quantization, such as accuracy and efficiency, while being fully compatible with existing mobile hardware. The framework introduces three key methodological enhancements, inspired by the limitations of current state-of-the-art methods when applied to edge devices and builds upon these existing techniques.
These enhancements include: (1) applying weight equivalent transformation across all applicable layers, (2) learning the optimal quantization range for activations, and (3) jointly optimizing all weight transformation and range parameters in an end-to-end fashion. MobileQuant implements a combination of per-tensor and per-channel weight quantization at 4-bit or 8-bit, along with per-tensor activation quantization at 8-bit or 16-bit, using fixed-point integer representations for all operations.
MobileQuant offers several advantages over previous methods. First, it allows for the quantization of weights to 4-bit or 8-bit and activations to 8-bit integers, with minimal performance degradation. This approach maximizes the potential of equivalent transformation-based methods that enable linear-invariant weight equalization. Additionally, the end-to-end optimization of MobileQuant benefits from an increased number of calibration and training samples, as demonstrated in the ablation study. Moreover, unlike other learnable-based quantization methods such as Quantization Aware Training (QAT), MobileQuant maintains model generalizability, as the model remains mathematically equivalent to its unquantized version.
The research team conducted an extensive evaluation of MobileQuant on edge devices, assessing model accuracy, inference latency, and energy consumption. The results show that MobileQuant can reduce both inference latency and energy usage by 20% to 50%, all while maintaining accuracy comparable to models utilizing 16-bit activations.
In conclusion, MobileQuant represents a significant advancement in the development of energy- and compute-efficient quantized LLMs with minimal performance loss. This framework is fully compatible with current edge device hardware and low-level runtimes, making it a practical solution for deploying LLMs on mobile devices.
The paper MobileQuant: Mobile-friendly Quantization for On-device Language Models is on arXiv.
Author: Hecate He | Editor: Chain Zhang

