Retrieval augmentation is a commonly employed and effective approach for enhancing the factual knowledge of language models, while simultaneously accelerating model inference times. Nonetheless, this approach comes with considerable computational costs attributed to the substantial storage demands required for storing precomputed representations.
To address this pertinent issue, a Google research team has presented a groundbreaking solution in their new paper titled “MEMORY-VQ: Compression for Tractable Internet-Scale Memory.” This innovative method, MEMORY-VQ, significantly diminishes the storage prerequisites associated with memory-based techniques while upholding high performance levels, achieving an impressive 16x compression rate on the KILT benchmark.
Remarkably, this endeavor marks a pioneering effort in the realm of compressing pre-encoded token memory representations, as no prior research has explored this avenue. The MEMORY-VQ approach seamlessly blends product quantization with the VQ-VAE method to achieve its primary objective: reducing storage requirements for memory-based methods without compromising quality.
The core concept involves employing vector quantization techniques to substitute the original memory vectors with integer codes for memory compression. These codes can then be efficiently transformed back into vectors as needed. By implementing this approach in LUMEN, a potent memory-based technique that pre-computes token representations for retrieved passages to significantly expedite inference, the researchers have developed the LUMEN-VQ model.
In their empirical investigation, the research team conducted a comparative analysis, pitting LUMEN-VQ against naïve baselines such as LUMEN-Large and LUMEN-Light, using a subset of knowledge-intensive tasks from the KILT benchmark. Impressively, LUMEN-VQ managed to achieve a remarkable 16x compression rate with only a limited loss in quality.
In summary, this research underscores the effectiveness of MEMORY-VQ as a memory augmentation technique and a pragmatic solution for substantially enhancing inference speed when dealing with extensive retrieval corpora.
The paper MEMORY-VQ: Compression for Tractable Internet-Scale Memory on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.