Predictive models and lossless compressors have long been known to share a transformative relationship. Recently, the remarkable success of large pre-trained Transformers, often referred to as foundation models, in a diverse range of predictive tasks has positioned them as potent candidates for the role of robust compressors.
In a groundbreaking research paper titled “Language Modeling Is Compression,” a collaborative team from Google DeepMind, Meta AI, and Inria delves into the lossless compression capabilities of foundation models, unveiling their achievement of state-of-the-art compression rates across various data types. This feat is accomplished by harnessing their contextual understanding to adapt a general-purpose compressor to excel in specific tasks.
The team summarizes their main contributions as follows:
- Empirical Investigation: The team conducts a thorough empirical examination of the lossless compression capabilities of foundation models.
- General-Purpose Compressors: Foundation models, primarily trained on textual data, emerge as versatile compressors due to their adeptness in contextual learning.
- Scaling Insights: A fresh perspective on scaling laws is presented, revealing that the dataset size imposes a definitive limit on model size concerning compression performance. It underscores that scaling is not a panacea.
- Compression-Prediction Duality: The research leverages the equivalence between compression and prediction to employ compressors as generative models, demonstrating their effectiveness through visual representations.
- Tokenization Clarification: Tokenization, viewed as a form of pre-compression, is shown to generally not enhance compression performance. Instead, it allows models to enrich the information content within their context, thereby generally improving prediction performance.
This work advocates for the utilization of (lossless) compression techniques as a means to scrutinize foundation models. The rationale behind this approach lies in the ready availability of these models for compression tasks, eliminating the need for additional training overhead.
To substantiate their findings, the researchers compare their arithmetic coding-based language model compressors with two prominent general-purpose lossless compressors: gzip and its enhanced counterpart, LZMA2. Additionally, specialized lossless compressors tailored for image and audio data, namely PNG and FLAC, respectively, are considered. The evaluation encompasses two variants of language models differing in size, all using arithmetic coding.
The results decisively establish the prowess of large language models as versatile predictors and unveil fresh insights into scaling laws, tokenization, and in-context learning. Notably, Chinchilla 70B, primarily trained on textual data, achieves remarkable compression ratios of 43.4% for ImageNet patches and 16.4% for LibriSpeech samples, surpassing domain-specific compressors like PNG (58.5%) and FLAC (30.3%), respectively.
In summary, this work not only highlights the significance of the compression viewpoint but also contributes novel insights into scaling laws by recognizing the inextricable connection between optimal model size and dataset size, dispelling the notion that limitless scaling is attainable.
The paper Language Modeling Is Compression on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.