A new paper from researchers at Google DeepMind demonstrates that large language models like GPT-3 are not just adept at generating human-like text - they are also excellent general-purpose compressors. This means they can compress many types of data like text, images, and audio down to very small sizes, similar to specialized compression algorithms like gzip and PNG.
Why Should We Care About Compression?
Data compression is a fundamental capability in computing and AI. Compressing data means we can store and transmit it using less memory, disk space, and bandwidth. This saves costs and allows systems to scale.
But more importantly, good compression also indicates a deep understanding of the structure and patterns in data. To compress well, an algorithm needs to spot redundancies and exploit statistical regularities. So compression capability acts as a benchmark for how much knowledge an AI system has learned.
The fact that huge natural language models can compress varied data types so efficiently has major implications:
- It demonstrates they have learned general abilities beyond just processing language.
- Their skill at compression reflects an understanding of images, audio, video and more.
- There is potential to apply them to practical compression tasks.
How Was the Research Conducted?
The DeepMind researchers tested the compression capabilities of different sized language models on 3 different 1GB datasets:
- Text - The first 1 billion bytes of Wikipedia.
- Images - 1 million 32x64px patches extracted from ImageNet.
- Audio - Speech samples from the LibriSpeech dataset.
They compared the models against standard compression algorithms like PNG, JPEG, and FLAC which are specialized for images, audio etc.
The language models compressed using arithmetic coding - a technique that turns a predictive model into a compressor. The more accurately a model can predict the next byte in a file, the better it can compress the data.
They tested 3 main types of language models:
- Smaller Transformer models trained from scratch on Wikipedia text.
- Larger foundation models like Chinchilla-70B pretrained on huge text datasets.
- As a baseline, general purpose compressors like gzip and LZMA.
Key Technical Findings
The experiments yielded several insightful results:
- Despite being trained only on text, the foundation models compressed all modalities better than methods specialized for each domain. For example, Chinchilla-70B compressed ImageNet images 43.4% smaller than PNG.
- Confirmed scaling laws: Bigger models compressed better, but only up to a point. After a certain size, the model itself took up too much space.
- There was a direct link between model size and training data size. More data enables bigger models. But model size must be suited to dataset size.
- Tokenization like BPE, while useful for language tasks, generally decreased compression performance slightly. This is because it makes the prediction task harder.
- Longer contexts improved compression, as models could exploit more sequential dependencies.
These findings have significant implications:
- They demonstrate language models have learned very general capabilities beyond just text. Their versatility likely stems from pretraining on vast datasets.
- The models' strong compression across modalities reflects an understanding of images, audio and more at a deep statistical level.
- There are inherent tradeoffs between model scale, datasets, and compression performance. Bigger datasets allow bigger models, but size must match.
- The results provide new perspective on model scaling laws - compression considers model size unlike log loss. Scaling hits limits.
- The equivalence between prediction and compression means these models could have practical applications for compressing images, video and more. However, model size may be prohibitive compared to current methods.
- The compression viewpoint offers new insights into model generalization, failure modes, tokenization, and other aspects of deep learning.
In summary, this research shows large language models have become adept general-purpose learners. Their exceptional compression capabilities demonstrate an expansive understanding of patterns in textual, visual and audio data. There is still progress to be made, but these models show increasing competence as general systems for automating prediction and compression across modalities.