Large transformer decoders have demonstrated game-changing performance on short-sequence processing (up to several thousand tokens of context); but scale poorly to images, books and videos, where sequences can climb into the millions of bytes. This limitation has become a bottleneck for many real-world transformer applications.
In the new paper MegaByte: Predicting Million-Byte Sequences with Multiscale Transformers, a Meta AI research team presents MegaByte, a multiscale decoder architecture that enables million-byte sequence modelling.
MegaByte comprises three main components: 1) a patch embedder that encodes a patch by concatenating embeddings of each byte; 2) a large global transformer that contextualizes patch representations via a self-attention mechanism; and 3) a smaller local transformer that takes the patch representations as inputs and autoregressively predicts the next patch.
The team summarizes MegaByte’s three major architectural improvements over transformers as follows:
- Sub-quadratic self-attention: MegaByte decomposes long sequences into two shorter sequences, and optimal patch sizes reduce the self-attention cost, which remains tractable for even long sequences.
- Per-patch feedforward layers: MegaByte uses large feedforward layers per-patch rather than per-position, enabling much larger and more expressive models for the same cost.
- Parallelism in Decoding: By generating representations for patches in parallel, MegaByte allows greater parallelism during generation.
These improvements enable training larger and better-performing models with the same compute cost, scaling to extremely long sequences, and generation speed-ups during deployment.
In their empirical study, the team compared MegaByte with a standard decoder-only transformer and the autoregressive, modality-agnostic PerceiverAR architecture on a range of long-text datasets. In the experiments, MegaByte performed competitively with subword models, achieved state-of-the-art perplexities for density estimation on ImageNet, and enabled efficient audio modelling from raw files.
This work demonstrates MegaByte’s ability to effectively process million-byte sequences. The team believes their approach could enable byte-level models to replace tokenization in autoregressive long-sequence modelling, and suggests future work should explore scaling MegaByte to much larger models and datasets.
The paper MegaByte: Predicting Million-Byte Sequences with Multiscale Transformers on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

