Site icon Synced

Meta AI’s MegaByte Scalable Architecture for Long Sequence Modelling Outperforms Existing Byte-Level Models

Large transformer decoders have demonstrated game-changing performance on short-sequence processing (up to several thousand tokens of context); but scale poorly to images, books and videos, where sequences can climb into the millions of bytes. This limitation has become a bottleneck for many real-world transformer applications.

In the new paper MegaByte: Predicting Million-Byte Sequences with Multiscale Transformers, a Meta AI research team presents MegaByte, a multiscale decoder architecture that enables million-byte sequence modelling.

MegaByte comprises three main components: 1) a patch embedder that encodes a patch by concatenating embeddings of each byte; 2) a large global transformer that contextualizes patch representations via a self-attention mechanism; and 3) a smaller local transformer that takes the patch representations as inputs and autoregressively predicts the next patch.

The team summarizes MegaByte’s three major architectural improvements over transformers as follows:

  1. Sub-quadratic self-attention: MegaByte decomposes long sequences into two shorter sequences, and optimal patch sizes reduce the self-attention cost, which remains tractable for even long sequences.
  2. Per-patch feedforward layers: MegaByte uses large feedforward layers per-patch rather than per-position, enabling much larger and more expressive models for the same cost.
  3. Parallelism in Decoding: By generating representations for patches in parallel, MegaByte allows greater parallelism during generation.

These improvements enable training larger and better-performing models with the same compute cost, scaling to extremely long sequences, and generation speed-ups during deployment.

In their empirical study, the team compared MegaByte with a standard decoder-only transformer and the autoregressive, modality-agnostic PerceiverAR architecture on a range of long-text datasets. In the experiments, MegaByte performed competitively with subword models, achieved state-of-the-art perplexities for density estimation on ImageNet, and enabled efficient audio modelling from raw files.

This work demonstrates MegaByte’s ability to effectively process million-byte sequences. The team believes their approach could enable byte-level models to replace tokenization in autoregressive long-sequence modelling, and suggests future work should explore scaling MegaByte to much larger models and datasets.

The paper MegaByte: Predicting Million-Byte Sequences with Multiscale Transformers on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Exit mobile version