Large transformer decoders have demonstrated game-changing performance on short-sequence processing (up to several thousand tokens of context); but scale poorly to images, books and videos, where sequences can climb into the millions of bytes. This limitation has become a bottleneck for many real-world transformer applications.
In the new paper MegaByte: Predicting Million-Byte Sequences with Multiscale Transformers, a Meta AI research team presents MegaByte, a multiscale decoder architecture that enables million-byte sequence modelling.
MegaByte comprises three main components: 1) a patch embedder that encodes a patch by concatenating embeddings of each byte; 2) a large global transformer that contextualizes patch representations via a self-attention mechanism; and 3) a smaller local transformer that takes the patch representations as inputs and autoregressively predicts the next patch.
The team summarizes MegaByte’s three major architectural improvements over transformers as follows:
- Sub-quadratic self-attention: MegaByte decomposes long sequences into two shorter sequences, and optimal patch sizes reduce the self-attention cost, which remains tractable for even long sequences.
- Per-patch feedforward layers: MegaByte uses large feedforward layers per-patch rather than per-position, enabling much larger and more expressive models for the same cost.
- Parallelism in Decoding: By generating representations for patches in parallel, MegaByte allows greater parallelism during generation.
These improvements enable training larger and better-performing models with the same compute cost, scaling to extremely long sequences, and generation speed-ups during deployment.
In their empirical study, the team compared MegaByte with a standard decoder-only transformer and the autoregressive, modality-agnostic PerceiverAR architecture on a range of long-text datasets. In the experiments, MegaByte performed competitively with subword models, achieved state-of-the-art perplexities for density estimation on ImageNet, and enabled efficient audio modelling from raw files.
This work demonstrates MegaByte’s ability to effectively process million-byte sequences. The team believes their approach could enable byte-level models to replace tokenization in autoregressive long-sequence modelling, and suggests future work should explore scaling MegaByte to much larger models and datasets.
The paper MegaByte: Predicting Million-Byte Sequences with Multiscale Transformers on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Impressive blog post! The introduction of Meta AI’s MegaByte scalable architecture for long sequence modeling is truly groundbreaking. It’s exciting to see advancements in the field of AI that outperform existing byte-level models. The article effectively highlights the key features and advantages of the MegaByte architecture, showcasing its superior performance in handling long sequences. This innovative approach has the potential to revolutionize various applications, from natural language processing to machine translation. Kudos to Meta AI for pushing the boundaries of AI research and presenting a promising solution. Thank you for sharing this exciting development!
Hi guys. If you are ever in need of quality vitamins, I can help you and recommend Valhalla Vitality https://valhallavitality.com . I am impressed with their team of experts who are always willing to give professional advice and help me choose the most appropriate products for my needs. Thanks to Valhalla Vitality, I feel healthy, strong and full of energy. I recommend everyone who strives for a healthy lifestyle to contact this store!