AI Machine Learning & Data Science Research

Microsoft’s LongNet Scales Transformer to One Billion Tokens

In a new paper LongNet: Scaling Transformers to 1,000,000,000 Tokens, a Microsoft research team presents LONGNET, a Transformer variant that successfully scaling sequence to more than 1 billion tokens while maintaining stronger performance and have a linear computation complexity.

Scaling sequence length is of paramount importance for large language models, as it brings about singnificant benefits. These advantages include a large memory and receptive field for more effective human communication, intricate causality and reasoning pathyways to leverage training data, and the potential to overcome the limitations of in-context learning.

In their recent paper LongNet: Scaling Transformers to 1,000,000,000 Tokens, a Microsoft research team introduce LongNet. This transformer variant successfully scales sequence longth to more than one billion tokens while maintaining stronger performance and maintaining a linear computation complexity.

The challenge of scaling up sequence is to strike the balance of between the computational complexity and the model expressivity. The solution of this work is LongNet, which replaces the attention of vanilla Transformers with dilated attention, a novel component that splits the given inputs of query-key-value pairs into the corresponding segments equally with a given segment length. Each segment is then sparsified along the sequence dimension, which later are fed into the attention in parallel. Finally, they are scattered and concatenated as the outputs.

In such design, the dilated attention can be transformed into dense attention between a gathering operation over the input and a scattering operation over the output, as such it can be reused to optimize vanilla attention, and it also can significantly reduce the computation cost to linear complexity via a factor on vanilla attention.

Next, the team takes the advantage the linear computation complexity of LongNet for its distributed training. They distributed the training on two GPU devices and further scaled to an arbitrary number of devices. Notably, in contrast of vanilla attention, both sizes of key and value are independent of the sequence length, therefore the communication cost is constant.

In their empirical study, the team recorded the runtime of vanilla attention and the proposed dilated attention. The result show that dilated attention successfully scales up the sequence length with almost constant latency, which verifies its feasibility of scaling to 1B tokens.

The researchers also compared LONGNET with both vanilla Transformer and sparse Transformers, LONGNET consistently surpasses other baseline models while reducing the computation complexity from quadratic to linear on both short and long sequences.

The code is available on project’s GitHub. The paper LongNet: Scaling Transformers to 1,000,000,000 Tokens on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

1 comment on “Microsoft’s LongNet Scales Transformer to One Billion Tokens

  1. Pingback: 微软将LongNet Transformer扩展到十亿个标记 - 偏执的码农

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: