Microsoft’s LongNet Scales Transformer to One Billion Tokens

In a new paper LongNet: Scaling Transformers to 1,000,000,000 Tokens, a Microsoft research team presents LONGNET, a Transformer variant that successfully scaling sequence to more than 1 billion tokens while maintaining stronger performance and have a linear computation complexity.

by Synced

2023-07-10

Comments 16

Scaling sequence length is of paramount importance for large language models, as it brings about singnificant benefits. These advantages include a large memory and receptive field for more effective human communication, intricate causality and reasoning pathyways to leverage training data, and the potential to overcome the limitations of in-context learning.

In their recent paper LongNet: Scaling Transformers to 1,000,000,000 Tokens, a Microsoft research team introduce LongNet. This transformer variant successfully scales sequence longth to more than one billion tokens while maintaining stronger performance and maintaining a linear computation complexity.

The challenge of scaling up sequence is to strike the balance of between the computational complexity and the model expressivity. The solution of this work is LongNet, which replaces the attention of vanilla Transformers with dilated attention, a novel component that splits the given inputs of query-key-value pairs into the corresponding segments equally with a given segment length. Each segment is then sparsified along the sequence dimension, which later are fed into the attention in parallel. Finally, they are scattered and concatenated as the outputs.

In such design, the dilated attention can be transformed into dense attention between a gathering operation over the input and a scattering operation over the output, as such it can be reused to optimize vanilla attention, and it also can significantly reduce the computation cost to linear complexity via a factor on vanilla attention.

Next, the team takes the advantage the linear computation complexity of LongNet for its distributed training. They distributed the training on two GPU devices and further scaled to an arbitrary number of devices. Notably, in contrast of vanilla attention, both sizes of key and value are independent of the sequence length, therefore the communication cost is constant.

In their empirical study, the team recorded the runtime of vanilla attention and the proposed dilated attention. The result show that dilated attention successfully scales up the sequence length with almost constant latency, which verifies its feasibility of scaling to 1B tokens.

The researchers also compared LONGNET with both vanilla Transformer and sparse Transformers, LONGNET consistently surpasses other baseline models while reducing the computation complexity from quadratic to linear on both short and long sequences.

The code is available on project’s GitHub. The paper LongNet: Scaling Transformers to 1,000,000,000 Tokens on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

16 comments on “Microsoft’s LongNet Scales Transformer to One Billion Tokens”

Pingback: 微软将LongNet Transformer扩展到十亿个标记 - 偏执的码农
jumping shell

2023-11-09

I’m enriched by the ocean of knowledge you’ve given.

Loading...

Reply
doodle cricket

2024-01-03

I am grateful for the information and the suggestions that you have provided.

Loading...

Reply
uno online

2024-04-17

I’m happy to be here and read yours. It gives me a lot of helpful information.

Loading...

Reply
Animation Iconic

2024-05-08

I am really glad to be a part of this. Thanks for such an interesting post

Loading...

Reply
tiny fishing

2024-06-07

I really admire the way you describe your feelings and opinions in such a concise and understandable way. It’s worth learning.

Loading...

Reply
nytwordlehints

2024-07-15

Your perspective on this topic is both unique and enlightening.

Loading...

Reply
Sudadera

2024-07-26

Explora nuestra moderna colección de Sudadera perfectas para uso casual. Alta calidad, cómodo y a la moda. ¡Compra ahora!

Loading...

Reply
Poem Generator

2024-07-30

Our Poem Generator tool enables you to create poems like haiku, free verse, and sonnets with length variations.

Loading...

Reply
Lisa Bower

2024-11-25

Microsoft’s Long Net is a new transformer that handles sequences over 1 billion tokens with linear computational complexity, thanks to its innovative knowingly called ‘Dilated Attention Mechanism’. Furthermore, It’s a big leap for scaling large language models efficiently.

Loading...

Reply
harry styles7117

2024-11-28

thanks for your knowledge papa’s games

Loading...

Reply
Jim Henderson

2025-01-13

It’s exciting to think about how these advancements can push the boundaries of what AI can achieve in understanding and generating human-like responses. Cookie Clicker

Loading...

Reply
stickman boost

2025-08-06

You may assist your stickman in escaping from dangerous, obstacle-filled parkour courses in stickman boost . Scale over shaky platforms, spiky ceilings, and traps.

Loading...

Reply
love calculator

2025-10-06

Use our fun love tool to find out if two names are good for love. Type in names to get a quick score on how compatible they are! love calculator

Loading...

Reply
TextNow APK

2025-10-16

Fascinating read — scaling transformer models to a billion tokens is a bold step forward. For professionals who often collaborate or exchange feedback while on the move, the TextNow APK provides a reliable mobile communication option to stay connected and responsive.

Loading...

Reply
Lincoln

2025-10-29

How does LongNet’s use of dilated attention improve the balance between computational complexity and model sports games expressivity compared to traditional Transformers, and what specific advantages does this approach offer in handling long sequences?

Loading...

Reply