AI Machine Learning & Data Science Research

Microsoft’s LongNet Scales Transformer to One Billion Tokens

In a new paper LongNet: Scaling Transformers to 1,000,000,000 Tokens, a Microsoft research team presents LONGNET, a Transformer variant that successfully scaling sequence to more than 1 billion tokens while maintaining stronger performance and have a linear computation complexity.

Scaling sequence length is of paramount importance for large language models, as it brings about singnificant benefits. These advantages include a large memory and receptive field for more effective human communication, intricate causality and reasoning pathyways to leverage training data, and the potential to overcome the limitations of in-context learning.

In their recent paper LongNet: Scaling Transformers to 1,000,000,000 Tokens, a Microsoft research team introduce LongNet. This transformer variant successfully scales sequence longth to more than one billion tokens while maintaining stronger performance and maintaining a linear computation complexity.

The challenge of scaling up sequence is to strike the balance of between the computational complexity and the model expressivity. The solution of this work is LongNet, which replaces the attention of vanilla Transformers with dilated attention, a novel component that splits the given inputs of query-key-value pairs into the corresponding segments equally with a given segment length. Each segment is then sparsified along the sequence dimension, which later are fed into the attention in parallel. Finally, they are scattered and concatenated as the outputs.

In such design, the dilated attention can be transformed into dense attention between a gathering operation over the input and a scattering operation over the output, as such it can be reused to optimize vanilla attention, and it also can significantly reduce the computation cost to linear complexity via a factor on vanilla attention.

Next, the team takes the advantage the linear computation complexity of LongNet for its distributed training. They distributed the training on two GPU devices and further scaled to an arbitrary number of devices. Notably, in contrast of vanilla attention, both sizes of key and value are independent of the sequence length, therefore the communication cost is constant.

In their empirical study, the team recorded the runtime of vanilla attention and the proposed dilated attention. The result show that dilated attention successfully scales up the sequence length with almost constant latency, which verifies its feasibility of scaling to 1B tokens.

The researchers also compared LONGNET with both vanilla Transformer and sparse Transformers, LONGNET consistently surpasses other baseline models while reducing the computation complexity from quadratic to linear on both short and long sequences.

The code is available on project’s GitHub. The paper LongNet: Scaling Transformers to 1,000,000,000 Tokens on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

16 comments on “Microsoft’s LongNet Scales Transformer to One Billion Tokens

  1. Pingback: 微软将LongNet Transformer扩展到十亿个标记 - 偏执的码农

  2. I’m enriched by the ocean of knowledge you’ve given.

  3. I am grateful for the information and the suggestions that you have provided.

  4. I’m happy to be here and read yours. It gives me a lot of helpful information.

  5. I am really glad to be a part of this. Thanks for such an interesting post

  6. I really admire the way you describe your feelings and opinions in such a concise and understandable way. It’s worth learning.

  7. Your perspective on this topic is both unique and enlightening.

  8. Explora nuestra moderna colección de Sudadera perfectas para uso casual. Alta calidad, cómodo y a la moda. ¡Compra ahora!

  9. Our Poem Generator tool enables you to create poems like haiku, free verse, and sonnets with length variations.

  10. Microsoft’s Long Net is a new transformer that handles sequences over 1 billion tokens with linear computational complexity, thanks to its innovative knowingly called ‘Dilated Attention Mechanism’. Furthermore, It’s a big leap for scaling large language models efficiently.

  11. harry styles7117

    thanks for your knowledge papa’s games

  12. Jim Henderson

    It’s exciting to think about how these advancements can push the boundaries of what AI can achieve in understanding and generating human-like responses. Cookie Clicker

  13. You may assist your stickman in escaping from dangerous, obstacle-filled parkour courses in stickman boost . Scale over shaky platforms, spiky ceilings, and traps.

  14. Use our fun love tool to find out if two names are good for love. Type in names to get a quick score on how compatible they are! love calculator

  15. Fascinating read — scaling transformer models to a billion tokens is a bold step forward. For professionals who often collaborate or exchange feedback while on the move, the TextNow APK provides a reliable mobile communication option to stay connected and responsive.

  16. Lincoln

    How does LongNet’s use of dilated attention improve the balance between computational complexity and model sports games expressivity compared to traditional Transformers, and what specific advantages does this approach offer in handling long sequences?

Leave a Reply

Your email address will not be published. Required fields are marked *