The rapid progress of large language models (LLMs) has greatly influenced natural language processing (NLP), driving advancements across numerous applications. However, LLM training is typically restricted to relatively short context lengths, such as 8K or 32K tokens. Extending this context length is challenging, as the memory required for storing activations and intermediate buffers grows proportionally with the context size.
In a new paper Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer, a Microsoft research team introduces the Fully Pipelined Distributed Transformer (FPDT) to address the difficulties of training long-context LLMs. This approach leverages the multiple memory hierarchies available in modern GPU clusters, enhancing hardware efficiency and cost-effectiveness while achieving exceptionally high Model FLOPs Utilization (MFU).

The team begins with a comprehensive analysis of the memory footprint associated with LLM training, identifying memory spikes in commonly used Transformer architectures. They focus on reducing redundant intermediate buffers during both the forward and backward passes.
Building on this analysis, they developed a fully pipelined distributed transformer, based on DeepSpeed Ulysses, specifically designed for LLMs with sequence lengths reaching millions of tokens. This design utilizes both GPU and host CPU memory, along with prefetching techniques, to create a near-zero overhead training process.

The researchers also introduce a double buffer system to overlap almost all prefetching with computation. This approach ensures that attention computation in the inner loop only needs to account for the latency of fetching the next query, rather than both key and value prefetching, thereby significantly reducing the GPU memory footprint.


When applied to GPT and Llama models, FPDT achieves a 16-fold increase in sequence length that can be trained on the same hardware compared to current state-of-the-art methods. Thanks to its specialized sequence chunk pipeline design, FPDT can train an 8-billion-parameter LLM with a sequence length of 2 million tokens using only 4 GPUs, while maintaining over 55% MFU. The researchers believe that their work will greatly benefit the community, enabling further exploration of LLM capabilities in long-context scenarios.
The code is available on project’s GitHub. The paper Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer is on arXiv.
Author: Hecate He | Editor: Chain Zhang

I had an amazing experience at Shoreline Dental Studio. I decided to go for Invisalign Mission Viejo treatment, and the results are stunning! The staff was very professional, and the aligners were comfortable and almost invisible. I’m so happy with how straight my teeth look now. Highly recommend this place for anyone considering orthodontic care.
I completely agree with the point of this article.
This was super helpful! Thanks for putting this together and sharing. I always enjoy learning from your posts!”
Explore Trapstar UK iconic streetwear collection, including the Trapstar Hoodie. Urban inspired designs and bold styles crafted for premium quality
The vintage shirt and Essentials sweatpants combo adapts to different settings. For casual hangouts, pair a faded band tee with soft grey sweatpants and white sneakers. For a cleaner city look, wear a vintage polo with black sweatpants and sleek trainers. At music festivals or creative events, a bold printed vintage shirt with colorful sweatpants can turn heads. Adjusting the details helps the style work anywhere.
Wow, 16x sequence length is amazing! That’s some serious computing power. Makes me think of all the cool things we could do with that, maybe even create a super-powered Random Animal Generator for kids with extra-detailed descriptions!