Large Language Models (LLMs) have revolutionized the world of Natural Language Processing (NLP) with their remarkable abilities to handle intricate and complex tasks. These models have been trained on massive datasets using immense computational power, showcasing impressive long-context capabilities.
However, the key to accessing these long-context capabilities lies within proprietary LLM APIs, and there has been a lack of open recipes for constructing comparable long-context models that can deliver similar downstream performance. Additionally, existing open-source long-context models often fall short on evaluations, primarily relying on language modeling loss and synthetic tasks, while neglecting the need to maintain strong performance on standard short-context tasks.
In a new paper Effective Long-Context Scaling of Foundation Models, a Meta AI research team presents a series of long-context LLMs, built through the pretraining from LLAMA 2. These models support effective context windows of up to 32,768 tokens and outperform all existing open-sourced models in terms of performance.
The proposed model is constructed through continuous pretraining from LLAMA 2 checkpoints, augmented with an additional 400 billion tokens, assembled into long training sequences. Remarkably, the team preserves the core architecture of LLAMA 2, making only a crucial modification to the positional encoding necessary for the model to handle longer contexts.
For positional encoding (PE), the researchers introduce a minimal yet vital modification to the RoPE positional encoding, reducing the rotation angle. This modification mitigates the decaying effect of RoPE for distant tokens, enhancing the model’s ability to attend to longer contexts effectively.
Furthermore, the team explores different strategies for improving long-context abilities. Surprisingly, their findings indicate that the quality of the data used plays a more pivotal role than the sheer length of texts in the context of continual pretraining. This highlights the significance of data curation in achieving superior long-context performance.
In the realm of instruction tuning, the researchers employ a simple and cost-effective approach. They leverage a pre-existing, large, and diverse short-prompt dataset and augment it with synthetic self-instructed long data generated by LLAMA 2 CHAT. This strategy allows the model to acquire a diverse set of skills from the extensive RLHF dataset and transfer that knowledge to long-context scenarios via self-instructed data.
The research team conducts an extensive evaluation encompassing language modeling, synthetic context probing tasks, and a wide range of research benchmarks. In these evaluations, the proposed models consistently outperform LLAMA 2 on most standard tasks and exhibit significant improvements on long-context tasks.
In summary, this pioneering work showcases the superiority of the series of long-context LLMs developed by the Meta AI research team. Their innovative approach and robust performance have the potential to democratize access to long-context LLMs, opening doors for further advancements in the field of Natural Language Processing. This breakthrough promises to empower researchers and developers in tackling more complex and nuanced language understanding tasks, marking a significant step forward in the world of AI-driven language models.
The paper Effective Long-Context Scaling of Foundation Models on ai.meta.com.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.