AI Machine Learning & Data Science Research

Microsoft’s phi-1.5 Challenges LLMs Scaling Law, Showcases the Crucial Rule for ‘Textbook Quality’ Dataset

A Microsoft research team introduce phi-1.5, a 1.3 billion parameter model trained on a vast dataset of 30 billion tokens, remarkably delivering performance that rivals models five times its size. Moreover, it outperforms most non-frontier LLMs in tackling intricate reasoning tasks.

Large Language Models (LLMs) have undoubtedly showcased remarkable performance in the field of natural language processing. However, beneath their accomplishments lies a profound and far-reaching impact on the economic landscape, and they have yet to fully redefine the artificial intelligence framework and even cognition itself.

On the flip side, the enhancement of LLMs appears to be predominantly driven by their sheer scale. Many of today’s state-of-the-art models approach the staggering realm of trillions of parameters and tokens, demanding substantial resources for training, deployment, and maintenance. This exponential growth in scale inevitably leads to soaring costs.

Hence, a pressing question emerges: “How compact can an LLM be while retaining its capabilities?” In a groundbreaking paper titled “Textbooks Are All You Need II: phi-1.5 Technical Report,” a dedicated research team at Microsoft embarks on a quest to explore this inquiry. They introduce phi-1.5, a 1.3 billion parameter model trained on a vast dataset of 30 billion tokens, remarkably delivering performance that rivals models five times its size. Moreover, it outperforms most non-frontier LLMs in tackling intricate reasoning tasks.

This endeavor builds upon the foundation laid by phi-1, a pioneering Transformer-based model introduced by Microsoft in June of this year. With a mere 1.3 billion parameters, phi-1 managed to surpass the formidable GPT-3.5, thanks to its utilization of high-quality “textbook” training data. The phi-1.5 model preserves the exact architecture of phi-1, boasting 24 layers, 32 heads, and each head with a dimension of 64. Additional enhancements include the use of rotary embedding, flash-attention for speech training, and the codegen-mono tokenizer.

The training data for phi-1.5 comprises a fusion of phi-1’s training data and synthetic “textbook-like” data, meticulously crafted to enhance common-sense reasoning and general knowledge acquisition. The researchers initiated phi-1.5 from random initialization with a constant learning rate of 2e−4, coupled with a weight decay of 0.1. The training process harnessed the Adam optimizer and fp16 with DeepSpeed ZeRO Stage 2.

In their empirical investigation, the research team subjected phi-1.5 to rigorous evaluation on benchmark natural language tasks, encompassing common sense reasoning, language skills, and multi-step reasoning. The results unequivocally demonstrate that phi-1.5 achieves performance on par with significantly larger models, even surpassing them in the realm of complex reasoning tasks.

Collectively, this groundbreaking work challenges the prevailing belief that the prowess of LLMs hinges primarily on their scale. Instead, it underscores the pivotal role played by data quality, suggesting that it may hold the key to unlocking the true potential of these transformative models.

The paper Textbooks Are All You Need II: phi-1.5 technical report on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

0 comments on “Microsoft’s phi-1.5 Challenges LLMs Scaling Law, Showcases the Crucial Rule for ‘Textbook Quality’ Dataset

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: