Microsoft’s Crafted “Textbook Quality” Data Are All You Need to Train 10× Smaller Yet Strong Language Model for Code

Training large artificial neural networks is an art. It has long been known that high quality training data has significant impacts the improvement of large-scale models performance and can even alter the scaling laws related to model and data size.

Following this approach, a Microsoft research team, in their new paper Textbooks Are All You Need, crafts ‘textbook quality’ data to train a large language model for code. The resulting phi-1 model surpasses the state-of-the-art large language models (LLMs) with a mere 1.3 billion parameter.

This work focus on training language model for code and aims at showing that the power of high quality data can break existing scaling law.

The team first shows how to craft high-quality data for training better language models with much smaller size. Specifically, they use a transformer-based classifier, which is GPT-4, to filter the subset of the publicly existing Python code datasets from the Stack and the StackOverflow with educational value for a student to acquire basic coding skills. They also inject randomness into the prompt to ensure the crated dataset are diverse and non-repetitive.

Next they use a decoder only transformer as a base model. The resulting 1.3B parameter phi-1 model consists of 24 layers, 2048 hidden dimensions, 8192 MLP-inner dimension and 32 attention heads. In particular, the phi-1-base was trained on the CodeTextbook dataset and phi-1 is obtained by fine-tuning phi-1-base.

Furthermore, they also show that fine-tuning exabits substantial improvements in terms of model understanding and model’s ability to use external libraries such as Pygame and Tkinter.

The empirical results show that phi-1 achieves pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP, surpassing almost all open-source models on coding benchmarks despite that its parameter is mere 1.3B and was trained with 100x smaller dataset size.

Overall, this work verifies that the developing methods for creating high quality datasets is a crucial research on large language model training, and such methods have great potential in unlocking more coding capabilities of language models.

The paper Textbooks Are All You Need on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Microsoft’s Crafted “Textbook Quality” Data Are All You Need to Train 10× Smaller Yet Strong Language Model for Code

Like this:

3 comments on “Microsoft’s Crafted “Textbook Quality” Data Are All You Need to Train 10× Smaller Yet Strong Language Model for Code”

Leave a Reply Cancel reply

Related

Share this:

Like this:

3 comments on “Microsoft’s Crafted “Textbook Quality” Data Are All You Need to Train 10× Smaller Yet Strong Language Model for Code”

Leave a Reply Cancel reply

Related