Large language models (LLMs) pretrained on massive data are being used in countless real-world applications. However — as computer scientists have known for decades — not all data is equal, and this is also true with regard to the composition of LLM pretraining data, which is typically sourced from publicly available domains such as Wikipedia, book and web texts, etc.
Most current data selection strategies for LLMs determine the mixture of domains (i.e. their relative weights) based on intuition, which is usually suboptimal; or by considering a set of downstream tasks, which can involve training thousands of candidates on different domain weights and incurs a high risk of overfitting the model to specific downstream tasks.
In the new paper DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining, a research team from Google and Stanford University introduces Domain Reweighting with Minimax Optimization (DoReMi), a domain weight optimization strategy that leverages distributionally robust optimization (DRO) to substantially speed up effective language model pretraining without any knowledge of downstream tasks.
The DoReMi process first trains a small reference language model in a conventional manner, then uses this model to train a small, distributionally robust language model (DRO-LM) to minimize the worst-case excess loss across all domains. Finally, the researchers train an 8B parameter LLM on a new dataset whose composition is determined by these domain weights.
The DRO-LM framework dynamically updates the domain weights based on the loss on each domain to rescale the training objective. As the domain weights are produced by DRO training, no knowledge of specific downstream tasks is required.
In their empirical study, the team evaluated their 8B DoReMi model trained on The Pile and GLaM datasets. On The Pile, DoReMi effectively reduced perplexity over all baseline domain weights, improved average downstream accuracy by 6.5 percent on generative few-shot tasks, and reached the baseline accuracy 2.6x faster. On GLaM, DoReMi achieved comparable performance when using domain weights tuned on downstream tasks.
This work confirms DoReMi’s effectiveness in optimizing different data domains to speed up language model pretraining. The team believes additional research on such data-centric approaches can further improve language model training efficiency.
The paper DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.