Google & Stanford U’s DoReMi Significantly Speeds Up Language Model Pretraining

Large language models (LLMs) pretrained on massive data are being used in countless real-world applications. However — as computer scientists have known for decades — not all data is equal, and this is also true with regard to the composition of LLM pretraining data, which is typically sourced from publicly available domains such as Wikipedia, book and web texts, etc.

Most current data selection strategies for LLMs determine the mixture of domains (i.e. their relative weights) based on intuition, which is usually suboptimal; or by considering a set of downstream tasks, which can involve training thousands of candidates on different domain weights and incurs a high risk of overfitting the model to specific downstream tasks.

In the new paper DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining, a research team from Google and Stanford University introduces Domain Reweighting with Minimax Optimization (DoReMi), a domain weight optimization strategy that leverages distributionally robust optimization (DRO) to substantially speed up effective language model pretraining without any knowledge of downstream tasks.

The DoReMi process first trains a small reference language model in a conventional manner, then uses this model to train a small, distributionally robust language model (DRO-LM) to minimize the worst-case excess loss across all domains. Finally, the researchers train an 8B parameter LLM on a new dataset whose composition is determined by these domain weights.

The DRO-LM framework dynamically updates the domain weights based on the loss on each domain to rescale the training objective. As the domain weights are produced by DRO training, no knowledge of specific downstream tasks is required.

In their empirical study, the team evaluated their 8B DoReMi model trained on The Pile and GLaM datasets. On The Pile, DoReMi effectively reduced perplexity over all baseline domain weights, improved average downstream accuracy by 6.5 percent on generative few-shot tasks, and reached the baseline accuracy 2.6x faster. On GLaM, DoReMi achieved comparable performance when using domain weights tuned on downstream tasks.

This work confirms DoReMi’s effectiveness in optimizing different data domains to speed up language model pretraining. The team believes additional research on such data-centric approaches can further improve language model training efficiency.

The paper DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

2 comments on “Google & Stanford U’s DoReMi Significantly Speeds Up Language Model Pretraining”

essaydaixie

2024-11-06

We take customer satisfaction as our responsibility, providing excellent essay writing services http://www.pnstudy.com/ for you. We adopt a proactive attitude to listen to your needs and continuously optimize our services to ensure you achieve the best academic results. Customer feedback is crucial to us and serves as the driving force behind our continuous improvement.

Loading...

rolldorado

2025-12-11

The complexity of DoReMi, a data-centric approach for Optimizing Data Mixtures to speed up Language Model Pretraining (2.6x faster), is extremely tiring. The need to grasp DoReMi’s effectiveness on models like GLaM wore me out. I needed something simple and fast for immediate relief. A friend suggested I check out rolldorado and I was thrilled by the excellent offers they provide specifically for the Greek market. I took a risk on ‘Crazy Time’, placed a significant bet, and managed to hit a substantial win.

Loading...

Google & Stanford U’s DoReMi Significantly Speeds Up Language Model Pretraining

Like this:

2 comments on “Google & Stanford U’s DoReMi Significantly Speeds Up Language Model Pretraining”

Leave a Reply Cancel reply

Related

Share this:

Like this:

2 comments on “Google & Stanford U’s DoReMi Significantly Speeds Up Language Model Pretraining”

Leave a Reply Cancel reply

Related