Anyone who has ever assembled Ikea-like home furniture knows that step-by-step instructions make the task much easier. Recent studies have shown that large language models (LLMs) can also benefit from instructions, and hence they are now commonly trained or fine-tuned with open-domain instruction-following data provided by human annotators. This approach however introduces a bottleneck, as the manual creation of such instructions is time-consuming and labour-intensive.
In the new paper WizardLM: Empowering Large Language Models to Follow Complex Instructions, a research team from Microsoft and Peking University presents Evol-Instruct, a novel approach that leverages LLMs to automatically generate large amounts of instruction data with varying levels of complexity. In human evaluations, the team’s resulting WizardLM model’s generated instructions were judged superior to human-created instruction datasets.


The Evol-Instruct pipeline comprises three steps: 1) Instruction evolving, 2) Response generation based on newly evolved instruction, and 3) Elimination evolving.

Starting from a simple initial instruction, Evol-Instruct randomly selects one of two options: In-depth Evolving (to upgrade the given instruction to more complex ones via one of five operations: add constraints, deepening, concretizing, increase reasoning steps, and complicate input); or In-breadth Evolving (to create a new instruction based on the given instruction). The final Elimination Evolving step serves as a filter that eliminates failed instructions.

In their empirical study, the team had Evol-Instruct generate instructions with different complexity levels, then employed a mixture of all generated instruction data to fine-tune a LLaMA LLM and create their WizardLM model. They compared WizardLM with strong baselines such as ChatGPT, Alpaca and Vicuna.
The team summarizes the empirical results as follows:
- Instructions from Evol-Instruct are superior to those from human-created ShareGPT. When we use the same amount of Evol-Instruct data, WizardLM significantly outperforms Vicuna, with a 12.4% higher win rate (41.3% vs. 28.9%).
- Labellers prefer WizardLM outputs over outputs from ChatGPT under complex test instructions. In the high-difficulty section of our test set (difficulty level ≥ 8), WizardLM outperforms ChatGPT, with a 7.9% higher win rate (42.9% vs. 35.0%).
Overall, this work demonstrates that the proposed Evol-Instruct’s AI-evolved instruction approach can significantly enhance LLM performance and enable models to better handle difficult and complex instructions such as solving math problems, writing code, and reasoning.
Codes and generated data are available on the project’s GitHub. The paper WizardLM: Empowering Large Language Models to Follow Complex Instructions is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Transform your home with the best furniture from our shop in Dubai. Browse through our wide selection of high-quality, stylish pieces to find the perfect fit for your space.