The impressive generative capacity of large-scale pretrained language models (PLMs) has inspired machine learning researchers to explore methods for generating model training examples via PLMs and data augmentation procedures, i.e. dataset generation.
A novel contribution in this research direction is proposed in the new paper ZeroGen: Efficient Zero-shot Learning via Dataset Generation, from researchers at the University of Hong Kong, Shanghai AI Lab, Huawei Noah’s Ark Lab and the University of Washington. The team describes their proposed ZEROGEN as an “extreme instance” of dataset generation via PLMs for zero-shot learning.

ZEROGEN is a framework for prompt-based zero-shot learning (PROMPTING). Unlike existing approaches that rely on gigantic PLMs during inference, ZEROGEM introduces a more flexible and efficient approach for conducting zero-shot learning with PLMs.
The ZEROGEN process involves three sequential stages: pseudo dataset generation, pseudo-supervised training, and zero-shot evaluation. The pseudo dataset generation stage exploits the generative power of PLMs to synthesize a dataset to solve a downstream task. Given a PLM and a text classification task, ZEROGEM instantiates a prompt and outputs a natural language sequence to be completed by the PLM. With carefully designed prompts and a powerful PLM, the generated dataset is believed to incorporate rich task-specific knowledge.
Using the pseudo-dataset synthesized as above, a tiny task model (TAM) is then trained to solve the task. This step is flexible; the researchers can use any model architecture, loss function or training strategy. Because the trained TAM is orders-of-magnitude smaller than the PLMs, it is able to perform extremely efficient inference on the target task. No human annotations are involved during the entire process, making the evaluation setting purely zero-shot.

The team evaluated the proposed ZEROGEN on natural language processing (NLP) tasks, including text classification, question answering, and natural language inference. They used six datasets in their experiments and compared ZEROGEN to PROMPTING and Supervised baselines.


The team summarizes their key findings from the empirical results as:
- The zero-shot performance of the ZEROGEN framework’s TAM significantly surpasses its PLM counterparts (which often serve as teacher models under the knowledge distillation context), with only ∼0.4 percent of the number of parameters.
- In some low-resource settings, TAM trained with synthesized data even outperforms the same model trained with human annotations in a fully supervised manner.
- The quality of the generated text by known models and algorithms is well reflected in downstream tasks’ performance, and decoding strategies that encourage more diversity also result in greater noise.
- Prompt engineering is challenging – the performance of more instructive or natural language style prompts varies on different tasks.
Overall, the results validate ZEROGEN as a practical and promising approach for flexible and efficient zero-shot learning in NLP, with the potential to also serve as a data-free, model-agnostic knowledge distillation and unreferenced text evaluation method.
The paper ZeroGen: Efficient Zero-shot Learning via Dataset Generation is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Meet ZEROGEN: An Extreme Method for Dataset Generation via PLMs for Zero-Shot Learning”