While contemporary deep learning models continue to achieve outstanding results across a wide range of tasks, these models are known to have huge data appetites. The emergence of large-scale pretrained language models such as Open AI’s GPT-3 has helped reduce the need for task-specific labelled data in natural language processing (NLP), as the models’ learned contextualized text representations can be fine-tuned for specific downstream tasks using relatively small training data sizes. These powerful large language models have more recently also shown their ability to generate answers for unseen NLP tasks via few-shot inference.
Motivated by this development, a new Google AI study explores zero-label learning (training with synthetic data only) in NLP, proposing Unsupervised Data Generation (UDG), a novel training data creation procedure designed to synthesize high-quality training data without any human annotations.
The mechanism behind few-show inference on NLP tasks is the leveraging of large language models to infer the correct label based on manually crafted input prompts comprising a task description and a few sample input label pairs. Recent studies on large-scale GPT-3 models indicate that by using limited task-specific data and no gradient updates, few-shot inference can obtain performance comparable to traditional fine-tuning methods.
The performance of few-shot inference methods using large language models however still lags behind state-of-the-art fine-tuned models on many NLP tasks. The researchers suggest a possible reason for this is that the language models were never explicitly trained to directly conduct inference. They thus propose utilizing the models to perform few-shot generation — instead of predicting output labels, have them generate the inputs, with the goal of formulating input prompts that are more likely to naturally exist in the training corpus.
Unlike few-shot inference, the proposed UDG framework only requires unsupervised few-shot examples — in other words, it is effectively a zero-label learning setting. In a departure from existing synthetic data generation methods, it also requires no fine-tuning of the generative model and only uses unsupervised data.
To test the performance of their method, the researchers conducted experiments on text classification and complex language understanding tasks.
The proposed UDG achieved better or comparable results than strong baseline models trained on human-labelled data, and was also shown to be a highly effective data augmentation method when combined with labelled data, achieving new state-of-the-art results on the SuperGLUE benchmark.
Overall, this work demonstrates that NLP models can obtain strong results without any human annotated labels, opening a promising new direction for future transfer learning research in NLP.
The paper Towards Zero-Label Language Learning is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.