The paradigms for natural language processing (NLP) are rapidly evolving — from fully supervised learning to pretraining and fine-tuning and, more recently, pretraining with prompt prediction. The exciting progress and real-world applicability of NLP systems have motivated more AI researchers to explore innovations in this field.
In the new paper ReStructured Pre-training, a Carnegie Mellon University research team proposes reStructured Pre-training (RST), a novel NLP paradigm that pretrains models over valuable restructured data. The team’s RST-based QIN system scores 40 points higher than the student average on the Gaokao-English National College Entrance Examination and 15 points higher than GPT-3 with 1/16 of the parameters.
The 111-page paper opens with a quote from scientist Clifford Stoll, “data is not information,” underscoring the authors’ assertion that pretraining over information, i.e., restructured data, will be more effective than simply pre-training on raw data. The team summarizes their study’s main contributions as:
- Evolution Hypothesis: This paper attempts to establish a “Hypothesis of NLP Technique Evolution” from a global perspective by exploring the internal connection between the development of modern NLP technology.
- New Paradigm: We propose a new paradigm for modelling NLP: reStructured Pre-training. This paradigm regards model pre-training/tuning as a data storing/accessing process and claims that a good storage mechanism should make expected data easily accessible.
- AI for Gaokao: We develop QIN, the first deep learning-based AI system for the Gaokao-English examination.
- Rich Resources: We release the Gaokao Benchmark to track how well we make progress towards human-level intelligence, and set up an interactive leaderboard using ExplainaBoard as a Gaokao Benchmark.
- Inspiring Evidence: The success of AI in English for the Gaokao Examination has provided us with much new thinking: AI technology can empower education and help solve a series of problems in education and teaching. The impressive performance on more than 50 datasets from varieties of NLP tasks shows the value of data-centric pre-training and inspires more future exploration.
Unlike current NLP paradigms that focus on model architecture/structure, the proposed RST seeks to optimize the utility of available data by having it cover as many types of signals as possible and offers a precise access mechanism for these signals based on the requirements of downstream tasks.
The RST method comprises three steps: restructure, pretrain, and fine-tune. Existing data signals in diverse formats are first restructured into a unified form for model pretraining, and a pretraining architecture is then selected and trained over this structured data. Finally, the model is further fine-tuned on restructured labelled data for improved performance.
The paper also introduces QIN, which the researchers believe is the first dedicated deep learning-based AI system for China’s Gaokao English College Entrance Examination.
In their empirical study, the team evaluated the proposed RST on a variety of NLP tasks, where it outperformed baseline models such as GPT-3 and T0pp on 52 of the 55 surveyed datasets. The QIN AI system also achieved outstanding results on the Gaokao exam: scoring 40 points higher than the average student result and 15 points higher than GPT-3 with 1/16 of the parameters.
Overall, this work argues that, in NLP, “blindly sticking with supervised or unsupervised, pre-training or fine-tuning, few-shot, or zero-shot makes little sense. In practice, all that matters is how we make the best use of the information from data that we can get from the world.”
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.