The paradigms for natural language processing (NLP) are rapidly evolving — from fully supervised learning to pretraining and fine-tuning and, more recently, pretraining with prompt prediction. The exciting progress and real-world applicability of NLP systems have motivated more AI researchers to explore innovations in this field.
In the new paper ReStructured Pre-training, a Carnegie Mellon University research team proposes reStructured Pre-training (RST), a novel NLP paradigm that pretrains models over valuable restructured data. The team’s RST-based QIN system scores 40 points higher than the student average on the Gaokao-English National College Entrance Examination and 15 points higher than GPT-3 with 1/16 of the parameters.

The 111-page paper opens with a quote from scientist Clifford Stoll, “data is not information,” underscoring the authors’ assertion that pretraining over information, i.e., restructured data, will be more effective than simply pre-training on raw data. The team summarizes their study’s main contributions as:
- Evolution Hypothesis: This paper attempts to establish a “Hypothesis of NLP Technique Evolution” from a global perspective by exploring the internal connection between the development of modern NLP technology.
- New Paradigm: We propose a new paradigm for modelling NLP: reStructured Pre-training. This paradigm regards model pre-training/tuning as a data storing/accessing process and claims that a good storage mechanism should make expected data easily accessible.
- AI for Gaokao: We develop QIN, the first deep learning-based AI system for the Gaokao-English examination.
- Rich Resources: We release the Gaokao Benchmark to track how well we make progress towards human-level intelligence, and set up an interactive leaderboard using ExplainaBoard as a Gaokao Benchmark.
- Inspiring Evidence: The success of AI in English for the Gaokao Examination has provided us with much new thinking: AI technology can empower education and help solve a series of problems in education and teaching. The impressive performance on more than 50 datasets from varieties of NLP tasks shows the value of data-centric pre-training and inspires more future exploration.
Unlike current NLP paradigms that focus on model architecture/structure, the proposed RST seeks to optimize the utility of available data by having it cover as many types of signals as possible and offers a precise access mechanism for these signals based on the requirements of downstream tasks.
The RST method comprises three steps: restructure, pretrain, and fine-tune. Existing data signals in diverse formats are first restructured into a unified form for model pretraining, and a pretraining architecture is then selected and trained over this structured data. Finally, the model is further fine-tuned on restructured labelled data for improved performance.

The paper also introduces QIN, which the researchers believe is the first dedicated deep learning-based AI system for China’s Gaokao English College Entrance Examination.


In their empirical study, the team evaluated the proposed RST on a variety of NLP tasks, where it outperformed baseline models such as GPT-3 and T0pp on 52 of the 55 surveyed datasets. The QIN AI system also achieved outstanding results on the Gaokao exam: scoring 40 points higher than the average student result and 15 points higher than GPT-3 with 1/16 of the parameters.
Overall, this work argues that, in NLP, “blindly sticking with supervised or unsupervised, pre-training or fine-tuning, few-shot, or zero-shot makes little sense. In practice, all that matters is how we make the best use of the information from data that we can get from the world.”
The Gaokao Benchmark for AI is available on the project’s GitHub. The paper ReStructured Pre-training is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: CMU’s new ‘restructured pre-training’ NLP approach pre-trains model over valuable restructured data – Yhtcomic
Pingback: CMU’s New ‘ReStructured Pre-training’ NLP Approach Pretrains Model Over Valuable Restructured Data – Deeptech Central
Pingback: CMU’s New ‘ReStructured Pre-training’ NLP Approach Pretrains Model Over Valuable Restructured Data
Pingback: Nový přístup NLP „Restrukturalizovaného předtréninkového“ NLP předškolního modelu přes cenná restrukturalizovaná data - Hpntunes