The paradigms for natural language processing (NLP) are rapidly evolving — from fully supervised learning to pretraining and fine-tuning and, more recently, pretraining with prompt prediction. The exciting progress and real-world applicability of NLP systems have motivated more AI researchers to explore innovations in this field.
In the new paper ReStructured Pre-training, a Carnegie Mellon University research team proposes reStructured Pre-training (RST), a novel NLP paradigm that pretrains models over valuable restructured data. The team’s RST-based QIN system scores 40 points higher than the student average on the Gaokao-English National College Entrance Examination and 15 points higher than GPT-3 with 1/16 of the parameters.

The 111-page paper opens with a quote from scientist Clifford Stoll, “data is not information,” underscoring the authors’ assertion that pretraining over information, i.e., restructured data, will be more effective than simply pre-training on raw data. The team summarizes their study’s main contributions as:
- Evolution Hypothesis: This paper attempts to establish a “Hypothesis of NLP Technique Evolution” from a global perspective by exploring the internal connection between the development of modern NLP technology.
- New Paradigm: We propose a new paradigm for modelling NLP: reStructured Pre-training. This paradigm regards model pre-training/tuning as a data storing/accessing process and claims that a good storage mechanism should make expected data easily accessible.
- AI for Gaokao: We develop QIN, the first deep learning-based AI system for the Gaokao-English examination.
- Rich Resources: We release the Gaokao Benchmark to track how well we make progress towards human-level intelligence, and set up an interactive leaderboard using ExplainaBoard as a Gaokao Benchmark.
- Inspiring Evidence: The success of AI in English for the Gaokao Examination has provided us with much new thinking: AI technology can empower education and help solve a series of problems in education and teaching. The impressive performance on more than 50 datasets from varieties of NLP tasks shows the value of data-centric pre-training and inspires more future exploration.
Unlike current NLP paradigms that focus on model architecture/structure, the proposed RST seeks to optimize the utility of available data by having it cover as many types of signals as possible and offers a precise access mechanism for these signals based on the requirements of downstream tasks.
The RST method comprises three steps: restructure, pretrain, and fine-tune. Existing data signals in diverse formats are first restructured into a unified form for model pretraining, and a pretraining architecture is then selected and trained over this structured data. Finally, the model is further fine-tuned on restructured labelled data for improved performance.

The paper also introduces QIN, which the researchers believe is the first dedicated deep learning-based AI system for China’s Gaokao English College Entrance Examination.


In their empirical study, the team evaluated the proposed RST on a variety of NLP tasks, where it outperformed baseline models such as GPT-3 and T0pp on 52 of the 55 surveyed datasets. The QIN AI system also achieved outstanding results on the Gaokao exam: scoring 40 points higher than the average student result and 15 points higher than GPT-3 with 1/16 of the parameters.
Overall, this work argues that, in NLP, “blindly sticking with supervised or unsupervised, pre-training or fine-tuning, few-shot, or zero-shot makes little sense. In practice, all that matters is how we make the best use of the information from data that we can get from the world.”
The Gaokao Benchmark for AI is available on the project’s GitHub. The paper ReStructured Pre-training is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Pingback: CMU’s new ‘restructured pre-training’ NLP approach pre-trains model over valuable restructured data – Yhtcomic
Pingback: CMU’s New ‘ReStructured Pre-training’ NLP Approach Pretrains Model Over Valuable Restructured Data – Deeptech Central
Pingback: CMU’s New ‘ReStructured Pre-training’ NLP Approach Pretrains Model Over Valuable Restructured Data
Pingback: Nový přístup NLP „Restrukturalizovaného předtréninkového“ NLP předškolního modelu přes cenná restrukturalizovaná data - Hpntunes
Good afternoon! Thank you for this article. I am currently preparing for my exams, and one of the most important exams for me is my thesis. Tell us how you wrote your thesis, was it easy? How many points did you get?
Unlock your creativity with zzo.ai, the comprehensive AI platform for all your visual needs. Whether you are a marketer, designer, or content creator, zzo.ai helps you:
1. AI Image Generator: Turn text into high-quality images instantly.
2. Magic Editor: Modify details and fix images effortlessly.
Background Remover: Clean up product photos or portraits with one click. Streamline your workflow and save hours of editing time. Try it now at zzo.ai.
The article on CMU’s novel approach to NLP with ReStructured Pre-training is incredibly enlightening. It’s impressive to see how this new paradigm, by rethinking the data utilization process, can outperform existing models like GPT-3 with significantly fewer parameters. The potential benefits for education, particularly in rigorous exams like the Gaokao, are both promising and exciting. Speaking of innovation in handling data, I’ve been exploring ways to edit images without complex tools like Photoshop. If you’re interested in text editing within images just by typing, you might find this useful.
edit words in pictures
The advancements in NLP with Carnegie Mellon University’s ReStructured Pre-training approach are truly fascinating. The ability to outperform even GPT-3 with a fraction of the parameters highlights the importance of data restructuring in achieving superior results. This paradigm shift not only enhances AI’s capabilities in exams like Gaokao but also opens new opportunities in educational technology. Speaking of optimizing visuals, if anyone’s dealing with distracting text or watermarks in images, there’s a tool called RemoveTexts that can seamlessly handle it.
clean text off pictures
The concept of ReStructured Pre-training (RST) introduced by the CMU team is truly fascinating. The idea of leveraging restructured data for pretraining rather than raw data is a significant shift that seems to hold great promise. Scoring so highly on the Gaokao-English exam demonstrates the potential impact of RST and its implications for NLP advancements. It’s exciting to think about how such innovations can further bridge the gap in education and understanding across different languages.
On a related note, for those interested in exploring AI’s potential in creative fields, tools like SparkPix’s GPT Image 2 editor allow users to manipulate images effortlessly, echoing the transformative power seen in language processing.
GPT Image 2-great image generate model
The article on CMU’s reStructured Pre-training approach is truly fascinating and highlights significant advancements in NLP. It’s impressive how the RST paradigm leverages data restructuring to excel in tasks like the Gaokao-English exam, outperforming even the massive GPT-3 with a fraction of the parameters. This innovative approach underscores the potential of data-centric strategies in developing more intelligent AI systems.
Speaking of restructuring data, I recently came across a tool called MagicRemover that helps in removing unwanted text from images, which could be quite handy for making clean datasets for AI models.
remove text from picture
This article on Carnegie Mellon University’s ReStructured Pre-training (RST) approach is fascinating! It’s impressive how RST achieved a significant leap in performance on the Gaokao-English exam while using far fewer parameters than GPT-3. The shift from focusing solely on model architecture to optimizing the use of data seems to be a promising direction for NLP advancements. The success of RST highlights the potential of data-centric methods in enhancing AI capabilities, especially in complex tasks like language exams.
On a related note, for those interested in exploring the cutting-edge image generation models, the gpt-image-2-OpenAi model offers great potential for developers working with AI platforms, providing a unified API solution.
AI image generator