AI Research

GPT Understands, Too! Tsinghua & MIT’s P-Tuning Boosts Performance on NLU Benchmarks

Tsinghua & MIT researchers break the stereotype that GPTs can generate but not understand language, showing that GPTs can compete with BERT models on natural language understanding tasks using a novel P-tuning method that can also improve BERT performance in both few-shot and supervised settings.

The GPT-3 large language model has demonstrated an unprecedented ability to generate human-like text passages, from creating imaginary conversations between historical figures to summarizing movies and writing code. However, although its output is grammatically correct and even idiomatically impressive, GPT-3’s comprehension of the world is often seriously lacking.

A research team from Tsinghua & MIT has dispelled the stereotype that GPT models can generate but not understand language, demonstrating that GPTs can compete with Google’s BERT models on natural language understanding (NLU) tasks by employing a proposed P-tuning method. The team says P-tuning can also improve BERT performance in both few-shot and supervised settings.

The researchers summarize their contributions as follows:

  1. Show that GPTs can be as competitive as BERTs in NLU (and sometimes even better) with P-tuning, which can boost pretrained language model performance. This reveals the potential of GPT-style architectures for NLU has been underestimated.
  2. Show that P-tuning is a general method to improve GPTs and BERTs in both few-shot and fully supervised settings. The proposed approach outperforms state-of-the-art methods on LAMA knowledge probing and few-shot SuperGLUE, indicating that language models have grasped more world knowledge and prior-task knowledge during pretraining than previously thought.

Giant models like GPT-3 tend to suffer from poor transferability, meaning fine-tuning these trillion-scale models for downstream tasks is not effective. Instead, GPT-3 has leveraged handcrafted prompts to make models suitable for downstream applications. Such handcrafted prompt searching however relies heavily on impractical large validation sets, and a single work change in prompts can cause drastic performance disruptions.

The proposed P-tuning approach solves these issues. It automatically searches prompts in the continuous space, bridging the gap between GPTs and downstream NLU tasks and bringing substantial improvements to NLU performance.

The P-tuning architecture itself is relatively simple. Given a pretrained language model, a sequence of discrete input tokens is mapped to input embeddings by the pretrained embedding layer. The function of a prompt p is to organize context x, target y and itself to a template t. In this way, the approach can find better continuous prompts, and finally optimize the continuous prompts through a downstream loss function.

The team conducted extensive experiments on the popular LAMA knowledge probing and SuperGLUE NLU benchmarks to evaluate P-tuning performance.

LAMA knowledge probing is used to evaluate how much knowledge language models have gained from pretraining. The results show that P-tuning significantly boosts knowledge-probing performance, from 43.3 percent to 50.6 percent on LAMA-34k and from 45.2 percent to 64.2 percent on LAMA-29k, indicating that merely finding a better prompt without fine-tuning for language models can capture more knowledge than researchers previously believed. P-tuning also outperforms previous discrete prompt searching approaches such as AutoPrompt and LPAQA.

In the SuperGLUE experiments, the team considered both a fully supervised setting and a few-shot setting on NLU tasks that included question answering (BoolQ and MultiRC), textual entailment (CB and RTE), co-reference resolution (WiC), causal reasoning (COPA), and word sense disambiguation (WSC).

Under the fully supervised setting, for both BERT-base-cased and BERT-large-cased models, the P-tuning approach outperformed all other BERT-based models on almost all tasks. P-tuning also achieved promising results with both GPT-2-base and GPT-2-medium models.

Under the few-shot learning setting, P-tuning consistently outperformed PET (Ddev32) and PET-best (Ddev32) with manual prompts on all tasks; and compared to GPT-3, P-tuning improved performance on six out of seven tasks, proving that P-tuning can search far better prompts than manual approaches and significantly improve few-shot task performance.

Overall, the study demonstrates the competitive potential of P-tuning in prompting larger-scale pretrained models that are difficult to fine-tune.

The paper GPT Understands, Too is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


2 comments on “GPT Understands, Too! Tsinghua & MIT’s P-Tuning Boosts Performance on NLU Benchmarks

  1. Pingback: [N] Tsinghua & MIT’s P-Tuning Boosts Performance on NLU Benchmarks – ONEO AI

  2. Pingback: r/artificial - [N] Tsinghua & MIT’s P-Tuning Boosts Performance on NLU Benchmarks - Cyber Bharat

Leave a Reply

Your email address will not be published. Required fields are marked *