A new Google Brain and New York University study argues that the current evaluation techniques for natural language understanding (NLU) tasks are broken, and proposes guidelines designed to produce better NLU benchmarks.
Contemporary NLU studies tend to focus on improving results on benchmark datasets that feature roughly independent and identically distributed (IID) training, evaluating and testing. The researchers however say such benchmark-driven NLU research has become problematic, as “unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements.”
A recent trend to address this issue is the abandoning of IID benchmarks in favour of adversarially-constructed, out-of-distribution test sets. However, this method is out-of-step with the higher purpose of benchmarks: helping to improve models. Instead, it focuses on increasing dataset examples that current models will fail on — an approach the researchers say is neither necessary nor sufficient to create a useful benchmark.
In the paper What Will it Take to Fix Benchmarking in Natural Language Understanding?, the Google & NYU team sets out to restore a healthy NLU evaluation ecosystem by identifying four criteria it believes benchmarks should meet.

NLU models have achieved impressive results in recent years. Leaderboard scores have approached or exceeded human performance on all nine tasks on the popular GLUE (general language understanding evaluation) benchmark, while model performance on the SQuAD 2 English reading comprehension leaderboard has long surpassed that of human annotators.
Pouring cold water on these exciting results however is the uncomfortable reality that even state-of-the-art models often fail dramatically on simple test cases. While true upper-performance bounds cannot be reliably measured, the superhuman performance of models suggests there is not much headroom left in this area. What’s more, the presence of socially relevant biases in top models also makes their deployment difficult in many applications, and there are as of yet no effective solutions for discouraging such harmful biases.
In consideration of these NLU benchmark shortcomings, the researchers lay out four criteria they believe can facilitate the building of machines that demonstrate a comprehensive and reliable understanding of everyday natural language text in specific, well-posed tasks and across language varieties and topic domains.

The first criteria is validity, which asserts that if a tested system outperforms another on some benchmark task, that result should reliably ensure that the tested system is actually better at the task. The team identifies the minimal requirements for a benchmark to achieve this requirement:
- An evaluation dataset should reflect the full range of linguistic variation—including words and higher-level constructions—that is used in the relevant domain, context, and language variety.
- An evaluation dataset should have a plausible means by which it tests all of the language-related behaviours that we expect the model to show in the context of the task.
- An evaluation dataset should be sufficiently free of annotation artifacts that a system cannot reach near-human levels of performance by any means other than demonstrating the required language-related behaviours.
The second criteria is reliable annotation, which simply means the labels used on test examples should be correct. Here, the team lists three failure cases that must be avoided:
- Examples that are carelessly mislabelled.
- Examples that have no clear correct label due to unclear or underspecified task guidelines.
- Examples that have no clear correct label under the relevant metric due to legitimate disagreements in interpretation among annotators.
The third criteria is statistical power, meaning evaluation datasets should be large and discriminative enough to ensure the detected qualitatively relevant performance differences among different models are convincing. This criteria introduces a trade-off: If benchmark datasets are both reliable and difficult for the models, then a moderate dataset size will suffice. But if benchmark datasets are too easy, then much larger evaluation sets will be required to reach adequate statistical power.
The last criteria is disincentives for biased models, i.e. a satisfying benchmark should favour models with fewer socially relevant biases. As current benchmarks are mostly built around naturally-occurring or crowdsourced text, many fail this test. The team notes that adequately enumerating the social attributes on which we might want to evaluate bias is difficult, especially across different cultural contexts, and that attitudes on sensitive issues such as race, gender, sexual orientation, disability etc., are constantly changing. Meeting this criteria is thus challenging, and issues in this area could continue to cause concern in research communities and for the public.
The team also traces some possible research directions that could lead to improvements for each criteria. These include the use of experts in crowdsourced datasets in order to mitigate annotation artifacts, treating ambiguously labelled examples the same way as mislabelled examples, and so on. The researchers hope their work and such efforts will help build a more healthy NLU benchmark ecosystem.
The paper What Will it Take to Fix Benchmarking in Natural Language Understanding? is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: [N] Google Brain & NYU Guidelines Address ‘Broken’ NLU Benchmarking – ONEO AI
Pingback: r/artificial - [N] Google Brain & NYU Guidelines Address ‘Broken’ NLU Benchmarking - Cyber Bharat