In a bid to better track progress in natural language generation (NLG) models, a global project involving 55 researchers from more than 40 prestigious institutions has proposed GEM (Generation, Evaluation, and Metrics), a “living benchmark” environment for NLG with a focus on evaluation.
NLG models take a non-linguistic or textural representation of information as input and automatically generate understandable texts. Natural language processing (NLP) benchmarks such as GLUE (general language understanding evaluation) have already been applied to NLG and natural language understanding (NLU) models. Although such benchmarks can aggregate multiple tasks under a unified evaluation framework and have helped researchers efficiently compare models, they also risk simplifying model complexity to a single number on the benchmark leaderboard. As the research team notes in the paper The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics, a single metric cannot properly characterize system performance, as considerations such as model size and fairness are overlooked.
The paper’s first author, Google AI Language Research Scientist Sebastian Gehrmann, explains the researchers plan to conduct a shared task workshop at this summer’s ACL 2021: “Since data, models, and evaluation evolve together, benchmarks need to be up to date on all of them. As a ‘living’ benchmark, GEM does not have a fixed set of metrics or leaderboard. Instead, we aim to discover shortcomings and opportunities for progress. To do so, the shared task will have two parts; modeling and eval. First, we ask for challenge set submissions for 11 datasets and 7 languages in various NLG challenges. In the second part, participants will analyze the outputs.“
Automated metrics tend to perform differently across different tasks, setups, and languages. It has been a common practice among NLG researchers to assess how well human-ratings and automated metrics correlate with task-based evaluations. The GEM workshop will help NLG researchers explore shared tasks and develop GEM into a benchmark environment for NLG. By having human annotators evaluate the outputs, the researchers also hope to establish repeatable and consistent human annotation practices for future NLG research. Moreover, “to discourage hill-climbing,” Gehrmann tweeted, “we are developing a result analysis tool that will help gain insights from the evaluation without a focus on which result is SotA.”
The GEM project’s ultimate goal is to enable in-depth analysis of data and models rather than focusing on a single leaderboard score. By measuring NLG progress across 13 datasets spanning many NLG tasks and languages, it’s hoped the GEM benchmark could also provide standards for future evaluation of generated text using both automated and human metrics.
The researchers have opened the project to the NLG research community, and senior members will be available to help newcomers contribute. All data can be found in HuggingFace datasets, and the GEM benchmark is at gem-benchmark.com.
The paper The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics is on arXiv.
Reporter: Fangyu Cai | Editor: Michael Sarazen