For humans, using language to describe scenes or exchange opinions comes almost without effort. Natural language generation (NLG) models that generate text are the go-to tools for mimicking such human behaviours, and have been applied in translation, summarization, structured data-to-text generation and image captioning tasks. Over the past few years, the performance of NLG models has witnessed a robust boost thanks to machine learning researchers’ continually evolving techniques.
As NLG models continue leaping forward and producing text with unprecedented levels of naturalness and accuracy, many existing evaluation methods have struggled to keep pace. Surveying human evaluators remains one way to ensure the quality of new NLG models, but this is generally a long process that also carries heavy labour costs. Another available evaluation method is automatic metrics such as the popular BLEU (Bilingual Evaluation Understudy) score, but these are oftentimes unreliable when compared to human evaluation.
The challenge is to develop a novel and automatic metric that can be robust and reliable enough to match the evaluation quality that human annotation can deliver. In the recently published paper BLEURT: Learning Robust Metrics for Text Generation, a Google Research team proposes the automatic metric BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) which is based on the highly successful Google language model BERT and designed to model such human assessments.
Researchers introduced BLEURT as a machine learning-based automatic metric trained largely on the WMT Metrics Shared Task dataset, which, with 260k human ratings, is the most extensive such public collection of human ratings. Although the WMT Metrics benchmark enables comparisons on ability to imitate human assessments, its scope is limited to the news domain. Researchers therefore set out to train a metric that could perform well across a wide range of tasks and domains to evaluate the ability of models to generalize.
The project employs a novel pretraining scheme leveraging millions of synthetic examples to help the model generalize. The researchers first used Google’s state-of-the-art transformer network BERT to learn contextual representations of text sequences and prepare BERT before fine-tuning on rating data. They chose an automatic approach that comes with little cost to generate synthetic sentence pairs to ensure BLEURT could be exposed to as many different sentence pairs as possible while also capturing the errors and alterations in the sentence pairs datasets.
The team generated the synthetic sentence pairs by randomly perturbing 1.8 million segments from Wikipedia before fine-tuning them on human ratings, and observed that the combined steps improved BLEURT’s agility and generalization ability, especially with incomplete training data.
In experiments, BLEURT achieved state-of-the-art performance on both the WMT Metrics shared task and the WebNLG Competition dataset. The researchers see BLEURT as a valuable addition to the language evaluation toolkit that could contribute to future studies on multilingual NLG evaluation and hybrid methods involving both humans and classifiers, while offering more flexible and semantic-level metrics to machine learning engineers.
The paper BLEURT: Learning Robust Metrics for Text Generation is available on arXiv.
Journalist: Fangyu Cai | Editor: Michael Sarazen