This is a boom time for commercial applications of Neural Machine Translation (NMT), with the performance of multilingual systems rapidly advancing both in terms of translation quality and robustness to input perturbations such as spelling or grammatical errors. A new paper however argues that the current NMT research focus on performance and robustness can result in “hallucinations” — translation outputs that are not faithful to the source and can even be factually incorrect.
Sometimes We Want Translationese, from a McGill University, Quebec AI Institute and Facebook AI research team, proposes a series of novel metrics and perturbation functions designed to detect, quantify, and compare trade-offs between robustness and faithfulness in NMT systems, both at the corpus level and with particular examples.
SOTA translation systems have evolved to be more robust in dealing with noisy inputs, and in doing so typically assume that orthographic, lexical or grammatical variants in the input are mistakes that should be corrected in the translation. The team cautions that this assumption overlooks use cases where systems should instead prioritize faithfulness to the input text — in other words, where stylistically awkward but semantically faithful “translationese” outputs are better.
The paper analyses how existing NMT systems deal with word order permutations in the inputs. Previous studies have shown that some word-order agnostic models such as recurrent neural networks trained in dependency parsing can translate from source languages to distantly related languages better than word-order sensitive models.
While randomly shuffling linguistic elements is an established method for evaluating NLP model performance, NMT systems in applications such as automated tutoring, argument structure information or poetry require the preservation of word order and local syntax, i.e. faithfulness.
The researchers introduce three metrics to evaluate the translation robustness and faithfulness of SOTA NMT models when dealing with word-order permuted source sentences. The first metric measures robustness to perturbation by scoring the similarity between the translation of a perturbed sentence as source and a gold sentence as target. The second computes a similarity score between a translation perturbed source sentence and the same perturbation operation applied to a target sentence to measure its faithfulness. The last metric measures the standard translation performance on any given source-target sentence pair.
The team proposes 16 functions to perturb the structure of an input sentence, which fall into three categories: dependency tree based, PoS-tag based, and random shuffles. These functions vary in complexity and linguistic sophistication and are designed to reveal whether a model performs faithful translations or stays robust to the perturbed inputs.
To evaluate their NMT system analysis method, the researchers conducted experiments using SOTA transformer translation models, pairing English with French, German, Russian, Japanese, Chinese, Spanish and Italian.
The team summarizes their observations as:
- State-of-the-art NMT systems tend towards producing translations that are unaffected by the noisy source (more robust).
- The accuracy (word overlap BLEU score) correlates with model robustness.
- Certain perturbations involving parts of speech-based word reordering tend to further encourage robustness.
- Results vary by target language, with the Japanese model producing translations that are most robust but less faithful.
The researchers conclude that an over-emphasis on system performance and robustness may be limiting the richer development and broader usefulness of NMT systems.
The paper Sometimes We Want Translationese is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.