Neural machine translation (NMT) systems have achieved promising results in recent years, but numerical mistranslation remains a general issue found even in major commercial systems and state-of-the-art research models. A single mistranslated digit can cause severe consequences, especially in systems deployed in the financial and medical fields.
To facilitate the discovery of numerical errors in NMT systems, a research team from the University of Melbourne, Facebook AI, and Twitter Cortex has proposed a black-box test method for assessing and debugging the numerical translation of NMT systems in a systematic manner. The approach reveals novel types of errors that are general across multiple state-of-the-art translation systems.
The researchers drew inspiration for their proposed NMT evaluation suite from Ribeiro et al.’s 2020 CheckList, designed for behavioural testing of NLP models. The team first explores four capabilities that demonstrate the expected translation abilities of systems on common numerical text: integers, decimals, numerals and separators. The integers were tested on sequences of digits with variable lengths, the decimals on floating-point numbers with different levels of precision. The results reveal that NMT systems tend to malfunction when translating larger integers and decimals with longer fractional parts.
The team also evaluated NMT systems’ ability to translate numbers presented as words (numerals); and numbers with separators, generally commas or periods used to designate decimals or thousands.
The researchers examined numerical formats of various lengths and decimal-point positions, and tested them across all four capabilities. They note that the numbers they created for a given format can also be seen as a set of “adversarial” examples, as they are small perturbations of each other.
The team conducted their evaluations in both high-resource (HR) and low-resource (LR) scenarios, where the HR set included English-German and English-Chinese pairs, and the LR contained English-Tamil and English-Nepali pairs. The behaviour tests were conducted on popular commercial translation systems, with the team using pass rate (PR) — the fraction of inputs where the system translates the numerical component perfectly — as their evaluation metric.
The results show that numerical translation is a symmetric problem existing across all three tested SOTA systems, with four major error types: decimal/thousands separators, cardinal numerals, digits and units. The researchers propose several strategies to mitigate such errors, including separate treatment of numbers, data augmentation, tailoring BPE segmentation and sanity checks.
Overall, the study reveals novel types of errors that are present across multiple SOTA translation systems. The team believes their findings can help improve numerical translation quality and reduce numerical misinformation in NMT systems.
The paper As Easy as 1, 2, 3: Behavioural Testing of NMT Systems for Numerical Translation is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.