Machine learning has significantly boosted automatic translation in recent years. Systems such as 2016’s GNMT (Google’s Neural Machine Translation System) have improved translation performance in over 100 languages. However, even today’s most advanced translation systems still lag behind human performance in almost all respects. For languages with high usage rates, plentiful data resources make it easy to improve technical capabilities. Translation improvements for low-resource languages however have not nearly come as quickly.
In a recent Google AI team blog post, researchers report on recent efforts and progress in the field of language translation, especially with resource-poor languages. Overall, the quality improvements average about five points on the BLEU (Bilingual Evaluation Understudy Score) metric for over 100 languages. The researchers also show how to apply their system to the noisy, web-mined data environment at a large scale.
Google researchers used a Hybrid Model Architecture to advance both High- and Low-Resource Language translation. They combined the transformer encoder and an RNN decoder implemented in TensorFlow framework Lingvo to replace the original RNN-based GNMT system and train the model. Results show that the final hybrid model can achieve a high level of performance with better stability and shorter delay.
To improve web crawling, researchers replaced its previous data collection system with a new data miner with more focus on precision than recall, enabling higher quality training data collection from public networks. The researchers also changed the web searcher from a dictionary-based model to an embedding-based model, which increased the amount of collected data by an average of 29 percent without reducing precision.
When dealing with limited resource languages, researchers use Back Translation and M4 Modelling techniques. Back Translation augments parallel training data with synthetic parallel data. By integrating Back Translation into Google Translate, researchers were able to train models with more monolingual text data for low-resource languages, which significantly improved the fluency of the model output.
M4 Modelling is also a particularly useful technology for languages with scarce resources. It uses a single large model to translate between all languages and English. It also enables large-scale transfer learning, so low-resource languages can be co-trained with various other languages to provide useful signals for the model.
Incorporating these new advances, Google’s latest model scores an average five-point increase on BLEU compared to the previous GNMT model — with the class of 50 low-resources languages achieving an average seven-point boost.
The researchers note that there is still much room for improvement in the quality of automatic translation for languages with limited resources, as models still struggle with the shortcomings of typical machine translation. However, with these updates, the new system provides relatively consistent automatic translation performance even with a lack of resources. Given a meaningless input, the previous model would produce nonsensical “translations.” Inputting Telugu-language characters “ష ష ష ష ష ష ష ష ష ష ష ష ష ష ష” for example, caused the old model to output “Shenzhen Shenzhen Shaw International Airport (SSH),” while the new model can learn instead to produce a more reasonable “Sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh.”
For more details on the recent advances in Google Translate, check out the official blog.
Author: Herin Zhao | Editor: Michael Sarazen; Yuan Yuan