Established in 2006, the WMT Conference on Machine Translation has grown into one of the leading machine translation competitions in the world, attracting top international enterprises, universities and research institutions such as Microsoft, Facebook, Tencent, ByteDance, Baidu, etc.
The WeChat AI team’s neural machine translation (NMT) entry in the WMT21 shared news translation task achieved the highest BLEU (bilingual language understudy) scores among all submissions for English→Chinese, English→Japanese and Japanese→English, while their system performance on the English→German task bettered all constrained submissions.
Last week, WeChat AI and Beijing Jiaotong University system developers released the paper WeChat Neural Machine Translation Systems for WMT21, revealing the architecture behind their novel neural machine translation (NMT) system as well as the strategies they adopted to achieve such impressive performance in the WMT21 competitions.
The team first exploited several novel transformer variants to improve their model performance and diversity. To alleviate the gradient vanishing problem of the post-norm transformer, they choose a pre-norm transformer as their baseline model. They also used the fast and straightforward average attention transformer (AAN) to replace the self-attention module in the decoder without any loss in performance.
The researchers combined AAN with multi-head-attention to produce three novel variants that obtain more effective and diverse performance: an Average First Transformer, an Average Bottom Transformer and a Dual Attention Transformer. They also adopted a talking-heads attention mechanism in the encoders and decoders to improve information interaction between the attention heads.
For synthetic data generation, the team employed a large-scale back-translation method to utilize the target-side monolingual data, a sequence-level knowledge distillation to leverage the source-side of bilingual data, forward-translation via ensemble models to get general domain synthetic data, and iterative in-domain knowledge transfer to generate in-domain data. Finally, they applied data augmentation methods such as different token-level noise and dynamic top-p sampling to improve model robustness.
The team’s training strategies included applying confidence-aware scheduled sampling, target denoising, and a graduated label-smoothing method for in-domain finetuning.
For their model ensemble, the team combined Self-BLEU and a valid-set BLEU to derive a Boosted Self-BLEU-based Ensemble (BSBE) algorithm. They then applied a greedy search strategy based on the Self-BLEU scores between the candidate models to select high-potential candidate ensemble models based on performance (BLEU scores on the valid set) and diversity (Self-BLEU scores among all other models).
The proposed constrained system achieved the highest case-sensitive BLEU scores on English→Chinese (36.9), English→Japanese (46.9), Japanese→English (27.8) and English→German (31.3). The BSBE search algorithm also achieved the same BLEU score as a brute force search, with search-time savings of around 95 percent.
The paper WeChat Neural Machine Translation Systems for WMT21 is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.