## Introduction

In this paper, the author aims to design a smart way to handle rare/unknown words in the bilingual training data during machine translation. Their solution is to design an algorithm to increase the translation lexicons which are rarely or never seen in the training data, in order to guarantee the repetitive occurrence of the rare words in the training data. As the result, the bilingual dictionaries are transformed into adequate sentence pairs, so that the Neural Machine Translation (NMT) system can learn the latent bilingual mappings from the enough and repetitive phenomena. More concretely, one method is based on a mixed character/word model to make up the out-of-vocabulary (OOV) problem, and the other aims to hallucinate parallel sentence, guaranteeing the repetitive occurrence of the rare words. With the proposed model, the author claimed that it can provide a significant improvement on the translation quality, and most rare words can obtain correct translations.

## Stacked NMT Math Recap

The above figure is the experimental NMT architecture the author used in the paper. On the encoder side, the encoder context vectors C = (h_1^m, h_2^m, …. h_{T_x}^m) are generated by the encoder using m stacked LSTM layers , where h_j^k means the j-th hidden state at k-th layer, it can be calculated as:

if k = 1, then h_j^k is the feature embedding of x_j.

On the decoder side, the objective is:

The conditional probability given previous target sequences and encoder context vectors C can be reformulated as p(y_i | y_{<i}, c_i), where c_i is the context vector at each decoder step i. zˆ_i is the attention output which can be calculated as:

where c_i is the weighted sum of the source-side context vectors:

where alpha_{ij} is a normalized item between the encoder hidden state h_j^k and decoder hidden state z_i^l:

and z_i^l can be computed based on z_{i-1}^k, and z_i^{k-1}:

when k = 1, then z_i^k becomes the decoder hidden state at the first decoding layer, which can be formulated as:

The overall stacked LSTM objective is to maximize the log-likelihood given the previously generated target sequences, source sentences and learnable parameters theta:

## The Pseudo-Mixed Model

The above figure is the combination model the author proposed in this paper, the goal is to make full use of all the bilingual dictionaries, especially the ones covering the rare or OOV words. The left part in the above figure is a mixed word/character model, and the right part is a pseudo synthesis model.

In terms of the mixed word/character model, the author adopt the idea from [1] to re-label the rare words in training data and bilingual dictionary with character sequences, e.g. **if oak is a rare word, it will be re-labelled as:**

<B> is the beginning token, <M> is the middle token and <E> is the end token. In this way, the rare words would be split into characters and is more abundant now, making it easier for its representation learning.

In terms of pseudo synthesis model, the author proposed an algorithm as following:

it’s easy to understand:

- Training an SMT system based on the bilingual data and bilingual dictionary
- Select the rare translation lexicons in D_{ic}
- Retrieval top K monolingual sentences based on each selected D_{ic_x} (source side)
- Translate these monolingual sentences conditioned on the translation rule D_{ic_x} → D_{ic_y}
- Add these pseudo pairs to the training data

This way, the rare or unknown word translation pairs D_{ic_x} and D_{ic_y} would become abundant after adding the pseudo synthesized pairs.

## Experiments

The experiments are performed on ZH-EN task, NIST03 is used as the development set, and NIST04-08 are used as testing datasets. The results are shown in the following table:

The author aims to do three experiments:

- To see if mixed word/character model can make an improvement
- To see if pseudo synthesis model can make an improvement
- To see if the combination of both can make an improvement

As shown in above table, the mixed model itself achieves 34.19 BLEU, compared to the baseline RNN (32.98) and PBMT (28.55). Furthermore, when adding the bilingual dictionary as training data, it further receives a moderate improvement, to 34.70 BLEU.

The data synthesis model also got improved. As shown in the line 5-12 in the above table, we can observe the following results:

- Synthesized pairs can bootstrapped the model performance. (e.g. 34.50 in line 5)
- Synthesized pairs + translation lexicons (bilingual dictionary) can further get improved. (e.g. 35.53 in line 6)
- The larger retrieval numbers K the better results, but the gain becomes smaller. (e.g. 36.39 in line 12)

Finally, the combination of both models got the best result, with 37.60 BLEU.

## Conclusion

In this paper, the author concentrated on the word translation lexicons which are rarely or never seen in the bilingual training data. They provided two methods which are focused on the data transformation, resulting in a massive and repetitive translation lexicons to help make a more balanced probability distribution over the softmax layer. Extensive experiments show that their method can improve the translation accuracy for ZH-EN task in NIST dataset.

## Reference

[1]. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR, abs/1611.04558, 2016. URL http://arxiv.org/abs/1611.04558.

**Author**: Shawn Yan | **Editor**: Haojin Yang | **Localized by Synced Global Team**: Xiang Chen

## 0 comments on “Bridging Neural Machine Translation and Bilingual Dictionaries”