Neural Machine Translation has gained much momentum these days. It has greatly improved from the traditional statistical machine translation, and achieved state-of-the-art performance on translation tasks within many languages.
However, NMT suffers from both over-translation and under-translation problems, that is to say, sometimes it may repeatedly translate words, while sometimes it may miss some words. That’s because NMT models are often seen as a black box, and we do not exactly know the mechanism behind them, as to how they convert the source sentences into the target sentences.
Aiming for this problem, Tu et al. (2017) proposed a ‘encoder-decoder reconstructor framework’ for NMT, which used back-translation to improve the translation accuracy. This paper is the implementation of this framework for the English-Japanese translation task.
Besides, this paper also pointed out that the framework could not achieve satisfactory performance, unless the forward translation model was trained like the traditional attention-based NMT, also called pre-training.
Traditional Attention-based NMT model
The traditional attention-based NMT model proposed by Bahdanau et al. (2015) is shown below.
The encoder converts the source sentence into a fixed-length vector C as a context vector. At each time step t, a bidirectional RNN is used, and the hidden state h_t of the encoder can be represented as:
where the forward state and the backward state can be computed respectively as follows:and
r and r’ are both nonlinear functions. Then the context vector C becomes:
where q is also a nonlinear function.
In a classical encoder-decoder model, the context vector C calculated by the encoder is directly ‘decoded’ into the target sentence by the decoder. But since the decoder has to process the whole vector, the previous information could possibly be covered by a later processed one. Hence, the longer the context vector, the more likely for the model to lose important information. That’s why attention-based mechanism is introduced, which will focus on certain part of the context vector at each step to guarantee sufficient information.
At each time step i, the conditional probability of the output word can be computed as:
where s_i is the hidden state of the decoder, and it is computed by:
From the equation, we can see that the hidden state s_i at time step i is calculated using the hidden state and the target word at the previous time step i-1, and a context vector c_i.
Different from the long fixed-length vector C mentioned above, the context vector c_i is a weighted sum of each hidden state h_j of the encoder, computed by:
where the weight matrix e_ij is generated by an ‘alignment model’, which is used to align the inputs around position j and the output at position i, and α can be understood as an ‘attention allocation’ vector.
Finally, the objective function is deﬁned by:
where N is the number of data, and θ is a model parameter.
Encoder-decoder Reconstructor Framework
The encoder-decoder constructor framework for NMT proposed by Tu et al. (2017) adds a new ‘reconstructor’ structure to the original NMT model. It aims at doing translation from the hidden state of the decoder architecture back into the source sentence in order to compare and improve the translation accuracy. The new structure is described as follows:
At each time step i, the conditional probability of the output ‘source word’ is computed as:
The hidden state s’ is computed in a similar way as the previous decoding process:
Note that the c’ is here called the ‘inverse context vector’ and is computed as:
where s is just each hidden state of the decoder(on forward translation).
And similarly, the α’ is further calculated by:
The objective function is defined by:
Note that this optimaization function contains two parts, both the forward translation part and the back translation part. The hyperparameter lambda specifies the weight between forward translation and back-translation.
According to this paper, the forward translation part measures translation fluency, and backward measures translation adequacy. In this manner, the new structure is able to enhance overall translation quality.
The paper uses two English-Japanese parallel corpora: Asian Scientific Paper Excerpt Corpus (ASPEC) (Nakazawa etal.,2016) and NTCIR PatentMT Parallel Corpus (Goto et al., 2013).
The RNN model used in the experiments, with 512 hidden units, 512 embedding units, 30,000 vocabulary size and 64 batch size, is trained on GeForce GTX TITAN X GPU.
The normal attention-based NMT is used as a baseline NMT model.
Note that the hyperparameter lambda is set to 1 in the experiments.
Some examples of the English-Japanese translation tasks are shown below. Note that ‘ jointly-training’ refers to the encoder-decoder reconstructor without pre-training.
Tables 2 and 3 show the translation accuracy in BLEU scores, the p-value of the significance test by bootstrap resampling (Koehn, 2004) and training time in hours until convergence.
The results show that the new encoder-decoder reconstructor framework takes a longer time to train than the baseline NMT, but it significantly improves translation accuracy by 1.01 points on ASPEC and 1.37 points on NTCIR in English-Japanese translation. However, it does not make such an improvement in Japanese-English translation task. Besides, the jointly trained model performs even worse than the baseline model.
Furthermore, the paper tests whether the new model can better solve the over-translation and under-translation problems mentioned above. For example, Figure 3 shows that the baseline model failed to output ‘乱流と粘性の数值的粘性の関係を基に’, while the proposed model succeeded in translating it. Figure 4 shows that 新生兒’ and ‘三十歳以上’ are repeatedly translated, while the proposed model performed better.
Figure 3: The attention layer in Example 1 : Improvement in under-translation
Figure 4: The attention layer in Example 2 : Improvement in over-translation.
In this paper, the newly proposed encoder-decoder reconstructor framework is analyzed on English-Japanese translation tasks. It points out that the encoder-decoder-reconstructor offers significant improvement in BLEU scores, and alleviates the problem of repeating and missing words in the translation on English-Japanese translation task. In addition, it evaluate the importance of pre-training, by comparing it with a jointly-trained model of forward translation and back-translation.
Back translation has always served as a useful method for translation studies, or for human translators to check whether they have accurately translated or not. The use of this traditional translation method in machine translation tasks is quite an amazing idea.
In the future, the closer combination of linguistics knowledge with Natural Language Processing may become the new way of thinking for better improving the performance of language processing tasks, such as machine translation, especially for those languages like Japanese, which features many ‘grammar templates’ (‘文法’ in Japanese) .
Dzmitry Bahdanau,Kyunghyun Cho,and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. Proceedings of the 3rd International Conference on Learning Representations (ICLR), pages 1–15.
Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. 2017. Neural Machine Translation with Reconstruction. Proceedings of the ThirtyFirst AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 3097–3103.
Philipp Koehn. 2004. Statistical Significance Tests for MachineTranslationEvaluation. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 388–395.
Source paper: https://arxiv.org/pdf/1706.08198.pdf
Paper authors: Yukio Matsumura, Takayu kiSato, Mamoru Komachi
Tokyo Metropolitan University Tokyo, Japan
Author: Kejin Jin | Editor: Joni Chung | Localized by Synced Global Team: Xiang Chen