Exploiting Source-side Monolingual Data in Neural Machine Translation

Instead of exploring target-side monolingual data for NMT, the author of this paper proposes two novel methods for source-side monolingual data exploration.



Recently, NMT (Neural Machine Translation) has become the state-of-the-art based on its “encoder-decoder with attention” architecture. Instead of exploring target-side monolingual data for NMT, the author of this paper proposes two novel methods for source-side monolingual data exploration. The first method is called “self-learning”, to generate the target sentences based on source-side monolingual data and then combined them with bilingual data for training. The second method is a multi-task learning, where one task is to learn the word ordering of target language and its reordering sentences, and the other task is a standard source to target sequence to sequence generation. Extensive experiments show that the use of source-side monolingual data can significantly improve the translation quality on NIST(MT) zh-en task.

NMT Recap

image (1).png

The figure above is an illustration for standard encoder-decoder based NMT with attention, where X=(x1,x2,..x_{Tx}) is the word embedding of source sentence; h = (h1, h2, .. h_{Tx}) is the encoder hidden states; Y = (y1, y2, … y_i) is the predicted word by maximizing the probability of p(y_i | y_{<i}, c_i) ; c_i here is the context vector which encodes the connection between the i_th decoder hidden state and the source sentence; y_{<i} are the previous predicted words.

When it comes to a bi-directional encoder, there are two h‘s, namely h_forward and h_backward, where h_forward_j = RNN(h_forward_{j-1}, x_j) and h_backward_j = RNN(h_backward_{j-1}, x_j). The combination of both directions can help the network make better use of the earlier words in the encoder, such that a longer memory can be captured through the translation.

In terms of bi-directional encoder, h will generally be represented as the concatenation of h_forward and h_backward. After h is calculated, c_i at different decoder time steps can also be calculated as: c_i = sum_j ( alpha_j * h_j), where alpha_j = <W_z * z_i , W_h * h_j> (alpha_j is the weighting factor for each encoder state).

Then, the prediction conditional probability can be computed as:

image (2).png

where z_i is the i-th decoder hidden state; c_i (see above derivative) is the i-th context vector c_i; y_{i-1} is the previous predicted word; and g(.) is a function which can also be modeled by using a fully connected network. Then it becomes to a word classification task given the vocabulary.

The overall objective can be written as:

image (3).png

given the source sentence X, previous predicted words y_{<i} and parameters theta, the goal is to minimize the conditional log likelihood of the sentence aligned bilingual data.


The first method proposed in this paper is the self-learning algorithm. The main idea is to use a pre-trained network to predict the target sentences given the source-side monolingual data, and then combine them with the bilingual data to make a bigger bilingual corpus for training. In this paper, the author called it “synthetic parallel data” by employing the self learning method.

Given the sentence aligned source corpus and target corpus, there exists the source-side large scale monolingual data, which is related to the bitext. Hence, the overall goal is to generate more bilingual data using the source-side monolingual data. The pipeline is as follows:

  1. Establish a baseline with labeled data
  2. Adopt the baseline to predict the un-labelled data
  3. Combine the un-labelled data together to make new labelled data

Note that the synthetic data may negatively influence the decoder performance. One reason for this is that the addition of more unrelated monolingual data may lead to decreased translation quality. Another reason is that the syntax/semantic correctness for huge amounts of synthetic data cannot be guaranteed. So in practice, the author froze the parameters of the decoder for synthetic data, and only the encoder parameters will be updated during synthetic data training. Moreover, the source-side monolingual data should share the same source side language vocabulary as bilingual corpus, so that no new words can be generated.

Multi-task learning

image (4).png

In multi-task learning, there is a shared encoder, but two different decoders for different tasks. The first task (top left in the figure above) is to do a source sentence reordering task, and the second (top right in the figure above ) is to do a translation task. As for the sentence reordering task, the goal is to predict the reordered source sentence (the form of reordered source sentence should be close to target language in word order). This way, the encoder can learn to generate the sentence in a right target language order, and make the translation more clear and fluent.

The overall objective of multi-task learning can be represented as following:

image (5).png

which is a summation of log probabilities of machine translation and sentence reordering. During the optimization, the iterative approach is computed as following:

  1. Optimize the encoder-decoder parameters in the reordering task for 1 epoch
  2. The learnt encoder network parameters are employed to initialize the encoder model for translation task
  3. Learn the encode-decoder parameters in the translation task for 4 epochs
  4. The new learnt encoder network is used to initialize the encoder model for reordering task
  5. Continue iteration until convergence (iteration number or no parameter change)


Experimental Results

The self-learning and multi-tasking learning methods are used for Chinese-to-English translation. The author used a small dataset (including 0.63M sentence pairs and 6.5M monolingual data) and a large dataset (including 2.1M sentences and 12M monolingual data). NIST 2003 (MT03) is used for dev set, and NIST 2004 (MT04), NIST 2005 (MT05), NIST 2006 (MT06) are used for test set. Chinese sentences are segmented by using the Stanford Word Segmenter tool and preordered by using the syntax based reordering method (Berkeley parser).


image (6).png

As shown above, Moses is the state-of-the-art SMT translation model, and RNNSearch is the NMT baseline model. Lines 3-6 shows the BLEU scores when applying self-learning algorithm based on the source-side monolingual data. Obviously, most of the self-learning methods outperform the NMT baseline. The best performance is obtained if the top 50% monolingual data is used for self-learning. The biggest improvement is up to 4.05 BLEU points (32.43 vs. 28.38 on MT03).

Lines 7-10 is employing the multi-task learning framework to incorporate source-side monolingual data. The translation is still better than baseline NMT. For example, RNNSearch- Mono-MTL using the top 50% monolingual data can improve up to 5.0 BLEU points compared to RNNSearch baseline (33.38 vs. 28.38 on MT03). It also performs significantly better than the state-of-the-art phrase-based SMT Moses. Note that the RNNSearch-Mono-MTL is slightly better than RNNSearch-Mono-SL approach, but it takes more time to train (due to the iterative optimization approach).


image (7).png

As shown in the table above, we can clearly see the translation results becoming better based on larger data sets, and the closely related source-side monolingual data (top 50%) can still bootstrap the translation quality (the second row). The gains are smaller compared to the small data set, but as the author explained, it is because the large scale training data might have already stabilized the encoder-decoder parameters.

Reviewer’s Thoughts

In this paper, the author adapted a self-learning approach and a multi-task learning framework to better accommodate the source-side monolingual data. I think it is also necessary to investigate the use of both source-side and target-side monolingual data for further improvement. For example, is there a more effective way to make full use of the target-side monolingual data instead of back translation? We might be able to integrate the language model of target-side monolingual data into the NMT training. Furthermore, besides monolingual data, we can also extend the bilingual vocabulary conditioned on these monolingual data to enable low/zero shot translation.



Author: Shawn Yan | Reviewer: Haojin Yang

0 comments on “Exploiting Source-side Monolingual Data in Neural Machine Translation

Leave a Reply

%d bloggers like this: