Paper author: Minh-Thang Luong and Christopher D. Manning
Computer Science Department, Stanford University, Stanford, CA 94305
The paper proposes a hybrid neural machine translation model concerning both words and characters, which mainly contributes to the translation of rare or unknown words. They use a long short-term memory (LSTM) neural network combined with global attention mechanism to achieve high training efficiency and translation quality.
Neural Machine Translation
Compared to previous statistical machine translation approaches, Neural Machine Translation (NMT) has made great improvements by using recurrent neural network (RNN), an architecture very suitable to model sequential data and to calculate the probability of the target sentences conditioned on the source sentences p(y|x) directly. NMT achieves state-of-the art translation results by adopting a framework called encoder-decoder. The basic idea of it is quite simple: we run the first RNN on the source sentence and ‘encode’ it as real number vectors, called the encoder. Then the second RNN – the decoder – will use this vectorized source sentence to create the corresponding target sentence, by calculating the sum of the negative log likelihood to generate the target sentence.
However, NMT is most commonly implemented at the word level, and in most NMT training process, we do not have enough words as training data. So there are definitely many non-occurring words in our vocabulary, which are called “unknown words”. The most common solution is to replace all the unknown words with a universal symbol, <unk>, which tends to ignore the valuable information of the source word. That is why this paper proposed the hybrid model to treat unknown words separately, character-by-character.
Hybrid Neural Translation Model
Suppose that ‘cute’ is an unknown word in the English phrase ‘a cute cat’, and we want to translate the sentence into Czech. First, the hybrid model will deal with the words that exists in our vocabulary, like the usual word-base NMT. Then, the deep LSTM will run over the unknown word independently at the character level.
For the regular NMT model, we want to optimize the model by minimizing the cross-entropy loss function (D refers to the parallel corpus):
While in the proposed hybrid model, the loss function becomes:
Where Jw refers to previous loss function concerning word-level decoding, and Jc concerning the new character-level decoding(α is set to one in this paper).
In the decoding step, in particular, the hybrid model uses global attention mechanism. First, it produces a context vector ct which is calculated according to the alignment between the current hidden state ht (corresponding to the current source word). Then, it uses this context vector to compute a new hidden state:
Then, the new state is used to compute the probability of generating a target word:
The basic idea of Attention is to represent every single word as a vector rather than the whole sentence, in order to save memory and computation time, as well as to prevent over-fitting by reducing the number of parameters. Through running an RNN in two directions, we can get a bi-directional representation of each word. Thus ct can be computed as:
Where H refers to the matrix containing the vectorized representation of all the words in the input sentence, and αt is called attention vector, with its elements usually assumed to have values between zero and one, and adds up to one.
The newly created vector is then used to initialize the character-to-character decoder. Because taking the context of the source sentence into account can make our model better predict the target sentence. This paper proposes a different approach called separate-path target generation, which is to create a counterpart vector computed in a similar way:
The hybrid neural machine translation model achieved +2.1-11.4 BLEU compared with other NMT models in the WMT’15 English-Czech translation tasks, and the best system achieved the highest 20.7 BLEU. It also achieved a higher chrF3 score.
Also, when the vocabulary size is relatively small, the hybrid NMT has already outperformed the regular word-based NMT. As the vocabulary size increases, the experiment results show that the hybrid NMT still performed quite well and obtained the highest BLEU score.
The character-based model, used to replace the pure <unk> technique, contributes most to the BLEU score as +2.1 BLEU on a 10k vocabulary. Also, the separate-path method is also significant in improving the score by 1.5 BLEU.
The hybrid architecture proposed in this paper combines both word-based model, which is fast and easier to train, and the character-based model, which deals with unknown words very well, so as to produce higher quality translations. This paper also demonstrates the potential of pure character-based models despite of its very low training speed. Thus, more work might be done to increase the speed of character-based models.
Code, data & models
Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models.
Minh-Thang Luong and Christopher D. Manning , Computer Science Department, Stanford University, Stanford, CA 94305
Neural Machine Translation and Sequence-to-sequence Models: A Tutorial.
Graham Neubig, Language Technology Institute, Carnegie Mellon University
Author: Kevin Jin | Reviewer: Zhen Gao