Introduction
In this paper, the author did massive experiments on WMT English to German translation task in order to analyze the NMT hyperparameters. They claim that, although no innovations for the NMT architecture was introduced, a classic NMT baseline system with carefully tuned hyperparameters can still achieve comparable result to the state-of-the-art.
NMT Baseline Math Recap
The classic Encoder-Decoder model is shown in the figure below:
As shown, the encoder is a bi-directional RNN. It takes the input sequence (source tokens x = (x1, x2, … x_end)) and produce a sequence of encoder hidden states. The encoder hidden states are the state combination of both direction, e.g. h_i = [h_i_forward; h_i_backward]. The decoder is also a RNN. The input y of decoder is the previous predicted word or ground truth target word, it will then be feed-forward to the hidden layer and becomes s_i, which will be further transformed and used for target prediction based on a softmax function. Note, the probability of the current target is conditioned on the encoder state h (can use a context vector c to model this), current decoder state s_i and previous word y_{<i}. The overall objective can be written as: max P(y_i | y_{<i}, ci, si). The context vector c is also called attention vector which is calculated as:
Which is a weighted sum of source hidden states h_i. The variable a_{ij} is the weight which can be calculated as:
Here, att(. , .) is called the attention function, which is a customized function; s_i is the i-th decoder stat; and h_j is j-th encoder state. In this paper, the author uses two attention functions for comparison, the first one being the additive attention function, and the second one being the dot product attention function, shown below:
Where v is a learnable vector used for calculating the score; and W1 and W2 are used for transforming the source hidden states and target hidden states into the same dimension. Overall, the probability of decoder outputs over a vocabulary V can be calculated as:
That’s the decoder objective. Additionally, you can condition it based on more stuff, for example: softmax(W[s_i;c_i;f_i]), where f_i is the embedding of the current decoder input; or softmax(W[s_i;c_i;f_i;t_i]), where t_i is a topic vector to bias the predicted word to a certain range.
Data Preprocessing
The author runs massive experiments based on WMT English-German dataset, which includes 4.5M training sentence pairs. The Dev set is newstest2013, and the Test set is newstest2014 and newstest2015. After the tokenization by using Moses tools, they pre-processed both source and target corpus by using BPE (Byte Pair Encoding) and created 32000 new symbols (each represents a character n-gram) . The resulting vocabulary size is around 37k.
Training Configuration
The baseline training is based on batchsize=128, beam size=10 and length penalty=0.6. The baseline model is based on the standard 2-layer stacked bidirectional architecture and a 2 layer decoder. They used GRU with cell size 512 for both encoder and decoder, the dropout between each layer is set to 0.2, and the input/output embedding size is set to 512.
Massive Experiments
Embedding Dimensionality
As shown in the figure above, the embedding size doesn’t make much difference. Even the 128-d feature works pretty well, while the converging time reduces to almost twice as quickly. Overall, 2048-d features works best. However, it only performs better by a small margin.
RNN Cell Exploration
As shown in the figure above, the LSTM cell performs better than the GRU cell. The author claims it is because the training data they used is not very large, and the highest time cost comes from softmax operation, so they don’t observe a obvious training speed difference between LSTM and GRU. Note that both LSTM and GRU are better than vanilla decoder, which means the decoder did pass important information through its own hidden states, instead of only depending on the context vector and current decoder input.
Encoder and Decoder Depth
As shown in the above table, the baseline bi-directional encoder-decoder already works reasonable well. The author explores uni-directional encoders of varying depth, with and without reverse. The result shows that a simple combination of input and reverse input always performs better than without reverse input, which means this way, the encoder can create a richer representation for early input words. Note that the benefit from increasing depth in the encoder side is not obvious, a 2-layer encoder is good enough.
ATTENTION FUNCTION
As mentioned in the NMT math recap section, the author uses two attention functions (dot product and addition) for calculating the attention score. The author referred to the dimensionality of W1h_j and W2s_i as “attention dimensionality”, and varies it from 128 to 1024. As shown in the table above, attention function based on addition is quite stable when the attention dimensionality changes, and the best result (22.47) is achieved by using a 512-d vector. Compared to addition, attention function based on dot product is not very stable, and the performance decreases after 256-d. Note that the best result of addition operation and dot production operation is quite similar (22.33 vs. 22.47), both of them are far better than “without attention” (9.98) and “without input” (11.57). In summary, the attention mechanism is important for a good translation result.
Beam Size
The author also investigates the beam size, by varying beam sizes from 1 to 100:
As shown in the table above, B1 corresponds to beam size=1, which is a greedy search. Its performance is worse than beam search (B3, B5, B10, B25, B100, B10-LP-0.5, B10-LP-1.0). By varying the beam size, the results are very similar, thus we can not assert that the larger beam size will help the translation. Additionally, length-penalty always helps the result to become better. For example: B10 (21.57) vs. B10-LP-0.5 (21.71) vs. B10-LP-1.0 (21.80).
Optimal Hyperparameter
The author gives his optimal experimental parameters for the attention-based encoder-decoder. It achieves quite good result (22.19 on newstest14 and 25.23 on newstest15)compared to many NMT models. Only GNMT [1] (24.61 on newstest14) is better than their well-tuned model:
As shown in the table above, even a well tuned baseline model with good initialization can surpass many benchmarks.
Some Reviewers Thoughts
In this paper, the author performed large-scale analysis of neural machine translation. It shows some clear conclusions, such as the attention mechanism always make sense, bidirectional encoder (or combined reversed input) is better than uni-directional encoder, etc. I think future research should also investigate various decoding techniques and hybrid system with combined SMT features (language model based on monolingual data, translation lexicons of bilingual dictionaries, hierarchical reordering, phrase tables etc.) to help encoder learning.
Reference
[1] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144.
Author: Shawn Yan | Reviewer: Haojin Yang
0 comments on “Massive Exploration of Neural Machine Translation Architectures”