This talk focuses on engineering techniques for large-scale NMT systems, and will help us understand how GPU works, and how GPU computing makes machine translation faster. For example, how did engineers reduce the training time for a state-of-the-art NMT system from 15 days to 3 days on a single GPU?
Summary of Each Key points
- History of neural machine translation:
- 2003, Large-scale neural networks started to be used for NLP based on Bengio’s feed-forward neural language model
- 2014, Sequence-to-Sequence by Sutskever et al, we used to process source sentences with RNN, get a hidden state, and use it to run another RNN language model to reproduce one word each time step on the target side
- 2015, Attention model by Bahdanau et al , which is the basis of the most current NMT implementation
- A common problem for all aforementioned methods is that training takes long time (several weeks with multiple GPUs)
- The training time makes it very difficult to do research with such long turn-around time.
- RNN structures and GPU optimization background
- Let’s take a look at the structure of RNN
- The element-wise function are implemented as CUDA kernels, so that the same kernel code is executed on many CUDA cores in parallel
- The large the batch size, the more efficiency you get (matrix multiplication with small batch sizes are very inefficient). Because larger matrix can make better use of GPU cores and RAM
- The problem is that we cannot simply use a large batch size (for example 1024)
- For RNN, minibatch is a collection of sequences
- Large batch of sequences probably won’t fit in memory, and will cause training inefficiencies
- How to optimize RNN with GPU?
- GEMM fusion: combine all matrix multiplication into a single call
- one type of GEMM fusion is fusing all weight matrices together (fuse w)
- another type of GEMM fusion is fusing across time steps (fuse x_i,…,x_n)
- if we do weights fusion and time steps fusion together, we get better efficiency
- Element-wise fusion: combine all element-wise functions
- GEMM fusion: combine all matrix multiplication into a single call
- fuse all element-wise functions into a single kernel, which would run the element-wise functions in sequence
- Variable-length sequences
- pad all sequences to the maximum size (very inefficient)
- terminate the recurrent loop early
- Use buckets, i.e. different bucket can contain different network graphs with different maximum sequence lengths
- Bucketing neural network
- different buckets can have different batch sizes. Just need to make sure total words number is fixed. e.g. batch size = 4000 words, bucket [7,11] has 384 sentences, bucket [40,50] has 64 sentences, notice that the sentences number is better rounded to multiples of 64
- fixed batch size to ensure the magnitude of gradients are similar, thus make training more stable
- Large output vocabulary
- A typical vocabulary size for NMT is 100,000; a typical hidden layer size is 1024. This means the output layer is an [n * 1024 * 100000] GEMM, where n is the batch size. Computing the full softmax every time is painful. One solution is that we can sub-sample the output layer, and compute part of the softmax.
- The best sub-sampling method that can be implemented with GPU is Importance Sampling (Bengio & Senecal, 2008). Other sub-sampling methods include: Noise Contrastive Estimation, BlackOut layer, Hierarchical Softmax.
- Importance Sampling.
- The softmax function convert logit x_i to probabilities over all words in the vocabulary. Importance sampling aims to approximate Z using a portion of the vocabulary.
- The importance sampling formulation: always include top m most common words and n random words from the vocabulary. Sampled vocabulary shared across minibatch
- Approximate softmax function is shown below, notice that we need to multiply the randomly sampled n words’ scores after exponentiated them by a factor of α, α = (|V| – m ) / n
- Why multiply α? If we don’t do this, rare words will have inflated probabilities. For example. if we have a vocabulary size of 100,000, we first get the top m = 8,000 most common words. Then randomly sample n = 8,000 words from 92,000 remaining words, which means each words in the random sample has been sampled 92,000/8,000 ≈ 11.5 times. Multiplying α to the rare word sample makes estimated Z bigger, pushing down the probability of rare words.
- The following table shows the loss, test set BLEU score, and training time for full softmax and sampled softmax
- Training on multiple GPUs
- Train on 2-4 GPUs using asynchronous SGD. Each GPU processes a different minibatch in parallel, updates a set of “master” weights independently of one another
- The gradients/weights must be transferred from the GPU to/from the parameter server (GPU or CPU)
- Difference between training and inference: training is on GPU only, inference is on CPU or GPU; training uses large batch, inference uses small batch; during training, weights change every batch, during inference they are fixed; at training time, the source and target networks are equally expensive, at inference time, the target network is b times of the source network (b associated with the beam search window size)
- large vocabulary problem: computing the full softmax over 100,000 words is intractable, we want the test time to be much shorter than the training time. Therefore, we prune the vocabulary size to a few hundred words which are plausible for the current sentence. We usually run a standard word alignment (“IBM Alignment”) to produce a lookup table. (blue cells have the source words, light blue cells are the lookup tables for each source words, combine all words in the lookup table and get the shortlist vocabulary that really goes into softmax)
- Efficient matrix multiplication on CPU
- small batch size (1 to 8)
- Intel MKL and OpenBLAS have very optimized floating point matrix multiplication implementations
- Model selection:
- The target is much more expansive than source (because of beam search)
- The first layer is 1/2 the cost of all other layers (because of pre-computation)
- The cost of a layer is quadratic with the size ( i.e. 1024-dim is 4 times the cost of 512-dim）
- GRU is 25% computationally cheaper than LSTM
- Model ensemble can be worthwhile, but has diminishing returns
- IBM alignment model: https://en.wikipedia.org/wiki/IBM_alignment_models
- MKL : https://github.com/01org/mkl-dnn
- Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation https://arxiv.org/pdf/1609.08144.pdf
- In importance sampling, why multiply α to the randomly sampled n words? what will happen if we don’t do that?
- How to choose m and n? empirically?
Analyst: Yuiting Liu | Localized by Synced Global Team : Xiang Chen