Recurrent Neural Networks (RNN) are widely used to solve a variety of problems, and as the quantity of data and the amount of available computing power have increased, so have model sizes. The number of parameters in recent state-of-the-art networks makes them hard to deploy, especially on mobile phones and embedded devices. The challenge is due to both the size of the model and the time it takes to evaluate it.
In order to deploy these RNNs efficiently, we propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network. At the end of training, the parameters of the network are sparse, while accuracy is still close to the original dense neural network. The network size is reduced by 8 times, and the time required to train the model remains constant.
Additionally, we can prune a larger dense network to achieve better than baseline performance while still reducing the total number of parameters significantly. Pruning RNNs reduces the size of the model, and can also help achieve significant inference time speed-up using sparse matrix multiply. Benchmarks show that, using our technique, model size can be reduced by 90% and speed-up is around 2 to 7 times.
REVIEW STARTS BELOW
Before I review the contents of this paper, I will start by giving a brief review on Recurrent Neural Networks (RNN). The standard version of a RNN is a computational simplification of networks of neurons and their action potentials (spikes) in the biological brain. Graphically, each neuron is a node and each synapse is an edge. Every neuron has a time varying activation potential (assumed to be in discrete time and synchronous clocking). Each synapse represents the transmission of activation value (data) from an upstream neuron to a downstream neuron. Some useful terminology that may be useful when reading literature on RNNs are listed.
- observed neurons: neurons that have observed activations
- dependent neurons: neurons that determine their activation by taking a linear combination of the activations it receives from its upstream neurons, followed by an activation function
- output neurons: dependent neurons whose activations have a target output
- hidden neurons: neurons which are latent but not output, e.g., their activation is not immediately clear
The transmission of activation along the edges of the network can be discretized into iterations, hence the activation of a dependent neuron at each iteration is always calculated based on the activations of its upstream neighbors in the last iteration. The weight matrix (usually denoted as W), defines how the activations transmit. This sequential transmission of information induces memory in the neural network, and is the core idea behind RNNs. RNNs are called reccurent because they perform the same task for every element of an information sequence, with the output depending on previous computations. For example, if the network was to predict the next word in a sentence, it better know which words came before it!
Training RNNs are similar to training a classic neural network using back propagation algorithm. However, because the parameters are shared by all time steps in the network, the gradient at each output depends on calculations at the current iteration as well as those from previous iterations.
Sparsity in Recurrent Neural Networks
Deep neural networks for acoustic modeling had approximately 11 million parameters which grow to approximately 67 million for bidirectional RNNs. in language modeling the size of these parameter sets are even bigger. As mentioned previously, the synapses (edges) that connect the network are defined by a weight matrix W. For complicated tasks such as NLP, W is often large. These large models face two significant challenges in practice. Modern portable devices have limited memory and storage, and the evaluation of large models requires significant computation time.
This paper proposes a method to reduce the number of weights in RNNs. This is done by progressively setting more weights to zero using a monotonically increasing threshold during training. Defining and controlling the shape of a function that maps iteration count to threshold value, they can control how sparse the final weight matrix W becomes. This paper trims all the weights of a recurrent layer while keeping other layers with small number of parameters untouched. Each layer can have its own threshold function. The paper achieves sparsity of 90% with a small loss in accuracy. Their results are shown to work with Gated Recurrent Units as well as vanilla RNNs (both are just different classes of RNNs).
The method of which the paper proposes to prune the weights is by maintaining a set of masks. These masks are determined by a monotonically increasing threshold and a set of hyper parameters that are used to determine the threshold. The steps to pruning follow the steps below. The hyper-parameters are given in the table below along with their descriptions.
- During model initialization, create a binary mask for each weight in the network. Set these binary masks to one initially
- At each iteration, the masks are updated by setting all parameters that are lower than the threshold epsilon to zero
- After each optimizer update step, each weight is multiplied with its corresponding mask
Many of these hyper-parameters are determined by heuristics. Once these hyper-parameters have been determined, the pruning algorithm can be executed as shown below.
Since the threshold is calculated via the heuristic approach, it’s hard to really have an intuition of why this algorithm works the way it does. Instead, I will only provide here how the algorithm works. The theta parameter is calculated from equation (1), and the threshold epsilon is updated under the following conditions.
- if the current iteration is within the iterations of which pruning is allowed, AND
- if the current iteration is an update iteration (iteration mod frequency of update is zero), AND
- if the current iteration does not require an increase in the rate of pruning, OR
- if the current iteration requires an increase in the rate of pruning
Recall that this paper only prunes the weights of the recurrent and linear layers, but not the biases or batch norm parameters. This is because the weights in the recurrent and linear layers are much larger in number and therefore, are worth making sparse.
The paper uses the Deep Speech 2 model for their experiments. The model has 2 convolution layers, followed by 7 bidirectional recurrent layers and a CTC cost layer. The layers of the model is shown below.
As mentioned previously, only the bidirectional recurrent linear layers will be subject to pruning as the number of parameters in those layers are large. The experimental results show that sparsity comes at a cost of accuracy of the model. Depending on the performance requirements of the application, one can determine the hyper-parameters to adjust the percentage of pruning. The experimental results of the Deep Speech 2 model is shown in Table 3 below.
The conclusion is that pruning works better on larger models than on smaller ones. For the same size RNNs (only the baseline case 1760 is shown), pruning the weight matrix to achieve sparsity of around 90% results in approximately 20% decrease in performance. If we extrapolate the performance numbers (as a comparison between dense medium and sparse medium is not explicitly shown), we can see that sparsity comes at an approximately 10% decrease in performance. It is also concluded that by having a gradual pruning model, rather than one that masks the weights at fixed intervals, will produce better results. This is observed by comparing RNN sparse 1760 performance with the results of Table 4. Aside from smaller memory space, the advantage of having a sparse matrix is the speed up of which calculations are carried out. It can been seen from their results that it’s possible to achieve speed ups up to $6$ times with 95% sparsity. For all the results, please find the paper @ https://arxiv.org/abs/1704.05119.
This paper demonstrated that by pruning the weights of RNNs during training, they can find sparse models that are accurate (especially in large, dense models), while significantly reducing model size. I would say the drawback is that this is backed up only by empirical results and lacks theory. Therefore, if any intuition that could be provided, it will be this method of achieving sparsity by zeroing out weights is similar to adding (non-random) noise to the network. The advantages of doing so is in the increased computation efficiency and data storage. Future goals of such techniques should be directed at achieving sparsity without sacrificing too much accuracy. In addition, research should also be directed towards improving the accuracy of small models when using a sparse matrix.
Author: Joshua Chou