Research

Accelerating Convolutional Neural Networks on Raspberry Pi

This method, with considerable computation time and controllable model size, is particularly simple and very useful for implementations on mobile devices such as phones, drones, etc.

Koustubh Sinhal was a co-founder and CTO of iLenze, a computer vision and visual commerce company based in India. Now, he is the CTO of AIMonk. In this blog, Koustubh tried to apply various methods to speed up the CNN and reduce their memory computational requirements. This method, with considerable computation time and controllable model size, is particularly simple and very useful for implementations on mobile devices such as phones, drones, etc.

To begin, he concluded the top three factors in determining the performance of neural networks solving image recognition problems:

  1. Low speed – cannot run real-time applications on embedded devices, for example, Raspberry PI and NVIDIA Jetson
  2. Very large model size – models are too large to fit into small devices
  3. Large memory footprint – requires too much RAM

Therefore, those models are usually deployed on servers with large GPU’s.

The reason why CNN’s are so slow, is that they spend most of their time on convolutional layers, and most of the storage on fully connected layers. Although some new models reduce or get rid of some fully connected layers, CNN’s still need to run on a large GPU to accelerate the process of fast matrix multiplication. Therefore, small devices like the Raspberry PI (RAM of 2 GB) are unable to deploy very deep CNN models with high recognition rate.

Koustubh Sinhal’s suggestion on accelerating CNN is to decompose the convolutional layers in order to manage the redundancy in parameters and response maps of deep networking, instead of decomposition of weight matrix which only works on shallow models. He claimed that his strategy would achieve 4X speedup on convolutional layers with only 0.3% increase in top-5 error for the VGG model on ImageNet.

His main idea is to find a lower rank space of the convolutional layer by PCA. In principle it can be demonstrated in Fig. 1.

Assume we have dots in a 2-dimensional pace, as shown in Figure 1. This 2-D space is an overkill if all the points are close to a straight line in blue. That is, although these points are in 2-D space, but they can be approximately represented in a 1-D space. The idea of finding the straight line by using PCA is easy, and the same idea could be extended to d dimensional space, which can be useful in our case.

image.pngFigure 1.

In the case of decomposing the convolutional matrix, assuming the response maps are highly redundant and in a lower dimensional subspace, we can follow two steps so the error from each layer is offset.

Step 1. Constructing Convolutional Layers

Assume a convolutional layers with weight tensor of size d*k*k*c, where k is the spatial size of the convolutional kernel, c the number of input channels, and d the number of filters in the layer. Consider only one filter which has the size of k*k*c with the associated weight w of dimension k*k*c+1 combined with the bias term and input feature z of the same dimension, the output is shown in Figure 2.

image (1).png Figure 2.

Now consider all the kernels applied on input. The output response map is defined as y = Wz, shown in Figure 3.
image (2).png Figure 3.

Hence, d is obviously a dimensional vector, and very large in deep networks. Nevertheless, in a practical CNN application, only part of the activation in the output vector affects the final prediction (e.g classification), so the other part of the vector is redundant. To get rid of the redundancy, we can imagine the idea of mapping the high dimension vectors to a low dimensional space.

Step 2. Decomposition

The decomposition step is calculated as below. Assume the number of most effective orthogonal directions are r (r<d). The approximation of output response can be obtained by:
image (3)

where V is the matrix of size d x r, formed by arranging direction vectors column-wise, the subscript m indicates the mean value, and tilde (~) sign is the approximation.
Then, the approximated vector based on the approximated mean subtracted vector is given by:

image (4).png

If a new matrix L is defined as the matrix multiplication of V and W, the equation above could be rewritten as:

image (5).png

Therefore, the original one convolutional layer is decomposed into two layers, shown in Figure 3, with weight tensors of V and L, and b can be treated as a bias term for the second convolutional layer.

image (6).png

Figure 3.

Now the dimensional complexity has been reduced from d(k*k*c +1) to r(k*k+1) +rd, where r <<d, which saves significant time on computation.

Technical Comments:

The paper [1] Koustubh cited and summarized proposed a new method which can accelerate the time-time computation of non-linear units, especially the convolutional layer, without using the stochastic gradient descent. The basic idea, as shown above, is that subsequent nonlinear units can be decomposed with the low-rank decomposition and later compensated, similar to GSVD (Generalized Singular Value Decomposition). Therefore, it can reduce the accumulated error when optimizing multiple layers (e.g. >10). This acceleration method is very efficient on very deep networks, because it applies the nonlinear asymmetric reconstruction.

Recently, a number of scientific computing methods were applied to speed up the computation efficiency of deep learning networks. Apart from the speed up of subsequent convolutional layers as shown above (c.f. [2]), different kinds of methods have been proposed. For instance, performing low-rank decomposition on the convolution kernels [3], or pruning the convolution tensors into sparsely deep ConvNet from previous smaller tasks [4] . Such methods, compared to the state-of-the-art papers which seek higher recognition rates, may results in more practical engineering solutions.

People can try this method at home to test its performance, and there are plenty of choices of small devices that could be used for computer vision cases. You can choose one of them from: http://www.learnopencv.com/embedded-computer-vision-which-device-should-you-choose/

Reference:

[1] Zhang, Zou, He, Sun: Accelerating Very Deep Convolutional Networks for Classification and Detection
[2] Zhang, Xiangyu, et al. “Efficient and accurate approximations of nonlinear convolutional networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
[3] Denton, Emily L., et al. “Exploiting linear structure within convolutional networks for efficient evaluation.” Advances in Neural Information Processing Systems. 2014.
[4]Sun, Yi, Xiaogang Wang, and Xiaoou Tang. “Sparsifying neural network connections for face recognition.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.


Author: Bin | Editor: Joni Chung | Localized by Synced Global Team: Xiang Chen

0 comments on “Accelerating Convolutional Neural Networks on Raspberry Pi

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: