Paper Source: https://arxiv.org/abs/1609.04112

With sophisticated and effective neural net architectures emerging, the performance of Convolutional Neural Network has outperformed traditional Digital Image Processing methods such as SIFT and SURF. In the field of Computer vision, scholars tend to shift their research focus to CNN and start believing that CNN is the future trend of this area. However, there is little understanding behind the empirical successes. Hence, a hot topic today is to investigate what is going on. Basically, there are three mainstream approaches: Optimization Perspective, Approximation Perspective and Signal Perspective. The first two mainly focuses on pure mathematical representations trying to analyze neural nets’ statistical properties and convergence, while the third, from a signal perspective, tries to address the following problems:

- Why a nonlinear activation function is essential at the filter output of all intermediate layers?
- What is the advantage of the two-layer cascade system over the one-layer system?

**Introduce REctified COrrelations on a Sphere (RECOS)**

It is well-known that Feedforward Neural Network can be seen as a universal approximator capable of approximating any continuous function, given a single hidden layer containing finite number of neurons. What makes FNN special is its power of nonlinear activations in neurons. Neural nets are sometimes wide and deep, but without the nonlinear activation, their complex architectures just behave as a simple single layer linear model trying to map inputs into another output space. To be specific, a nonlinear activation function can provide us with a new set learned representation of inputs, which is more suitable for approaching the real-world problems.

CNN is just another type of FNN or MLP(Multi Layer Perceptron). To analyze the nonlinearity in CNN, the author proposed a mathematical model to understand the behavior of CNNs. CNNs are viewed as a network formed by basic operational units that conducts “REctified COrrelations on a Sphere (RECOS)”. Thus, it is called the RECOS model. During the training of CNNs, the kernels’ weights are first initialized, then adjusted by gradient descent methods and back-prop algorithm. In RECOS model, the weights are referred to anchor vector to represent their role in clustering the input data. This is to say that we try to compute the correlation between input vector and anchor vector, then measure the similarities.

**Why Nonlinear Activation?**

Instead of considering interactions of all pixels in one step as done in the MLP, the CNN decomposes an input image into smaller patches, known as receptive fields for nodes at certain layers. It gradually enlarges the receptive field to cover a larger portion of the image. A neuron computes the correlation between an input vector and its anchor vector to measure their similarity. There are K neurons in one RECOS unit. We consider our system as **Y = AX**, where **X** is input vector and **Y** is output vector, **A** is our anchor vector (the weight matrix of kernel filters). From this equation, we can see that CNN maps input into another space. And in RECOS model, we can immediately conclude that the learned kernel weights tend to map the similar objects into the same region. For example, if x_i and x_j are close in euclidean distance, the corresponding outputs y_i and y_j must be close as well in the new space. For a filter used to capture the features of cat, any cat vector **X_cat** will be mapped by such kind of learned anchor vector **A **into **Y_cat **while other vectors **X_dog **or** X_car **will never be in this region. This is why CNN can effectively recognize different objects.

But why we must have nonlinear activation? Consider the images above: the original cat image(left), the negative of left(right). From human knowledge, these two images can be the same but also be different. We can conclude that these are the same cats and we can also say that they are negatively correlated. Because the black cat is simply obtained by subtracting the value 255 from the white cat. How will the CNN interpret the two cats?

From figure above, we can peek into the need of rectification. X is the input vector, *a_1,a_2 *and* a_3* are different learned anchor vectors. In RECOS model, linear operation Y=AX is the measurement of the similarity between inputs and anchor vectors. Thus, for anchor vector a_1 and a_3, we can see the similarities of x between the two anchor vectors are the same in magnitude but has an inverse sign. At this moment, the cats are different for CNN. But taking LeNet5 as example, which has two convolution layers, if the raw input x goes through the two layers, the final product will be confused: a system without rectification cannot differentiate the following two cases: a positive response at the first layer followed by a negative filter weight at the second layer; and a negative response at the first layer followed by a positive filter weight at the second layer. However, by using nonlinear activations, CNN can easily rule out the impacts on the negative values, leading to a robust system.

Moreover, the author conducted an interesting experiment, and the result is quoted below:

## We trained the LeNet-5 using the MNIST training dataset, and obtained a correct recognition rate of 98.94% for the MNIST test dataset. Then, we applied the same network to gray-scale-reversed test images as shown in Fig. 5. The accuracy drops to 37.36%. Next, we changed all filter weights in C1 to their negative values while keeping the rest of the network the same. The slightly modified LeNet-5 gives a correct recognition rate of 98.94% for the gray-scale-reversed test dataset but 37.36% for the original test dataset.

We can see the symmetrical result after changing all of filter weights in the first convolution layer. This result shows that adding activation incurs the vanish of negative correlation, and if we double the anchor vectors to learn the gray-scale-reversed features, we can have a high recognition performance to the both test sets.

**Advantages of Cascaded Layers? **

Generally speaking, as CNN goes deeper, the kernel tries to build its own abstract features based on all previous kernels’ outputs. So deep layers can capture the global semantic and high-level features. What is going on here is that as a RECOS model, CNN tries to utilize a sequence rectified transform, which is equivalent to the measurement of similarity, to cluster the similar input data layer by layer. The output layer predicts the likelihood of all possible decisions (e.g., object classes). The training samples provide a relationship between an image and its decision label. And they guide the CNN to form more suitable anchor vectors (thus better clusters) and connect clustered data with decision labels.

The figure above shows the effectiveness of deep network, the experiment detail is quoted below:

## We use an example to illustrate this point. First, we modify the MNIST training and testing datasets by adding ten different background scenes randomly to the original handwritten digits in the MNIST dataset. For the three rows, we show three input digital images in the leftmost column, the six spectral output images from the convolutional layer and the ReLU layer in the middle column and the 16 spectral output images in the right two columns. It is difficult to find a good anchor matrix of the first layer due to background diversity. However, background scenes in these images are not consistent in the spatial domain while foreground digits are.

With different distorted backgrounds, CNN successfully captures the representative patterns. Notice that there are a lot redundant and unwanted information at the first layer, and by applying another features extraction, the CNN learns a global patterns rather than local details. That is, for an input vector x, the RECOS transform generates a set of K non-negative correlation values as the output vector of dimension K. This representation enables repetitive clustering layer by layer. At last, the label guides the CNN finds the same pattern among different settings.

From the analysis above, we can see that convolution layer is a useful model for automatic feature selection. Without any human efforts, it measures the similarities and clusters the input data into different regions. But what is the role of the fully-connected layers?

It is typical to decompose a CNN into two sub-networks: the feature extraction (FE) subnet and the decision making (DM) subnet. The FE subnet consists of multiple convolutional layers while the DM subnet is composed by a couple of fully connected layers. Roughly speaking, the FE subnet conducts clustering aiming at a new representation through a sequence of RECOS transforms. The DM subnet links data representations to decision labels, which is similar to the classification role of MLPs.

So far, we can conclude that CNN is much more superior than classic machine learning algorithms in Computer Vision. Because CNN can both automatically extract features and learn to classify inputs based those features, while classic algorithms such as random forest (RF) and support vector machine (SVM) heavily rely on feature engineering, and this kind of feature engineering is often hard to perform.

**Conclusion**

In summary, RECOS model supply us with a signal analysis perspective of Convolution Neural Network. From this perspective, we can see the effectiveness of activation and deep architectures. However, more efforts need to be devoted into following areas: the network architecture design, weakly supervised learning, robustness to wrong labels, dataset bias and overfitting problems, etc.

**Author: ***Arac*** |Editor: ***Junpei Zhong*** |** **Localized by Synced Global Team: ***Xiang Chen*

## 0 comments on “Understanding CNN from Signal Perspective”