Deep learning model performance has taken huge strides, allowing researchers to tackle tasks which were simply not possible for machines less than a decade ago. Nevertheless the theoretical framework supporting these improvements hasn’t advanced as much as the models’ empirical performance, and pesky questions remain, particularly: What exactly happens inside a deep neural network during training? In the paper Opening the Black Box of Deep Neural Networks via Information, Schwartz-Ziv and Tishby leverage Information Theory to explore Deep Neural Network training.
Synced invited Joaquin Alori, a Machine Learning Research Engineer at Tryolabs with a focus on object tracking, pose estimation, and person re-id problems, to share his thoughts on this paper.
How would you describe this paper?
In Opening the Black Box of Deep Neural Networks via Information, Schwartz-Ziv and Tishby provide insights on the process of Deep Neural Network training by looking at it through the eye of Information Theory.
For their analysis they take small, fully connected neural networks and consider each whole layer as a single random variable. They then calculate the mutual information of each layer with regards to the input data to the network, and with regards to the label data the network is fitting. They plot these two numbers in a 2D diagram they call the information plane:
The colors correspond to the layer each point belongs to in the first plot, and to the epoch each point belongs to in the second. From this they obtain some very big insights.
First, there are two main distinct phases a neural network goes through during supervised training: an initial phase called Empirical Error Minimization and a subsequent phase called Representation Compression.
During Empirical Error Minimization, each layer starts increasing its mutual information with regards to the inputs and also its mutual information with regards to the labels. This seems quite intuitive and the authors don’t spend much time analyzing this phase. On the other hand, after this phase is done, the network goes through a new, much longer phase called Representation Compression, in which the layers in the network continue to increase their mutual information with regards to the labels, but start decreasing their mutual information with regards to the inputs to the network. This is quite astonishing as it shows that not only it is important for layers to be able to ignore unimportant information encoded in the inputs they receive, but also that the phase in which they start doing this compression of the irrelevant data occurs later during the training and can be clearly seen by drawing simple plots.
Second, to gain more insight into the two training phases, the authors plot the normalized mean and standard deviation of the network’s gradients for every layer as a function of the training epochs:
Again, there are two clearly demarcated phases. An initial phase in which the gradient means are much larger than their standard deviations, indicating small gradient stochasticity; and a subsequent phase in which the gradient means are very small compared to their batch to batch fluctuations, with the gradients behaving like Gaussian noise with very small means. They call the initial phase the Drift Phase, and the second phase the Diffusion Phase. Interestingly the transition between these two phases corresponds to the transition between the Empirical Error Minimization and Representation Compression phases previously mentioned. The authors claim that the noise introduced in the second phase leads to more compressed representations of the input data in each layer we see during the Representation Compression phase.
What impact might this research bring to the research community?
This new way of thinking about neural network training can be used to jumpstart several new areas of research. Just as a side note, in this paper the authors deduce that:
- Adding hidden layers dramatically reduces the number of training epochs for good generalization, or in other words, the compression representation phase takes much longer.
- The compression phase of each layer is shorter when it starts from a previous compressed layer.
- The compression occurs faster in the deeper layers.
Can you identify any bottlenecks in the research?
The authors tested their results on two very particular neural network architectures, it’s still unknown if they generalize to other architectures such as convnets, recurrent networks, or even non DNNs, though it seems likely. Also, the authors indirectly verified their findings on the MNIST dataset. It is still left to confirm whether they generalize to larger datasets such as ImageNet, though again, this seems very likely.
Can you predict any potential future developments related to this research?
The most important future development for this area of research is the question of what are the practical implications of the findings. The authors explain that they are currently working on new algorithms that incorporate their findings. They argue that SGD seems like an overkill during the diffusion phase, which consumes most of the training epochs, and that much simpler optimization algorithms may be more efficient.
The paper Opening the Black Box of Deep Neural Networks via Information is on arXiv.
About Joaquin Alori
Joaquin Alori is a Machine Learning Research Engineer at Tryolabs, where he currently works on object tracking, pose estimation, and person re-id problems. Tryolabs is a machine learning consulting firm that companies partner with to create data-powered solutions that produce results. Counting 150 customers served over ten years, Tryolabs is experienced in consulting, development and deployment of custom machine learning systems, using techniques in the fields of computer vision, natural language processing and predictive analytics. As part of the international AI community, Tryolabs hosts machine learning talks and workshops at conferences around the globe and shares its experience on the Tryolabs Blog.
Synced Insight Partner Program
The Synced Insight Partner Program is an invitation-only program that brings together influential organizations, companies, academic experts and industry leaders to share professional experiences and insights through interviews and public speaking engagements, etc. Synced invites all industry experts, professionals, analysts, and others working in AI technologies and machine learning to participate.
Simply Apply for the Synced Insight Partner Program and let us know about yourself and your focus in AI. We will give you a response once your application is approved.
0 comments on “Peeking Inside DNNs With Information Theory”