Bayes’ theorem is one of the most important formulae in the field of mathematical statistics and probability, used to calculate the chances of a particular event occurring based on relevant existing information. Bayesian inference meanwhile leverages Bayes’ theorem to update the probability of a hypothesis as additional data becomes available. How can Bayesian inference benefit deep learning models? New York University Assistant Professor Andrew Gordon Wilson addressed this question in his recent paper The Case for Bayesian Deep Learning.
Paper Abstract: The key distinguishing property of a Bayesian approach is marginalization instead of optimization, not the prior, or Bayes rule. Bayesian inference is especially compelling for deep neural networks. (1) Neural networks are typically underspecified by the data, and can represent many different but high performing models corresponding to different settings of parameters, which is exactly when marginalization will make the biggest difference for both calibration and accuracy. (2) Deep ensembles have been mistaken as competing approaches to Bayesian methods, but can be seen as approximate Bayesian marginalization. (3) The structure of neural networks gives rise to a structured prior in function space, which reflects the inductive biases of neural networks that help them generalize. (4) The observed correlation between parameters in flat regions of the loss and a diversity of solutions that provide good generalization is further conducive to Bayesian marginalization, as flat regions occupy a large volume in a high dimensional space, and each different solution will make a good contribution to a Bayesian model average. (5) Recent practical advances for Bayesian deep learning provide improvements in accuracy and calibration compared to standard training, while retaining scalability. (arXiv)
Synced invited Dr. Hao Wang, a Postdoctoral Associate at the MIT Computer Science & Artificial Intelligence Lab (CSAIL) who works on statistical machine learning and deep learning, to share his thoughts on the paper The Case for Bayesian Deep Learning.
What are Bayesian neural networks (BNN) and Bayesian deep learning (BDL)?
As clearly defined in this report by Andrew Wilson, Bayesian neural networks (BNN) usually refer to a Bayesian treatment of neural networks. Specifically, the goal is to train a number of networks p(y|x,w), where x, y, and w are input, output, and the network parameters, respectively. Each parameter configuration w has a posterior p(w|D) indicating the importance of such a configuration. BNN then makes predictions by marginalization: p(y|x,D)=∫p(y|x,w)p(w|D)dw. This can be seen as a ‘clever’ approach for ensembling with p(w|D) as weights. In this report, Bayesian deep learning essentially refers to Bayesian neural networks.
It is worth noting that Bayesian deep learning (BDL) in a broader sense also includes methods that unify probabilistic graphical models and deep neural networks to achieve better reasoning performance.
What is the History and development of BNN and BDL?
The study of BNN dates back to 1990s with notable works from Hinton and van Camp, Denker and LeCun, Radford Neal, and David MacKay. Over the years, a large body of works have emerged to enable substantially better scalability and incorporate recent advancements of deep neural networks, including our paper on exploring different distributions in BNN and a series of interesting works on marginalization and ensembling in BNN from Andrew’s group.
What are the key points in this research?
Andrew’s report clarifies some important issues around Bayesian neural networks and shares a lot of valuable insights. Besides the main point on marginalization as the key property of Bayesian neural networks, one other interesting and insightful point in my opinion is the connection between deep ensembles and BNN.
At a higher level, they are both trying to train a set of neural networks and produce final predictions using some form of model averaging. The differences are (1) deep ensembles separately train these networks with different initializations while BNN directly trains a distribution of networks under the Bayesian principles; (2) deep ensembles directly average predictions from different networks while BNN computes a weighted average using the posterior of each network as weights. The implication behind this point is that BNN actually subsumes deep ensembles in a sense, since the latter is an approximate Bayesian model average. Therefore, success in deep ensembles actually brings both encouragement and additional insights to BNN.
Can you predict any potential Future Development on BNN or BDL in general?
The main obstacles of BNN and BDL’s wide adoption in the old days included computation efficiency and community support (e.g., publicly available packages). Recent exciting development has made a solid step to clear out such obstacles, e.g., a myriad of works to speed up computation and packages such as Edward specifically designed for probabilistic modelling and inference.
In the future, we can expect significant progress in BNN for learning with limited data, ensemble learning, and model compression/pruning, etc. In a broader sense, there will also be much more work based on the philosophy of BDL (i.e., harnessing the reasoning ability of probabilistic graphical models for deep learning) in various areas such as computer vision, natural language processing, health care, data mining, etc.
The paper The Case for Bayesian Deep Learning is on arXiv.
Dr. Hao Wang is currently a Postdoctoral Associate at the Computer Science & Artificial Intelligence Lab (CSAIL) of MIT. He received his PhD degree from the Hong Kong University of Science and Technology, as the sole recipient of the School of Engineering PhD Research Excellence Award in 2017. He has been a visiting researcher in the Machine Learning Department of Carnegie Mellon University. His research focuses on statistical machine learning, deep learning, and data mining, with broad applications on health care, recommender systems, computer vision, social network analysis, text mining, etc. He has published in top venues including NIPS, ICML, ICLR, KDD, CVPR, AAAI, and IJCAI. His work on Bayesian deep learning and its application to personalized modelling has been well received and was the most cited paper at KDD 2015. In 2015, he was awarded the Microsoft Fellowship in Asia and the Baidu Research Fellowship for his innovation on Bayesian deep learning and its applications on data mining.
Synced Insight Partner Program
The Synced Insight Partner Program is an invitation-only program that brings together influential organizations, companies, academic experts and industry leaders to share professional experiences and insights through interviews and public speaking engagements, etc. Synced invites all industry experts, professionals, analysts, and others working in AI technologies and machine learning to participate.
Simply Apply for the Synced Insight Partner Program and let us know about yourself and your focus on AI. We will give you a response once your application is approved.