Introduced by Google in 2017, Federated Learning (FL) “enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on the device, decoupling the ability to do machine learning from the need to store the data in the cloud.” The encrypted distributed machine learning (ML) technology enables various parties to build models without revealing the underlying data. It helps private-owned data to stay local, enabling encrypted parameter exchange with outside parties without data exchange, thereby improving the performance of ML models. FL solves the fundamental problems of “data silos” and “data privacy” in the process of large-scale AI industrial use. FL was mainly used for privacy protection and joint modeling of mobile devices between individual users.
Two years have passed, and several new research papers have proposed novel systems to boost FL performance. This March, for example, a team of researchers from Google suggested a scalable production system for FL to enable increasing workload and output through the addition of resources such as compute, storage, bandwidth, etc. Synced also noticed under the leadership of renowned AI scientist Professor Qiang Yang, WeBank AI group proposed a generalized solution for large-scale enterprise-level AI collaboration, breaking down the domain and algorithmic limitations. The team proposed Federated Transfer Learning (FTL) to improve data utilization and model performance further, introducing transfer learning to the method.
WeBank is also popularizing the method using open-source tools, applications, setting standards, and hosting international academic seminars. In 2018, WeBank initiated the Federated Learning International Standard (IEEE P3652.1) project. As of November 2019, the Standards Working Group has held four meetings and is expected to draft relevant benchmarks in early 2020. To reduce the use threshold of federated learning and add in contributors, WeBank launched the world’s first industry FL open-source framework Federated AI Technology Enabler (FATE) in February 2019. This grants a ready-to-use FL framework tool for any companies to work together. Companies including Tencent, Huawei, Alibaba, JD.com, and Intel, have all joined the FL ecosystem.
Earlier this month, NeurIPS 2019 in Vancouver hosted the workshop Federated Learning for Data Privacy and Confidentiality, where academic researchers and industry practitioners discussed recent and innovative work in FL, open problems, and relevant approaches. WeBank co-organized the FL workshop with Google, CMU, and NTU, with 400 scholars joined in the discussion.
The workshop organizing committee announced the Distinguished Paper awards and other awards.
- Distinguished Paper
- Private Federated Learning with Domain Adaptation (Oracle Labs)
- FedMD: Heterogeneous Federated Learning via Model Distillation (Harvard University & Yale University)
- Distinguished Student Paper
- MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling ( Carnegie Mellon University & Bosch Center for Artificial Intelligence)
- Think Locally, Act Globally: Federated Learning with Local and Global Representations ( Carnegie Mellon University & University of Tokyo)
Professor Dr. Max Welling is the research chair in Machine Learning at the University of Amsterdam and VP Technologies at Qualcomm. Welling is known for his research in Bayesian Inference, Generative modeling, Deep Learning, Variational autoencoders, Graph Convolutional Networks.
Below are excerpts from the workshop talk Dr. Welling gave on Ingredients for Bayesian, Privacy Preserving, Distributed Learning, where the professor shares his views on FL, the importance of distributed learning, and the Bayesian aspects of the domain.
Why do we need distributed learning in the first place?
“The question can be separated in two parts. Why do we need distributed or federated inferencing? Maybe that is easier to answer. We need it because of reliability. If you in a self-driving car, you clearly don’t want to rely on a bad connection to the cloud in order to figure out whether you should brake. Latency. If you have your virtual reality glasses on and you have just a little bit of latency you’re not going to have a very good user experience. And then there’s, of course, privacy, you don’t want your data to get off your device. Also compute maybe because it’s close to where you are, and personalization — you want models to be suited for you.
Why distributed learning is so important?
It took a little bit more thinking why distributed learning is so important, especially within a company — how are you going to sell something like that? Privacy is the biggest factor here, there are many companies and factories that simply don’t want their data to go off site, they don’t want to have it go to the cloud. And so you want to do your training in-house. But there’s also bandwidth. You know, moving around data is actually very expensive and there’s a lot of it. So it’s much better to keep the data where it is and move the computation to the data. And also, personalization plays a role.
There are many challenges when you want to do this. The data could be extremely heterogeneous, so you could have a completely different distribution on one device than you have on another device. Also, the data sizes could be very different. One device could contain 10 times more data than another device. And the compute could be heterogeneous, you could have small devices with a little bit of compute that now and then or you can’t use because the battery’s down. There are other bigger servers that you also want to have in your in your distribution of compute devices.
The bandwidth is limited, so you don’t want to send huge amounts of even parameters. Let’s say we don’t move data, but we move parameters. Even then you don’t want to move loads and loads of parameters over the channel. So you want to maybe quantize it to see this. I believe Bayesian thinking is going to be very helpful. And again, the data needs to be private so you wouldn’t want to send parameters that contain a lot of information about the data.
What is the solution?
So first of all, of course, we’re going to move model parameters, we’re not going to move data. We have data stored at places and we’re going to move the algorithm to that data. So basically you get your learning update, maybe privatized, and then you move it back to your central place where you’re going to update it.And of course, bandwidth is another challenge that you have to solve.
We have these heterogeneous data sources and we have very variability in the speed in which we can sync these updates. Here I think the Bayesian paradigm is going to come in handy because, for instance, if you have been running an update on a very large dataset, you can shrink your posterior parameters to a very small posterior. Where on another device, you might have much less data, and you might have a very wide posterior distribution for those parameters. Now, how to combine that? You shouldn’t average them, it’s silly. You should do a proper posterior update where the one that has a small peaked posterior has a lot more weight than the one with a very wide posterior. Also uncertainty estimates are important in that aspect.
The other thing is that with Bayesian update, if you have a very wide posterior distribution, then you know that parameter is not going be very important for making predictions. And so if you’re going to send that parameter over a channel, you will have to quantize it, especially to save bandwidth. The ones that are very uncertain anyway you can quantize at a very coarse level, and the ones which have a very peak posterior need to be encoded very precisely, and so you need much higher resolution for that. So also there, the Bayesian paradigm is going to be helpful.
In terms of privacy, there is this interesting result that if you have an uncertain parameter and you draw a sample from that posterior parameter, then that single sample is more private than providing the whole distribution. There’s results that show that you can get a certain level of differential privacy by just drawing a single sample from that posterior distribution. So effectively you’re adding noise to your parameter, making it more private. Again, Bayesian thinking is synergistic with this sort of Bayesian federated learning scenario.
What are the key takeaways?
We can do MCMC (Markov chain Monte Carlo) and variational based distributed learning. And as there’s advantages to do that because it makes the updates more principled and you can combine things which, one of them might be based on a lot more data than another one.
Then we have private and Bayesian to privatize the updates of a variational Bayesian model. Many people have worked on many other of these intersections, so we have deep learning models which have been privatized, we have quantization, which is important if you want to send your parameters over a noisy channel. And it’s nice because the more you quantize, the more private things become. You can compute the level of quantization from your Bayesian posterior, so all these things are very nicely tied together.
People have looked at the relation between quantized models and Bayesian models — how can you use Bayesian estimates to quantized better? People have looked at quantized versus deep to make your deep neural network run faster on a mobile phone you want to quantize it. People have looked at distributed versus deep, distributed deep learning. So many of these intersections have actually been researched, but it hasn’t been put together. This is what I want to call for. We can try to put these things together and at the core of all of this is Bayesian thinking, we can use it to execute better on this program.
Journalist: Fangyu Cai | Editor: Michael Sarazen