Ep59 – WTM: Federated Learning
[00:00:00] Announcer: Welcome to What That Means, companion episodes to the Cybersecurity Inside podcast. In this series, Camille asks top technical experts to explain in plain English, commonly used terms in their field, then dives deeper giving you insights into the hottest topics and arguments. They face. Get the definition directly from those who are defining. Now here is Camille Morhardt.
[00:00:37] Camille Morhardt: Hi, and welcome to this episode of Cybersecurity Inside. Today, we’re going to do What That Means: Federated Learning. And I’m really interested in this topic I have with me today. Olga Perepelkina who’s a Product Manager at Intel for Federated Learning, and she’s doing some really amazing work. She’s joining from Moscow actually right now in Russia. So interesting, uh, cross globe. And she has a PhD from Moscow State University in Clinical Psychology and she also has a Master’s in Computer Science.
We’re going to get the definition from her and then dig a little deeper specifically into the open platform, but she’s worked on as well as looking at the application in healthcare that she helped to drive.
Welcome to the show Olga.
[00:01:23] Olga Perepelkina: Thank you, Camille for this introduction. Nice to meet you.
[00:01:27] Camille Morhardt: It is really nice to meet you before we even get into the definition of federated learning, can you start by just telling us why it matters? What’s kind of the human side? It’s pretty new, so 10 years from now, when we’ve worked through all the kinks we’re about to talk about, how do you envision this changing the world?
[00:01:48] Olga Perepelkina: Oh, this is a very good question. Uh, so, I think we can dramatically improve AI in healthcare. Right now we have only about 10 or 20 FDA clearance models, not so many, right? Because it’s still very complicated and very expensive and not so many companies can do it. But when they can use federated learning, we can dramatically change the situation. Because in general, we have a lot of data in the world, right? But at the same time, we have a lot of concerns of using this data–not only DDPR or HIPAA compliance stuff, but also for example, in, um, areas like face recognition or emotion recognition, people, uh, don’t want to share the information because they are afraid of doing that.
Or for example, such a scandal and conflicts in Facebook, for example, when, uh, people became aware that their personal information is being used for ad technologist, right? Uh, they don’t want to do it, but as a researchers, we wanted to use this models, right–to improve products, to build new technologies, completely new technologies. But at the same time, we need to protect private information of people and federated learning can help to do that, uh, to provide access to data and to protect privacy of people. And I think this is very clear why we should do it
[00:03:23] Camille Morhardt: Really, really interesting. Do you think you can give us just a couple minute definition of what federated learning is? Why it exists?
[00:03:32] Olga Perepelkina: So in general, what is federated learning? We don’t need to collect all the data centrally. We can train it locally and send only model updates—only weights, for example–of.all of the model to one server that is called aggregation server. So in federated learning, we keep data private on local devices where it was born. And we only send updates of the model to one server and aggregate this model and then send this aggregated model back to a local devices.
For example, collaborators that participate in this federation, they will get more advanced model because this model sorts different examples, but we don’t collected this data in central server, right? So we do not go beyond like some privacy. We do not. Um, how to say, I’m sorry…
[00:04:33] Camille Morhardt: I think I understand what you mean. So traditionally–I mean, I don’t know if there is a traditional, when it comes to machine learning or deep learning–but I guess before 2016 or 17, we used to take the data that was being collected from, in this case, since we’re talking about healthcare from patients would need to be sent to a central server where the central model then updates all of the data from people. The problem is privacy then, or the privacy concern that patient data is being sent somewhere else.
So you’re saying federated learning kind of reverses this and the data stays local to wherever the patient is um, and instead the model itself ships out to wherever the patient is. And the only data that gets shipped back to the model are updates that, you know, the individual–I guess we’ll just say patients for now–are sending back.
So that’s interesting to me because you’re leaving the data local. So I expect when you’re dealing with Europe and GDPR privacy laws, and I guess other privacy laws and other parts of the world (HIPAA, for example, in the United States) you’re removing that problem of having the patient data center out.
What about the problem that you’re then introducing of shipping the model out? So now the models distributed all over the place. So do you have IP concerns or security concerns about anybody targeting the model itself through a remote or distributed collaborator or node?
[00:06:05] Olga Perepelkina: Uh, so that’s correct. So in a classical way in “vanilla” federated learning, uh, we still have some IP concerns, IP issues. So in classical federated learning any of the participants can steal the model, right. It’s a huge issue in medical applications. So for example, some organization would like to develop such model it’s super expensive. Usually they pay money for data collection, for annotation and for model development themselves. And they also need to have this FDA clearance to approve, uh, this model to be used in healthcare. And this is highly expensive process and they would like to protect the model that is being developed.
And in a classical approach, it’s really complicated because every participant can save the model and use it for example, in their startup. Right. And this is a huge challenge with IP protection of the model. Intel’s technologists like ?esjeeks? for hardware and isolate can potentially solve this problem. And, uh, in, uh, our team in Intel, uh, we work on this problem with integration of federated learning with such hardware.
[00:07:19] Camille Morhardt: Okay. So in some cases you’re integrating hardware into the security of federated learning?
[00:07:26] Olga Perepelkina: In this case, uh, this problem of, uh, protection of the model, uh, will be solved.
[00:07:32] Camille Morhardt: There’s so much I want to ask. I know that you’re also doing open source, which is also new in federated learning. Is it not?
[00:07:41] Olga Perepelkina: Not new. Actually, there are some, uh, existing open source projects in federated learning; though federated learning is a completely new area of research in machine learning and even some professionals in machine learning are not aware about federated learning approach. And, uh, so since it was proposed only several years ago in 2017.
And there are some existing projects in federated learning, uh, and Intel also. It’s developing open source platform, which is called OpenFL–Open Federated Learning. This project was born several months ago. Probably I should start with the research project. After Google has proposed federated learning for mobile phone usage, Intel also started to do this research in medical imaging, And Intel Labs, in collaboration with the University of Pennsylvania (UPenn), has developed an approach in medical imaging with federated learning.
And we published a nature paper a couple of years ago. Uh, it was the first paper in medical imaging with federated learning. In this paper, uh, we compared several collaborative learning approaches. So for example, federated learning versus different collaborative machine learning approaches like institutional improvement in incremental learning or cyclic institutional incremental learning, what does it mean? So infiltrated learning, uh, we train models in parallel. And in these two other approaches, uh, we do it sequentially. So not in parallel. Uh, we compare to directed learning with centralized learning and with the two different collaborative learning approaches. And we showed that federated learning perform the better.
The task was a brain tumor segmentation based on MRIs.
[00:09:36] Camille Morhardt: How many different medical facilities were involved in that?
Olga Perepelkina: 10 institutions.
Camille Morhardt: Are they all around the world or were they just in Russia in the United States?
[00:09:46] Olga Perepelkina: Uh, not Russia. Uh, it was in the United States. Different areas of United States. Yeah. And, uh, after that Intel Labs developed a proof of concept of, uh, federated learning in medical imaging. And then Intel they decided to make an open source library based on this research. And now we have OpenFL. OpenFL is a Python library. Uh, it can work with the different hardware. So if it is hardware agnostic, it can work with the GPU CPU. Uh, and it also can work with different deep learning frameworks like ?Byturge? Carriage, ?Cancelflow?
Now we have several releases of this library already. So first of all, you can, you can go to our GitHub repository and have it from there. Also, we have ?docker? image for this library and you can use it, uh, in, in your research or even in production.
[00:10:43] Camille Morhardt: And I assume you can use it in contexts outside of medical imaging. Is that right?
[00:10:48] Olga Perepelkina: That’s correct. Uh, basically it works better for computer vision applications, or even for NLP.
[00:10:55] Camille Morhardt: NLP, natural language processing?
[00:10:58] Olga Perepelkina: Natural Language Process. But first of all, it can be used for deep learning.
[00:11:04] Camille Morhardt: So, can you tell me a little bit more about how this original study worked, um, that you, we have the link for it, of course, below to the paper, but you were doing the brain scanning segmentation, uh, off of the MRI machines. You were actually sending the model to the patients and then they were making updates locally and then re-aggregating that. Can you kind of walk through what happens in that scenario?
[00:11:32] Olga Perepelkina: In federated learning, we have different participants, collaborators, or medical sites or institutions in which we collect data and annotate them. And we have one aggregator or model owner who will get this model. Uh, so in, uh, clinical sites, first of all, we, we annotate the data. We ?remake? it, then—
[00:11:57] Camille Morhardt: You say” annotate” the data, you said it’s off MRI scans. So what’s being annotated and who’s doing it?
[00:12:05] Olga Perepelkina: So, uh, usually, uh, the data is annotated by professionals, by neurologist, uh, by medical doctors. In centralized training, uh, usually people who want to build this model, they pay money for that. Uh, and they provide some instructions for annotation, uh, to make it consistent, right. They can check the quality of annotation process. In federated learning, uh, we still have some problems with that because people who will create this model, they can’t directly observe the quality of annotation.
Uh, and this is one more major issue and the challenge for federated learning. And we also try to solve it in our product.
[00:12:48] Camille Morhardt: Uh, just to extrapolate, I guess, from that, then you either–like in the case of Google, when they were looking at phones–you either need so many different collaborator nodes that you can cancel out kind of any, any of the tail ends, right? Any, any bad annotations, I guess we’ll hopefully get lost because you have so much data coming in, most of it’s good; or, I guess, you need to be very specific about who’s doing the annotations–like in, in the case where you’re talking about 10 different medical institutions, you’re having neurologists do it.
And I suppose even if you have sort of one bad neurologist, you’ve got 10. facilities and professionals making the annotations. Would that be fair? You’re either trading like quantity or, or, or you’re opting for kind of quality and in terms of the annotators?
[00:13:41] Olga Perepelkina: We haven’t solved that yet. Let’s imagine that we have only one collaborator with very high-quality data that we can check. And the second scenario where we have 10 collaborators and, uh, we can’t directly review the annotation, but we trust the partners and we provide a high quality instructions for the neurologists how to annotate our data, how to do that in consistent way, but we can’t check it.
And our suggestion is that the second case is better because we will have much more data in comparison with the first case, right? Right now we can’t prove that. We still need to test it more in different use cases, but we believe in this case that, uh, if you have more data with trusted or semi-trusted partners, it will be better than if you have only one node, only one institution that will provide data for you.
[00:14:41] Camille Morhardt: That makes sense. I mean, that just makes sense, logically. You would rather have multiple inputs than, you know, a single input for any kind of, to avoid any kind of bias, I suppose.
[00:14:52] Olga Perepelkina: I would like to make a disclaimer. So we did not state that federated learning is better than centralized learning. If you can collect all your data in one place and make it centralized model to conduct classical machine learning, research in centralized way, you can do it. Because in federated learning, you still have. You can’t directly observe your data. You can’t, uh, experiment on that.
[00:15:21] Camille Morhardt: You don’t have the raw data. You’re, you’re relying on it being updated. Yeah.
[00:15:25] Olga Perepelkina: Yeah, that’s correct. And, uh, in our future products, uh, we will add, uh, some monitoring tools to not to absorb your data, but to collect some, uh, statistics to help you, uh, in your research and your experiments.
Uh, so federated learning, make sense if you have privacy issues or if you have data silo problem. For example, you are a huge financial or retail industry, in this case, you maybe don’t care about privacy. But you have a lot of silos with data and, uh, you have this communication efficiency problem, uh, because you don’t want to build it central server, because it’s expensive. You need to have a lot of memory. Uh, you need to send your data from different locations from different countries. For example, 4:1 central server, and then to train, uh, locally in one place. So there are two major problems that if federated learning can help to solve. So first is privacy and the second communication efficiency or data silo problem.
[00:16:34] Camille Morhardt: When we’re looking at it from kind of a cyber security perspective, where do you think that attacks are going to occur in general in federated learning? Are they going to be targeted at– well, first of all, will they be targeted? And where are the most likely attacks going to happen? Is it going to be at the collaborator nodes or is it going to try to attack the model, which is, I guess, on the aggregator.
[00:17:00] Olga Perepelkina: So we suppose both actually, so that attacks can be in any of the nodes of this network. Right? So first of all, some untrusted partners, again, try to, uh, steal data or to try to find some leakage of data privacy. And one important, uh, research topic is to understand when we have aggregated model, are we sure that it still doesn’t, uh, have some private information from our data?
One more approach to solve this is differentiational privacy. When we try to add noise to our data–and this is still is in research not only by us, Intel Labs, for example, but in, uh, different, uh, research groups–but when you add noise to your raw data, uh, the performance of your model will drop. So you asked me which components can be attacked. So first of all, it can be raw data and we still may have some leakage of, uh, private data, private information. And the second one is this problem that we already discussed IP protection of model itself.
[00:18:14] Camille Morhardt: Okay. And so the IP protection of the model itself is you’re saying one of the areas that you’re looking at is how do you secure actually at the hardware layer of the model so that it can’t be breached. And then at the individual nodes, you don’t have control over that hardware necessarily if you’re the model owner, because they’re in distributed places. So how are you, how do you look at securing those?
[00:18:39] Olga Perepelkina: So the idea is to have this hardware protection on all the participants. Because people who are developing, uh, this model, they really care about the protection of this model because it’s super expensive in medical imaging, for example, it’s super expensive. And in this case, they may choose a participant for this collaboration only who has these specific hardware to protect their model and the data, as well.
[00:19:05] Camille Morhardt: When you’re choosing your model, do you have to pick one kind of homogeneous input? And I don’t mean necessarily for the project you did, but in general, in federated learning, does it all have to be like everybody’s using the same kind of phone, everybody’s using the same MRI machine? or can you have different kinds of inputs coming into the same model?
[00:19:30] Olga Perepelkina: This is a very good question. So let’s imagine we would like to train a very good model for some healthcare problem, for example, brain tumor segmentation. What do we want? We would like that this model will perform very well in different conditions with different MRI scans, uh, from different locations, for example, in US, in Russia, in Europe, with different populations, right? With different people.
So in this case, we need very diverse state. And in this case, we absolutely would like to have different type of, uh, inputs as much as possible from different locations, different populations, uh, people from different, like, I don’t know, age, uh, gender and others with the different possible. I don’t know diseases, right? So not only with the target diseases, but with additional ones. And also we would like to have different equipment, for example, different MRI scans.
And only in this case, we believe that this model will work better in new cases because when we have only a model that, uh, that’s so only the local data, when, for example, we change equipment, the performance will dramatically drop, right? Uh, because we change a major thing. And when we have this diverse dataset, we believe that, uh, this model will be much more generalizable in this case.
[00:20:59] Camille Morhardt: Well, I can obviously see it in healthcare, but I can imagine like natural language processing where you need all kinds of different accents and ages and, you know, allergies or no allergies, or, you know, different kinds of things that affect it. Yeah.
[00:21:16] Olga Perepelkina: So this is actually the problem about biases in machine learning. Right? So biases, uh, came when we have biased data set, right. Uh, with only representation. One type of population, for example, and federated learning also can solve this problem can help to do that because in this case we have more diverse data.
[00:21:38] Camille Morhardt: Hmm. So if this is a new, relatively new field, um, you know, we’re talking like five years old or something, not even, I guess, if you were gonna give advice to grad students now, what areas of federated learning would you want them to look into?
[00:21:54] Olga Perepelkina: Communication efficiency. So for example, when we have right now in medical imaging, we have only 10 collaborators. Our open-source platform can work pretty well with them, but if we have, I don’t know, hundreds of thousands of participants, we will have some specific communication issues. And so this is one topic that, uh, can be solved. Uh, the same with the leakage of private information from models or with protection of models.
So one way to solve it is hardware, but, uh, sometimes it’s complicated to do so. For example, some medical institutions are not able to buy some specific hardware, but we still need to protect the model and the data. So how to solve it?
[00:22:47] Camille Morhardt: Is there anything else I should be asking you about federated learning?
[00:22:52] Olga Perepelkina: So I think that probably I should mention our first challenge in federated learning. So Intel Labs and, uh, University of Pennsylvania, we host the first federated learning challenge in medical imaging. The purpose of this challenge is to, uh, propose new aggregation methods of the models. So in a classical way in classical federated learning, uh, we just calculate average mean of the weights in the model. But we can suggest more robust ones, weights–aggregate model weights.
And the goal of this challenge is, uh, to create these new, uh, aggregation mechanisms, uh, and about 15 teams contributed to this challenge. The results of the challenge will be pretty soon. So I also will share the link about the challenge.
[00:23:51] Camille Morhardt: Okay. Okay. That’s really interesting. Um, can you, before we go, can you just explain what a wait is? When you’re sending the weights back to the model from the collaborator nodes, what is that weight?
[00:24:05] Olga Perepelkina: Model weights, like model parameters and some model. So we don’t say, and, uh, we don’t send, uh, uh, data or data, but we send some model parameters.
[00:24:17] Camille Morhardt: Okay. Um, Olga, thank you for your time today. It’s been really interesting and I, you know, I’ve read your paper and I’m not a scientist, but I thought it was fascinating. Really interesting and actually written in a way that was pretty easy to understand, even if you’re not, I might have skipped just a little bit of the coding and math part of the paper, but the rest of the paper, the layout and how everything works together, uh, was very easy for me to understand.
And I really appreciate that. So, um, I do recommend it. If anybody wants to click and read a little bit more. It’s easy to grasp the way it’s written down.
[00:24:54] Olga Perepelkina: Thank you. Thank you so much.
[00:24:55] Announcer: Stay tuned for the next episode of Cybersecurity Inside. Follow @TomMGarrison and Camille @Morhardt on Twitter to continue the conversation. Thanks for listening.
The views and opinions expressed are those of the guests and author, and do not necessarily reflect the official policy or position of Intel corporation.