Camille Morhardt 00:12
Welcome to InTechnology Podcast. I’m your -host, Camille Morhardt, and today, we’re going to have a conversation about artificial intelligence, specifically federated learning and its relationship with health, with biopharmaceuticals, with life sciences.
Now this, to me, it poses a bit of a conundrum, right? Because while I’m very enthusiastic and excited about what AI as technology can do for us in this space, which is obviously one of the most important aspects of our lives, I, like many other people, am concerned about privacy of personal information.
So today, in order to have the conversation, I brought two people in who are experts in both technology, confidential computing, and life sciences and biopharma. I’m going to introduce co-host Prashant, and then he’s going to introduce our guest. Prashant Shah is Intel’s CTO for federated artificial intelligence products. He actually founded OpenFL, which is a federated machine learning library that facilitates development of AI models across private data silos. And we’ll talk a little bit more about what that means. He has spent over two decades in health and life sciences, driving scalable AI architectures. And beyond his life at Intel, he’s an advisor to the National Institute of Health, all of us Precision Medicine Research Program. Welcome to the podcast, Prashant.
Prashant Shah 01:41
Thanks, Camille. And I’m going to introduce Abhishek Pandey, who’s a friend and a dear colleague of mine from AbbVie. So Abhishek Pandey is a global lead and principal research scientist at AbbVie, specializes in machine learning and drug discovery, and really leads a team of machine learning scientists focused on developing advanced machine learning methods for target identification, prioritization, ADME, the de novo design, of molecule, toxicity, and preclinical and clinical precision medicine. His team is responsible for developing the state-of-the-art machine learning algorithms for Discovery Development Sciences and the AbbVie-Calico Alliance departments of AbbVie. Prior to joining AbbVie, Abhishek was a founding member of the AI team at Tempus Labs, now known as Tempus AI. And he helped transform the company in the field of precision medicine and establish market value. Abhishek holds a PhD in electrical and computer engineering from University of Arizona and has previous experience in embedded systems software engineer at Toshiba Semiconductor in the mobile multimedia division. So welcome, Abhishek. Nice to have you on this podcast with us.
Abhishek Pandey 02:56
Thanks for the kind introduction. Thanks for inviting me.
Camille Morhardt 03:00
Maybe we could just start with what is some of this opportunity that is driving all the excitement around AI? Understand that federated learning is one implementation of AI, but why AI at all in this space? What kinds of things are we looking for?
Abhishek Pandey 03:17
Yeah, very fair question, frankly, and this is an experiment or a question that almost every pharma company, or frankly, healthcare company is dealing with right now, actually. I’ll talk especially for pharma right now, but a lot of these things hold true for healthcare in general actually. So most of the drug discovery that has happened till now, and when I say it’s not just AbbVie, I’m talking about all of pharma actually, has been pretty like a pipeline sort of process actually, where you do target identification, then you find out a molecule that will bind to the target, and then that particular molecule, you hope to have different, all the right properties while using ADME toxicity and all of those things. And that molecule essentially, you push towards preclinical and then towards clinical stage to take it to some other place.
Now, this is a process which takes around 16 years on an average right now, actually, for one single rep to market is 16 years. The investment it takes to do is approximately $4 billion actually. And essentially, only one in 10 make it from stage one of the clinical trial to like last stage. So a drug discovery is a process which needs change and has to be done right now, because frankly, right now, we are standing at a place where a lot of diseases we’ll never be able to cure because we’ll never be able to reach that place. Rare diseases, forget about it, because frankly, it’s not economically viable to have $4 billion of investment and not find enough patients to actually cure it.
So the problem is that we are right now standing at a place where drug discovery needs a change, and AI and machine learning has a true potential there, because if you think about it, the speed of internet, the cost of compute, all of those things are actually, and the amount of compute that you have at your disposal, all of this is increasing dramatically, actually, if that’s the right word for it. It’s almost explosion, if there is one. And what we have to do is to offset these two curves, the increase in price for drug discovery with a decrease in price to compute, memory, all the cool things that we are seeing in terms of technology. I think when put together, it’s a perfect solution to whatever problems we are facing in this area, actually, frankly.
Prashant Shah 05:41
There’s a paper that they published a few years ago and they called out something called E-room’s law, which is essentially Moore’s law spelled backwards. And the Moore’s law is all about number of transistors doubling every two years. And so that’s basically driving the efficiency within the compute. But E-room’s law is essentially where the cost of drug R&D is actually going higher, and the number of drugs coming to market, to Abhishek’s point, are fewer and fewer. And so the sort of efficiency is actually decreasing in terms of the number of drugs coming to market through per billion dollars of R&D that is being spent. So one hope that Intel has in partnering with companies like AbbVie is, can we actually bridge that? Can we actually accelerate the rate at which we can sort of create new drug candidates or leads through the means of AI? And then I think, Abhishek, I don’t know if you can speak to, in the last 10 years at least, I’ve seen sort of an explosion of genomics and high content imaging and some of these really rich data types that if you have so much data, it almost becomes almost impossible for a human to just process all of these kind of data sets.
So can you speak to some of the data types, the new rich data types that are coming out that are fueling sort of drug discovery process?
Abhishek Pandey 07:10
Yeah, absolutely, and I think this is the exciting part, which is people used to talk about big data. But big data is just a term right now. We need as much data as possible, frankly, right? We live in the age of foundational model where, bring in your data and I’ll make it work essentially, right? And I think this is exactly what it should and is happening in drug discovery side as well. There was explosion of genomics first that led to all of the work that people have done in precision medicine, especially on the oncology side, but in other areas of medicine as well. And what people have realized is that this complementary set of data, which was not utilized earlier properly, brings in so much information that if we don’t use it, we are almost doing a disservice when we are actually designing our process, when we are actually designing our drugs or anything like that.
So genomics, high content imaging, essentially, the most popular form of it is called cell painting actually that exists in the pharma world. There’s a whole company made out of it actually. And so what I’m saying is that these high content imaging data, plus genomics data, plus the clinical data, plus the molecular data that exists out there, there’s a huge variety of multimodal data that exists out there. Now, the idea is how do we put all of these things together? And there is no manual way to do this actually. In fact, there should not be a manual way to do this, because it’ll take you forever to actually, if you’re trying to almost get insight out of it. So you better actually use more computational methods, more AI-based methods, which can develop insights for you and bring you to the next stage of this process, which is using those insights.
Camille Morhardt 08:58
Who’s putting together the models then?
Abhishek Pandey 09:01
Well, actually, this is interesting, because there are multiple levels of model development that are happening, especially in the pharma world. So there are like, for example, big pharma. So teams like my team or any other team probably in big pharma trying to develop model internally for their sake as their perspective of drug discovery is. But then there are totally like technology companies which are developing models for drug discovery as well, and in fact, increasing their ambition in this space. One example is like, for example, Alphabet spinning out isomorphic lab out of the Nobel Prize-winning discovery of AlphaFold and saying that we have AlphaFold 3 now, which is better than Nobel Prize-winning discovery, and we can do drug discovery better. So the point is that if you have a technology company, big pharma, big tech, we are all in the same place. The only difference is that big pharma is dealing it with day in, day out. Big tech is actually dealing it with technology and with alliances with big pharma or small pharma or medium-sized pharma and then actually trying to develop algorithms that can hopefully change the drug discovery as we know of.
Now, one of the key features here is that there has to be a collaboration piece here, because big tech does not, for example, have insight, how do we do this, actually? And as I said, it’s not kind of a startup that you’ll start and you will say that, “Oh, I will be really good because I have the best technology in the world,” because it’s still… the foundation is that 16-year cycle of drug discovery. Imagine this. You have a startup, you invested $4 billion into it, and you are waiting for 16 years, only one in ten to actually make it to a drug. That’s a really bad investment for startup of that size actually. And nobody would give you that kind of leeway, if that’s the right way of doing it. And that’s why collaborations are important, because you need to have those collaborations where big tech does what it does best, which is develop new algorithms, develop new methodologies, and then let pharma do what it does best, which is essentially putting drug out there and maybe take some of the risk as well that is involved with it, because that’s how our industry is designed.
Camille Morhardt 11:24
But one of the problems that everybody’s facing, be it pharma or tech right now in processing this information is you have to have access to health data, right? Very personal information. And as you’re saying, you want to be grabbing it from as many sources as possible to train a model. And then we have regulations and we have general decency getting in the way of collecting all this data. So what is the approach that you’re taking within AI to handle this?
Abhishek Pandey 11:56
Yeah, this is essentially hitting the problem at its most important point, actually, because every data that we have is classified data. Let’s just put it that way for us. Be it patient data or in early discovery process where it’s molecular data. That’s a data that is our IP. We cannot share it with anybody. Now, for the earlier part, it becomes more of a IP problem. For the later part, it becomes more of a problem that you cannot share it because of privacy reasons. But nonetheless, we cannot share any data actually, frankly. And that is the main bottleneck that we are dealing with right now. That we have data, and probably, we all, all large pharma companies, are probably specialized in one or the other field and they have a lot of data in that particular field.
But unfortunately, we cannot help each other in any form or way in spite of the fact that we can and we probably sometimes want to as well, because these are sensitive data and we cannot actually bring them together to actually build algorithm, for example, or do anything that we can do together. So most of these consortiums that essentially, earlier it was formed, were not many times successful because of this problem statement itself, because we cannot actually share truly internal data, but mostly, like a data which is similar to maybe public data space or something like that. But with maybe like the advent of like federated learning or anything of these sort, I think there is a possibility now, and that’s what is opening a lot of doors.
Camille Morhardt 13:34
So maybe, I mean, Prashant, could you give us kind of an overview of what federated learning is, how it works?
Prashant Shah 13:40
Yeah, sure, so in sort of traditional machine learning or deep learning methods, you essentially collect all the data in one data silo and then you train a model and outcome as a model. But as Abhishek pointed out, right, it’s very hard to actually get people to pull datasets, especially when you are dealing with high intellectual property data or private data. If it’s private data, the state-of-the-art techniques are you go through a de-identification process and then you anonymize and you scrub all the data off of any patient identifiers or personal health information before you can actually move these datasets. And if it’s a high IP data, you can’t even anonymize it, right? It’s basically, it’s something that cannot be moved. So that’s actually limiting what types of AI models can be built.
It’s also limiting the robustness of these AI models, because the more data you have, the more diverse data. So it’s not just about the quantity of data, it’s also about the diversity of data, the more robust your models are going to be. Otherwise, your models are going to be biased and they’ll work for a narrow data space and then not work for data that is outside of the space on which it has been trained in. So federated learning takes a different approach. It essentially says that keep the data where it is. You don’t need to move the data sets around. Instead, if you’re going to send the model to where the data are, so you’re moving the model to the data instead of moving the data to the model. And every data collaborator or every data silo can hold onto the data sets. They have full custody of the data. The model is sent to a compute where the data are.
The model actually trains on those datasets, on those local datasets concurrently across all the different data silos. And then these models, the updated models or the parameters of these models are shared back to a central location that essentially aggregates those learnings from all the different data collaborators that it has learned from. And then that aggregate model is then sent back for further training to each of these data silos. So that’s the way essentially federated learning works. So this basically allows the data custodians to have full custody of the data. They don’t have to share the datasets. There are certain risks of privacy and security that open up with federated learning that we can talk about subsequently. But that’s the general principle around federated learning.
Keep the data where it is, we move the model to where the data is, model’s going to train locally on each of these data silos, and then the models are going to get aggregated and then sent back for further training and this process continues until you get the model that you’re looking for in terms of the accuracy, right? So that’s the process of federated learning. And we talked about learning, but there’s also the notion of federated validation, which means that you’re essentially, you may have trained a model centrally or in a federated fashion, but now you want to see the performance of the model on real world data sets, on real world data, on private data silos. You can also use the same architecture to essentially send these models to private data silos, you validate these models on those data sets, and you get back the accuracy metrics about how well your model is doing on those real world data sets. You can get an estimation of the overall performance of your model.
Camille Morhardt 17:15
What kind of collaborations are you seeing? I’m trying to get a sense also for kind of the business model here. Are all of the contributors of the data then receiving access to the centralized model, or are people getting the inferences out of it, or are they essentially selling access to certain kinds of data?
Abhishek Pandey 17:36
Yeah, I would say that among equals, when I say among equals, I mean that among big pharma companies, the collaboration is like, look, we don’t have a lot of data, you don’t have a lot of data for this particular problem to be solved. Why don’t we just come together and actually solve this problem? Because it’s a common problem for all of us, actually. And that’s where the common minimum setting is done. And then that’s where the collaboration’s taking into place.
For example, I’m developing algorithm. And I can always claim that I have the best algorithm possible for doing maybe like admin prediction or something like that. But how would you ever know that? Because nobody ever releases their internal exact architecture. How do they do that? Essentially, that’s the level of trust you have to develop among different pharma companies or among equals, at least, saying that I have this, you have that, or whatever. And then that’s the benchmark that we should put. So that I think the benchmarking is also the next step that should come into picture, which can be a part of collaboration. Now, there is a third portion to it, which I don’t think we have explored that yet.
Most of the big pharma, they do not care about it in the sense that, why would we give it to anybody else, right? Because if we can, we can actually share it among the people who have contributed to this particular task. But there is a possibility that as we do benchmarking and more common problem statements can actually become maybe like open source or whatever. I don’t know if that’s the right way of doing it, but can become an open source down the line and say that this is the benchmark for this kind of problem statement. And anybody who is developing, for example, big tech is trying to develop a model, they have to beat this standard to get to the next essentially standard. That will lower the bar of entry in this particular field for a lot of companies, actually. Right now, the bar of entry is very high, because you first need to have that kind of data. And then once you have that kind of data, you need to raise a machine learning team that can do that kind of work, which essentially is a very tall ask right now. And I think these benchmarking efforts, hopefully, will lead to some of those kind of problem statements and lowering the bar of entry too in this particular field especially.
Camille Morhardt 19:53
And the two of you are working together, right? On MLCommons? Prashant, you want to talk a little bit about that?
Prashant Shah 19:59
Yeah, that’s what I was going to basically mention is that that’s basically the work that Abhishek’s team and my team are doing with the organization called MLCommons. And specifically, there’s a chapter called MedPerf that is being led by Alex. And we are essentially figuring out what are some of the benchmarks that we can create across pharma companies that can do, they do a, Abhishek said, right? Sort of define the benchmark, set a standard for it, and then like perform the benchmark and see what type of like results come out of that benchmark and be able to actually set, that basically solves a number of lower level problems as well, right? How do you represent the data? How do you format the data? What data standards we need to have?
So it’s basically standardizing as much as you can to stack, and then be able to, actually…the higher level goal of the year is to be able to benchmark various models so that you can compare model A’s performance or the model B performance and you can say that, “Okay, model E from Company X is actually outperforming model B from company Y.” And that’s the higher level goal, but it ends up for us solving sort of some other lower level problems, like I mentioned, from data standards to, you know, when you actually start to ask about the metrics about what needs to be measured, right?
We typically tend to throw out, like, “Hey, let’s measure X.” but then when you actually have to measure it to really define what that X is. And when you start talking to various companies, you start to realize that everybody has a different definition of X. So even that question starts to get answered and you start to have a more unified view across all the participants of what is a true measurement that needs to be made, and then what are some of the best practices to make that measurement and can we now create some sort of standards across the industry that way. That way, we can actually start speaking the same language and start speaking and start comparing models from a real sort of utility perspective, right? Because that’s where sort AI is useful is when it’s useful. If you’re like not able to get at the utility, it’s basically a wasted exercise.
Abhishek Pandey 22:28
And I’ll give you like just one example of it, right? For example, generative model. Right now, nobody exactly knows exactly what should the parameter to measure a generative model in a molecular world actually. So I suppose if machine learning is generating molecules for you, what metrics you should use to say that this method is better than this method for generating molecules? Right now, what we are doing is of course, we have our loss function, whatever, essentially and say that, “Oh, this is better than that because of these loss function and stuff like that” and these synthesize ability, novelty, these metric we put around.
But I think the true method for it should be that every hundred molecule that algorithm generates, they have to be made in lab and then tested out that they did what they were supposed to do. Now, that’s a tall ask, right? Frankly, if you ask me. But if there is a benchmarking effort, they can do that actually. And that’s what I think is something that we are looking forward to is to understand and develop hopefully, co-develop rather, some of these benchmarks that essentially does not exist and absolutely needed, and nobody would ever be able to do it unless they all come together.
Camille Morhardt 23:43
And we’re sort of at the relatively early stages, correct me if I’m wrong, of embracing federated learning in the health and life sciences space. So it feels like when things are early, there’s a bigger sense of collaboration versus protection, as lifting anybody’s boat, lifting everybody’s boat is better than all running in a silo and hoping for a win. So kind of in that spirit, I guess, I know Prashant, you work on OpenFL, and can you talk about what that is and what is its goal?
Prashant Shah 24:16
Yeah, OpenFL is a simple Python library that essentially allows, that’s a wrapper on top of the popular deep learning frameworks, right? It basically, it supports, whatever framework most of the AI data scientists use. And as a wrap, that basically allows that sort of distribution of the model training on the local collaborator side and then getting back the weights and aggregating them and being able to apply some sort of an aggregation function on top of that. So it’s a general purpose library that allows people to federate machine learning models. That’s the summary, essentially. And then what’s also important in federated learning space is that, yeah, for a science experiment, yes, you can send models around, but then there are certain privacy and security risks that actually open up when you federate.
So when you combine federated learning frameworks like OpenFL with certain confidential computing technologies and with like trusted execution environments, that’s where actually, the power of the two come together, where you’re able to provide that with model protection and data privacy or data sensitivity protection and be able to accomplish this end goal of being able to train models across various data silos so that everybody benefits from it without having to share their private data with potentially even the competition, right? So it’s been used in several different real world federations.
The one that we are working with Abhishek and his team on is essentially around drug molecule generation, and you’re using a diffusion model that Abhishek can speak to, and we are federating that across, we are basically doing a simulation of the QM9 dataset and we are creating these partitions and to see whether a centralized learning model versus a federated model, do you see the same levels of accuracy? If yes, then that actually opens up a whole set of possibilities. Now, if both are comparable in terms of accuracy, then you don’t need to centralize anymore. And so now, that has now enabled us to essentially add more data collaborators to it, right? So you don’t have to actually share the data, and hey, by the way, by federating, you’re improving the accuracy of the model, that’s the win that we are ultimately shooting for. And I don’t know, Abhishek, I think you probably have a better sense of the digress model, what it does, and how it actually helps and how the project that we’re doing is actually helping companies like AbbVie. So if you could just comment on that little bit here.
Abhishek Pandey 27:10
The teams are developing algorithms that can design molecules for you rather than like, till now, what was happening was that a bunch of computational chemists were designing the molecule, a bunch of medicinal chemists were making that molecule, and then testing it out with wet lab assays to say that, “Oh, is this a good one, bad one?” And then in cycle, essentially.
In a very manual cycle, that’s the right word for it. But right now, what we are seeing is that you can automate the whole cycle without actually having a scientist maybe, a scientist in a loop, but definitely not for all processes essentially. So the idea is that can machine learning algorithm generate molecules for you with all the right properties from the get go? So that essentially, when you make it, when the medicinal chemists actually make it in their lab, essentially, it goes forward in the process of drug discovery rather than saying that, “Oh, let’s repeat it” or something like that. And there’s a lot to be done, frankly. These are like earlier steps that we are taking actually in this, because frankly, having designed a molecule with good properties, it still needs a bunch of other things to be satisfied. Now, all of those modeling are still not there.
For example, like can you come up with a virtual human model? Can you come up with a virtual preclinical model? That you test your molecule right there rather than actually like saying that, “Oh, this is the molecule that my AI generated. Just make it and believe me that hopefully, it’ll work, actually.” So I think the other next step or the next big step is essentially making those models, and that will take some science to happen actually as well so that we can test those molecules right there and make the decision that should we make it or not make it, or should we do another iteration where we can generate a good bunch of molecules that will actually pass those tests, actually?
Prashant Shah 29:03
So Abhishek, you’re referring to basically in silico testing of these molecules so that you don’t have to go through assays and that’s what you’re trying to, so even before you get to the assay portion of it, can you actually try out these type of molecules on some sort of a model that can tell you whether this is a viable lead or not? And then if it’s a viable lead, then you take it further in the process, or is that the main idea?
Abhishek Pandey 29:32
So drug discovery is the design, make, test, analyze cycle. DMTA, that’s what people call it. What I’m proposing is, hopefully, if implemented and done right, it’ll be a DTA and then M at the very end, where you design, test, analyze in a repeated loop of maybe like active learning loop, and then you make it. That essentially will hopefully carry forward further. So what you are doing is you’re de-risking the process, and most of the drug discovery is about de-risking it. And if you can do that at a very early stage, as I said, it’s a funnel. And as funnels behave, the better you do at the beginning, the better you’ll do at the end. And that’s the whole facts.
Camille Morhardt 30:03
It’s kind of like a two-way funnel though, right? Because you’re starting by cutting the first part of the funnel, right? So that you’re using the right molecules and not having to do as much assay and narrowing down what you’re going to focus on. But at the other end, that then opens your market. But isn’t there then an opportunity for AI to then customize and personalize? You could have a model of an individual patient and then tweak the drug ultimately at the other end of the funnel.
Abhishek Pandey 30:28
So what I’m envisioning and that’s my perspective and I think some people do agree with it as well, which is we have to move from precision medicine to precision molecule. Right now, we are not there yet. And the reason for it is because precision medicine comes very late in the game. So it’s almost like two different worlds, where you actually make your molecule, test it out on rats, mice, all of those things, and say, “Oh, it’s doing great, Let’s try it out on clinic.” And you throw everything that you have till now, because human is a whole new beast, and then essentially, you do your clinical trials and hopefully make it work, actually, at that particular stage.
These two processes are so disconnected that there is nothing you can do about this actually. And what I’m talking about is you bring this process as early in the discovery process as possible. So when I talk about virtual human, when I’m talking about it, there should be a precision medicine model for virtual human as well, saying that, “I have made the molecule using my AI methodology, and now test it out in a precision medicine virtual human model and see if it behaves as it’s supposed to be.” And so hopefully down the line, that’s what will happen, and we will move all of this precision medicine thought process as early in the drug discovery process as possible. Hopefully, that’s what we’ll get it to, in a much more better shape.
Prashant Shah 31:57
Can you speak to the value of what federated learning enables companies like AbbVie to do that wasn’t possible without federated learning?
Abhishek Pandey 32:06
For the pain areas, the biggest pain that we have, and any other large pharma company will have, is the rise of new modalities of medicine, actually. Till now, we were only using small molecules to cure our diseases, actually. Now, within small molecule, there is PROTAC degrader, drugging the undruggable. Then there is cell and gene therapy that is making a huge inroads into the modality of medicine. There is antibody design that is making huge inroads. With the advent of GLP-1, there’s peptide which is making a return, almost.
What I’m talking about is that there are so many modalities that are coming up every day that you may be good at something. Everybody’s new to it. Nobody has a lot of data, and the only way to make it work would be to come to together. You generate some data, I generate some data; we pull in the power of our knowledge and algorithm in one place, and we build algorithm that is combined from all the data together.
And I think that is the exact future for all of these new modalities. There is no way one single company can say that they will be like masters of this particular thing, and they’ll be perfect in building algorithms out of this. The only way to do is this, essentially, which is federated learning of essentially putting the data and algorithms together. There is another possible method that exists, which is automated generation of data. But I would say that down the line, even that automated generation of data, should have a combination of data with some other people to increase the diversity, because whenever you’re designing assay, most probably, you are only interested in certain kind of projects or certain kind of antibodies or something like that, and you do not have enough diversity that other companies may be interested in something else, and that will bring the diversity to your data actually.
Camille Morhardt 34:03
Well, Abhishek Pandey of AbbVie and Prashant Shah of Intel, thank you very much for the conversation today. It was really fascinating.
Prashant Shah 34:10
Thanks, Camille.
Abhishek Pandey 34:11
Thanks for inviting us.
Prashant Shah 34:13
Thanks, thanks, Abhishek.
Abhishek Pandey 34:14
Thank you, thanks, Prashant. Thanks, Camille.