Ep 68 – WTM: Data Anonymization
[00:00:00] Announcer: Welcome to what that means. So with Camille companion episodes to the Cyber Security Inside podcast in this series, Camille asks top technical experts to explain, in plain English, commonly used terms in their field, then dive deeper, giving you insights into the hottest topics and arguments they face. Get the definition directly from those who are defining it. Now, here is Camille Morhardt.
[00:00:36] Camille Morhardt: Hi, I’m Camille Morehardt and welcome to this episode of What That Means: Data Anonymization. Today we’re actually going to meet with Kristin Ulrich, who is a Senior Solutions Specialist at SAP for HANA architecture. And I’m going to let her describe what HANA architecture is, but, uh, I want to just say briefly, SAP is a world leader in Enterprise Resource Planning or ERP software.
And for those of you who need a brief refresher, ERP–which now scales to almost any size business, but was classically developed for larger enterprises–helps enterprises kind of manage all different kinds of aspects of their company, all different systems. So it can range from everything from payroll to supply chain, to purchase order tracking. And it can actually also even integrate to a degree with customers and vendors so that it can keep track of all of these things. Can also make some automated decisions if you choose to do that. You can see clearly the connection then to the importance of data anonymization, you’re dealing with all kinds of data.
Uh, so we’re going to get into that. So welcome, Kristin. It’s great to have you here all the way from Berlin.
[00:01:53] Kristin Ulrich: Hello everybody. Really glad to be here tonight.
[00:01:57] Camille Morhardt: So I know that, uh, SAP HANA is optimized on Intel architecture, but I’m hoping that you can just describe what HANA is briefly for people.
[00:02:07] Kristin Ulrich: Okay. So Hannah in general is an in-memory database and, um, yeah, with that database, you have a lot of capabilities that come along and one of them is actually data anonymizing.
[00:02:20] Camille Morhardt: And so now what is data anonymization? Can you give us the base? I know it’s an itty bitty part, as you said, of all of HANA, but what is it, why is it important and what is it actually?
[00:02:34] Kristin Ulrich: Okay. So if we speak about data anonymization, it’s all about like personal data that is in the end, irreversibly altered. And that’s really important because here, the important thing is that the data subject can no longer be identified directly or indirectly.
So let me just give you an example. Um, we all know that student class of 20 students sitting with an a class and you have one student wearing a red shirt. So if the teacher calls out that one particular student with a red shirt, everybody in that group of 20 students will know exactly who that person is. But if we won’t have five students with red shirts and the teacher calls out that one particular person with a red shirt, nobody would actually know who of those five students is actually meant. So we’re hiding one particular individual in a group of five students. And this is exactly what data anonymization is about, if we speak about the concept of K anonymity. So we’re hiding one particular person in a group so that we can no longer identify that particular one person.
And, um, these methods that we have applied in HANA data anonymization actually stem from research. There are two methods that we’ve applied. One is K anonymity—which said, we’re hiding individuals in groups. And then we have the other concept, which is differential privacy. So there we’re applying noise to the data so that we’re no longer able to, for example, know exactly what type of earnings individual people have, what type of salary individual people have because, and in a group of four, for example, and you have salaries from 40,000, 50,000, 60,000 and 70,000, we would alter that.
So that, uh, you would in the end have data such as 10,000 and 100,000. So in that data set as a whole of these four people, we wouldn’t be able to know exactly who earns. And the data set as a whole would still make a lot of sense.
[00:04:35] Camille Morhardt: So what good does it do the company, if you don’t have access to this specific information?
[00:04:42] Kristin Ulrich: The thing is that under GDPR, for example, we can only, um, use data, which is no longer personal. So we can only process it, we can only use it for analytics, we can only use it for machine learning if it’s no longer personal; if the data’s personal, we can’t process it. So if a company has a lot of data, personal data about individuals, they’re not able to use it. So they need to really adopt data anonymization so that the data is no longer rendered personal and we can actually use it for other types of scenario.
[00:05:16] Camille Morhardt: So you’re looking at analyzing data sets or looking at making modifications of the way that processes are running within the company or across the company. But you can’t pull out very specific identifiable we’ll call it personal or private traits. So you’re instead you’re your abstracting it. That’s probably not the right word. You’re probably not abstracting it. You’re making it very specific, but hiding the source. Would that be?
[00:05:42] Kristin Ulrich: Yeah, let’s say we’re generalizing the data or the individuals and groups if we take the concept of K anonymity, for example.
[00:05:51] Camille Morhardt: So what existed before K anonymity?
[00:05:55] Kristin Ulrich: To my knowledge, K anonymity or differential privacy are, um, concepts that evolved through, um, anonymization. So these concepts really came with, let’s say the uprise of, um, the discussions about anonymization.
[00:06:11] Camille Morhardt: Okay. So prior to this is kind of a new emergence of a new way of doing things, anonymizing data. Before that companies just had the data, saw the data, tried to protect the data. Now we’re saying, I guess, an evolution of culture society saying “that’s not enough. You actually need to anonymize it.”
[00:06:28] Kristin Ulrich: Exactly. Since we have new laws and regulations in place, companies need to adhere to these laws and regulations. And this is exactly the reason why we are now speaking about data anonymization.
[00:06:41] Camille Morhardt: So you tell us a little bit more about actually how it works? Are there sort of different aspects of data anonymization? Rules or parameters you need to enter. What kind of decisions do you have to make when you’re setting it up?
[00:06:54] Kristin Ulrich: Okay. So it really depends. Um, what kind of use case we’re talking about? So, um, like for example, um, you know, give you a couple of different use cases and you will see that we’re speaking about complete different datasets and depending on those, uh, use case, we’ve actually, alter the data or apply one or the other technology or methodology.
So, one of the use cases that we have been working on is actually a use case from around last year July. We had a hospital that approached us and they said due to the uprise of the COVID 19 pandemic, um, they had to collect a lot of patient data. And they came to the understanding that, um, some patients were from a specific ethnical background; that some patients were really fit, others weren’t as fit; there were patients with preconditions; and they were taking certain kinds of medications or they weren’t taking any. So all of these different traits we’re actually really interesting because every patient with certain traits reacted differently to treatment. And the hospital said “it would be fantastic if we can make the state available so that when other people are admitted to hospital with the same disease with COVID-19, it would be easier to cure because we already have some learnings. We have some understanding of what worked well and what didn’t work so well.”
And, um, this is exactly what we were working on, who were actually working on, let’s say a shareable COVID 19 database, so that different companies, institutions, healthcare organizations were able to access this data and generate learning about, um, let’s say COVID-19 a new, better how to treat the different patient.
And there was also another use case in the traffic planning industry. So when looking into traffic planning, we have a lot of mobile phone providers. Everybody now has a smartphone and as we’re walking around with our smartphones, this point of location data is actually sensitive data. So this data needs to be anonymized in order to use this kind of data. Traffic planners were really interested in using this state, because if they know that a lot of people are on a subway at a certain point in time, so “maybe we need to have more subways or, you know, having them run every two minutes instead of every 10 minutes so that we can actually respond to the need of the people”.
And then we also have a use case we bought in company travel expenses. Let’s assume we have a travel agency. And the travel agency is booking company trips and a person between 30 and 35 calls in and needs to travel to Frankfurt. It would be really good for that agent to be able to advise based on the learnings of past travels, where the best hotel would be, where good restaurants are, where people like to go. Well, all of that data, if we look at one particular individual is personal data, because it tells us where a person has been. That data is sensitive and needs to be anonymized in order for that travel agent to actually use the data and work with the data and give me as I’m calling in order to book my trip to Frankfurt, give a good at wise of where I should go.
So, um, as you can see, we have really different parameters because of, um, having those different use cases. And every time we speak about data anonymization, it’s really important for us to really get into the depths of the use case and then decide which of the two methodologies should be applied and how we can actually bring the use case to life.
[00:10:29] Camille Morhardt: I think all the use cases are really interesting and I can see the clear benefit to any individual and the clear un-necessity to have the private information. It makes sense to me, like, of course I want to get the best hotel, but I of course don’t want to know who was in it last week or whatever. That’s not my business.
Do you ever worry, though, that especially in compute and with AI out there now, that if you define parameters for one specific use case, very, very carefully that if the data somehow leaked, you could then figure something else out–if you had malicious intent, let’s say? Because the parameters were set to protect one thing, but now somebody is looking for something else.
[00:11:15] Kristin Ulrich: It is definitely a concern. I mean, there’s lots of researchers that state that we don’t have an absolute anonymization. There’s always a certain risk to revert the data so that you can actually find out that potential individual. But I think science, and especially as from a technology part, we’re doing everything possible that this is not going to happen. And we do our best that this particular individual is no longer, um, identifiable and, um, yeah, but there’s always a certain risk, I’d say. Yeah.
[00:11:49] Camille Morhardt: How do you know whether your personal information is being collected and whether that data then is being anonymized? I guess you’re saying if you’re in Europe, the GDPR, which I’m going to forget exactly what it stands for general data protection. Okay. General data protection and privacy. So that’s kind of like the, I guess America’s healthcare HIPAA, but it’s GDPR is broader than healthcare. So are you guaranteed because of public policy or laws, or are you guarantee is, do you have to look at individual companies to know who’s doing what? Like, if you’re allowing an app say to collect your information on your location, how do you know whether it’s being protected or anonymized?
[00:12:34] Kristin Ulrich: Okay. Every company here we speak about GDPR has to comply to GDPR. If there are breaches, which are significant and they’re tracked. So it’s really important that companies that have headquarters or operations in Europe apply, um, GDPR. And with that, we already have a certain amount of standards that out defined, and we don’t actually have concise definition of anonymization. Um, there’s been lots of interpretation by researchers what anonymization means, and this is something that’s scientifically approved. And I think from that standpoint, you have those privacy guarantees that your data is not being used differently in different companies.
There’s one standard of how K anonymity, how differential privacy should be applied. You know, if you incorporate that into software, then this should be at say the same and any kind of application or database.
[00:13:36] Camille Morhardt: Okay. And then what are the people in your fields arguing over right now?
[00:13:40] Kristin Ulrich: Um, for example, a good, uh, let’s say point of discussion is currently anonymizing non-structured data. I mean, here, HANA data anonymization, we’re only using structured data, but then unstructured data would be emails or, um, you know, different types of data sets. Or even social media is also a part of unstructured data. And to anonymize that that’s something that’s not easy to do. And that’s something that a lot of researchers, but also technologists are currently looking into in order to find an answer to.
[00:14:13] Camille Morhardt: With structured data, you mean like for the example of the hospital data, it’s all entered in a specific way. So you have more control over how you’re tackling it versus unstructured, which is what it just says free flowing, so now you’re scanning all kinds of different information?
[00:14:28] Kristin Ulrich: Well, you would get much more information about that one particular individual. So if people did know, it would be able to have access to emails, to social, to, you know, different types of data and acumen that or add that to that particular individual, you’d have much more knowledge. And, um, you could probably tackle different problems when using unstructured data. But as I said, this is not an easy thing to do and that’s why it’s so widely discussed within research and technologists.
[00:14:58] Camille Morhardt: Is there anything else people are arguing over, like how to anonymize or new methods for it?
[00:15:04] Kristin Ulrich: Yeah, I think, um, there is a lot of, let’s say argumentation around the part of what’s the difference between between pseudonymization and anonymization. I mean the clear difference is that if you apply pseudonymization, you’re really just taking, for example, my name Krista, and you put Mickey Mouse instead of Kristin; but all of the other identifiers, falsely identify as we call them, they remain the same. Whereas in [00:15:30] anonymization, you’re adding or you’re generalizing. You’re hiding the individuals in groups so that the person can no longer be identified. Whereas with pseudonymization, you have that possibility. You have that possibility to revert back and identify that one particular person.
[00:15:46] Camille Morhardt: If you were giving advice to say a company that was going to work with another company–so some kind of a third party interaction where some of the data sets were going to be shared–what would you recommend uh, this company asked the other company? Like to make sure that they were paying attention to this type of thing?
[00:16:06] Kristin Ulrich: I think it’s really important to ways understand the use case because for some scenarios, for example, inner company scenarios, it’s good enough if let’s take another example about hardware ordering and you just want to get an understanding from the Office of Berlin or the Office of Frankfurt who ordered how many devices have hardware and we just want to share that information within your company. Maybe it’s good enough to just leave out that name and put Mickey mouse or XXX for everybody in that group you’re looking at; but maybe you already meet an anonymization standards. So I think it’s always super important to get down to the use case to really understand what is it the other person wants to get out of the data and then see if we have the right technology in place and the right people in place to actually do it, to actually put it to life.
[00:16:57] Camille Morhardt: And then my final question here is just what other types of technology or concepts do you think if people are trying to really understand data anonymization, what other things do they need to be looking into?
Like, is it Artificial Intelligence or privacy regulations or, you know, what kind of other technologies intersect with that?
[00:17:19] Kristin Ulrich: So, first of all, you have the legal part, which is super interesting because we have so many countries that are just evolving laws with regards to data privacy and data regulations. So there is loads to learn on that side, because if you work with different companies and different countries, will you need to get a better understanding of what is their current legislation? And then see how you need to apply data anonymization. Or maybe that country only asks for a type of data masking, you just have to understand, first of all, the legal regulation. Then I think we’re getting more and more data and companies, if they can use that data, you know, they can just [00:18:00] use any type of data and there’s lots of, lots of data that they’re collecting that’s personal. So I think big data is definitely a topic that you have to look into when you speak about data anonymization.
And then the last part is, so you have that data and then what do you want to do with that data? Is it analytics? Is it machine learning? Is it Artificial Intelligence? I mean, if that data’s no longer personal, what are the possibilities? Maybe there are possibilities that we’re currently not even thinking about, but that are going to be, let’s say “the standard” in maybe 2, 3, 4 years time. So, um, yeah, I think this is a super interesting field and what’s great is that it’s currently evolving and evolving and evolving. There’s more and more research done on this field. And, um, this is what keeps the topics so up to date. And so interesting.
[00:18:48] Camille Morhardt: What do you think is going to evolve out of it in two to three years, if you had to put your finger on something? You think that most people aren’t thinking about that you’re like, “ah, that’s something I’m paying attention to.”
[00:18:57] Kristin Ulrich: I mean, just from my part and the discussions I have with, um, with technologist, I think, um, the most important, let’s say topic to solve is the unstructured data part, because then we would probably be able to think about use cases that are currently unthinkable—at least that’s my current understanding and my current assumption. But maybe it’s something completely different and something you and me we’re not even thinking about today.
[00:19:25] Camille Morhardt: Very cool. Kristin, thank you. And I have to say your accent, your, I know that German is your first language, but your English accent is phenomenal. And I’m wondering, it’s like this combination between American and British. Can you tell us why?
[00:19:40] Kristin Ulrich: Well, I used to live in Nebraska for some time and then moved over to the UK to Cambridge and Bristol. So it might be a bit of both.
[00:19:48] Camille Morhardt: It’s really cool. It’s amazing how fluent you are too in a second language. Thank you so much for joining us today. I really appreciate the conversation.
Kristin Ulrich: Thank you
Camille Morhardt: Again. I’ll just say my guest was Kristin Ulrich who’s Senior Solutions Specialist for HANA at SAP.
[00:20:08] Announcer: Stay tuned for the next episode of Cyber Security Inside. Follow @TomMGarrison and Camille @Morhardt on Twitter to continue the conversation. Thanks for listening.
[00:20:22] Announcer: The views and opinions expressed are those of the guests and author, and do not necessarily reflect the official policy or position of Intel Corporation.