[00:00:37] Camille Morhardt: Hi, and welcome to InTechnology. Today we’re going to cover What That Means: synthetic data. I have with me Selvakumar Panneer, and Omesh Tickoo, both Principal Engineers in Intel Labs. Welcome to the show.
[00:00:49] Selvakumar Panneer: Thank you.
[00:00:50] Omesh Tickoo: Thank you Camille.
[00:00:51] Camille Morhardt: Can we start by having one of you identify what is synthetic data? I’m not sure everybody has heard the term.
[00:00:59] Selvakumar Panneer: Yeah, so synthetic data, this is artificially generated data using a computer and there are two kinds of synthetic data: one that is generated using programming models and another one generated using AI. And both have some rules while generating these data, so they’re not random data, but they have some ground rules around how to generate those data and these data are used in many ways today.
[00:01:23] Camille Morhardt: So to be clear, this is fake data. I understand there’s rules so that we’re trying to mimic reality or realistic conditions, but it actually is fake. We’re generating it, making it up.
[00:01:34] Selvakumar Panneer: That’s right. Yeah, they’re generated.
[00:01:36] Camille Morhardt: What scenarios or what industries is this occurring in and why? Why are we doing it?
[00:01:42] Omesh Tickoo: It’s used across many industries. So if you look at today’s AI systems, they are both consumers of data as well as producers of data. And we can have fake data used to build models. For example, our AI models are very data hungry and we don’t expect to find that data every time we want to build a new AI model. So people are using fake data or synthetic data to train these models as well as generating this data so that, for example, if I’m building content for a media house and I want picture scenarios that may not be realistically possible but have to be as close to realism as possible. So there again, it’s generated synthetic data that’s being used versus synthetic data, that data that was used to build an AI model. So in terms of industries, anywhere where we need a lot of data, whether it’s industrial applications, whether it’s autonomous systems like cars or robots, or whether it’s media and gaming industries, where we want to use the data for different usages.
[00:02:36] Camille Morhardt: Are we seeing it like … for example, I think of the scenario in industrial use cases where we have a lot of known good data. If you’re looking for defects or something–by “we,” I mean the world–most factories do pretty well. Most of the products come out looking good. And so we have a lot of data of how it’s supposed to look, but we don’t do so well with when there’s a problem. So is that the kind of scenario where you’re generating synthetic data so you can train a model to notice when something is wrong?
[00:03:09] Omesh Tickoo: It’s interesting that you brought up the defect detection. That’s something we’re actively working on. And there is a use of synthetic data in exactly the way you said it, because defects can come in any different sizes, shapes and forms and you really cannot train for every single kind of defect because you really don’t know all the defects that you will see in the real world. So there are two ways to approach this. One is to say that I know all my good data and I’m going to use that information to figure out what’s bad. But that only allows you to tell bad from good. It doesn’t allow you to, for example, say whether there’s a crack, whether there’s a break, whether there’s somebody forgot to do something. If you want to get to that level, synthetic data really helps because you can actually produce those kinds of supplementations and combinations and train your model so that those things can be detected in real world without actually needing them to happen when you train your model initially.
[00:03:57] Camille Morhardt: So the second kind of scenario I want to ask about is Selva, I know you’ve been working in film, so can you describe how we’re generating … Again, “we” as the industry more generally, but how it’s being used in the movies?
[00:04:12] Selvakumar Panneer: So in movies, for example, visual effects, we need a lot of compute power today and these compute powers are needed to render things that look more realistic. And this is an area where the need for compute is much higher and now we are using AI to see if we can actually render things much faster. So these AI need a lot of data to train on. In this scenario, synthetic data has been used to train these AI to reduce the render time.
[00:04:41] Camille Morhardt: So what are you rendering? A person? a background? What are we looking at?
[00:04:45] Selvakumar Panneer: It could be anything. It could be the background; it could be the person. And when it comes to the person, it’s much higher too. And we need to do a lot of data and this is where the lack of data, we can’t really use real people in these data sets. So we need to actually generate those data using computer graphics and use those data again to train AI models so that they can actually generate a twin of somebody or even characters in a game or in movies.
[00:05:12] Camille Morhardt: So how is a digital twin being used in movies?
[00:05:16] Selvakumar Panneer: The intersection of media, AI and graphics is happening right now. AI is trying to bring two worlds together. One is the computer vision world, where computers are used to see how our real world is, to perfect that AI that’s being used there. And on the other side, there is the computer graphics world, which is basically trying to create this realistic world. And AI is again used there to bring this realism in movies and in games.
So similarly for creating a digital twin, we need to have these two worlds come together and AI is the glue that kind of making this virtual world and real world come together. And that requires a tremendous amount of data to train these AI to kind of blur the lines between this real world and the virtual world coming together. And this literally brings a lot of innovation in movie industry, gaming industry or the futuristic metaverse that we are all heading towards.
[00:06:09] Camille Morhardt: But you have to bottom line it for me. I know it’s being used in stunts, so where there were stunt people who were putting their lives in danger, this is one of the areas that it’s being used to generate to sort of mimic what the main character might look like.
[00:06:23] Selvakumar Panneer: Yeah, today, a VFX artist, they go through this tremendous process. Everything is done manually today, which takes a lot of time and energy to create those stunt sequences. And this is where synthetic data can really help. And the advancement in computer graphics and rendering. We can now render and we can’t really tell by looking at two people whether one is real, another one is being rendered or synthesized using AI too. So the lines are really blurring, which means that now VFX artists can really take a rendering as a base and use AI to make them look realistic in games or in movies.
[00:07:02] Camille Morhardt: What does VFX stand for?
[00:07:04] Selvakumar Panneer: Visual effects.
[00:07:06] Camille Morhardt: Oh, okay. I just want to hit one other use case that I’ve heard of is autonomous driving. So how is it being used there?
[00:07:14] Selvakumar Panneer: They need a lot of data to train cars and to perform in all scenarios, so weather conditions, environment changes and most likely, we don’t have a lot of data. We can collect data, but we hit the limits and this is where synthetic data can help, understanding where the potholes are, what action needs to be taken. And so we can generate these data today. Like I mentioned, we can actually render them and use them as data sets or we can even generate visual data using AI and feed that data back into these autonomous agents to see what action they need to take when such condition occurs. And there are too many variable factors here, like the weather, like I mentioned, road conditions, people, pedestrians and data is going to help quite heavily in those cases. And this is where we need synthetic data.
[00:08:02] Camille Morhardt: So I want to talk unintended consequences because it’s exciting. I mean, when we talk about generating a human that we can’t tell the difference between the actual human and the digital twin or rendering of a human or same thing for background environments or we’re training safety critical systems on data that isn’t real. What kind of concerns might somebody have about something like that?
[00:08:29] Omesh Tickoo: Every time we build an intelligent system, there are concerns about ethics, bias and responsibility. And this is no different. Our AI systems are highly dependent on the data that they get trained with, along with the model architecture extra, so we’ve seen those problems with non-generated data before, where if your data set is biased, what you get is a biased model and that’s no different here. If we generate data using models that are biased, we’ll get biased generated data and if we use it for safety critical systems or even anything that has a societal impact, it’s going to show its effects there. So there definitely is the responsibility on people who are developing these models to be careful about that. The one silver lining I would say is that the fact is that we do control many variables in how we generate synthetic data, which is different from how we collect data, because sometimes the societal bias can creep into where we go to collect the data. It’s true.
Having said that, there’s still similar possibilities here depending on who is doing the job of generating these models and building them. So this is no different in my opinion than training your normal AI model. But here also, if you’re not careful about how you build your generated model that’s generating this data, you will get biased results. So one has to be careful about building these models grounds up in a way that it’s tested for ethics and bias concerns. And also, as these develop, these models have also adaptability to them because they are generating. So they keep generating more and more data over time. So there is a possibility of feeding that loop back and saying, “Hey, if I can detect some bias or something that’s not right from responsibility perspective, can I fine-tune the model and keep making it better over time as I see more and more of these instances?”
And again, I feel the flexibility is there just because it’s generated data and we have the control over what we can generate and what we can’t. We still have to be cognizant about what we’re building as we kind of go into using this generated data for more and more applications that are not just movies, that actually affect autonomous cars or that affects the financial industry.
[00:10:28] Camille Morhardt: Yeah, it’s interesting. And the other thing that it makes me think of, especially when you talk about entertainment or media industry is if there’s an actor or an artist whose image can be generated, then all of the implications about identity protection and sort of copywriting your identity or even your identity at a particular age in time or the sort of actions that you would do with that digital identity come up.
[00:10:56] Omesh Tickoo: And we’ve seen some of those examples in the recent past. Not movie actors, but political players whose statements were kind of slapped on deep fakes, as an example, and shown as coming from them when they really weren’t. And that caused some news as well. These things need to be investigated more carefully. I know that there are lots of approaches that one can take and there are platform companies looking at certifying media that’s been built on their platform. So if I develop something using a platform, I would also assign a certification to it saying that it has not been tampered with. And that may not be there if somebody actually does kind of switch the face with the voice on a different platform and then use that certificate and it doesn’t propagate with the media anymore. So that’s one way to do it.
The other way is to develop tools that can detect such manipulations, depending on what was used to generate this data. You could actually backtrack and see. But the problem that you raise is genuine. It has happened and many researchers are actually looking at different ways to kind of go around that, both from a mitigation perspective as well as detection perspective.
[00:11:58] Camille Morhardt: You brought up deep fake and I’m wondering if you can talk a little bit about the relationship between deep fake or anything like that, generative AI, static diffusion, whatever other generative portion of AI we have along with synthetic data because one seems to be a particular ability to generate something and then the synthetic data seems like a volume kind of a vector. So help me understand how those two go together and how they operate.
[00:12:26 ] Selvakumar Panneer: Yeah, I think maybe the deep fake is kind of more towards detecting whether a specific image or video has been manipulated. If you look at how the stable diffusion, all those generative AI is heading, it is able to create things beyond our imagination and it’s able to do that much, much faster. And from an artist’s point of view, it gives us a tremendous advantage where they don’t really need to start from ground up and they can literally use AI to set the base and then bring their creativity on top of it.
And in a way, I think the synthetic data or the AI helping these stable diffusion to create these imaginary scenes or videos can be a big advancement in creativity, where artists can really put their efforts and their creativity where it’s needed most, rather than spending their energy to bring the baseline into their art. So the generative AI is actually helping … in my opinion, it’s helping the creative community to get to the next level, just like how from analog media to the digital media enabled a lot of these image composition, video editing to be a lot more interesting for anyone to do. I think in the future, AI is going to help anyone create movies or create plays much easier. It’s going to be in everybody’s hand in coming years.
[00:13:48] Omesh Tickoo: Maybe if I could add one more sentence or two to that. If I look at the scope of generative AI, as Selva mentioned, it’s quite large. It basically can generate video, texts, audio, what have you. And that’s the field of generative AI. We have models in there where if I give a text, it will give me a video back based on what the text I give. So we could use it in movies where we are actually developing these fantastical words, trying to provide content that we couldn’t do otherwise, or we could use it to generate human-like appearances of people that are speaking words that we put in their mouth.
Both are applications of generative AI. It just turns out to be that somebody’s using it for a specific purpose of generating the effects while others are using it to develop new content for entertainment or training these autonomous models. So in a way, these are different pillars of the same underlying technology domain.
[00:14:41] Camille Morhardt: I understand. So you’re going to use synthetic data to make the deep fakes better, to make the generative AI better, because it’ll have more data to start looking at in order to get you to that next level. That makes sense.
[00:14:53] Omesh Tickoo: Any way it is generative AI, right? Because we’re combining two different things that you may already know of, but you’re generating a third thing out of those two it seems. So it is kind of in the domain of generative AI.
[00:15:04] Camille Morhardt: Okay. Well, we’re close now or we’re there now, right? Just to understand from you, we cannot tell the difference. People can’t tell the difference between a human that has been generated from synthetic data and a real human?
[00:15:19] Selvakumar Panneer: Yeah, it’s very, very close. And today, even computer rendered images or scenes are much, much closer, if not almost similar to real capture from cameras. And generative AI is taking that to the next level, where the mix of rendering plus AI can take to the next level, plus we have control over those content. Like you mentioned, we can de-age a person or we can change the hairstyle of somebody or all the things that are needed from an entertainment industry, I think this is going to catch on.
And today, they’re doing it more manually, plus it requires a tremendous amount of compute to do it. I think AI is really helping to bridge that gap where content creation can be done much faster and with a lot of varieties for the creator to pick one. And so they’re not likely to go after just one style. They can go after thousands and thousands of style to see how they want to bring that entertainment module to the audience.
[00:16:19] Camille Morhardt: So what are some of the challenges with synthetic data? We already kind of walked through some of the unintended consequences, but we’re looking at how synthetic data is helping us kind of bridge this gap between not having enough real data to train a model. What are the bottlenecks in synthetic data itself right now?
[00:16:38] Omesh Tickoo: So from a training perspective, the topic that you started with, I think there are multiple challenges. One is the challenge that Selva was mentioning about platform level performance itself. These are pretty heavy algorithms that need a lot of compute memory and things thrown at them so that they can actually run in good enough realtime kind of scenario so we can actually get our data when we need it. The second thing is that they are generating synthetic data, but they’re also built upon something. And that something is generally exposure to the natural world, kind of giving them data about things around them and then kind of letting the uncertainty within the models go loose and try to generate different combinations and different types of scenarios based on the few things that you provide that model.
So from that perspective, we still are limited in the sense that we could do a very good job of generating synthetic data for a very specific use case, whether it’s trying to generate for a car in California, driving there, a robot in a factory, or maybe a movie actor or something like that. But if we were talking about, “I want to build a model that I can then give to a customer and that customer could just take it to a factory floor, record the video around there and then start generating data for me,” we are not there yet. I think we still are working in very domain specific kind of silos. Wherever it’s working, it’s working really, really well. But the scale out is a challenge and so is the platform performance.
[00:18:02] Camille Morhardt: Okay, that’s helpful. And how does this intersect with probabilistic research?
[00:18:07] Omesh Tickoo: Oh, good question. So we’ve been doing probabilistic research for a few years now. The intersection is pretty natural. If you look at the natural world, it’s fraught with uncertainties. Whenever we kind of step foot outside our house, we don’t know how the world will look around us. How do we react to it? Whether it happens to us, the noise in the environment happens to us, people running in front of our car happens to us. All these things are not something that can be rule based. These are complete uncertainties in our lives. And that’s where probabilistic computing shines because it brings in that element of probability and uncertainty awareness to AI systems, which actually is very important to make AI systems intelligent. Because if there is no uncertainty, we don’t really need AI. We can just build a rule-based system and it will be intelligent.
Now, if I go to generative models and I want to generate data that looks realistic, I got to have that uncertainty in there. Otherwise, it’s not realistic. So that’s where these two kind of intersect, because we are bringing the uncertainty from the real world through probability measurements and probabilistic computing. And we’re bringing the power of AI to generate these new worlds and new models. And we’re kind of taking some things that this model observes, uses the uncertainty and then spews out things that look much more realistic, much more close to the real world. And if I run the generator model twice, I may get two different results, but both of them will be realistic enough that they will appear to a human that, “Oh, this might be real data.”
[00:18:27] Camille Morhardt: So what Omesh, for you, what is sort of the most exciting aspect, or terrifying, take your pick, of synthetic data? What do you worry about or what you can’t wait for?
[00:19:39] Omesh Tickoo: Well, the exciting aspects are just getting there. When natural worlds and digital worlds combine, magic happens. We can do so many different things. We started with media, but we can actually build digital twins for any usage that you can think of, because now you have the control over dreaming of stuff and taking the real stuff and bringing it together and working in these bins to experiment with stuff before you actually take it out to the real world, knowing exactly what the consequences would be and all those kinds of things. So those are really exciting. Just looking at it from an AI practitioner’s perspective, it’s a great litmus test for AI to claim that AI is truly intelligent. If it can generate data that looks real and can fool a real person, we are almost there, right? It’s really great to get there. So that’s very, very exciting.
The same thing is also very terrifying. Now, looking at it from a perspective of something that has a life of its own and is generating worlds around us that are real and we don’t have the capability to differentiate between the two, one could think of so many things that can go wrong and we have to be very responsible, keep an eye on where we are going, make sure that everything has checks and balances along the way while we get there. But I think that’s true with any new technology, and this is no different here. It’s just that the stakes seem to be much higher here. But at the same time, if we do 10% of that in my lifetime, I will be really happy.
[00:21:02] Camille Morhardt: A similar question to you, Selva. I know we were talking about, we have kids the same age, so what would you tell them, knowing that we have synthetic data out there?
[00:21:12] Selvakumar Panneer: The merger of these two worlds to begin with brings the physical world into a virtual world. Plus, you have tremendous opportunity to do anything you want in the virtual world. Things that are not bound in real world, we actually can do in the virtual world. And this virtual world is a combination of rendering and AI coming together. And the way I see the next generation, whether it is the Gen Z or the Gen Alpha coming along, they would see AI as a foundational block for whatever they want to create. I think our current generation, we were using SDKs, modules, libraries and plugins, but I think the next generation is up for a combination of programming models and AI. And to them, AI is just like another building block.
And these building blocks really need to be as rock solid as they should be. And to do that, I think, synthetic data is required. And one other point from Omesh, these digital twins or characters that get created, they’re immortal. They’re going to stay forever. And so they persist in the world. And so that’s another thing that the next generation, they may have things that persist forever. And so I don’t know how that’s going to evolve, but I think we are in the verge of creating those foundational blocks for the next generation to take upon.
[00:22:32] Camille Morhardt: That’s an interesting perspective, that we would need to consider things like expiration dates on things that are generated. Yeah. Very interesting. Thank you both. We have again Selvakumar Panneer and Omesh Tickoo with us from Intel Labs, talking about synthetic data.
[00:22:49] Selvakumar Panneer: Thank you.
[00:22:49] Omesh Tickoo: Thank you very much.