Camille Morhardt 00:13
Hi, I’m Camille Morhardt, host of InTechnology podcast. And today we wrap up our “look back” at your favorite podcast topics of 2023. It probably won’t surprise you that AI and machine learning are at the top of the list. 2023 has been the year of ChatGPT and generative AI. And also the setting of some ground rules around it through an Executive Order in the US and the European Union’s AI Act.
We’re going to touch on all these topics today, starting with a conversation I had earlier this year with Andres Rodriguez, a fellow at Intel. Andres is an expert on deep learning, which is a subset of machine learning and was used to create ChatGPT.
Andres Rodriguez 00:54
The way machine learning typically works is you have some input data, for example, let’s put you’re trying to predict the price of a home. And so, you have a bunch of data about the features of the home, like the number of rooms, the size of the home. And you take these features and then you put it through a machine learning algorithm and the output is a price—a few transformations in the data to come up with an output.
In deep learning, you have multiple layers of transformations that you’re doing to the input data. And so, to differentiate this from traditional machine learning, for example, take a problem of image classification. You pass as an input a set of pixels that correspond to an image. In the past, you require somebody that had expertise in computer vision to extract features about the image, and then you will pass those features into the machine learning algorithm. But today, having multiple layers in your deep learning model, you can pass the raw pixels into the model and out comes the class of the model. So, for example, this is a cat or a person or a fruit, et cetera. And this has been possible more recently because of the computational advancements. So, you can train larger models, deeper models, as well as access to a much larger datasets, which are needed in order to train the multiple layers that deep learning models have.
Camille Morhardt 02:37
So, what are the limitations to the deep learning model? I think you just said one, right? It requires a lot of data.
Andres Rodriguez 02:45
Large amounts of data, although there are ways around that. Another one is large amounts of computations. But again, there are some ways around that, as well. Nowadays, there are very large models that are being trained from scratch. So, you take a model. Usually, in the beginning, the model is composed of weights that are random. And as you go through multiple iterations of the training, then the model converges. Now, this requires a lot of data and a lot of compute.
But once you have a trained model, let’s suppose you want to use that model for a similar application–not exactly the same one that it was trained for, but one that shares some of the similar characteristics. What you can do is you can take that already pre-trained model and apply a smaller dataset that is specific for your problem that you’re trying to solve, and you can re-train it with the pre-trained model as a starting point. This is often called fine-tuning or transfer learning, and this process requires less compute than when you’re training a model from scratch.
So, the idea that you need humongous amounts of data or humongous amounts of compute, it’s only true when you’re training large models from scratch. There are a number of pre-trained models available in the open-source domains that you can pull from and do fine-tuning on these models for your particular problem that you’re trying to solve.
Camille Morhardt 04:23
That was Andres Rodriquez fellow at Intel and expert in deep learning.
The issue of data and where to get it was the focus of another popular AI-related episode of InTechnology this year. This one was about synthetic data. I’ll let Selva Panneer and Omesh Tickoo, Principal Engineers at Intel Labs explain.
Selvakumar Panneer 04:45
So synthetic data, this is artificially generated data using a computer and there are two kinds of synthetic data: one that is generated using programming models and another one generated using AI. And both have some rules while generating these data, so they’re not like random data, and these data are used in many ways today.
Camille Morhardt 05:05
So to be clear, this is fake data. I understand there’s rules so that we’re trying to mimic reality or realistic conditions, but it actually is fake. We’re generating it, making it up.
Selvajumar Panneer 05:16
Yes, that’s right. They are generated.
Omesh Tickoo 05:18
Our AI models are very data hungry and we don’t expect to find that data every time we want to build a new AI model. So people are using fake data or synthetic data to train these models as well as generating this data so that, for example, if I’m building content for a media house and I want picture scenarios that may not be realistically possible but have to be as close to realism as possible. So there again, it’s generated synthetic data that’s being used versus synthetic data, that data that was used to build an AI model.
Camille Morhardt 05:47
Are we seeing it like … for example, I think of the scenario in industrial use cases where we have a lot of known good data. If you’re looking for defects or something–by “we,” I mean the world–most factories do pretty well. Most of the products come out looking good. And so we have a lot of data of how it’s supposed to look, but we don’t do so well with when there’s a problem. So is that the kind of scenario where you’re generating synthetic data so you can train a model to notice when something is wrong?
Omesh Tickoo 06:21
It’s interesting that you brought up the defect detection. That’s something we’re actively working on. And there, there is a use of synthetic data in exactly the way you said it, because defects can come in any different sizes, shapes and forms and you really cannot train for every single kind of defect because you really don’t know all the defects that you will see in the real world. So there are two ways to approach this. One is to say that I know all my good data and I’m going to use that information to figure out what’s bad. But that only allows you to tell bad from good. It doesn’t allow you to, for example, say whether there’s a crack, whether there’s a break, whether there’s somebody forgot to do something. If you want to get to that level, synthetic data really helps because you can actually produce those kinds of supplementations and combinations and train your model so that those things can be detected in real world without actually needing them to happen when you train your model initially.
Camille Morhardt 07:09
I just want to hit one other use case that I’ve heard of is autonomous driving. So how is it being used there?
Selvakumar Panneer 07:15
They need a lot of data to train cars and to perform in all scenarios, so weather conditions, environment changes and most likely, we don’t have a lot of data. We can collect data, but we hit the limits and this is where synthetic data can help, understanding where the potholes are, what action needs to be taken. And so we can generate these data today. Like I mentioned, we can actually render them and use them as data sets or we can even generate visual data using AI and feed that data back into these autonomous agents to see what action they need to take when such condition occurs and data is going to help quite heavily in those cases. And this is where we need synthetic data.
Camille Morhardt 08:00
That was Selva Panneer and Omesh Tickoo of Intel Labs talking to me about synthetic data. Our conversation also touched on using “generative AI” to “generate” people and scenes in films and on “digital twins.” To catch the entire episode, click on the link in the show notes.
As I mentioned at the start of this episode, the release of ChatGPT and similar AI platforms was a watershed moment in 2023. These platforms are what’s known as large language models–or LLMs—and they can help with text-based tasks from writing a letter to summarizing thousands of pages of text that are in contracts. With their arrival there was a lot of excitement, but also a lot of questions about trust and accuracy. One of the people we reached out to is Sanjay Rajagopalan, Chief Design and Strategy Officer at Vianai Systems to help us better understand how these models work. Sanjay talks a bit about how they can tend to go off the rails and also about how and why enterprises are adopting large language models.
Sanjay Rajagopalan 09:10
Of course, it sounds articulate, it sounds believable. It sounds knowledgeable. And it is, but every once in a while it is confidently incorrect. It goes back to how these systems are actually working. What these systems are doing is generating the next best word given a sequence of words. And the reason they’re able to do that is they’ve seen a lot of sentences—billions, billions of sentences. They’ve been trained on the entire web, all of Wikipedia, all the electronic books that we have–tens of thousands of them. And so they’re able to very, very rapidly—using an immense amount of compute—they are able to predict given a sequence what the best next word is. Now, if I was just tell you to fill in the blanks, “the sky is ___”, most people would be able to fill in the blanks right? Almost immediately, and most people will say the sky is blue, right? So it turns out, this system can do the same thing; it can complete that sentence; it can find the next best word. But if every time I gave it, the sentence, “the sky is ___,” and it came back with “blue,” it’s doing the most cliched boring thing it can do, right; it’s always completing the sentence with the most likely word. And that it turns out is actually quite non-human. It’s very robotic.
If instead, if I said “the sky is, ___” and someone was to say, “the sky is an amazing window to the rest of the universe through which you can see the beginning of time.” Well, that’s a different completion of “the sky is.” That sounds more human like. So in order to get the system to start doing those things, you have to introduce a little bit of randomness into the process of predicting the next word. Now, the problem with that type of design is that every once in a while randomness takes it off the rails and you know, it just like it goes in a direction which is random and incorrect, right, versus random and cute. It starts saying things that are completely wrong, because it’s really trying to say something that looks right, versus trying to say things that are correct.
Camille Morhardt 11:29
Sanjay told me we’re still at the point where we need to use additional tools and human review to make sure there’s accuracy with LLMs. But that doesn’t mean, in his opinion, businesses can’t start using them to make many tasks quicker and easier.
Sanjay Rajagopalan 11:44
Everyone’s gotten used to using ChatGPT now you ask a question and answers. Well, I could do that on HR documents; I could do that on contracts. I could do that even potentially, on any type of database in the back end. Well, I use the conversational UI, not to generate the answer, but to generate the code that can be executed in order to extract the answer from the database, right? So I’m able to ask a question and get some data out of a database by asking the language model to write the code.
It’s also able to summarize anything, right? So if you have a large document corpus–if I have 100 page document–I just have time to read two paragraphs. “Can you tell me in two paragraphs what this 100 page document is saying and summarize it, make some bullet points?” It’s able to do that, but sometimes it hallucinates, so you have to go back and read? Did it actually exist? And so you have to have techniques for pre-processing and post-processing to check that it’s not hallucinating in that kind of scenario.
Maybe I can take a specific example that I have worked on. In the enterprise many times, you have to match a piece of data with a piece of text. Imagine you are, let’s say eligible for a discount, if you buy at least a certain volume of widgets. And that discount is contained within some contractual language, which says something like, you know, if you get to this level of sale, you get this much discount. If you get the, you know, the next level, you get more discount, and so on and so forth. In many cases, the contractual language that was negotiated gets put into a PDF file into some lawyer’s folder or something like that. If you’re lucky, some of that might be pulled into a pricing system or a payment system. In most cases, it’s forgotten. Companies have tens of 1000s of contracts, and they may never know that they’re eligible for some benefits.
Now you have a system which you could say, “Well, based upon my actual purchase, which is available in a database, check that against what the contract says I’m eligible for and if I’m eligible for it, then make that the new pricing thing that I would pay for this thing.” So that kind of a chain thing where it’s extracting the information from a database, comparing it to a language in a contract and as a result taking action, which is actually driving business value, these kinds of things start becoming possible. But with a lot of help, not just out of the box, but with a lot of tools and components that need to come in to make sure that this whole thing is done without hallucination, without errors, and with human oversight; so that if that ever happens, people are looking at it to make sure that it’s not doing something wrong.
Camille Morhardt 14:31
Sanjay Rajagopalan, Chief Design and Strategy Officer at Vianai Systems.
In this last segment on favorite AI episodes of 2023, we’re going to look at AI regulations. You no doubt saw news stories this past year about CEOs from OpenAI, Microsoft, X and other tech companies meeting with lawmakers in the United States to talk about AI and the need for rules and safeguards around its use. And late this fall both President Biden of the United States and members of the European Union released proposed rules and guidelines about AI. I spoke with Chloe Autio, an independent advisor on AI Policy and Governance about where we stand on regulating AI in the United States.
Chloe Autio 15:21
In just two months, ChatGPT became the fastest growing consumer tool to reach 100 million users and that’s, that’s crazy, right. Before that we were talking, you know, TikTok and Meta and so to have these kinds of technologies in the hands of people also raises the questions about how they can be used by bad actors, right, malicious actors. And so I think that a lot of offices within the White House and across the Executive Branch have sort of leaned into this discussion and said, “Hey, what do we need to be doing to control these technologies?”
So earlier this year, the White House got about fifteen companies to agree to a number of commitments on the security, the safety, and trust in AI systems, and particularly focused on these really powerful foundation models that form sort of the basis for large language models and the chatbots and really, really powerful models that have captured public attention lately.
Most of the bills I’m seeing are really focused on weaving the right balance, threading the right needle between protecting rights and values, and civil liberties, and also fostering innovation in the technology or at least not limiting innovation too much. And an underlying theme that really supports that is national security in competition with China. As you know, China is our global competitor. They’re also a major competitor in developing AI research and cranking out AI journals and citations and papers and making contributions to the AI space. And so there’s a lot of concern from US lawmakers about what’s been dubbed the “AI race,” though I’m not enamored by that term, to really make sure that the US stays competitive as a government and as an industry in developing really powerful AI models, and also preventing China from getting too powerful with AI development.
But the reality is, is that most AI development is not happening at that level; it’s in and across the enterprise. It’s not foundation models. It’s not models trained with billions of parameters like these foundation models are. It’s what I maybe would call clunky AI–computer vision models, reference implementations. And not to say that these technologies aren’t advanced, but the technologies and sort of intensive data workflows and applications that are really being used and adopted right now that are creating real harms, like bias algorithms used in hiring contexts, algorithms used to make decisions about loan eligibility, that sort of thing. But I think what’s missing out of this conversation, particularly in policy circles, generally is that focus on, how can we address the harms? And the concerns with AI that are happening today? And how can we sort of shift the focus back or at least maintain focus on the AI that was being used, and that needs governance and regulation before ChatGPT and foundation models entered the room?
Camille Morhardt 18:11
That’s Chloe Autio, an independent advisor on AI Policy and Governance. Chloe and I tackled many more issues and developments on AI regulation, so check out our entire conversation, and one we had with other AI experts on the President’s Executive Order on Artificial Intelligence.
You’ll find links to both those episodes in the show notes. And as always, you can find more InTechnology episodes on YouTube or wherever you get your audio podcasts.