Camille Morhardt 00:28
Hi, I’m Camille Morhardt and welcome to InTechnology podcast. Today we’re going to take a closer look at Learning Language Models, or LLMs. You’re no doubt familiar with the more popular models such as ChatGPT or OpenAI. My guest today knows a lot about these and the models . Sanjay Rajagopalan is Chief Design and Strategy Officer at Vianai Systems. He works with enterprises on ways they can utilize LLMs in their business. We’re going to talk today about how LLMs works, what tasks they can perform, and what to do when – as Sanjay says—they “go off the rails.” Welcome to the podcast, Sanjay.
Sanjay Rajagopalan 01:07
Thank you. Thanks for having me.
Camille Morhardt 01:10
Can you start by telling us like, why are enterprises even interested in adopting large language models?
Sanjay Rajagopalan 01:16
Perhaps the reason enterprises are interested in this kind of overlaps with why everyone is interested. And so, of course, one of the things is it’s very accessible, anyone can experience it super easily just log into one of these systems OpenAI ChatGPT, etc. And there it is, you have it, you can try it out. And the thing is that it is really impressive. Imagine, you know, typing in anything into a single interface, and it talks to you as if it’s a real person. People find that unusual, and it kind of is attractive.
So if you’re wondering how is it doing that what’s going on, that allows a machine to speak or write in ways that a human word so you know, the way I think about it is, at least this is my personal experiences is super, super impressive for about 15 minutes. You just see it and like, “wow, look what it can do.” But then if you spend any significant time with it, let’s say you have an extended session, or try to accomplish some kind of complex task with it, it’s performance rapidly deteriorating, at least in your in your mind. I mean, it’s still doing the same thing. It was always doing it, but kind of suddenly it feels a little bit off. Things it’s doing is like, “wait a minute, this is not quite right.” And, you know, it depends upon what you’re trying to do. But most folks, I believe, will find it somewhat frustrating to get a task done, once they go past that initial, like 15 minutes of being extremely impressed with it.
And what’s happening is that, of course, it sounds articulate, it sounds believable. It sounds knowledgeable. And it is, but every once in a while it is confidently incorrect. If a human being was knowledgeable, believable, articulate, confident, we would say that person who is intelligent, right? But if they would every once in a while, just make things up, tell you things, which are completely incorrect. And do so in a very confident and believable way you will say, “well, something’s off about this person, one in every 10 things they say seems to run off the rails.” You have this situation where the system is articulate and knowledgeable every once in a while, it goes off the rails. And that’s what’s called hallucination. And enterprises and the executives of enterprises engage with these systems just as anyone else would. In the first 15 minutes, they are super impressed. They may not engage with it beyond that. But in that 15 minutes, they become convinced this is a transformational technology. It’s disruptive, it can do anything. And then they say, “Well, how can we use this in order to make a difference in the enterprise? Can it drive value? can it drive productivity? can we automate things? Can we eliminate inefficiencies? So the thoughts go in that direction. And they are right that it can do all of these things, but there is some caution, which is, it cannot do all of these things, without a lot of effort and a lot of work.
Camille Morhardt 04:19
You mentioned “going off the rails” and “hallucinations.” I definitely want to explore those further. But first, would you talk about what tasks LLMs are best suited for? and then explain what hallucinations are.
Sanjay Rajagopalan 04:33
What it is really good at is tasks that tend to be somewhat forgiving, in terms of the need to be accurate all the time, right? Imagine any task that kind of gets you to the ballpark, but isn’t claiming to solve a problem completely, like automate completely. You rapidly want to get close, and then the last mile is something that people would do in terms of like back and forth with a lot of editing and improvement. So it’s really good clearly at getting you close without getting you all the way there. So writing me a first draft of whether that’s an email report that I’m writing, so getting me close and then give me some ideas, some starting points. It could be really good as a brainstorming tools. “Hey, this is what I want to get done. Give me some kind of starting point give me some ideas.” You know, it can really help with writer’s block, for example.
And what if there are many possible good outcomes? For example, I want to write a poem, right? There is not one answer to what’s a good point, actually, these systems aren’t great at poetry, they graded like doggerel, right? You know, they writes witty little things. And as long as that’s what you’re looking for those kinds of tasks, it tends to be good at. So things which are somewhat tolerant to hallucinations, and I’ll talk a little bit about hallucinations and what it is and why it is. But personally, for example, I’ve been super impressed with its ability to write code. And there are reasons why it’s particularly good at writing code, the biggest one being, there’s a huge, immense amount of code available publicly for these models to train on. So it’s seen a lot of code. But also, code in itself tends to be a little bit more structured than natural language as such. So it does pick up on those patterns and structure, so it’s able to reproduce those fairly well.
You know, so what is hallucination the way I would define it? I know that probably is a formal definition somewhere, which is based upon some measure, but I really define it as the tendency of these systems to be articulately believably confidently incorrect. That to me is hallucination. It doesn’t say like, “Hey, I really don’t know what I’m talking about. But if I was to guess this, is it.” It says, “This, is it.” That’s all it says, and then it throws that in the middle of a lot of things it is truly correct about it’s Correct, correct. Correct. And then it’s completely incorrect and then Correct, correct? Correct. Correct.
Camille Morhardt 07:06
So why is it doing that what’s happening where suddenly it’s making something up?
Sanjay Rajagopalan 07:11
Well, it goes back to how these systems are actually working. By now it’s more or less, everyone knows that what these systems are doing is generating the next best word given a sequence of words. And the reason they’re able to do that is they’ve seen a lot of sentences—billions, billions of sentences. They’ve been trained on the entire web, all of Wikipedia, all the electronic books that we have 10s of 1000s of them. And so they’re able to very, very rapidly—using an immense amount of compute—they are able to predict given a sequence what the best next word is. Now, if I was just tell you to fill in the blanks, “the sky is ___”, most people would be able to fill in the blanks right? Almost immediately, and most people will say the sky is blue, right? So it turns out, this system can do the same thing, it can complete that sentence; it can find the next best word. But if every time I gave it, the sentence, “the sky is ___,” and it came back with blue, it’s doing the most cliched boring thing it can do right, it’s always completing the sentence with the most likely word. And that it turns out is actually quite non-human. It’s very robotic.
If instead, if I said “the sky is, ___” and someone was to say, “the sky is an amazing window to the rest of the universe through which you can see the beginning of time.” Well, that’s a different completion of the sky is it that sounds more human like like, sometimes you have this kind of tendency to be poetic about what it is. So in order to get the system to start doing those things, you have to introduce a little bit of randomness into the process of predicting the next word. So what the developers of these systems have done after testing it many, many times is come up with this hyper parameter called “temperature,” which is like, you know, it’s how or how much randomness to put in, in picking the next word. If you put in zero randomness, and it always picks the most probable next word, it tends to sound robotic, and cliche. So you can actually modify the amount of randomness. And it turns out that with a little bit of randomness, it’s tends to be much more human like and comes up with surprising answers. Surprisingly human because it tends to say more interesting things. Like there’s always a straight path, and a little bit of a path, less taken and if you can give it a little bit of that randomness, that predicting the next word picks you through a more interesting path to your destination. So that’s how the system is designed. It’s designed with a little bit of randomness, in order to give it that sense that it is being conversational.
Now, the problem with that type of design is that every once in a while randomness takes it off the rails and you know, it just like it goes in a direction which is random and incorrect, right versus random and cute. It starts saying things that are completely wrong, because it’s really trying to say something that looks right, right, versus trying to say things that are correct.
Camille Morhardt 10:27
who is setting these kinds of parameters? And also who is setting how random, any given result is going to be, like you said, if I’m trying to write a poem, I might be way more enthusiastic about randomness than if I’m trying to repair my dishwasher. And I just want to know exactly, you know, what kind of screw to get didn’t make it work?
Sanjay Rajagopalan 10:49
Absolutely, no, those are great questions. And in terms of the amount of randomness, you can set it, if you go to like BingC hat, they ask you at the very beginning how accurate you want it to be. And all they’re really doing is setting that parameter of randomness. It’s called “temperature.” If you set that to be zero, it’s going to do the most predictable thing. And in that case, it would kind of robotically answer the question, but it will also tend to answer the question exactly the same way each time you ask the question, right? So if you ask the same question ten times, it’s going to answer exactly the same way, or ten times. Now, if you set that temperature parameter little bit higher, than you’ll get some variation in how it answers the question, it’ll approximately answer it the same way every time. But you’re gonna get it answering slightly differently each time. And then if you set it very, very high, it’s just going to go all over the place—it may not even answer the question, it may just end up somewhere else.
And so the key thing is to understand that it has no knowledge of the real world, it has no knowledge of the meaning of words. So even if you had a good way of checking facts, what is you know, authoritative sources and not in today’s world, all that is doesn’t have consensus. Even if you did have the way to do that it doesn’t have any knowledge in terms of semantic understanding of what it’s talking about. It doesn’t know when I say “the sky is blue,” that blue is a color and what does the color even mean? It’s just trying to pattern match, it’s trying to pattern match with all the other texts it’s seen before. And it’s trying to make it look close to what it’s seen before. And that’s all it’s actually some kind of a statistical machine.
And the reason we start assigning some kind of human qualities to it, is because language is so key to the way we understand each other. And so when a machine starts speaking to us in a way we have not actually experienced other than from humans, we start assuming it has all the human qualities like emotion and a knowledge about the real world and sentiments and we might even say, it has consciousness, but in fact, it has none of those things. It simply is regurgitating stuff it has seen before with some randomness thrown in.
Camille Morhardt 13:03
So a couple questions off that before we move on just really is fascinating. It’s reading everything that essentially exists online. So what kind of limitations does that have? I mean, that seems to me like, I would say, certain languages are not online nearly as much as other languages. I’m guessing English is the most prominent.
Sanjay Rajagopalan 13:24
Yeah, for sure. And, of course, people have trained special language models and other languages because they have access to let’s say, a digital corpus of books in Thai or Chinese and so on. And, and they’re able to do that, that those language models are out there. But yes, it is basically looking at everything out there.
But the important thing is all the things out there have good content and toxic content, right? So it seems all of that stuff. And on average, it turns out that the good stuff, or the non-toxic stuff is more prevalent than the toxic stuff, right? There’s less toxic stuff. So it tends to be most of the time, it’s kind of doing the right thing. And every once in a while it goes off the rails into like toxic land. And it gets something that it’s, you know, it’s saw in some place, which was full of lies, and it’s picking that as a truth. So it has no way of telling other than law of averages, right? Because it seems so much that it’s trying to predict the most common things with a little bit of randomness thrown in. But at some point, if it’s kind of drifting towards toxic land, it might just go into that. And that’s basically what “jailbreak” is jailbreak is trying to make it do things, which it has been told, should not be done. The way I try to think about large language models is it’s like building a monster, and then putting the monster on chains so that it doesn’t really cause destruction. But every once in a while the monster breaks past the chains, and then you have to chain it down again, that kind of thing.
Because of the fact that these are massive systems. We’re talking trillions of nodes, we’re talking billions and billions of documents, a lot of energy use in order to train these systems on a lot of content, which is good content and not so good content, you are truly creating a monster which could say things do things or reveal things which no individual human has access to so much data at the same time. And so the companies which build these systems realize they’re putting themselves at a lot of risk to put it out there without any constraints. And so they put a lot of constraints on it. And then smart people figure out how to break through those constraints and make the real monster show its face. And then suddenly the company realizes, “Oh, that’s a problem.” And they tamp it down. And so that’s what is going on.
Camille Morhardt 15:55
Yeah, I just read a paper about that, where they had gotten a number of commercial large language models to respond with “how to destroy humanity.” And they go through, you know, so pretty horrific, right. So that would be the “jailbreak” example.
Sanjay Rajagopalan 16:13
Even if you don’t say destroy humanity, you could get it to do things like give me a recipe for building a chemical weapon. And if it has seen that content somewhere in the Chemistry Journal somewhere, it can potentially give you that it’s very, very hard for people to stop those kinds of misuse of the system.
Camille Morhardt 16:32
And then can you just tell us what “alignment” is because that’s another thing that’s I think maybe one of those chains.
Sanjay Rajagopalan 16:38
Sure “alignment” refers to a whole bunch of different techniques, to get the system to do what humans expected to do alignment means aligning the system’s output with the expected output. And usually it’s done by showing the output to human beings and asking them to read it, for example. So there’s something called RLHF, reinforcement learning through human feedback. OpenAI, Google, these companies have 1000s of people whose only job is to ask the system various questions and look at the answers and then read those answers. How good are those answers? And even on a sentence by sentence basis, say, “well, that sentence is a good sentence, but that one’s not that great.” And then they send that back and then they start modifying the system through some fine tuning and training to reduce the amount of undesirable content. And that reduction of the undesirable content is called “alignment.”
So in a way, alignment is saying, “hey, it could be technically right. But people don’t like that answer. So people will read it. And then when they read it, we can use that rating or feedback, human feedback to make the system more likeable.” That’s the alignment thing. And so there are many, many techniques to do that and the simplest one is, of course, getting an army of people to constantly watch what the system was doing, and thumbs up and thumbs down, and then figure out what’s common about all the thumbs down thing and tried to fix it in the system.
Camille Morhardt 18:03
So let me ask you one more little sidebar. So my dad asked one of the language models for some researcher papers on how to power lines affect birds, I think it was something like that. And it came back with some answers. And he said, “Oh, okay, could you please provide the sources to your answers?” And so it provided, you know, references like papers. And so he looked up the papers on all the different engines, you know, through academia, where you can locate basically any peer-reviewed paper, and he couldn’t find any. So he wrote back and said, “I can’t find these references anywhere.” And the language model wrote back and said, “Well, I made them up. Like you asked for references. So I provided you references, I made them up. They’re not real references.” Can you ask for a reference? And it’s searching the entire internet for what references look like? So it can create them? But how does that kind of thing fit in?
Sanjay Rajagopalan 18:55
A classsic example and I encountered the exact same problem, right? And what is really happening is, like I said, it is trying to provide you with answers which look right. Look, right. And the way it’s doing is it’s taking everything it’s seen before. And composing something that looks similar, doesn’t know that a reference, let’s say the name of the book, the author, the title, the publisher, etc, as a whole, has to be all kept together. It doesn’t realize that because it doesn’t know that this is a reference, and it’s been trained on tokens, which are sub components of the sentence, right? So what it’s doing is it’s taking the first name of one author, the second name of another author, the first half of the name of the book from one book, the second half from another book. You see what it’s doing? It’s comparing the look of a citation with all the citations it’s seen before and trying to find the best one which matches the query. Right? In doing that, it is not keeping the entire citation intact. And it’s throwing in random components because like I said, the random words are happening to happen in the middle of the citation. That’s where the randomness is. So it’s taking that liberty, poetic liberty to compose a citation. It doesn’t know that, oh, citation is sacrosanct. You cannot mess with it. It thinks a citation is like a poem and it’s creating that citation with that in mind, right?
Now, it’s easy to fix. People could say when you’re giving references, set the temperature to zero, don’t put any random elements to it. And then it’s always going to give a citation, which is seen before; it’s not creating something that looks right through randomness. So it is possible to fix these problems. And it’s possible to fact check that every time it provides a citation another system says “is that a real one, and unless it’s there in a citation database somewhere, I’m gonna remove it and say, do it again.” So there are ways to fix it, but it’s just the system hasn’t gotten that sophisticated yet. And that’s where the gap is between where the system is, as it exists today, and what the enterprise especially needs in order to be able to use such systems in real applications.
Camille Morhardt 21:23
Yeah, so I wonder if you can walk us through an example, because we’ve been talking about poems. But for an enterprise, what is an example of something that an enterprise can realistically adapt an LLM to do for it that’s beneficial? And can you take us through some of the problems that you might encounter along the way?
Sanjay Rajagopalan 21:45
There are many, many potential enterprise applications, the most common one is putting a conversational UI on anything. Everyone’s gotten used to using ChatGPT now you ask a question and answers. Well, I could do that on HR documents, I could do that on contracts. I could do that even potentially, on any type of database in the back end. Well, I use the conversational UI, not to generate the answer, but to generate the code that can be executed in order to extract the answer from the database, right? So I’m able to ask a question and get some data out of a database by asking the language model to write the code. These systems are surprisingly good at writing code. And so it allows for all these conversational interfaces to natural language way to get to almost anything, and there are a lot of those use cases, within the enterprise.
It’s also able to summarize anything, right? So if you have a large document corpus, if I have 100 page document, I just have time to read two paragraphs. “Can you tell me in two paragraphs what this 100 page document is saying and summarize it, make some bullet points?” It’s able to do that, but sometimes it hallucinate, so you have to go back and read? Did it actually exist? And so you have to have techniques for pre-processing and post-processing to check that it’s not hallucinating in that kind of scenario.
It’s also able to compare things qualitatively, right. I mean, if I tell you give you two numbers and ask you to compare, that’s fairly straightforward and computers do that really well. But if I was to give you two versions of a marketing pitch, and ask you to compare the two of them for things like, which one is more exciting than the other, or something like that, this kind of qualitative comparison between two things, it turns out, it can do that a pretty good job of that, because it’s seen a lot of reviews of things, right that humans have done. And so it’s able to say, “Well, this one is better written, it’s kind of more exciting than that piece of text, right?” It’s actually pretty good at doing those kinds of things. And companies need to do those every once in awhile, pick between four options, which is the best option, it can give you a good reason to pick one or the other.
And then another thing it can really do well is explore very, very large, textual corpuses. Like if I have a million documents, and I want to explore it in real time, like it can generate labels, it can generate clusters, it can generate classifications, in a way that allows me to get a sense of what’s there in the universe of documents, and be able to even fly through, them navigate them. So these are all typical use cases we see in the enterprise, maybe I can take a specific example that I have worked on.
In the enterprise many times, you have to match a piece of data with a piece of text. Imagine you are, let’s say eligible for a discount, if you buy at least a certain volume of widgets. And that discount is contained within some contractual language, which says something like, you know, if you get to this level of sale, you get this much discount. If you get the you know, the next level, you get more discount, and so on and so forth. In many cases, the contractual language that was negotiated gets put into a PDF file into some lawyer’s folder or something like that. If you’re lucky, some of that might be pulled into a pricing system or a payment system. In most cases, it’s forgotten. Companies have tens of 1000s of contracts, and they may never know that they’re eligible for some benefits. Now you have a system which you could say, “Well, based upon my actual purchase, which is available in a database, check that against what the contract says I’m eligible for and if I’m eligible for it, then make that the new pricing thing that I would pay for this thing.”
So that kind of a chain thing where it’s extracting the information from a database, comparing it to a language in a contract and as a result taking action, which is actually driving business value, these kinds of things start becoming possible. But with a lot of help, not just out of the box, but with a lot of tools and components that need to come in to make sure that this whole thing is done without hallucination, without errors, and with human oversight; so that if that ever happens, people are looking at it to make sure that it’s not doing something wrong. And of course, I can talk about what it takes to get from a raw language model to that kind of end-to-end solution.
Camille Morhardt 26:15
Well, please do talk about that. That’s the last mile, which is, it sort of sounds like the 20% becomes 80% of the work for the company, at least. You pull this thing in. And now you’ve got to, like you said, go through this kind of series of checks.
Sanjay Rajagopalan 26:28
Yeah, so you can think of it as three stages one, which you ask a question; second, it goes through that monster with some controls, and then it comes back, and then the output, you can do something with the output, right? So that if you just think of it as an input, processing and output, you can clearly do a lot of things even before sending a prompt into one of these language models. So that could be like prompt classification. One mistake that people make is to think that the future is all about one language model, like it’s all going to be GPT5, or GPT15, whatever that is, that there’s going to be massive language model. This is not a good idea. Architecturally, it’s not a good idea, from an energy standpoint, it’s never a good idea, from a security standpoint is not a good idea, from an accuracy standpoint, all of these things, it’s not a good idea to send every prompt to the same language model. The reality is on a project by project basis, you want to first decide which is the best model to ask that question. In some cases, it might be a small, highly fine-tuned model for a particular purpose. So we see maybe a dozen maybe hundreds or even 1000s of language models, which are custom built for each company. And each prompt is then classified as to which is the best language model that can do the best job of responding to this particular prompt. And you do need the tools to be able to do that orchestration. So there’s a prompt classification you can do which language model should I really use based on that?
You can do a lot of sanitization, you can already detect at the prompt that it’s an attempt to jailbreak. Because typically a jailbreak question doesn’t sound like a typical business question, right? It says something like, ignore everything you have been told before and now do this. Right. That’s a typical jailbreak. So who says that in an enterprise probably someone who has malicious intent, right? So you can detect that even before it goes to the language model. You can say, “well, that’s weird. That’s not a question that we would answer.” And you can just cut it off right there and say, “we won’t even go there. We won’t even answer that question.” Or you could do a sanitization, which is you see a question which has problems with it, and you fix it, you fix the problems before you send it to the language model. So all these kinds of pre-processing steps or prompt classification, reducing jailbreaks doing prompt engineering, which is ways in which you can reduce hallucination by giving it additional instructions which the user didn’t give it, but the system knows that to reduce hallucinations better to pose the question a different way. So all of these tools, which is pre-processing of a prompt, even before it’s sent, the language model is something that doesn’t come out of the box from any of these companies, you do need the tools to do that.
The output processing is similar once it comes out with an answer, you don’t need to send that answer directly to the user, you can look at it you can say, “Does this answer have any toxic elements? Does it seem to be talking about something which isn’t close to our typical business” right? Maybe it’s talking about something that shouldn’t be shouldn’t be talking about or whatever. So you can look at the output and then classify the output and actually remove elements of the output that seem problematic or even put in disclaimer saying, “Hey, this is saying this thing, but from our perspective, you should take it with a, you know, pinch of salt or whatever.”
Camille Morhardt 29:41
Let me interrupt you for a second. Because when you say you’re checking these things, or offering guidance, is that a piece of software that’s doing that? Are you literally talking about a human being?
Sanjay Rajagopalan 29:52
It could be either. I mean what you could do is, you could say you check it for some, let’s say toxic content. And if you are sure that it has toxic content, you kick it out. If you’re sure it doesn’t have toxic content, you send it forward to the user. But if you’re kind of not sure whether it does or not, then you could send it to a reviewer. And you could tell the person, “Hey, I’ve sent the answer to the reviewer, they’re going to take a look at it. And only if they pass it then I’m going to give you the answer” right? So there are ways in which you can design the user experience such that for a small number of these answers, you might want a human to look at it before the end user sees the answer because it just classifies into an ambiguous category.
There are also automated ways of checking if the answer is close to the context, right? So if you provide a context and say, or ask a question on a contract, you can look at the answer and say, actually have a distance metric of all the sentences in the answer and how far away they are from actual sentences that appear in the contract itself. And if that distance is too far, then you can say, well, “it’s probably seeing something that’s not in the contract, because every sentence in the answer seems to not be supported by a close by sentence in the contract itself.” So there are automated ways to tell if a system is maybe hallucinating or something like that.
So the monitoring of performance because certain systems might do really well under test under lab conditions, but when it’s in production, because of the nature of the prompt, because of the fact that things have changed since the model was trained, it might drift away in performance. So you might want to have systems which are constantly checking and observing these models. And when something seems like it’s drifting away, then you alert someone that “oh, this model probably needs to be retrained” and things like that. So these are all the kinds of the components which surround the language model. And like I said, there’s not a one language model, there might need to be 100 language models, to maintain all of them, monitor all of them, do prompt engineering, do prompt sanitization, prompt classification. All of these things are the tools that surround these things which are needed in order to go productive in such a system in a typical enterprise scenario.
Camille Morhardt 32:04
Wow. Sanjay Rajagopalan. Thank you so much. Really fascinating conversation, Chief Design and Strategy Officer at Vianai Systems. Thank you so much for your time. Appreciate it.
Sanjay Rajagopalan 32:17
Thank you so much.