Aurora Quinn-Elmore 00:13
Once people actually try to use the tools to solve problems, they’re going to find that it can solve a lot of those problems, or at least get them something to react to and help them make progress.
Camille Morhardt 00:27
Hi, I’m Camille Morhardt, host of InTechnology podcast. And today I had a very interesting guest, Aurora Quinn-Elmore, who is founder and CEO of Metamorph AI, which is a company that helps other companies–in particular, small-medium businesses–figure out how they can take advantage of large language models and customize them for use in their own business. Something I really enjoy about this conversation is she not only helps kind of walk us through how large language models are initially trained, she then very specifically walks through how companies can fine-tune those models or customize them with their own data and how they want their own responses to happen.
Also, something I particularly love is she walks through a lot of pros and cons of the different models that are out there and available to people right now. And it turns out, a lot of them have some things in common and some things different. And it really matters what your criteria is and what you’re trying to do. And she helps just walk us through that in a very sort of neutral way to kind of understand what you might be most interested in pursuing. Fascinating woman, very interesting conversation. I hope you enjoy the podcast.
Welcome to the podcast, Aurora.
Aurora Quinn-Elmore 01:43
Hey, thank you.
Camille Morhardt 01:45
So I wanted to ask you, we had a pretty popular podcast recently about how and why large enterprises are using large language models. I’m very curious to get your perspective on how small- and medium-sized businesses are starting to use large language models.
Aurora Quinn-Elmore 02:03
Yeah, absolutely. That’s been a big focus of mine. And my business has been working with CEOs and founders of small and medium sized businesses who see the potential for AI to transform their business. Or in some cases, they see risks from AI, potentially disrupting their business and want to get ahead of that. So, want to find a way to incorporate GenAI’s capabilities in their operations or their core product in order to kind of stay ahead of that.
Camille Morhardt 02:30
I want to understand how they can customize. But before we do that, can you level-set on how these models are trained, and then we can get into we can tweak them?
Aurora Quinn-Elmore 02:41
A lot of people have probably heard about how ChatGPT is trained on all the information on the internet or a large chunk of the information on the internet. And now we have, Google recently came out with Gemini Ultra, which is a model that has a much larger context window than ChatGPT but is kind of at a similar level of performance. And there are also some open-source projects that are putting out models that have impressive performance, but they are behind the state of the art, which kind of makes sense given how expensive it is to train these models.
So as a first step, companies need a huge amount of data. So that’s either real-world data scraped from online, or there’s also some companies, particularly open-source projects, are increasingly using what’s called synthetic data, where they have a big chunk of data to start with of different types. So maybe there’s a large amount of data that is conversations on a forum, for example, and maybe there’s a large amount of information that is from technical white papers describing different technologies. So they’ll have a bunch of different sources of real data from the world. But then in order to build out this synthetic data, they sort of create additional data on top of that, that kind of has a similar shape and similar set of features as the real data. But it turns out, even just creating that synthetic data, and training on vast amounts of synthetic data, if it’s high quality, if it’s generated in the right way, you can get really good performance out of those models.
And then there’s a huge amount of compute, you need to use powerful hardware in order to train the model–so allow it to kind of ingest all that information from the world. I like to kind of think about this as like, as you and I were children were growing up, we moved through the world, we learned so much from the world, both what we learned formally in school, as well as just you know, from a day to day basis, like communicating different ways and behaving different ways, and then seeing how people respond to us. So for the AI model, the initial training run is kind of that implicit knowledge of what we pick up moving through the world, just based on observations that we make of, you know, written text or visual input that we receive.
And then there’s this really interesting process that AI companies have to do after that, where it’s called reinforcement learning from human feedback. So that’s essentially taking the model that has all of this information, but doesn’t necessarily know how to behave like you know, it’s perhaps trained on a bunch of Reddit forums, so it has a bunch of examples of people being really rude to each other and not understanding each other’s points. And so reinforcement learning through human feedback is that process that AI engineers use in order to teach a model “this is how you respond in order to be helpful and harmless.” You know, it teaches them don’t help someone figure out how to create a chemical weapon, don’t help someone figure out how to commit a crime. So that’s the work that happens behind the scenes to create a product like Chat GPT, or Gemini Ultra, to get the model to be in a place where it’s very knowledgeable and very helpful.
Camille Morhardt 05:35
And when that goes awry, or when something goes awry. Can you describe what happened with Google Gemini and some of the images that were being generated and how this type of reinforcement training by humans helps with that, or doesn’t always help with that?
Aurora Quinn-Elmore 05:53
So in case people didn’t come across this, people were playing with the Gemini Ultra model, and it has image generation capabilities built in in the way that ChatGPT, if you’re paying for the premium version, has that also. But something people were noticing was that the images that it was generating, they weren’t like diverse in a way that represents you know, the normal distribution in the population; it sort of seemed like they were diverse in a way that didn’t make sense. So like some of the things that people noticed was like, you’d ask it to generate an image of a German soldier in World War II, it wouldn’t generate an image of a Nazi that was kind of on the black list of things that wouldn’t do. But a German soldier in World War II, it would generate a very diverse set of soldiers. The uniforms would be right, but people would be of all races and all genders. That’s not you know, there’s not really Asian Nazis and Black female Nazis, for example.
Similarly, if you asked it to generate an image of the founding fathers, it would generate images that very much looked like the cast of Hamilton, which is a beautiful piece of art, but it’s not historically accurate. So people were really confused about what they were getting as the outputs. And it turns out under the hood, what Gemini Ultra was doing was it was kind of modifying the prompt. So if you asked it to generate an image of the founding fathers, before the AI received the prompt, it would modify the prompt to say, “give me an image of the founding fathers represented by diverse people of different ethnicities and genders.” There was some kind of additional work that was done under the hood, which, like, if you had just asked for an image of firefighters, for example, or a crowd at a concert, that would be great. That would make a lot of sense. Cause like, the thing that Google and other AI companies are trying to do is to kind of de-bias the model so that it is representative because the images, you know, these AI models are trained on huge numbers of images that are online, which as it turns out, are not representative. So we can look at Hollywood films, particularly from like 50 years ago or so, and look at the number of white people versus BIPOC, people look at the number of male characters versus female characters. And it just didn’t represent the population as a whole. So these AI companies are trying to do important work in order to make sure that these image generators don’t produce images of all white male firefighters, for example, because that wouldn’t be a true representation of reality. But they kind of maybe made a mistake in tipping too far. And/or they didn’t take into account historical context.
And then the models were also doing funny things where like, you’d ask it to generate a picture of a black family, and it would do that no problem. But you asked to generate a picture of a white family, and it would refuse to do that. So it sort of didn’t really feel like it was being historically accurate, or that it was allowing people to generate the kinds of images they wanted to, where, it’s like, there’s nothing offensive about generating an image of a white family, you should be able to do that. So it’s just kind of a tricky thing that we’re working through and these companies are working through where they’re trying to get the models to behave in a way that is respectful and representative, and they weren’t thoughtful enough around how to do that is my belief.
Camille Morhardt 9:00
So if you take one of these models, be it open-source, or be it like you’re talking ChatGPT, how can companies customize then that model for their own use?
Aurora Quinn-Elmore 09:13
So some of the folks that I’ve worked with might initially use ChatGPT in order to prototype a potential solution where they’re, like, “Okay, well, we have this thing that is a core part of our operations.” One of my clients his company trains therapists who are interested in learning how to work with psychedelic medicines. And so he’s got this great business where they’re training therapists who are licensed mental health workers in order to learn the skills they need to have in order to work with these substances. But a key bottleneck that he found in his business is that not enough senior therapists who know how to work with these substances are able to give the level of detailed feedback that would really be beneficial to students as they’re building these new skillsets in terms of how they establish emotional safety, how they set expectations with the patient about what’s going to be different for this medicine journey that they’re guiding them through.
And so for this client, when he came to me, he had already done some work where he had recorded sessions and role-plays between students who are pretending to be the therapist and the patient. And he had those transcripts, he fed them into ChatGPT, he told ChatGPT, “Here’s what I’m looking for: these are examples of the therapist handling things. Well, these are examples of thing of things where the therapist is not handling it well.” So he was able to already verify before he came to me that this technology did a really good job of finding those moments of success and failure. And so he came to my company in order to figure out how do we scale this up? How do we make this something that can work for 100 or 1000 students moving through a class together, who normally would be able to get some peer feedback, which is useful, but wouldn’t necessarily be able to get the level of detailed nuanced feedback that a professional therapist with a strong grounding in this area will be able to get. So we’re able to up the quality of feedback that his students were able to receive by using GenAI in that way.
Camille Morhardt 11:12
That’s really interesting.
Aurora Quinn-Elmore 11:13
Yeah, so there’s a lot of really interesting possibilities there. My team mostly uses the GPT-4 API. So, ChatGPT is the consumer-facing product that’s available, but OpenAI the company behind it also makes APIs available, which allows engineers to build on top of that. Just in case some folks listening aren’t as technical, I’m sure they’ve noticed when they’re buying something online, they often have the option to checkout with PayPal. So what’s happening there is that PayPal, the company, has made their APIs available such that an engineer who’s building an e-commerce store knows that people will be more comfortable using PayPal than they would be giving their credit card to a random company. So that engineer is able to use the PayPal APIs in order to bring the PayPal functionality into their own website, however big or small their website is.
And so my team mostly works with the GPT-4 APIs and so some things that you can do on top of that, in order to kind of customize the responses that you’re getting, RAG is something that a lot of teams are exploring right now—that’s Retrieval-Augmented Generation. I like to think about that as like giving the AI model access to a reference library. If you’re trying to build an AI that’s really, really good at answering very nuanced technical questions specific to the hundred hardware products that your company sells, that’s not information that necessarily would have been in the training set, that might be too niche, it won’t have the detailed understanding. But what you can do with RAG is you can give it access to that reference library. So if you build out an AI chatbot that uses RAG that you’re surfacing on your website, and a user asks a technical question about your product, they’re trying to make two pieces work together, and they are they’re trying to figure out what products should I buy in order to solve this, like, fairly obscure question, then the AI basically has the option to look in that reference library. So, do a fairly standard search of the information that’s available, pull in that information, and then incorporate that information as part of its response. So, that’s something that works really well for organizations that have kind of that niche knowledge there.
And something that’s actually really interesting is, we were just talking about Gemini Ultra’s million token context window, which for people who are paying attention to this, that’s enormous. It was groundbreaking maybe six months ago when OpenAI expanded their context window from maybe 6000 tokens, or maybe it was 12-, anyway, somewhere in that range to 100,000. Like, that changed my life in terms of a project that we were working on, that allowed us to just execute much better. So 100k tokens six months ago was amazing. This million tokens is crazy. That’s basically like, what’s the working memory of the AI. Like, what can I just think of in the moment, and like, you know, it can now just take into effect an entire encyclopedia with the million token context window.
So something really interesting that’s happening is my understanding of how we’re getting those massive context windows is essentially, when you’re asking the AI a question, but you load up a ton of information, it’s doing something that is similar to RAG in terms of it’s finding a denser way to kind of store and digest that information, and then pull in the relevant bits to answer the question. So that’s RAG. There’s also fine tuning, but I’ll pause and let you jump in.
Camille Morhardt 14:38
Well, no, I’m actually interested in fine tuning also. So how does that work?
Aurora Quinn-Elmore 14:43
Let’s say you run a company that’s been operating for a couple years, you have hundreds or thousands of customers, you have a customer support team that’s answered maybe hundreds of thousands, or millions of questions, bigger data sets are better in this case. And then maybe you have data where customers gave one to five stars on how helpful that customer support agent was in solving the problem. And then maybe you also have a way to classify those tickets as like a password reset, or a request for a refund on a purchase. So you have this data set of customer support tickets that were closed, you have the rating of how good of a job the person did of answering that question from the customer, and then you have the categorization of the question. So if I was building out a chatbot for a company that had that kind of data set and they wanted to do fine tuning on that, then we’d be able to use that data set in order to train the AI of how to answer questions in a way that our customers liked, based on what was successful in the past and how to avoid the problems based on the things that were given low star ratings.
Camille Morhardt 15:46
Can you give us an overview of some of the pros and cons of the different models or overall kind of using the open-source versus non-open-source models that are out there?
Aurora Quinn-Elmore 15:58
Yeah, absolutely. With the example that I gave of when OpenAI expanded its context window to 100k. That was really interesting, because I was at the time working on a client project–the one I mentioned actually with the role-plays from the therapist students that were learning this new set of skills–and the first step that we built our system to do, it was a tool where our client could upload the transcript of the role-play session, but sometimes these would be an hour and a half long because the students were kind of taking turns with different role-plays. So first, we needed to send the full transcripts to AI to figure out which sections are these are role-plays, which sections are these are discussion and then what kind of role-play is happening here so we could send it on to the next step in order to do the analysis.
So with the original limitation of OpenAI’s context windows, we actually had to use Claude 2 for the first step in order to do that chunking. And I was, frankly, really struggling to get the level of performance that I wanted from Claude 2 in that chunking. But it was, like, literally that week when I was doing that work that the OpenAI expanded their context window to 100k. So, I could suddenly use the GPT-4 API for both that initial chunking and for the next step of the analysis. So, that’s just kind of an example where there’ll be some projects where you just need a certain context window, and you can’t get it done with a smaller context window at the level of quality that you want. So, that’s like, that’s one key consideration is like, is the context window large enough? So, so far, the 100k GPT-4 context window has been fine for all the projects that I’m working on. But I could imagine a scenario in the future where we really actually did need the Gemini Ultra. So I think that’s kind of one question for someone who’s building something using these tools is what are the key constraints that they have in terms of context window, but then also, cost is another consideration.
So there’s another client project that I have, where we’re doing analysis on millions of records of work that was done to repair properties that this company manages; it’s for low-income housing in the UK, they manage the properties, you know, so they’ll have to come in and replace windows, they’ll have to replace insulation, they’ll have to, you know, repair bathrooms, etcetera. And they want us to use AI in order to turn that qualitative data into quantitative data–at what date was what thing repaired or replaced? And so I was initially costing this out with the idea that we would use GPT-4, but the AI costs would have eaten up a ton of the contract there. So it was a much lower profit margin project initially, but you know, interesting project, wanted to get it done. But then Claude 3 comes out with their, they have kind of three levels of models. And the most expensive one is better performance than GPT-4, their middle one is somewhere between GPT-4 and 3.5. And so something that I found from experimenting with that was that the mid-size model that was cheaper than GPT-4 was able to give me the quality of analysis that I needed for this project. So, for projects with, like, huge amounts of data where I need to run it for a million lines of data, the model that’s a third or a fourth of the cost makes a huge difference.
Another one is latency. So for example, a friend of mine has a company focused on synthetic agents that are able to have voice conversations. So he was initially experimenting with both the GPT-4 and the GPT 3.5 API. And I actually played with it a little bit where I could, like, dial a phone number. And then I could talk to kind of this fake agent who was trying to sell me a car and could talk me through all the pros and cons of different cars. And what he found was he really liked the intelligence that he was able to get with the GPT-4 model because it was just a lot smarter and able to figure things out better; but from a latency perspective, he realized he really needed to have to go with the 3.5 model, because the 4 model was just too slow in its responses for live voiced conversation there. But 3.5 was fast enough that people could have a slightly delayed but normal feeling conversation. So latency is kind of that third consideration when looking at different models.
And then intelligence or quality of output is another consideration. So 3.5 is not as smart as 4 is. But if 3.5 is good enough for the task, it will be a lot cheaper to use that.
Camille Morhardt 20:22
How hard is it to, you know, try one and then try another one? Is it better to kind of invest your time in selecting one to the best of your ability and then just becoming really good at it? Or is it fairly easy to go back and forth?
Aurora Quinn-Elmore 20:35
If you’re doing fine tuning, I think there’s a lot more complexity there. But with the example where we were using Claude 2 for the chunking, and then GPT-4 for the analysis, I think it took my CTO half an hour to switch it over; the APIs are pretty well-built out. So if you’re using an off-the- shelf commercial model, then it is very straightforward. If they have good API documentation, etcetera, then it’s pretty easy to switch it over. There’s more challenges around that with the open-source models. So, with the open-source models, the default would be that you’d have to do a bunch of work kind of setting it up to run on your own hardware. So, whether that’s on your computer, if you have a powerful enough computer, or you might have to use AWS compute in order to run it, because there’s a ton of hardware like GPUs that are needed for the initial training run. But then there’s also a need for GPUs or other hardware, depending on how expensive it is. You do need some amount of power there when I think it’s called at inference, so when you’re asking the question, it’s generating the response that uses some hardware capacity also. So, if you’re using an open-source model, the default is that you have to configure it and set it up on your own hardware. So, whether that’s, you know, real hardware that you control or AWS cloud compute that you’re using, but there are actually some really interesting offerings now through Microsoft. Microsoft Azure has something called I think it’s Models-As-A-Service. Llama2 is one of the more powerful open-source models, that’s kind of at the 3.5 level that Facebook put out. So you can use Microsoft Azure Models-As-A-Service API to just directly connect with an instance of Llama2. So they’ve handled all the setup for you. They’re running it on their own cloud compute infrastructure, so you don’t have to worry about that. But then the consideration there is that, you know, obviously, you’ll have to pay them for the compute, which of course, there’s going to be some markup for money at yourself. So, it’s sort of the investment of purchasing the hardware versus using cloud compute for that.
Camille Morhardt 22:40
So Aurora, what kinds of things can people do or companies do from an AI literacy perspective?
Aurora Quinn-Elmore 22:47
One thing I definitely recommend is if it’s not going to be a huge strain on your budget, I strongly recommend paying for the pro version of either ChatGPT or Claude. With Claude 3 and GPT-4, they’re at a similar level at this point. So, paying that $20 a month just makes a huge difference because if people are using the free version, it might be interesting, but it’s just not smart enough to do a lot of useful things.
And then, you know, if you’re trying to think about how to get started, thinking about those top three challenges you’re having in your professional world, or personal life, and just use the chat bot to think through it, like say, “Hey, I have these three problems, how could you help me solve those?” and then sort of see what it suggests, and you know, not all the suggestions will be great. But there’s probably going to be one or two suggestions there that do seem interesting, that are worth trying, and then just kind of exploring from there. Once people actually try to use the tools to solve problems, they’re going to find that it can solve a lot of those problems, or at least get them something to react to and help them make progress.
Camille Morhardt 23:47
So you kind of work with AI like a friend–it’s almost like that movie Her–where you start to interact with it, and it gets to know you a little bit better and you can customize it to your own style. Can you talk a little bit more about that?
Aurora Quinn-Elmore 23:59
It actually does kind of respond to tone. So if you’re speaking to it, and really formal way or informal way, that’ll influence it. And there might be something like this for Claude, I haven’t looked, but with ChatGPT, there’s a way that you can configure your account such that you can kind of tell it about itself. So, I’ve told ChatGPT, that I’m the Founder and CEO of Metamorph AI, I’ve given it a sentence or two about what work we do. I’ve told it a little bit about my relationship with my boyfriend. And so when I’m asking it a given question, it’ll often kind of have that contextual knowledge in order to fill it in. Or my boyfriend, for example, he’s configured it to know, like, “I’m very technical. Don’t flatter me, just get to the point, etcetera.” just to kind of tell it how he wants it to communicate with him.
Camille Morhardt 24:46
It’s not being fed back to the cloud? Or should people have concern or be aware if they are feeding that type of very personal potentially information in?
Aurora Quinn-Elmore 24:55
Yeah, so I was just looking at this recently for a project proposal where I was working with the legal team of a large organization that was kind of, it was at the last stage of approval where they were trying to see if they could get legal sign-off on this. And so, from what I remember from my research, I think OpenAI, in their privacy documentation and their user agreements, they make it very clear that they don’t train their model on inputs that are either put in through ChatGPT or the API. I believe Claude had something similar, although I don’t quite remember; it might have just been the API won’t be used for training. I don’t remember on the chat part. But then Google was a little different, I couldn’t find anything in Google’s documentation that made a strong statement about how they weren’t going to train on your inputs. But I think this was when Gemini was announced but not yet released. So, they may have clarified that and they may have taken a similar position. Because this is a huge concern for a lot of enterprise companies. But for a lot of day-to-day individuals also, like, “Will the model be trained on my inputs?” And a lot of people are quite reasonably not comfortable with that.
Camille Morhardt 26:02
So your advice is to just look into that before you submit your data and find out what you’re comfortable with and not.
Aurora Quinn-Elmore 26:09
Yeah.
Camille Morhardt 26:10
So we’re expecting ChatGPT-5 to come out later this summer. What do you expect to see there?
Aurora Quinn-Elmore 26:16
Huh, I am so excited. When we got GPT-4 Turbo–that was when the context window got bigger–I felt like Christmas had come early. So, I’m so excited for GPT-5. I think one thing that we’re definitely going to see more of is probably more agentic behavior. So, you’ve seen some companies and open-source projects playing around with this. I think Baby AGI was one of the open-source projects that got a lot of attention. And the idea was that you could just ask it to do a thing, and it would come up with a plan on how to achieve that. And then it would step-by-step do that. So, I think we asked it to look for AI conferences that were happening in the United States in the next three months and identify people that we should try to contact and meet with at those conferences based on specific criteria.
So, it made a pretty good plan in terms of like, search for conferences, look at the speakers that are listed, look at those speakers on LinkedIn in order to decide who to talk to or Google around for them, and then find their contact info. So, I think it made a pretty good plan, but it wasn’t really able to execute on that plan. So this is kind of a classic problem with agentic AI is it’ll often get like stuck in loops, where it’ll like do the same Google search again and again and again. And like not really move on to the next task. There’s been a lot of really interesting progress on agentic AI over time, but nothing that blows me away yet. So that’s something that I think, classically, OpenAI likes to do with their releases. So there were moments with the GPT-4 Turbo release, with this extended context window and a bunch of other things that they did, were like, as Sam Altman was talking, there were, like, dozens of startups getting destroyed every minute, because all of these companies had been trying to solve some of these key problems that made the GPT-4 API hard to work with. But then they just kind of built those improvements in. So, I think that’s something that we’ll continue to see from OpenAI, where it’s, like, what are the core ways in which the API is hard to work with today? Or, what are some use cases that people want to be able to use ChatGPT-4, that they can’t use it for? And then kind of building that in.
Camille Morhardt 28:30
Wow, phenomenal, very interesting. Aurora Quinn-Elmore, founder of Metamorph AI, thank you so much for joining us today.
Aurora Quinn-Elmore 28:38
Yeah, absolutely. This is a lot of fun, thank you.