Mostly Unstructured Podcast ep. 5
00:00:00 Speaker: All right. Welcome back to the Mostly Unstructured Podcast. We're back Ed and Clay. Back again that fires. Yet I say that every time. Yeah, but it's all right. Yeah. Still true. It's still true. Um, so we're gonna talk today about, uh, really unpacking a, a blog post that we wrote some around how to train an LLM for, for enterprise. Yeah. So no jokes today. We're just going to jump right in. Okay, Thomas, I know it's boring. We'll we'll try to bring you something back later, but, um. Yeah. So when we're talking about training an LLM for enterprise, kind of throw out some terminology. So I think a lot of times we'll hear people say, I we've used corporate language model. Enterprise language model. Sure. Um, and a lot of our domain specific, that was where I was going. Yep yep yep. Uh, domain specific language model DMSL. What is it, DSL? Yeah. Sorry. Yeah, we got too many acronyms. Alphabet soup. We've we've sworn we're going to do an episode on. Yeah, just acronym soup. It's coming. We're gonna have to do it. Um, but there may be a temptation, um, by some organizations to go I'll just train my own language model. Right. And I think if that's what you're thinking about, if that's. Yeah, part of your plan, I think we would say a hard no. Yeah. You know, a good idea. Right. I mean, it's one of those things that I think, you know, on paper, you know, people can look and say, well, this will make it specific to ourselves. Yeah. Build it from the ground up. And, you know, that's going to help us with, you know, accuracy and, uh, obviously internal knowledge and all those things. But, you know, there's a reason that we didn't hear about LLMs for so many years while they were still being built. Right, right. Nobody just turned them on. This was a process over a period of years that, you know, took place to get the right kind of information, contextualize the right way, you know, searchable the right way, and all those things. And, you know, we know in our, you know, kind of the, the, the history of the company and, you know, the content world or the IDP world, it's, you know, content in particular. Everybody thought, well, how hard can that be, right? I just, you know, database and file store, just do it. And I know lots of, you know, we've replaced a lot of those attempts. I think, you know, this is obviously a very different class of technology. But the lesson is there, you know, it may look relatively simple. It's not. And the reality is most most enterprises really have no business messing around with that. And no. So I like to think of an analogy of, you know, if you're in the web world, you want to build a website. Um, you don't want to have to start from scratch. That's why there's a lot of open source, you know, options out there, like like WordPress, for example. Or if you're a huge enterprise, you know, Sitecore, for example. Yep. That's kind of the same analogy. You also don't start kids in your organization and grow them up to be an employee. No, you go find you go find a really talented person. Smart. Bring them into the organization, teach them about organization. Sure. And then they help you grow your business. And I think that's a good way maybe to look at it. And it starts with what, some until we turn like a foundational right. Foundational models which like everybody knows most of those chat ChatGPT. Right, right. I always say Gronk. Yeah it's like Gronkowski but Gronk yeah yeah Gronk or probably a good thing. It's probably a good thing. Um yeah, but any of those, Gemini, you name it, Claude, etc. those are great. A great place to start. And then using that to improve on your enterprise. Well, and I think to your point, first of all, there's no lack of options. Sure. Right. Right. Secondly, you know, understanding the sort of the specialization and differentiation between those different options, you know, is important. Right. ChatGPT, I think is, you know, I think probably the best known. Right. Or maybe, maybe the most used. I don't I don't know the statistics, but, um, you know, there's a lot of companies, particularly in the R&D world where Claude is, you know, amazing, right? So Claude code is amazing. And so, you know, it's it's that kind of thing, right? It's it's getting to the point. I'm not saying this is exactly correct, but it's getting to the point where you, you want to tap into, you know, the, the sort of more specialized, you know, model. But pick a model. You know, starting from scratch is that that's a that's a reach. Yeah. And and really the model and what we're talking about in training your own LM, your domain specific language model, it's not typically the model. It's the context right in in what you're what you're feeding it. Right. And we we talk a lot about how AI can give you really bad results. And that there's the internet is rife with those examples. Even the simple can't spell strawberry right. That's gone viral right in the last, you know, several months or so, uh, that, that at the end of the I mean, what's really intriguing about feeding it the right stuff is it's so critical if you don't want to get the bad out. Right. Yeah. They'll go. Right. Garbage in, garbage out. That's right. And it's no different than Legacy. Um, BI it's always been the case If you put in bad, you get bad out. That context that you're feeding it can really backfire on you. And we were talking about how Canada Canada Air. Yeah. How that it happened to them. Right. Yeah. I mean, you know that that Air Canada story broke, I think, in 2024. And, you know, I think became this, you know, kind of teachable moment for people about, you know, when you sort of unleash AI without sort of I don't know, restraint is the right word, but without governance, you know, we talked about governance in the last podcast and, you know, this idea that they just fed in lots of policies and information, right. And said, you know, here you go. Turn it on. Now we've got a chatbot. And, you know, I don't know if their motivation was, you know, well, that's going to allow us to decrease call volume, which means we need lower call centers. You know, there's something behind it, right? Or maybe it's just people want to self serve, right? So the idea behind it fine. The issue is because it wasn't governed. The you know, the sort of case study was that this gentleman went on the site, communicated with the chatbot, wanted to know about their bereavement policy because he had a funeral to attend, and that the bot said, yes, absolutely. Take the flight, come home, submit it, and, you know, we'll credit or refund or whatever that it promised. When he submitted that credit after the fact, he was told, absolutely not. That's not our policy. The court ruled that you're responsible for the AI, you're responsible for what it's putting out there. And you know that you hope is a lesson for all of us. You know, if we're going to expose. Right? A bot or however we're doing it and it's going to share information. I better be sure it's right. Yeah. And that has that has context in multiple areas, whether you're in marketing and you're generating something from other sure copyrighted information, that there's tons of articles and information on that. We'll dive into that, of course. But in your own organization, if you create this model, your your corporate model, and you query that model identically and you get bad information back and you go make decisions on that, does that necessarily have the effect that that public facing example you used? What happened? No, not necessarily. But think about the ramifications. No, I mean, I think that's a really interesting point because let's say they hadn't exposed the bot, right? Let's say the bot was used internally for their customer service center. And so this gentleman calls up, I'm on the phone with him. I'm the service center rep. I, of course, don't know every policy of the organization. So I go in, it comes back, and I verbalize what it's told me. Am I more or less culpable? The same. Right? The same. Because at the end of the day, I gave bad information. I gave the bad information based on this ungoverned right. Hey, just we'll just stand this up, we'll throw it in and, you know, and I think that we're all, you know, we're all having to learn very much in real time, right? It's almost like what we did yesterday isn't the same today, because that's how fast it's all evolving. But I think that, you know, there's the lesson in that is, you know, you've back to your garbage hit the criticality of what am I feeding it. Right. And the idea that I'm going to feed it from scratch. That's that's hard to wrap your head around. And I think it's important to say, like, we're not trying to this is not fear mongering. We shouldn't be fearful of it, but respectful of it. And you have to understand that, again, it's not to be feared. It's to be. We use it every day, right? We're using it every day. We're using it to do our jobs. We're we're helping customers with it. I mean, it's I think to your point, you know, this is not, uh, driven from a point of be afraid of AI. It's be smart about it, be smart about it. And I think that the you know, the reality is there's this sort of rush, I feel it I think we feel it as an organization. Like, how do we use it to propel faster, to have that competitive advantage, etc.. But you have to juxtapose that against, you know, how do I make sure that what I'm, you know, putting in is right? How do I make sure it's secure? How do I make sure it's governed? How do I make sure I'm not seeing what I shouldn't be seeing? Like, how do I manage all that? And, you know, I think that's what we're trying to get at, is that sort of rush that I think people are feeling has them skipping past good, you know, sort of educated, deep dives into how do I do it right. That's right, that's right. Yeah. So one of the temptations could be, why doesn't he just take this LLM or take this model? And I'm going to I'm going to refine it. I'm going to fine tune it. Yeah. Okay. And that's that's that's great. Like, it has good results. There's there's good that can come out of that. One of the things we covered in in our post on that briefly covered about fine tuning, was that it? It may tell you what you want to hear in the tone you know, and how you want to hear it, but it doesn't always tell you what you what you really want to know, because it's it's not always based on the knowledge, it's based on the response. And that's, that's one area of it. Obviously there's more to it than that, but that just fine tuning isn't isn't the final answer. We would say, well, if you want to get this information and you want to be able to query that, that data, what about all of the information that's coming in in real time? And RAG is one of those great other acronyms out there. Yes. Um, that that people throw around and don't think really explain what's fine. Explain the layman's term won't go into a deep dive on that right now. But what is RAG and why is it so important when you're training your when your your LLM? Well, what I want to step back just for a second is, you know, what you were saying, because I think there are some terms the terms themselves have been around, but not necessarily in the context of of technology bias. Right? Toxicity. Right. Hallucination. Right. The word it's been around a long time. Not never. Maybe not never, but certainly not words I associated with technology until, you know, more recently, you know, one of the things that to your point of, well, I'll just I'm just going to, you know, kind of throw some stuff in here, you know, into the LLM is you can inherently introduce all of those. You could potentially be introducing bias because of what you're what you're feeding. You could be introducing toxicity. You could be. And particularly if you stand up from scratch right. Is is I've just made it, you know, with my bias, any toxicity that's built in and, you know, hopefully not hallucinations but whatever, you know, I think where, where RAG comes in, you know, is it's, it's this idea of taking the core LLM and, you know, supplementing it with kind of key data to help help with decision making. Right. To your point about kind of using layman's, you know, approach. I'm, I'm giving it, you know, a supplement as opposed to building from scratch that, you know, would be more specific to me, my organization, my, you know, those things that should not, you know, shouldn't in and of itself, you know, add some of that bias, that toxicity, etc.. Right. We're going to go where we always go. And that is, you know, as a information intelligence company, KeyMark is like our passion is is information and content and data and not just getting it, but making it clean, making it contextual, feeding it into your your systems, your data lakes. Right. And in fueling a fueling enterprise AI through intelligent ingestion. Um, and at the end of the day, we're processing. We're not just documents, we're not just processing documents. We're making that context or making that information contextual and usable. And that starts with so often what we've talked about on other podcasts is intelligent document processing, that ability to adjust the information that what the vast majority of information is untapped gold. We like to say in an organization, but also that doesn't, you know, caveat that you may not want to ingest all of that. We, you know, that certainly understand that. But how do you get all of that information out of that structured data that's so important if you're going to build your own LLM. Why? Because that that is the true source. And doing that through IDP is, I think, mile number one. Would you agree? I totally agree. Well, and I think to your point, you know, we we we talk about unstructured a lot in the podcast name. Unstructured has you know, in my mind I would divvy it into kind of two, two parts. Right. We've talked a lot about the content at rest, right. Things I've been storing for years and unearthing that because that's that eighty to ninety percent of unstructured information that's sitting in the organization. There's also all the day to day transaction flow. The new information, you know, when it's coming into the organization. So how am I ingesting that and parsing that and making use of it in real time as, as well. So that's one of the big questions. So as you think about, you know, IDP as a technology, you know, whether you're talking about that, that inflow of new information, whether you think about, you know, that that data that's been at rest, our organization, our technology category, for as long as I can think about it, there was this expectation of absolutely dead on accuracy at the end. Right. Years and years ago it started as well. What's your OCR accuracy percentage? And we've talked about the fact that that's not the right question. You know, the right question is what's the accuracy when it goes into the system. So years and years ago you know AP was you know you're doing AP all the time. Well, you can't exactly, you know, say it would be okay to read a bunch of information off an AP invoice and, well, we got about eighty five percent of it, right? We're just, you know, ripping into the ERP. And I'm sure that'll be great. Like, that's that's an absolutely unacceptable scenario. Right. That's kind of how people are approaching some of this AI and LLM development. And it is like it's it's mostly right. It's directionally correct. Yeah. And you know, in our industry, because the work that we do is typically around, you know, sensitive regulated corporate information directly correct is not okay. It's got to be spot on. So you know, we think about you know, we think about it through the context of okay I'm you know I'm going to read information. Is there information I could go look up to to supplement it to, to validate it, to make sure that what I've read is correct. Uh, is there, you know, uh, human in the loop step that can take, you know, the combination of those and make sure it's okay. And you know what IDP the evolution of IDP has done is say, we can do that for lots and lots of different use cases now that we didn't used to be able to address. But what it does for us is it applies that same level of rigor that was this has to be dead nuts accurate to go in the ERP at the end of that AP thing. Now it applies it much more broadly so that what goes into your, you know, your RAG, your, you know, retrieval augmented generation which just rolls off the tongue. It does that what goes in is you can count on you know, it's not going to lead to that, you know, hallucination that should scare the heck out of every executive that's overseeing. Right. Which is why it's important to bring bring in the right organization, the right people to help you with that. And we did a whole podcast on how to choose that right vendor. Yeah, right. And unfortunately is. That's good enough is not good enough. Well, we cannot emphasize that enough when you're making decisions. Oh, you know, ninety percent of the way there is not good enough. And that's that's so important. I mean, you think about, you know, some of the, some of the top things that, you know, kind of immediately came out, you know, when you, you know, here's here's how you can use, um, you know, agents and bots and, you know, kind of AI behind it. Like, think about technical support, right? Immediately everybody was like, well, I can I can stand up a bot, right. That means less calls coming into my call center. Um, which means either I can focus on the best stuff or however you want to handle that, and that's great. What are you going to. What are you going to feed it? Well you're going to feed it? Your knowledge bases. You're going to feed it your, you know, maybe your support case data. You're going to support it. Well what's in that case data? This goes back to the governance question is is there a customer information in there? Is there customer, you know, uh, actual like, server information in there? Is there probably because these systems were completely internal previously. Now we're saying let's let it ask, you know, let the public ask questions, expose that data to outside your walls. Correct. Now, you know, that example might be one where people go, okay, you know, we don't get it. Duh. Yeah. But it's it's it's that sort of oversight that just has to be present. And so, you know, the accuracy, you know, that's you know, the reason that I get so excited about IDP is, as you know, is we apply that rigor that, you know, has been in existence in our world for a long time, that isn't necessarily being observed by all the organizations that we see out there. Well, so at the end of the day, if you're going to train a model, context and content, it's it's so critical we we use content and information and data kind of interchangeably. Um, but it's what runs your business and it's what's going to grow your business. Um, and so any final thoughts on if I'm getting ready to do this, if I have it? Because this is this is a new thing. I mean, sure, not everybody's jumping on this. There are solutions that can kind of come in and layer over the top and can pull some things identically, and you can get a lot of information from that. Those are great. And those have use cases that we would we would support all day long. Um, but if you're really going to get serious about this, um, any final thoughts on on how to build build out your own, your own LLM and your own domain specific language model? Yeah, yeah. Well, I think, um, nice job on the size of that. Um, by the time we're done, I'll have it out. Perfect. Well, you know, the the, um, you know, I was trying to figure out how to work in, uh, this example. I have a friend who likes to play around with these these models. And, um, he managed to use, uh, RAG to, uh, teach. Um, and create a model that responds to you and sort of eighties, nineties hip hop, you know, vernacular, which is pretty awesome. Completely useful, by the way. Completely useful. Like, will change the world, without a doubt. Without a question. Yeah. The thing that, you know, kind of more seriously to your question, the thing that I think you know, is happening is so, so people got ahold of the, the we'll call it the public models. Right. And so they did a lot of testing and they decided, well, that's not that's not giving me what I want. It either is lacking specificity or it's, it's wrong or whatever specificity. So you know, so the challenge is I know. Right. Um, my mom would be so proud the you know, the challenge is, of course, that what I think has happened is people have gone, well, you know, now let's swing the pendulum over here. Well, that public model didn't work. I'll just. I'll just create my own. Yeah. And I think, you know, what we're saying is, look, there is this good place in the middle which, you know, takes that public model, the one that you think fits your business the best. You know, going back to your point, like maybe it's Gemini, maybe it's ChatGPT, maybe it's, you know, um, Claude. Um, but then there's this RAG piece, right? And, and this ability to supplement the model with business, your business's intelligence without spinning it out of, you know, into a bias or toxicity or those kinds of things. But I think, you know, the point we're trying to pound home is that that that piece, that that RAG piece, how you feed that, what you feed that your the accuracy of that, the governance of that, the, the oversight of that is, you know, where you can go from I've got all those efficiencies that we're all reading about productivity and whatever or, you know, no, I'm in. I'm I'm having to defend myself against having to put out bad information into the public. Essentially, we like to say, you know, this is a great way to get ROI out of your AI. Like everybody's putting see what I did there. Yeah, yeah. Oh it's it's taking. How long were you working on that. No I've been saying that for a while. Okay. Yeah. Yeah. You can ask ask around. Okay. Let's go for it. Yeah. I didn't know. AI did not create that. Wow. Yeah. No, I know it's created most of everything I've said so far. Yeah. No, that's that's actually not true. But, um. Yeah, at the end of the day, are you just getting some AI to get some AI? I mean, there's so much money being spent on things that aren't getting a return, and we can't hammer that home enough that this is one of those areas that will will result in a return. I think that, you know, kind of to round off your point and, um, give the listener a break. I think the, you know, that MIT study that everybody's now loving to, to cite with the ninety five percent failure rate. Yeah. Speaks to exactly what you were saying with the, you know, get AI to get AI, because what it talked about was a focus on superficial outcomes, right? Yeah. Which which is exactly what you just described. I need to tell my boss we've done AI. Yeah. And metrics. And that is not. You know, that that's not how that's not going to pull the organization forward. That's right. I mean, yeah, some people may or may not keep their jobs in that, but. Right. Be a hero. Yeah. Yeah. Be a hero. Do it the right way. So we know that a lot of the things that we talk about here, I'm sure, create additional questions. Yeah. And I don't always do this, but I do want to invite, you know, any listener to engage with us. You can go to our website keymarkinc.com. Uh, look us up, drop in comments on the on the videos if you're watching them through YouTube and engage with us like we we. There's no way we can fit everything into these podcasts. They're meant to give some information to create some, maybe some curiosity, some questions, and we would love to engage with people. So that'd be fantastic. And usually, like I say at the end, if we've said something brilliant, then plan We planned it. Right, exactly. Yeah. So we'll see you next time on the Mostly Unstructured podcast. Thanks, everyone.
We recommend upgrading to the latest Chrome, Firefox, Safari, or Edge.
Please check your internet connection and refresh the page. You might also try disabling any ad blockers.
You can visit our support center if you're having problems.