(Music)
(Music)
(Music)
My name's Ed Thompson. I am CTO and co-founder here at Matillion. I'm joined by three guests today to talk about agentic AI. So I'm just going to do some quick introductions. Julian, would you like to introduce yourself?
Yeah, I'm Julian Wiffen. I'm Matillion's Chief of AI and Data Science, and my team work hand in glove with the developers working with Sam.
Excellent, Sam?
Yeah, so I'm a Senior Staff Software Engineer here at Matillion, primarily working on our team of virtual data engineers.
And we're very grateful for our special guest today, which is Sarah Schlobohm. Sarah, would you like to introduce yourself?
Hi, I'm Sarah Schlobohm. I'm a Senior Leader in AI. I've had Head of AI and Chief AI Officer roles in recent years.
Fantastic.
So Sarah, very impressive looking at your kind of LinkedIn profile and your background. But the one thing that really jumped out at me, which is that you're a goat zoomer. Goat zoomer, yes. What is a goat zoomer? And what does that mean? Been dying to ask you since you arrived. And I'll save it, I'll save it for the podcast.
Yeah, it's really cool. So I've been on the charity board for a sustainable hill farm in Lancashire. And when the pandemic hit, she had lots of income sources that had dried up from school visits and things like that. So my friend Dot got the brilliant idea of hiring out her goats to join Zoom calls because everything had gone online. So for a fiver, you could be sitting in your required fun happy hour with your team and then suddenly this goat would just pop up. Love it. And it would start like joining the chat and everything. So when I was helping out sometimes I'd literally see one of the people in the pen holding the phone, just taking videos of this goat to join a major.
I know enough about goats to know that they will eat anything. They will. Was there a problem to prevent the, we're just gonna talk about the goats by the way. Would prevent the goats eating like the rest of the office?
They were in their pens. Okay. The goats. It's just the goats. They were joining the—
There was a problem with the goats like headbutting you a lot. We were trying to hold the phone sturdy. I mean, you had some big shoulders just from it.
So we should be less worried about agents joining our Zoom calls and impersonating people. We're more worried about goats now, is that the word?
You never know what they're up to. They're pretty suspicious. Yes. I was interviewing someone for an AI role once and the cat brought in a mouse that ran up my trouser leg mid interview, I had to check it down. So animals may be an unexpected part of the AI world.
For sure. You can still hire a goat and you can get ‘goat with a note’ as well, which is fantastic. They'll write it, you'll write whatever message you want on the edible paper and then the goats just destroy it in front of you. Fantastic.
Well, Sam did his best there to pull us back on topic. Yes.
And, agentic AI. I'm joined by two experts here who are building agentic systems and you've got an enormous sort of depth of background in agentic AI. So I'll be the chump for the call. I'll just ask the questions, but I'm really interested.
Really, we know what we're talking about, agentic AI perhaps six months ago, and now it's everywhere and we're kind of building software at a breakneck pace. What do you think about how organizations are gonna change with the kind of introduction of agentic workflows right across the line of business? Are organizations ready?
I mean, I don't think they are. I think the two big questions that you have to ask yourself about is this organization ready? Is the data ready and are the people ready? Yeah. And I think a lot of companies, if they're honest with themselves, have to answer no to both of those.
Yeah, I think that's definitely the case. And we've done a lot of work in engineering and we're also doing a lot of work outside of engineering at Matillion to try and look at every single process in the business and the high impact ones.
What would you say we're seeing as the kind of the sticking points? Where do we need to kind of work hard to bring people on the journey?
Well, first is one part is just getting people to understand what's possible and then be used to what's possible. The second I'd say is getting the folks who've got the process and domain expertise close to, either hands on with the tools or very closely tied down with people to make it happen. Once you get that, you start to see a lot of kind of cascading spotting of use cases. But you want the people that own the process today to feel like they're automating it. And they're able to, once they get a feel for what they can do with it, get into their heads, they're like, automate the boring parts, automate the parts where you don't feel like you're at any much value. But it needs to be a team doing it themselves with assistance rather than having it done to them.
In some ways, the kind of POCs, everyone's experience with it are very valuable for this place because just getting the concept into people's heads about what can be achieved.
Sam, we've done a lot of work bringing people into the fold, particularly on agentic engineering, the use of agentic tools like WinSurf and Cluade code and Cursor. And it's been a real challenge for me to kind of bring those to the engineering team and get engineering engaged. Some people have engaged fantastically well, there's definitely some resistance in there. But one of the things that I think we've learned is that it's not a one and done. There is a level of mastery in the tools on a kind of range. How do you feel about that kind of mastery journey for agentic tools?
It depends, like it's such a broad rate, sort of broad spectrum of you've got people who are just using chat GPT on the side as a second brain, they're every day, they're bouncing ideas off it. And then now with the introduction of these coding tools, is this, how are they starting off their day? Are they using it to implement features directly? Or are they kind of moving towards an entire feature being implemented? And we're seeing such a broad range in engineering. We've got people who are just bouncing ideas off it, as I said, but then the mastery, there's a smaller set of users who have become superhuman in this space. And now we're setting up kind of internal groups, aren't we, to facilitate trying to cascade that out into the rest of the team so that the people who've mastered it can help the others master it, and then slowly and surely they chip away at doing more and more with AI. I think the hard part, as you know, is like measuring that. How do we start to measure the output of our engineers?
And bounce that question back to you.
Well, yeah, I mean, I don't know. (Laughs) When someone's got the answer, please come and tell me, because we'll go and make millions of dollars doing very effective productivity measurement. But like the FOMO one is really important, sorry.
I was just gonna say, one place I worked just did a really simple metric of on every jury ticket, did you use AI for this or not? And sometimes the simple metrics just work.
Exactly, exactly. We've done a similar thing, it's a little bit more of a complicated question than that. But yeah, we...
What have you seen in terms of that? It's that journey from first slightly naive prompt to what I see today where engineers, particularly high quality engineers, they're breaking down problems. They tend to be putting in multi-paragraph prompts and then almost going away and making a coffee and letting the machine do a lot of the legwork.
How can we have... Engineers tend to be at the forefront of these things. How can we apply that same learning journey, do you think, to other parts of the organization? Can we do it in sales? Can we do it in support? Can we do it in G&A and finance?
Yeah, absolutely. I mean, I think one of the best things I did in a previous role was to just do AI training for everyone. And I think especially with the EU AI Act coming out, it's gonna be one of these things like your annual GDPR training, your annual, how not to take it prime training. Here's some basic AI training. And I think just getting people to not be afraid of it helps but then targeted sessions with each of these teams, once they've had that, it's not scary, it can actually help you conversation. Okay, how can it actually help me? What part of my job do I really like? What part of my job do I say? This job would be great apart from... And every team's gonna have their own apart from the, "Oh, I hate these password resets, it says IT or customer service, or I hate these forms that are filled out wrong every time". But whatever that problem is, that's such a great thing.
Yeah, my advice when I was asked about it, but previously by Matthew Osseo, was hunt for the boring bits, hunt for the work people consider dull and we don't add value and that's the way. But hand in hand with that is, once you get it into people's hands, it's the best way to then get them to see other things that could go after. Like we did it with QA, they built one way started and built test plans, and then they realized it could write release notes and it could write acceptance criteria.
I think that's the scenario. I don't know if you see any of those other kind of adjacent use cases coming out with some of your... It's really exciting to see how excited people get about it when it lands with them.
So one team that I worked with, they had to process a lot of basically forms, documentation. And one of the managers there did a great Pareto analysis and it was basically five things that went wrong caused 70% of their things that they had to go back and forth with. So get the AI to fix those things and suddenly your job gets a whole lot nicer.
Yeah, yeah.
We were looking, so there's various studies, it's always really interesting because we focused on the coding issue primarily in engineering. And I mean, not just a Matillion, but as an industry, the flawed codes and the curses of this world are incredibly popular at growing like crazy.
But the coding problem, even some of the best engineers, they're not spending more than 20% of their time actually writing code. There's a whole load of other parts of the process. And a lot of, I think, some of that stuff is somewhat overlooked in terms of the process, maybe in Jira, the tickets, but then there's kind of the softer things and the things that engineers hate, like sitting in meetings and doing design.
Documentation.
Documentation, all that sort of stuff.
So I think deep research has just become so good now. It's like you say, there's the hands on, there's the coding, there's using the coding agent, but then there's like, oh, now I can like deep research, like five or six things at once through chat GPT or something like that, which has just changed the game really.
And one of the things we're starting to talk about a lot is the documentation in the data engineering space that we've admired is what we're finding, of course, is that the agents are great for, I look at my existing workflows, like what does that tell me about the layout of the data? And generating that documentation is then very valuable to the agent on the next pass for helping the user understand what tables to use, how to connect them up, and does the equivalent of, you know, makes the agent actually a member of the team that's getting tapping into the tribal knowledge. So those are the things where you allow them to, as a byproduct of them being involved in the process, actually create the documentation, create knowledge is a really interesting area for us to explore. I'm sure there's plenty of other processes too where that comes up.
But I think it's really interesting that you said, but of course the engineers are at the forefront of this, because I don't think that's true in every company. Ah. I've definitely seen a few different use cases, I'm not picking on anyone in particular, but where software engineers have been very resistant and said, ah, but it can never do this, I don't trust it, take it away from my data, no vibe coding allowed.
Okay, what's been your experience with that?
So yeah, I should clarify, there's definitely, we've definitely seen both sides of that. I think there's a few tricks that we kind of employed to try and break that down. First of all, we kind of went in with a high certainty that, and an ask really, that everyone tries it. Please don't jump to a conclusion, you're curious, you're an engineer to succeed, use that curiosity to experiment. And we created the space and the opportunity to do that, and we were kind of very open, like try different tools, let's see what sticks. So creating that environment, I think was very important. And then once we started to see some traction, then you have individuals who point at it sound, but there are others, you can have individuals that you can point to and say, it's working for them. So figure out why it's working for them. And then there was a couple of tipping points where there's individuals in the engineering team that have identified like, when that person is on the bus, we're winning. And there's some good moments where they, actually some interesting moments where they came, first of all, they came back and said, hey, we've tried this, but we're struggling with it. And it was, oh, I don't think AI is gonna work. It was a, hang on a minute, I realized that there's more to this and I need to learn more before I can master it. And then there was a moment where those people got over the hump and they said, oh yeah, this is like having an extra engineer, an extra pair of hands, I'm more productive cause of this now. So we took us on road tour, there's still some people that are naturally swinging back to their prior way of doing things. But once you've got some traction, you've got some FOMO within the teams, they start to, they definitely start to see traction. And you could, the only way I've got really measuring and measureability is the really hard problem, but the best way I've got measuring it is just the usage on the tools. The token usage and the amount of usage you see in Claude and the amount of usage you see in Windsurf ticking up. And unless they're just wasting tokens for they release, then they're actually, you can be fairly reliable that they're getting value from it.
So Phil, we've spoke a lot about coding agents and stuff like that. Sarah, what other types of kind of operational processes do you think are rife for Agentic AI to be applied to?
Anything where you're like on autopilot, right? Anything you're just kind of like, you can kind of switch off your brain, I'm just gonna turn through this, this is the part of my job I don't really like, I don't have to pay attention to, that's the thing to automate. And anything where you sort of sat and said, well, basically in that situation where you're using chat GPT in your second brain, right? And you're saying like, okay, I've talked through this problem with you, I'm using you for IT support. And you're like, I wish, but can't you just do that then? Yeah. That's a great use for Agentic AI.
Yeah, absolutely. We could describe occasionally as the infinite intern, anything that you've got a bright intern, they don't have any prior experience. But if you could explain how to do the task with some simple bullet points and leave them churning away, that's always a good use case for it. You could throw a hundred of them at it. A hundred of them at it, like read through all these documents and find these three facts from each one type thing. Yeah.
So you two guys have built up a good body of experience now taking those agents and giving them tools and letting them do things. And we've obviously applied that to our own tool because we want to build agents that solve data engineering problems. But what have we learned in terms of what are the things you need to think about, the things you need to be careful of, the potential dangers of handing an agent a tool to do something?
Well, I'd talk about it in terms of, everyone talks in corporate will be about thinking outside the box. We want our agent to think inside a box. We want to draw the edges of the box, give it kind of tight boundaries of it and restrict it. Again, because they could infinitely go off on all sorts of different tangents if you give them the wrong prompts. And burn a lot of tokens. And burn tokens and or just not do what you're looking for. Whereas the more you define the problem, the problem space. So in data transformation, we did lots of one of the early lessons we learned are getting it to answer just yes or no only was really powerful. When we were first getting started, getting it to respond with come back in JSON format was our first battle and Sam was firing off like five queries in parallel to one of the GPT models because two or three of them would come back in the format we wanted.
Now it's more when I work within the rules of the system, but also that's the way you give them feedback and intelligence. So the tools that give them structured responses in the way they work. So for us, it's sampling the tables of data really help them move ahead of the use case where you're just chatting with a bot in a chat window because they're getting that feedback and the agent can then iterate. It doesn't need the human to say, oh, I got this error message or that didn't work.
A real key point is the buzz kind of as human in the loop, isn't it? Any actions that these agents might perform that are destructive or might cost a lot of money, you've got to have some sort of, at least some sort of manual gate initially. I mean, if people want to configure those rules a bit and make it more permissive for certain actions, then that's kind of, it's up to them. And maybe they're in our case, they're using like an isolated data warehouse that's all locked down for just dev use and they're not too bothered about whether someone runs something. But in a production environment, it's very different. And then commercially, if I have a, what's the popular way of an agent to book a holiday or something, you don't want it to go and book the wrong holiday, do you? You want an approval process at the end that says, oh, I'm gonna do this, here's the plan, do you approve? And then you say yes, and it goes and books the holiday. Not that I've actually booked a holiday end to end with an AI agent yet, though I'm a little bit nervous about that. I might end up with a five star hotel somewhere, very expensive.
Or your trips to somewhere at the bargain end to the basis. Yeah, yeah, yeah.
But how we put that structure around, and again, I'm curious to see what problems you've seen with this, that allows us to push that human right to the edge of the process, so that you can leave the agent running asynchronously or in parallel to try and take the problem, I don't know if you've seen it.
Well, I think one of the other things that's so important in that process is recording the decisions that the agent is making as well. So especially if you're working in a regulated industry like financial services, you need that full auditability and traceability and everything that you'd have in MLOps world, you need that for your agents.
Yeah, certainly something we've been, someone's been tackling that, and the team's been working on that. We seem to create an enormous amount of information from the agents when they run, but it's all--
But if it's gonna get audited--
It's all gold, yeah. Unfortunately, it's also gold for understanding the behavior of the system and analyzing it. One of the things we've been setting up is frameworks we call the gym to be able to say, here's my set of test scenarios, here's let's, we'll run him 10, 50 times or similar, see how the agenda framework responds, what it does is it gets success or not, AI judge on the end. And that structure then lets us evaluate each variation, whether it's new model, whether it's new wording and prompts, whether it's different settings. And I'm sure that's probably a relevant technique for all sorts of different spaces where you want to, you want, you can measure it and repeatedly experiment and you can fine tune and iterate a lot faster.
I suspect that is the part where businesses will struggle. It's like they build AI agents, but actually testing and validating and evaluating, benchmarking those agents is actually really hard at the moment. There's no real predefined tooling to do it. You've kind of got to build your own system, right?
I mean, one of the best techniques is just to sit with those subject matter experts. It's a good opportunity to get them involved, but just sit there and mark its homework. And that's as close to ground truth as you can get at a lot of these situations.
Yeah, we've had some fun. We're trying to make them sit multiple choice tests in some ways and like anything where there's an, in the early days, because our professional certification is multiple choice tests and teach them multiple choice tests because you can mark them really quickly. And that sort of, so again, constraining the agent's output where the final thing is, are you accusing option A, B, C or D? And then you can evaluate, okay, well, it said A and B is the right answer is a way, is a good way to then constrain the scenario.
Oh, that's really interesting. I've used AI to help generate the quizzes themselves into me learning content, which has been quick.
But that let us do things like evaluate the impact of a vector database or a knowledge graph on it, on understanding of how to use our tool, for example. So that, it gives you a very measurable way to do it. And the core is just, can I run this many, many times and score it quickly? And actually you then calibrate that with a subject matter expert to say, okay, do you agree with the way that we're marking this thing? So who using an AI judge to score it? Actually the subject matter expert marks the AI judge's homework rather than the original systems.
This is one of the fundamental things that we've learned is, we're so used to building tests in software that are binary, it's like it passes or it fails. And with AI, that's just not the case. It's like Julian described his gym, it's essentially a big system test. And then there is an outcome and you can evaluate the outcome, but there's a subjectiveness to that evaluated outcome. So it becomes about accuracy and a metric rather than a pass or a fail.
But it's also a far heavier investment. We invest heavily in system testing and software, it's gonna be a much bigger investment to build the people and the LLRPs to build the gym and then build all of the constantly running evaluations and of course all the token uses that drives to constantly assure ourselves that we are building an increasingly high quality and accurate agentic system and that we're not accidentally going backwards. So that's certainly probably the biggest challenge taking it beyond proof of concept into productionization.
And that's really interesting because that touches on one of the previous points about one of the biggest barriers I've seen to this kind of adoption in software engineering is that software engineers like their test to be binary and they don't like the fuzziness that comes along with AI and they're not used to thinking probabilistically. I think that may be the most useful thing that our state we data scientists bring to the table. We're used to probabilities and uncertainties and only having a fraction of the picture. So we're comfortable with that kind of mess.
Yeah. Yeah, then from my point of view, getting the data science and the engineers working closely together has been really crucial in our, you know, AI team grows and grows and obviously more people you have, the more challenging it is to get all of those relationships working well, but.
And the gym, I mean, the gym sounds like amazing, agentic cross-fold validation basically.
Yes, yeah. And it's, I mean, the core framework really, and I'm sure you could use this in a lot of systems is, you have a headless version of the system so that effectively you could talk to the mirror of the conversation. This is all Sam's work putting it together. So you have one, you have an L and it pretends to be the human tester and you have an AI judge evaluating the logs. So you have a conversation going back and forth and then you have an AI judge evaluating the logs coming out of the bottom. Plus obviously a bunch of other telemetry going to systems like Langfuse or similar to measure all the steps, to measure the tool choices. And then we have a battery of test scenarios that can just be written as you would give a human tester instructions to say, try and get it to do this and then some success criteria at the end by what the data looks like or what the conversation should look like or the output should look like.
And you're able to do all sorts of interesting things like we've got some placement students from MIT over this year. And one of them, Jessica has been doing a great job in terms of just evaluating a variation, the prompt to say, just check to see if you've already got a pipeline that does this thing that you're asking for before you build a new one. And we're able to very mechanically measure the percentage improvement in how often the system uses the tool to evaluate the other pipelines before it gets started and say, okay, well, I've got, okay, now it's doing that check now 80% of the time when it was only doing it 50% of the time before. So it's small change, measurable improvement, easy. No brainer to then put that through.
I think the really difficult part of all of this is, if you were just building a software coding agent, you ultimately you've got some source code it compiles, and there will be tests that run, but the agents wrote the test. With something like what we're building, you've got external state that you're mutating. So you've got data warehouse tables and stuff like that. So this evaluation agent or this testing agent that runs on the side actually has to go away and check the state of lots of different systems, which is, and then eventually come up with some score that you can track over time of accuracy, relevancy. There are some more deterministic checks in there of like, did they pass or fail? But that all gets built together to give you ultimately a score out of a hundred of like, how good was this thing? It's never like this was 100%. It's always gonna, there's always gonna be something that marks it down probably.
That might again, in the same example, if this is get closer to how you measure quality of processing, human driven processes, like the whole, okay, we'll take some random samples off the production line, or I'll do the equivalent of my random mystery shopper to see if the my agentic e-commerce site is handling itself properly. And that's always been a problem even in data science machine learning before we rebranded it all as AI, was that, you know, they expected perfection. Well, but if you had a human handling the process, you didn't have perfection, almost guaranteed.
The benchmarks gotta be realistic. Yeah, maybe that's a key, my doctor message, you've gotta keep hammering, is that kind of, is it perfect? Is it 100% no, but what success rate does a human only get in this space? Or even if you're doing the human in the loop comparison, what percentage of time will two people to agree on this judgment call?
How do we think about managing expectations? Because we've definitely seen scenarios in the business where, you know, you introduce an AI system or an agentic system to an individual, and if you don't manage those expectations carefully, you can get, that's where you can get like pushback or rejection because it's, you know, they immediately try the hardest thing or they immediately try to do something fairly naive with it. How do you, how do you manage that?
I spent a lot of time just saying, well, AI isn't magic. A lot of people would have like a misconception of, oh, the AI will just learn this. Well, it can, but not unless you tell it to. This system isn't gonna, cause you didn't tell it to. I think that's a challenge we're still gonna face. But also things are just moving so quickly. The answer that you give today might be very different from what happens tomorrow.
That's another, I mean, that's another value of the, of the gym, isn't it? Cause like what we're worried about, you know, we could spend hundreds of thousands of dollars of the company's money doing reinforcement learning on a large language model to make it 5% or 10% bigger. But maybe you get that same game by tweaking the prompt or maybe someone brings out a new version of the model and we get 10% for free. And it's like all of the, there's so many variables that go in. The prompt can make a difference up to 40%.
Yeah. That figure is just wild to me that just changing how you ask the question can have such an impact on the final product. So we spoke a little bit about like systems getting better over time. What can people actually do to build a system that gets better, an agentic system that gets better over time? Some techniques.
Yeah, still human feedback seems like the best one. I mean, the thumbs up, thumbs down.
What's our response, right? Are the thumbs up, thumbs down?
Yeah, exactly. This comes back to this, as they, going after the explicit, the implicit feedback is far more reliable by are they using it? Do they keep using it? Which transactions do they keep going? As opposed to explicit, what they actually say. One of the ones we set up was basically to check if they tried to use the AI and then contacted a human afterwards. That was one of the most reliable metrics because did the AI actually help them with what they needed to do or did they have to find another way to do it afterwards?
Yeah, that's challenging for us. (Laughing) You could try it, I'd take it. So our agentic system is called Maya. And so we have this concept of Maya sessions, but the fundamental of question was, was that Maya session good for the user versus was that Maya session bad for the user? If they don't click, thumbs up, thumbs down, which they can re-readly do unfortunately, is really challenging. Because we don't have that natural step of okay, that didn't solve my problem. So I'll-- It's almost like a marketing attribution thing. I see this pipeline was edited by Maya. I also see that that afternoon that pipeline was run and then put into a schedule. Can we claim that as Maya built that pipeline or did they try with Maya, give up in despair and build it all by hand anyway? We all we can do is say, okay, the percentage of time that the Maya was involved in building pipelines is this much, hopefully that's growing, that kind of--
I think tracking over time is the key thing, right? We know for humans, the reward for good work is more work. So that's happening to the agents too. That's good, that's how I allow it to use that. There's a lot, I mean, there's lots to be learned from thinking about them in terms of okay, well, how do the human interact with this? How do the human learn from this? That's where I think we've got some exciting work that we're doing at the moment in terms of how do we learn from about the data landscape from Maya's interactions? Because the best way to learn about a data set is to work with it and try and build a report or try and do something with it. And that's really fine. Hang on a minute, these three columns are empty and these numbers are all junk and all the sales figures end in January or have you? And how do we capture that? That's a byproduct that it captures really valuable information for informing the system in the future. And I'm sure that's true of almost any AI system that is doing tasks and you can put parallel things in place, say, okay, what have you learned from this to help make it smarter for the future? So you're not training, I'm not training the model, but I'm updating my notes and my documentation in some way to make the system more intelligent.
And that was one of the things we saw in early like coding hallucinations is that it would often try and structure the things as they should be than as they really were. And so if it could, yeah, just repeatedly give advice like, hey, maybe have you considered organizing things this way? Yeah. And might be more helpful in the future.
So Sarah, right, rewinding almost back to the start of the podcast, on the very first question we talked about organizational readiness and you said something that I sort of filed away as we will definitely come back to that, so now is that time. You said that organizations are not necessarily ready organizationally, but you also said that organizations aren't ready in terms of that data. And that's something that's very close to our hearts as lifelong data wranglers. I couldn't agree more, but what have you seen as the challenges that organizations have bringing together AI agentic systems and their organizational data?
A lot of different pockets of that. Some of it sort of around PII and confidential information and who should have access to that. So using that within the organization, if you want to do some information on HR systems, making sure you've got all the correct guardrails into never leaking the stuff that you ought to leak. I think that data classification problem, is it confidential, public, super secret, double probation, whatever, that has been a challenge in some places.
But then the other bit is just having it, I think a lot of folks in the data space have long focused on structured data for a very good reason because this has mattered. And I think shifting that mindset to now we can handle unstructured data a lot better, has been really interesting to see this wave of AI. The analogy I always give is, okay, if you could fit it in Excel, then it's still a good use case for the traditional data science. If you put it in a Word document, then that's good for generative AI. And so a lot of folks have done good work around organizing their structured data, but their unstructured data is still quite obvious.
And then even trying to build something that like, I think a lot of particularly, a lot of vendors will say, "By our product, you can just throw all your data into this database and then rag magic waves hands, AI happens". And there's a lot of confusing stuff in there. And I think again, would a human get confused by this? That's a little bit coming back to our infinite intern problem. What we found and we're slightly this, here's the word from our sponsors, but what we've built tools to do, is stuff where you can ask questions about the unstructured data to turn it into something structured. So you go through with classification or decision questions, or can you extract the, here's the company's annual report, can you extract these five facts from it sort of thing and then put them in these boxes? Or can you make a yes or no decision? Like is this document talking about this topic? We had a really interesting one with getting the effective date out in the document. So not just a date that's on there, but when does this legal contract, for example, become valid? Is it when it was signed? Is it when it's listed in the text? Is it the date that might be stamped on top of it? Almost certainly not that one. But yeah, that was a really interesting challenge.
Yeah, we've seen some customers doing some interesting ones. I think insurers, particularly in the German market where there's still a lot of manual documents or PDF based pieces and their extraction was like, we need to tag all the home contents policies that cover basically when the counter is damaged because the dog peed on it. Or so like, does it meet this criteria? Or does it cover pets? Or does it cover the, we're trying to turn it into something structured to say, okay, which policies cover these things, which don't? And I've got a whole bunch of categories. And that would involve somebody reading the document two or three times to make each judgment call. That's a really good use case for it because you just clear instruction, clear output, push it through. And it's very auditable as well because you can then go and have your human in the loop, take a sample, say, do you agree with a judgment call enough times? Okay, we're confident in it just the same way. You might check the work of a new employee.
Do you think in these examples where it's like legal documents and stuff, there will always be a human there at the end to go, yes, this looks good. No, this looks bad. Or do you think we will get to a future where we will just let these things go through? Like there'll be another AI that checks the work that's just tailored to that. And it's like 99.9% accurate. Or do you think there's always gonna be a requirement in heavily regulated industries and stuff that someone comes and says, yes, that's approved by Sam Perrin, the legal expert.
Well, we get into the self-driving car problem then. Like who's responsible if the self-driving car gets in an accident? Is it the person driving? Because they're not driving. Is it the programmer who has nothing to do with the situation? It's really complicated. There is no one in those car. You can get away. We had a great time in some Waymos in San Francisco a few weeks ago. There's no one in there anymore. What is it? It always reminds me of, is it Total Recall? The one where-- Yeah, it feels, we did have a scenario and I'm not gonna name names where somebody's, oh, just one final thing conversation went on so long the Waymo got bored and drove off. Oh yeah. (Laughing) Oh, that's brilliant.
Yeah, I mean, I think we are gonna in stages do that. I think it's gonna come slower than we think. So one of the ones that we thought was gonna be complicated in the same document processing issue were signatures. We thought that was gonna be complicated. Turns out, nah, that was dead easy for most of the other. We just like that one. We're just like, yeah, that just never gets it wrong. So go with it. Fascinating. I mean, it's not gonna pick out a forgery, obviously. Then neither is a human. I don't know what your signature looks like.
I guess you could roughly, the benefit of AI is like, it doesn't need to be reviewed once. You could fire an agent or an LLM or even a dedicated train model like 50 times. And is it gonna get it wrong 50 times? It might flake a couple of times, but it becomes a little bit of a shock. It's right, isn't it? Yeah. It's like the weather forecast. It's like a pretty good chance. It's like in 80 out of 100 scenarios, it's gonna rain tomorrow. So we say 80% and so you're coming back with a model saying, 95 out of 100 agreed with this judgment call.
Sarah, do you think organizations are becoming more comfortable allowing AI frontier models to see their sensitive data? Because one of the things that we wrestle with, because we're trying to solve the problem of data engineering is a data engineer. A small desk. Yeah, there's more tasks. So of all data engineering. Just solve data engineering, easy. But data engineers, one thing I definitely know that data engineers do when they're working with data, is they look at it and they're always eyeballing it and making sure that it's sensible and looking at the quality of it and making lots of human judgment calls as they work through the process. So when we're thinking about making that an agentic process, obviously the agent's gonna need to look at the data and make those same judgment calls, which it can definitely do. But then we get into that debate inevitably of how do companies feel about, because one table could be some innocuous sales data and then another table could be employees and salaries. And are companies that are gonna win and compete just going to have to get more comfortable with that? Do you think?
I think it gets really, when you hear about the leaks and all the hacks and all the cyber hacks that are happening afterwards, yeah, it gets really worrying. But I think to some extent, there's a risk to not doing that so much. And I think as always, it's a human risk because if you don't give people the tools to do that safely, they will just be using Shadow AI. Yeah. I was at a cybersecurity event where someone told me about how they were using chat GPT on their phone under their desk with potentially sensitive data, you don't know what's in there. So better to have it someplace safe, where you have an audit trail, you have audit logs, you can take some snapshot of that and air gap it and know exactly who's done what when, rather than just trusting people to not take a literal screenshot of their computer and pass that to chat GPT.
That's really interesting. So yeah, it's not an argument that I've tried with our customers like, well, doesn't matter what you do, that still happened anyway. It reminds me of the old IT issues about you must have the world's most complicated password and then people would put it on a sticky note. I remember it, yeah. Yeah. Because it was that being a similar thing. Yeah.
I think that's why it's maybe important that organizations give their employees multiple options of maybe what tool they can use. Like in coding agents, I know we've spoke about this quite a lot, we have a team that's kind of a value in them, but ultimately like, if you're going to be more productive, we will give you what you need to be more productive. And that's great, but as you start to go into like enterprise agreements and you get all the certifications and the SSO logins and all that sort of stuff, well, then those all get expensive and they want annual commitments and all that sort of stuff. So you can see why businesses go up, we use Google for everything, we're gonna use Gemini, but then you've got Joe blogs on chat GPT, shadow chat GPT, because that's the thing he's just good with. Do you think that it's good that org should kind of cascade out access to multiple platforms or do you think it's a trade-off, right?
It's gotta be and it's gonna very much vary on the role in the job because it's not just the engineering teams are taught, I mean, you guys obviously engineering is the whole business, but if you've got your marketing team working on stuff, they're gonna need potentially a different tool if you have somebody editing something and they need storyboards as a better tool for storyboarding, I think you've got to take a really pragmatic approach to, okay, what's the right tool for the job? In some cases, maybe that's even better, maybe that's even lighter, maybe this team can get together and have a shared team subscription to chat GPT and that's good enough for your marketing team, right?
Yeah. I think it's probably time we got, we make you get your crystal ball out. We've talked about a lot of exciting things where we are now, probably the immediate future and all the things that we're all working on. But where is, where are we going? Are we going to a place where, you know, artificial general intelligence comes along and essentially supersedes everything we're doing or do we feel like that the current branch of technology that we've got is going to continue to give us better models but not going to give us a step change?
So the analogy I always use is with mobile phones. So when we first got mobile phones, we used them as mobile phones. We took phone calls in different locations, right? And now I'm like horrified if anyone actually calls me. Now it's just the super computer that lives in my pockets and shouts at me when I need to look at new pictures of cats, right? Like it's out. Or goes. Yeah, and so I think we're just at that tipping point now where we're going from, okay, we're running out of, we are doing the same thing but better use cases. And that's where we should start, absolutely. We should make our lives easier. But pretty soon we're going to start to see what are these amazing new things people are going to be able to do with AI. Another analogy I use is like photography. First it just kind of, we just used it to record people's faces in the same way we did portraiture. But then photography became its own art form. So what kind of art are we going to be able to create with AI? Yes, there's lots of issues around copyright and we need to handle those ethically. But there's also so much room for creativity with what can be done and how that can be democratized to people. So, I mean, I hope we wind up in this like really cool Jetson scenario of like, I have a sassy robot that lives in my house and does all my chores for me. But I think we have to think carefully about what we want the future to look like.
Yeah, I feel like there's going to be some speed bumps along the way. It's definitely, the genie's not getting back in the bottle. But I was really interested with the release of chat GPT-5 recently. I think there was some expectation that it might be a revolutionary step forward that changed the game. I think the reality is it's, you know, it is definitely a step forward, but it's a better version of the same thing that you can do people, we can rely on a bit more heavily, it's a bit more efficient and a bit more effective. But probably it doesn't bring in like that AGI use case to spy what to sum up, I might believe. So yeah, I think getting to that next step, you never know what technology is going to come along tomorrow, but I'm fairly confident that for the foreseeable future, it's going to be about augmentation of the individual as opposed to replacement of the individual.
The big evolution is the, is moved to more asynchronous, more longer, more complex tasks being given, less interaction back and forth, more of a, I give it five JIRA tickets and the AI does a pull request when it thinks it's ready. It's too rough than us standing over a shoulder every step of the way. Well, agents are in the evolution of this architecture, aren't they? Like the LLM isn't the only thing now, there's all this inference time, this test time compute that everyone's doing, that's becoming like the smart part now. And I think there's going to be so much more interesting stuff when we get really multimodal. I think, you know, we've had a few failed attempts at glasses that give you uptime, you know, real time feedback, but like I cannot wait for my glasses to be able to like put little names over everyone's head in the video game because that would help me so much, I'm so bad with names. I think more and more things like that, like, okay, I'm going to look at a person and you're going to tell me your name and you're going to remind me how I know them and you're going to say, okay, now ask them about their kids' summer school or whatever. The meta glasses on, all in real time, yeah.
All right, well guys, we've looked at the crystal ball. I think that is a good place we'll look back in a couple of years' time and see how good our predictions are and hopefully you've got your AI augmented glasses. I feel like they might not be too far away.
That was a fantastic conversation, thank you very much. Thank you, Sam, thank you, Sarah, thank you, Julian. That is the Agents of Data podcast and we'll see you again next time. Thank you.
We recommend upgrading to the latest Chrome, Firefox, Safari, or Edge.
Please check your internet connection and refresh the page. You might also try disabling any ad blockers.
You can visit our support center if you're having problems.