Human seeded Evals — Samuel Colvin, Pydantic
Channel: aiDotEngineer
Published at: 2025-07-25
YouTube video id: o_LRtAomJCs
Source: https://www.youtube.com/watch?v=o_LRtAomJCs
[Music] I'll assume given the time we have that you kind of get who I am and what Pyantic is to some extent. So I will I will move on. This is this I'm using the talk I gave at Pyon. So uh it it was building uh AI applications the pantic way which is uh I guess somewhat akin as I say I'm not going to be able to get to the eval stuff today. Um but I I can talk about these two. So everything is changing really fast as we all get told repeatedly in ever more hysterical terms actually some things are not changing. We still want to build reliable scalable applications and that is still hard arguably it's actually harder with Gen AI than it was before. Whether that is using Genai to build it or using Genai within your application. Um so what we're trying to talk about here is is uh some techniques that you can use to build applications uh quickly but also somewhat more safely than than you might you might do if you otherwise. Um I'm a strong believer that type safety is one of the really important parts of that not just for in production avoiding bugs but if you no one starts off building an AI application knowing what it's going to look like. So you are going to have to end up refactoring your application multiple times. If you build your application in a type safe way, if you use frameworks that allow it to be type safe, you can refactor it with confidence much more quickly. If you're using a coding agent like cursor, it can use type safety or running type type checking to get get basically mark its own homework and work out what it's doing right in a way that you can't do if you use a framework like Langchain or Langraph who either through decision or inability decided not to build something that's type safe. Um, I'll talk a bit about MCTP if I have a moment. Um, and I won't talk about how eval split in because I don't have time. Um, so before look, nothing I'm going to say here on what an agent is is controversial. This is um reasonably well accepted now by by most people as a definition of an agent. This uh image here is from Barry Zang's talk at AI engineer in New York in February. This is his definition or the the the anthropic definition of what an agent is now being copied by us by OpenAI by Google's ADK. I think generally the accepted definition of an agent. This although very neat doesn't really make any sense to me. This however does make sense. So what what they say is that an agent is effectively something that has has an environment. There are some tools which may have access to the environment. There is some system prompt that describes to it what it's supposed to do. And then you have a while loop where you call the LLM, get back some actions to run in the tool, run the tools that updates the state uh and then you call the LLM again. There is however even in his whatever it is six line pseudo code a bug which is there is no exit from that loop and sure enough that points towards a real problem which is there is it is not clear when you should exit that loop that that loop and so there are there are a number of different things you can do. You can say when the LLM returns plain text rather than calling the tool that is the end or you can have certain tools which are kind of uh what we call final result tools which basically trigger the end end of the run or if you have models like OpenAI or Google which have structured output types you can use that to end your run but it is it's not necessarily trivial to work out when the end is. So enough pseudo code let me run a real minimal example of pantic AI. So this is uh a very simple um padantic based model with three fields. Uh and then we're going to use pantic AI to extract structured data that fits that that person uh schema from unstructured data. This sentence here now here obviously to fit this into on screen. Uh this is um a very very simple example but this could be a PDF uh tens of megabytes. Well, probably not tens of megabytes necessarily in context, but like definitely, you know, enormous documents and and the schema is very simple, but this could be an incredibly complex nested schema models are still able to do it. And sure enough, if we go and run this example and the gods of the internet are with us, sure enough, we get the the pyantic model uh printed out. So, but some of you will notice that this example is simple enough that we don't actually need an agent or this loop. We're doing one shot. We make one call to the LM returns the structured data. We call under the hood. We call a final result tool. Pyantic AI performs a validation and we get back the data. But we don't have to change that example very much to start seeing the value of the agantic loop. So here I'm being a little bit unfair to the to the model. I've added a field validator to my person model which says the date of birth needs to be before 1900. And obviously the the actual definition here is abstract uh is uh doesn't define what year we're going to be um well sorry what which century we're talking about. You would obviously the model will for the most part assume 87 is 1987. will then get a validation error when you do the validation and that's where the agantic bit kicks in because we will take those validation errors and return them to the model basically as a definition and say please try again as I'll show you in a moment and the model is then able to use the information from the validation error to to try again. Obviously, if you were trying to do this case in production, you would add a a dock string to the do field saying it must be in the 19th century. But there are definitely cases where models, even the smartest models, don't uh pass validation. And being able to use this trick of returning validation errors um to the model is is a very effective way of fixing a lot of the simplest use cases. So, if we run this, you see we had two calls to Gemini here. And if I come and open you other thing you'll see in this example is we instrumented um this code with uh with logfire our observability platform. So we can actually go in and see exactly what happened. So you'll see our agent run we had two uh two calls to the model in this case Gemini flash. And if we go and look at the the exchange you can see what's happened here. So we I just try and make it big enough that you can see it. We first of all had the user prompt the description. It called the final result tool as you might expect with the date of birth being 1987. Uh we then responded. The tool response was validation error incorrect. Please try and then we we add on the end please fix the error and try again. And sure enough it was then able to return uh correctly call the final result tool with the right date of birth and succeed. Cool. I've got five minutes. I feel like I'm in one of those. Uh see how fast I can go. Uh I'm on the wrong window am I? I am here we are. Um I think the other thing that's worth worth saying here even if I don't have that much time is if you take a look at the this example I talked about type safety. If you look the way that we're doing this under the hood agent because of the output type is generic in in this case person. And so we can act when we access uh result output both in typing terms it's an instance of person and uh a runtime will guaranteed from the pyantic validation that it will really be an instance of person. So if I access herename all will be well if I access first name uh we suddenly get a validation we get a runtime we get the the nice error from typing saying this is a incorrect field. So that's the kind of the the kind of very beginning of the value of static typing uh of of our typing support. We go a lot further. You will have seen or some of you might have noticed there's a second generic on agent um which is the depths type. And so if you register tools with this agent they you can have type safe dependencies to tools which I will show you in a moment. Um so the other thing you will you will notice is missing from this example is any tools. So let's look at an example with tools. So if I open this example here, we have this is an example of memory, long-term memory in particular where we're using a tool to record memories and then uh another tool to be able to retrieve memories. So you'll see we have these two tools here, record memory and retrieve memory tools are set up by registering them with the agent. Decorate uh decorator. But this is where the typing as I say gets more complex. Now you will see that we've set depths type when we've defined the agent and so our agent is now generic in that depths type the return type is string because that's the default and so we when we call the tool decorator we have to set the first argument to be this run context to parameterize with our depths type and so when we access context.eps deps that is an instance of our of our depths data class that you see there and if we access one of its attributes we get the actual type and if we change this to be int let's say suddenly we get an error saying we've used the wrong the wrong type. So we get this guarantee that the type here matches the type here matches the attributes you can access here and then when we come to run the agent we need our depths to be an instance of that depths type. So again, if we put gave it the wrong type, we would get a a typing error saying you're using the wrong type. And as far as I know, we're what the only agent framework that works this hard to be type safe. And it is quite a lot of work on our side. I'll be honest, there's a little bit of work on your side as well as in it's not necessarily as trivial to set up, but it makes it incredibly easy to go and refactor your code. Um, and yeah, you we run this here and we give it the the I'm pretty sure I don't have Postgress running. Uh, do I have Docker running? I don't know if I have time to make that work. I will. That's Docker running. I'll just try and run this very quickly. Uh, docker run. Hopefully that is enough. If I now come and run this example, what you will see is it successfully failed. Great. Um, I will try one more time and see if I get lucky. I don't know quite what was going on there. Ah, and I have no Well, we can look in logfire and see what happened uh to make it fail. I promise you I hadn't set that up to fail the first time to demonstrate the value of observability, but maybe it can help here. So if you look um this first time we um our first agent run you'll see that we use the uh the tool called uh record memory the user's name is Samuel um and then it it returned finished and then the second time uh you can see that the when it did retrieve memory where it called the that tool the parameter or the argument it gave was your name. Um, which was not is not contained within the the query the previous time. We're just doing a very simple I like here. So your name is not a substring of user's name is Samuel. And so that's why why it failed that time. Um, so this has turned into a very useful example of where where Logfire can help. And if we look at the that second time, you'll see user's name is Samuel. And then when it when it ran the agent, it just asked for name. N name is obviously a substring of of the user's name is Samuel. And so it was able it got the response. User's name is Samuel. And therefore succeeded. The other thing we get here is like obviously we get this tracing information. So we can see how long each of those calls took. Um and we also get pricing on both aggregate across the whole of the trace and individual spans. Um I am told that I am running out of time. So, thank you very much. [Music]