Human seeded Evals — Samuel Colvin, Pydantic

Channel: aiDotEngineer

Published at: 2025-07-25

YouTube video id: o_LRtAomJCs

Source: https://www.youtube.com/watch?v=o_LRtAomJCs

[Music]
I'll assume given the time we have that
you kind of get who I am and what
Pyantic is to some extent. So I will I
will move on. This is this I'm using the
talk I gave at Pyon. So uh it it was
building uh AI applications the pantic
way which is uh I guess somewhat akin as
I say I'm not going to be able to get to
the eval stuff today. Um but I I can
talk about these two. So everything is
changing really fast as we all get told
repeatedly in ever more hysterical terms
actually some things are not changing.
We still want to build reliable scalable
applications and that is still hard
arguably it's actually harder with Gen
AI than it was before. Whether that is
using Genai to build it or using Genai
within your application.
Um so what we're trying to talk about
here is is uh some techniques that you
can use to build applications uh quickly
but also somewhat more safely than than
you might you might do if you otherwise.
Um I'm a strong believer that type
safety is one of the really important
parts of that not just for in production
avoiding bugs but if you no one starts
off building an AI application knowing
what it's going to look like. So you are
going to have to end up refactoring your
application multiple times. If you build
your application in a type safe way, if
you use frameworks that allow it to be
type safe, you can refactor it with
confidence much more quickly. If you're
using a coding agent like cursor, it can
use type safety or running type type
checking to get get basically mark its
own homework and work out what it's
doing right in a way that you can't do
if you use a framework like Langchain or
Langraph who either through decision or
inability decided not to build something
that's type safe. Um, I'll talk a bit
about MCTP if I have a moment. Um, and I
won't talk about how eval split in
because I don't have time. Um, so before
look, nothing I'm going to say here on
what an agent is is controversial. This
is um reasonably well accepted now by by
most people as a definition of an agent.
This uh
image here is from Barry Zang's talk at
AI engineer in New York in February.
This is his definition or the the the
anthropic definition of what an agent is
now being copied by us by OpenAI by
Google's ADK. I think generally the
accepted definition of an agent. This
although very neat doesn't really make
any sense to me. This however does make
sense. So what what they say is that an
agent is effectively something that has
has an environment.
There are some tools which may have
access to the environment. There is some
system prompt that describes to it what
it's supposed to do. And then you have a
while loop where you call the LLM, get
back some actions to run in the tool,
run the tools that updates the state uh
and then you call the LLM again. There
is however even in his whatever it is
six line pseudo code a bug which is
there is no exit from that loop and sure
enough that points towards a real
problem which is there is it is not
clear when you should exit that loop
that that loop and so there are there
are a number of different things you can
do. You can say when the LLM returns
plain text rather than calling the tool
that is the end or you can have certain
tools which are kind of uh what we call
final result tools which basically
trigger the end end of the run or if you
have models like OpenAI or Google which
have structured output types you can use
that to end your run but it is it's not
necessarily trivial to work out when the
end is. So enough pseudo code let me run
a real minimal example of pantic AI. So
this is uh a very simple um padantic
based model with three fields. Uh and
then we're going to use pantic AI to
extract structured data that fits that
that person uh schema from unstructured
data. This sentence here now here
obviously to fit this into on screen. Uh
this is um a very very simple example
but this could be a PDF uh tens of
megabytes. Well, probably not tens of
megabytes necessarily in context, but
like definitely, you know, enormous
documents and and the schema is very
simple, but this could be an incredibly
complex nested schema models are still
able to do it. And sure enough, if we go
and run this example and the gods of the
internet are with us, sure enough, we
get the the pyantic model uh printed
out. So, but some of you will notice
that this example is simple enough that
we don't actually need an agent or this
loop. We're doing one shot. We make one
call to the LM returns the structured
data. We call under the hood. We call a
final result tool. Pyantic AI performs a
validation and we get back the data. But
we don't have to change that example
very much to start seeing the value of
the agantic loop. So here I'm being a
little bit unfair to the to the model.
I've added a field validator to my
person model which says the date of
birth needs to be before 1900. And
obviously the the actual definition here
is abstract uh is uh doesn't define what
year we're going to be um well sorry
what which century we're talking about.
You would obviously the model will for
the most part assume 87 is 1987. will
then get a validation error when you do
the validation and that's where the
agantic bit kicks in because we will
take those validation errors and return
them to the model basically as a
definition and say please try again as
I'll show you in a moment and the model
is then able to use the information from
the validation error to to try again.
Obviously, if you were trying to do this
case in production, you would add a a
dock string to the do field saying it
must be in the 19th century. But there
are definitely cases where models, even
the smartest models, don't uh pass
validation. And being able to use this
trick of returning validation errors um
to the model is is a very effective way
of fixing a lot of the simplest use
cases. So, if we run this,
you see we had two calls to Gemini here.
And if I come and open you other thing
you'll see in this example is we
instrumented um this code with uh with
logfire our observability platform. So
we can actually go in and see exactly
what happened. So you'll see our agent
run we had two
uh two calls to the model in this case
Gemini flash. And if we go and look at
the the exchange you can see what's
happened here. So
we I just try and make it big enough
that you can see it. We first of all had
the user prompt the description. It
called the final result tool as you
might expect with the date of birth
being 1987. Uh we then responded. The
tool response was validation error
incorrect. Please try and then we we add
on the end please fix the error and try
again. And sure enough it was then able
to
return uh correctly call the final
result tool with the right date of birth
and succeed. Cool. I've got five
minutes. I feel like I'm in one of
those. Uh see how fast I can go. Uh I'm
on the wrong window am I? I am here we
are. Um I think the other thing that's
worth worth saying here even if I don't
have that much time is if you take a
look at the this example I talked about
type safety. If you look the way that
we're doing this under the hood agent
because of the output type is generic in
in this case person. And so we can act
when we access uh result output both in
typing terms it's an instance of person
and uh a runtime will guaranteed from
the pyantic validation that it will
really be an instance of person. So if I
access herename
all will be well if I access first name
uh we suddenly get a validation we get a
runtime we get the the nice error from
typing saying this is a incorrect field.
So that's the kind of the the kind of
very beginning of the value of static
typing uh of of our typing support. We
go a lot further. You will have seen or
some of you might have noticed there's a
second generic on agent um which is the
depths type. And so if you register
tools with this agent they you can have
type safe dependencies to tools which I
will show you in a moment.
Um so the other thing you will you will
notice is missing from this example is
any tools. So let's look at an example
with tools. So if I open this example
here, we have this is an example of
memory, long-term memory in particular
where we're using a tool to record
memories and then uh another tool to be
able to retrieve memories. So you'll see
we have these two tools here, record
memory and retrieve memory tools are set
up by registering them with the agent.
Decorate uh decorator. But this is where
the typing as I say gets more complex.
Now you will see that we've set depths
type when we've defined the agent and so
our agent is now generic in that depths
type the return type is string because
that's the default and so we when we
call the tool decorator we have to set
the first argument to be this run
context to parameterize with our depths
type and so when we access context.eps
deps that is an instance of our of our
depths data class that you see there and
if we access one of its attributes we
get the actual type and if we change
this to be
int let's say suddenly we get an error
saying we've used the wrong the wrong
type. So we get this guarantee that the
type here matches the type here matches
the attributes you can access here and
then when we come to run the agent we
need our depths to be an instance of
that depths type. So again, if we put
gave it the wrong type, we would get a a
typing error saying you're using the
wrong type. And as far as I know, we're
what the only
agent framework that works this hard to
be type safe. And it is quite a lot of
work on our side. I'll be honest,
there's a little bit of work on your
side as well as in it's not necessarily
as trivial to set up, but it makes it
incredibly easy to go and refactor your
code. Um, and yeah, you we run this here
and we give it the the I'm pretty sure I
don't have Postgress running.
Uh, do I have Docker running? I don't
know if I have time to make that work.
I will. That's Docker running. I'll just
try and run this very quickly.
Uh, docker run.
Hopefully that is enough. If I now come
and run this example,
what you will see
is
it successfully failed. Great. Um, I
will try one more time and see if I get
lucky. I don't know quite what was going
on there.
Ah, and I have no Well, we can look in
logfire and see what happened uh to make
it fail. I promise you I hadn't set that
up to fail the first time to demonstrate
the value of observability, but maybe it
can help here. So if you look um this
first time we um our first agent run
you'll see that we
use the uh the tool called uh record
memory the user's name is Samuel um and
then it it returned finished and then
the second time
uh you can see that the when it did
retrieve memory where it called the that
tool the parameter or the argument it
gave was your name. Um, which was not is
not contained within the the query the
previous time. We're just doing a very
simple I like here. So your name is not
a substring of user's name is Samuel.
And so that's why why it failed that
time. Um, so this has turned into a very
useful example of where where Logfire
can help. And if we look at the that
second time,
you'll see user's name is Samuel. And
then when it when it ran the
agent, it just asked for name. N name is
obviously a substring of of the user's
name is Samuel. And so it was able it
got the response. User's name is Samuel.
And therefore succeeded. The other thing
we get here is like obviously we get
this tracing information. So we can see
how long each of those calls took. Um
and we also get pricing on both
aggregate across the whole of the trace
and individual spans. Um I am told that
I am running out of time. So, thank you
very much.
[Music]