From Stateless Nightmares to Durable Agents — Samuel Colvin, Pydantic

Channel: aiDotEngineer

Published at: 2025-11-24

YouTube video id: flf_IKnFYnE

Source: https://www.youtube.com/watch?v=flf_IKnFYnE

Hi, I'm Samuel from Pantic and today I'm
going to give a demo of Pyantic AI
temporal and Pantic logfire. I'll also
cover Pyantic evals. So we have in
Pantic AI support for temporal and
deboss to durable execution frameworks.
We're actually adding a bunch more. I
think we've had something like five pull
requests to add other durable execution
or like workflow orchestration backends.
But at the moment it's it's these two.
And I think it's fair to say temporal
are like the big incumbent in this space
and they're they're kind of I guess
leaders leaders in how you do this to to
demonstrate this a simple example of
like I go ask an LLM question it replies
mostly just works and we don't need the
durable execution component. So, but
once we get into longer running
workflows, that's where it really
becomes a problem. In particular, where
we've done enough compute that we don't
want to lose it or we've spent enough
time on that compute that we really
don't want to have to start again for
the user. That's what for example I
think OpenAI it's I think it's public
that OpenAI use temporal for their um
deep research and I think some of the
other LLM deep research do the same
thing. So, I'll I'll start with a kind
of toy example and then I'll move on to
a more deep research type example. in
fact a deep research example. But before
we get into that, let me let me run this
example without. So this is a uh two
agents that play um 20 questions. Um so
instead of just a yes no answer, they
get to give a little bit more detail
like yes, kind of not really, no,
completely wrong. Um that was because I
was getting bored of waiting for them to
take ages to succeed on the with the
just the yes no. So, we have an answer
agent which is runs a relatively small
model Hiku 3.5 because I didn't know 4.5
came out until about an hour ago. And
and this basically takes a question and
answers yes well with with one of these
answers and it it gets added into its
context the the like secret object that
you're looking for um which in this
example is a potato. So, and then we
have the um questioner agent or the
player agent has a bit more context on
what it's going to go and do. You don't
need to read read through all of this
stuff. This code is is public now. It's
a pull request, but I'll I'll merge that
afterwards, but you should get the idea.
Um, and the way that the questioner
agent gets to ask its questions is by
calling a tool, ask a question. Inside
that tool, we run the other agent, the
answer agent to basically decide the
answer to this question, and then we
respond. And it takes a little bit of
time to run. You can see in this case,
it succeeded pretty quickly. Sometimes
it it's amazing how even these very
simple questions, even very intelligent
LLMs get themselves completely confused
and go down some weird track and and
like get very confused. But you can see
in the last run that it asked a bunch of
questions, got down to like is this a
fruit or vegetable? Is it a fruit? No.
So it knew it was a vegetable, is it
orange? And it worked out the answer was
was potato. But obviously and here it is
running again. I don't know how many
steps it's going to take. Sometimes it
can take up to like 50 steps to get this
question right. And obviously the
problem is if this dies either because
we have some unreliable uh endpoint
within our system or because we're
running in the cloud and Kubernetes
decides it wants to scale or whatever it
might be. If we run this again, we
obviously have to start from scratch.
That is problematic in this case. Um but
you can imagine as the tasks get longer
and longer just restarting it gets more
and more problematic. So um I think the
other thing to say about this this like
um 20 questions example is although it's
pretty simple to understand and it feels
like a toy, it is actually directly
equivalent to a deep research case where
effectively the agent is is like going
off on a quest to go and find an answer
to a question where it needs to ask the
like you know troll at the bottom of the
garden the other the like question to
the next riddle to get to the next
endpoint, right? like deep research is
effectively this 20 questions just with
like web like web search or rag or
whatever it might be is your
intermediate steps. So let's turn that
20 questions into a durable agent. Um
so this is mostly the same code for
simplicity. I've actually copied it here
but you see we have our answer agent. We
need to wrap it in this temporal agent
which gives us another thing that
behaves like an agent like a pantic AI
agent. So it's also a subclass of
abstract agent. We do the same with our
questionnaire agent. To keep things
simple, we aren't doing the same um
stuff about passing around the answer in
context. We've just hardcoded the answer
into the system prompt here for the for
the answer agent. But apart from these
like adding the temporal wrappers, you
can as I will show later just um apply
durable execution later. But here's
where the the temporal bit comes in. And
I'm not a salesperson for temporal. And
although the underlying stuff they do is
amazing, I do think some of their Python
abstractions are kind of ugly. But
anyway, the the principle of temporal is
that you have workflows and activities.
And workflows need to be entirely
deterministic and activities then need
to do anything that is non-deterministic
like IO in particular. So you can
basically do anything inside a workflow
other than IO and calling random. And if
you're if if that's the case, then you
have a deterministic system. And what
you can think of what temporal is doing
in the background as it's running that
workflow and it's basically recording
the every activity that runs and both
the the inputs to that and the outputs.
And so if you want to rerun it from like
from the beginning to a certain point,
it can basically plug in those answers.
And I'll show what that looks like. So
this is how we define our workflow. The
activities here are implicit. The point
is that this temporal agent takes care
of turning all of the IO that you need
to do to call an LLM into activities in
the background, including tool calls.
So, OpenAI claim to have temporal
support, but they don't support tool
calls as activities, which to me makes
it slightly a chocolate teapot. Like,
there's actually no point in having any
of these things without without tool
calling or little little point. But we
define our workflow like this. I think
you can for the most part just copy
paste their their definitions of how to
do it. Here we have our play mechanism.
The point here is we're going to we're
going to connect to the temporal server
which is what's going to record the
state of our task or our agent as it's
executing and be able to resume and
stuff. I have temporal running locally
here. This is just the open source
version of temporal which runs as a
separate process and I can restart that
to kind of kill the state and that's why
we're connecting to local host. In
production you use temporal's cloud.
that's why they make so much money. Um,
and here we we run the worker. This is
where we're actually going to kick off
our workflow. In general, we just kick
off our workflow with execute workflow.
We pl pass in the the workflow that we
want to run. We pass the inputs. There
aren't any inputs in this case because
we just start. So, there aren't any
here. And it will then go and run. And
so, if I run this, we will you will see
it start to to execute.
You'll see it running. The only couple
of things to note, there's a couple of
log messages. Ah, and we immediately
have this exception broken. And that was
because to simulate some system that's
unreliable inside the tool I added 20%
of the time it's going to it's going to
break. What you will see is that
temporal has immediately taken care of
continuing after that. So even though
this broke, it will continue to run. And
I may have set 20% to be too high um
because it's now failing all the time,
but it's actually going to continue and
deal with those runtime errors and just
continue to operate absolutely fine. Let
me However, I think I dialed up 20% too
high just before. So, I'm going to
actually see if this is going to
continue to operate.
Obviously, when you give a demo,
everything suddenly grinds to a halt.
Someone recently said they hate demos
where everything goes to plan, and I
said you'll never need to worry about
that with me. Um, I don't know why that
has actually ground to a complete halt.
I don't know whether that's it just
repeatedly failing.
Let me I'm going to kill temporal server
here and restart it so that we don't
have the state stored. And I will clear
this and run it again. And you should
now see it succeeding most of the time
and failing 10% of the time. So yeah,
you see it now asking questions and
occasionally breaking. Good timing, but
continuing. Um, so that's one of the
things Temporal does. It just does the
like retry logic that is like you could
implement without temporal, but they do
it very nicely. But there are more
powerful things. So let's say this
process that's in the middle of running
gets killed by Kubernetes. Now, so we go
across here and we we just like kill it.
Process gets killed. Now, what I didn't
show you is I also instrumented this
with logfire. So if we look at our our
workflow, we can see exactly what was
going on here. Um, and we we can see
what's going on here. So we have the
different calls to claude and then
inside that we have we're running the
activity which is then running the um
other agent. But in particular if we
come to the top level start for the
workflow I can take the the workflow ID
if we come back over to code here you'll
see I had some some code in here to
basically allow me to continue with a
given uh resume ID and to continue a
workflow. Now for the most part you
wouldn't have to do this. This is just
for the sake of the demo. If I just
reran the script again, it would kick
off this workflow again and it would run
the two in parallel. That would look
really confusing. So instead of doing
that, I'm I'm specifically hanging on a
particular workflow to finish. So you
can see what's going on. So if I if I
run my script again, but I give it the
workflow that was ongoing. Now you see
it's already whizzed forward to question
six and it's continuing to operate. So,
we've got it to basically resume without
having to add any resume code anywhere
in our actual agent code. We just set up
temporal and it works. And you can see
exactly what's happened if you um look
at logfire. What you will see is that
that whole that first bunch of um calls
to the LLM responded in like 5
milliseconds. So, these were not
actually sent to the LLM. Temporal just
returned the result, the kind of cached
result that it already had for each of
these cases. So we're able to
effectively zoom forward to the point
where it then continues to to call the
LLM. It's it's as if you've gone through
everywhere that you're doing IO and
you've set up uh caching on each
individual call so that you can run your
code. Uh I see some people nodding which
is making me feel a bit better about
explaining this. But we don't have to do
the inference. We don't have to wait the
time. We can basically run our workflow
code that's generally very fast because
it's no IO. It's just procedural. and it
will just keep getting results instantly
until it gets to the point where it
needs to needs to continue. And you see
in this case, it's got itself completely
confused and it's off um wondering about
whether this thing is a salad bowl. So,
you see how sometimes the LLM does well,
sometimes it does does terribly. Um I'm
going to actually I'll leave that
running to see whether it you see it's
it knows it's related to food, but it's
got itself really confused. Um I will
just say as as it might interest you
just before this I was wondering how the
different models would perform and so I
I ran some evals with padantic evals on
these different cases and you can see
here we have it's a bit hard to read on
the screen but uh GPT 4.1 Gemini and
Claude Sonnet 4.5 and you can see the
different assertions for each case
whether they passed or failed here and
you can see the the average cost. You
can see Gemini was way way cheaper, way
way faster. And somewhere we should
have, if we look at an individual case,
we have a a metric for how many steps it
took to succeed. I think maybe we have
to scroll over. Yeah. Question count.
You can see here Gemini was way quicker
each time. I discovered subsequently
having having checked the results that
actually the reason Gemini is way faster
and answers much more quickly is it just
invents an answer that's wrong and I
wasn't checking it. So, this is not
perfect yet, but like it's uh the the EV
this is definitely an interesting case
for evals and seeing and like working
out which model is actually better
because they're definitely not
particularly good at it by default. But
yeah, in my naive case, Gemini did way
better, but that's not representative.
Anyway, I'm going to leave that because
it's got 46 steps in and it's still
failing to work out that that thing's a
potato. I I can show the evals case if
you want, but I think it might be more
interesting to look at a deep research
case, which is a kind of more meaningful
example of where you would run durable
execution. Um, and also doing stuff in
parallel, which is also one of the
things that like just works out of the
box with temporal without you having to
write any any code. So, this is my very
quick last night hours attempt at
building deep research. I honestly think
it's as good as lots of the actual deep
research systems. So we have we define
our plan for deep research and this is
this is effectively our deep research
plan. So it has an executive summary
what you would effectively pump out to
the user about what I'm going to go and
do. Then we have a list of web search
steps. We maximum we have a maximum of
five here so it doesn't take forever and
then we have analysis instructions. And
the point is that like I think this is
one of the big change in AI this year
answering a bit the question I had on
the other zoom. I think we've moved from
thinking that like agents in the sense
of so there are three definitions of
agents. There is the like AI definition
which is LLM's calling tools in a loop.
There is the tech definition which is a
micros service and then there is the
business definition which is something
that can replace a human. Um ignoring
the business definition for a minute. If
you think about the AI and the like
engineering definitions, we thought at
the beginning of this year you would
have one AI agent, one LLM calling tools
in the loop within each microser. I
think we've moved more and more to think
that the agents are actually the kind of
quantum of development. They are the the
micro tasks that are doing that you
build up to to form a like what most
people would think of as an agent,
something that actually goes and
autonomously completes a task. And so
our deep research agent is actually made
up of multiple agents. So we have this
plan agent which goes off with a prompt
and it returns an instance structured
data extraction gives you an instance of
this pyantic model which is your plan to
run it. Then you have the search agent
which has access in this case to search
tool or I'll show using tavilli in the
other case um which is using a a faster
model gemini flash uh in this case and
then you're using in this case I'm using
claude son 4.5 for the final analysis
stage. So I suppose this is a bit what
people talk about when they're leaning
towards graphs. I haven't built this in
a graph although I could because it
doesn't need a graph. It's not complex
enough to need a graph. And durable
execution is a way better way of getting
snapshotting, but like much more
granular support for for durable
execution. We added a tool that allowed
the analysis agent to do a bit more web
search if it really wanted to. I don't
think it uses it, but this is the actual
deep research code. So you can see how
concise it is. We run the plan agent. We
get back our plan. We run in parallel
all of the search agents. So, we're just
using a task group from Python to run
all of these. We get those results,
which will all be the text results of
the different bits of search. We use
format as XML to basically smash all of
that into a massive lump of reasonably
readable data for the analysis agent.
Then we go off and run the agent. And we
run the kind of question that I'm asking
AI relatively regularly for sales, which
is find me a list of hedge funds that
write Python in London. And if I go and
run this uh UV run deep research,
we'll see it starting to churn away. We
can see it in the terminal with logfire.
But we can also
come over here to log fire. Let me clear
that. Um go to the bottom here and we
can see this this run here as it's going
on. It's it's run the plan step in nine
seconds and you can see all of the
search steps going on in parallel. Once
they've finished, it will start. You can
see the analysis agent has just started.
We can look at the individual searches.
So, you get a pretty good idea of what
happened, the question it got asked, uh
the queries it decided to run, bunch of
data from medium, different sites,
structured data, and then like the agent
also bangs in quite a lot of context,
right? So this is each individual one of
our like 10 parallel searches. And now
the analysis is going to go and run with
all of that input. You can see so far
we've sent spent 8 cents on this
particular run. We'll see what it gets
to by the time it finishes. But
obviously the problem with this is if I
kill this now, it's just going to die
and I'd have to restart from the
beginning if I wanted to to run it
again. So, while that churns away, let
me start introducing you to the durable
execution example, spoiler, it's going
to be pretty similar. I discovered last
night that there's a bug with the Vertex
SDK. That means that you can't use it
with temporal right now. So, I've
swapped out uh I think we should fix
that, or at least I'll be winging at uh
Deep Mind today to go and fix that. So,
I've switched it out to OpenAI responses
here and I'm using Tavilli instead of
the built-in search. Yeah, but other
than that, this is all pretty similar
code. I could have probably imported the
code from the other module. I just
decided to duplicate it just to keep
things easy. But you see again, we do
the same thing. We wrap uh our agents in
temporal agent. This analysis one can
take more than I think whatever the
default activity duration is because
it's a long basically build up a a
summary. And so I give it I think it was
taking longer than 2 minutes or whatever
the default is. So I just gave it an
hour so it's not going to fail. Um and
then the the for me the most powerful
bit is everything here inside my
workflow looks exactly the same. I don't
have to do any crazy stuff to do
parallelism. I just use uh task group
exactly the same. I could use async.io
gather. It's all just imperative Python
code as you would be used to. I could
have a if I wanted to run this
periodically, I could sleep for seven
days in here. Temporal would take care
of pausing everything. Again, I'm not
here to be a temporal salesperson. I
don't love everything about what they
do, but it's a pretty powerful way of
thinking about code. We don't have to do
all the infra stuff. And then
ultimately, again, smash all my context
into the last agent and run it. And
again, there's a bit of plug-in stuff. I
have to plug in log add some plugins,
add the agents as plugins. But again, my
code to actually go and kick it off is
just execute workflow. Simple as that.
And I asked it here slightly more
controversial question of what's the
best Python agent framework to use for
durable execution and type safety. And
we will pray to God it gives the right
answer when we run it in front of
everyone. If I go and kick that off and
run this
again, we should see it. If we come over
here, we should see it running in
Logfire. You can see we have the stuff
related to kicking off the agent. It's
kicking off the workflow, excuse me,
here. And we have the searches beginning
to happen happen. But the the powerful
bit here is again imagine that we're
halfway through running all these
searches. We're about to start the final
step and something comes along and kills
the process. And by in general, you'd
have to go and completely restart this
process and run your deep deep research
all over again with temporal
it will just go and rerun that workflow
automatically. In this case, I'm
restarting it and just running that one
workflow. But in general, it would just
automatically go and be restarted and on
the the next time that Kubernetes comes
up, the workflow will run as it would
have done before, but it will get
answers to each individual question
basically instantly. And so
if it's not going to fail for me, which
it seems to be, there we are. It started
again. You see a plan took 24
milliseconds. Search all took no time at
all in the grand scheme of things
because it just got the result back from
temporal immediately. And then the
analysis that was the the task we needed
to that we hadn't run yet. Obviously
that needs to go and start again because
that's an activity and you can't
activities obviously have to run again
from scratch. And so once that finishes
I think it does take quite a long time.
Maybe I can show the previous output. Or
did we not get to displaying the
previous output? Did it ironically
actually fail the time before? But
hopefully once this finishes, we should
be able to see uh its analysis, which
you know, I think is on a par with what
I see from the other deep research
things. Obviously, there will be some
there's some UI work to do to display
this in a nice deep deep research
interface. There we are. It's completed
and it has primary recommendation is
pantic AI with temporal. So, it it it
did what I hoped it would do. And you
see it's given a reasonable report here
of like the relative trade-offs of the
other inferior agent frameworks and it
should have done an executive summary at
the beginning with with links. Yeah. So
it said podantic AI langraph obviously
if you love snapshotting or writing
unsafe code type unsafe code temporal on
its own which makes sense. Yeah. So
there's a there's the summary. That is
the main stuff I had to show. I will
merge the the durable execution stuff in
here. So go here. I just other thing I
want to just say quickly while I have I
can't work out how to post a comment but
like you'll find it on Pantic if you if
you if you look for it. Um oh I have I
can do that if anyone wants to take a
picture of that QR code. The other thing
I just wanted to mention we're about to
announce uh Pantic AI gateway. So if
anyone wants to try it early let us
know. Um but yeah that platform will be
landing soon. You'll be able to use
panic gateway directly to buy inference
from any of the big models or most of
the open source models and self-hosting
for enterprise all the observability
stuff. But I I'll I'll save you the full
spiel, but that's coming soon. I think
some of you will find it interesting.
That's it. Thanks so much for watching.
If you want to learn more about Padantic
AI, Padantic AI gateway or padantic
logfire, please scan these QR codes. If
you have any feedback, uh please come
and talk to us. Thanks so much for
listening.