From Stateless Nightmares to Durable Agents — Samuel Colvin, Pydantic
Channel: aiDotEngineer
Published at: 2025-11-24
YouTube video id: flf_IKnFYnE
Source: https://www.youtube.com/watch?v=flf_IKnFYnE
Hi, I'm Samuel from Pantic and today I'm going to give a demo of Pyantic AI temporal and Pantic logfire. I'll also cover Pyantic evals. So we have in Pantic AI support for temporal and deboss to durable execution frameworks. We're actually adding a bunch more. I think we've had something like five pull requests to add other durable execution or like workflow orchestration backends. But at the moment it's it's these two. And I think it's fair to say temporal are like the big incumbent in this space and they're they're kind of I guess leaders leaders in how you do this to to demonstrate this a simple example of like I go ask an LLM question it replies mostly just works and we don't need the durable execution component. So, but once we get into longer running workflows, that's where it really becomes a problem. In particular, where we've done enough compute that we don't want to lose it or we've spent enough time on that compute that we really don't want to have to start again for the user. That's what for example I think OpenAI it's I think it's public that OpenAI use temporal for their um deep research and I think some of the other LLM deep research do the same thing. So, I'll I'll start with a kind of toy example and then I'll move on to a more deep research type example. in fact a deep research example. But before we get into that, let me let me run this example without. So this is a uh two agents that play um 20 questions. Um so instead of just a yes no answer, they get to give a little bit more detail like yes, kind of not really, no, completely wrong. Um that was because I was getting bored of waiting for them to take ages to succeed on the with the just the yes no. So, we have an answer agent which is runs a relatively small model Hiku 3.5 because I didn't know 4.5 came out until about an hour ago. And and this basically takes a question and answers yes well with with one of these answers and it it gets added into its context the the like secret object that you're looking for um which in this example is a potato. So, and then we have the um questioner agent or the player agent has a bit more context on what it's going to go and do. You don't need to read read through all of this stuff. This code is is public now. It's a pull request, but I'll I'll merge that afterwards, but you should get the idea. Um, and the way that the questioner agent gets to ask its questions is by calling a tool, ask a question. Inside that tool, we run the other agent, the answer agent to basically decide the answer to this question, and then we respond. And it takes a little bit of time to run. You can see in this case, it succeeded pretty quickly. Sometimes it it's amazing how even these very simple questions, even very intelligent LLMs get themselves completely confused and go down some weird track and and like get very confused. But you can see in the last run that it asked a bunch of questions, got down to like is this a fruit or vegetable? Is it a fruit? No. So it knew it was a vegetable, is it orange? And it worked out the answer was was potato. But obviously and here it is running again. I don't know how many steps it's going to take. Sometimes it can take up to like 50 steps to get this question right. And obviously the problem is if this dies either because we have some unreliable uh endpoint within our system or because we're running in the cloud and Kubernetes decides it wants to scale or whatever it might be. If we run this again, we obviously have to start from scratch. That is problematic in this case. Um but you can imagine as the tasks get longer and longer just restarting it gets more and more problematic. So um I think the other thing to say about this this like um 20 questions example is although it's pretty simple to understand and it feels like a toy, it is actually directly equivalent to a deep research case where effectively the agent is is like going off on a quest to go and find an answer to a question where it needs to ask the like you know troll at the bottom of the garden the other the like question to the next riddle to get to the next endpoint, right? like deep research is effectively this 20 questions just with like web like web search or rag or whatever it might be is your intermediate steps. So let's turn that 20 questions into a durable agent. Um so this is mostly the same code for simplicity. I've actually copied it here but you see we have our answer agent. We need to wrap it in this temporal agent which gives us another thing that behaves like an agent like a pantic AI agent. So it's also a subclass of abstract agent. We do the same with our questionnaire agent. To keep things simple, we aren't doing the same um stuff about passing around the answer in context. We've just hardcoded the answer into the system prompt here for the for the answer agent. But apart from these like adding the temporal wrappers, you can as I will show later just um apply durable execution later. But here's where the the temporal bit comes in. And I'm not a salesperson for temporal. And although the underlying stuff they do is amazing, I do think some of their Python abstractions are kind of ugly. But anyway, the the principle of temporal is that you have workflows and activities. And workflows need to be entirely deterministic and activities then need to do anything that is non-deterministic like IO in particular. So you can basically do anything inside a workflow other than IO and calling random. And if you're if if that's the case, then you have a deterministic system. And what you can think of what temporal is doing in the background as it's running that workflow and it's basically recording the every activity that runs and both the the inputs to that and the outputs. And so if you want to rerun it from like from the beginning to a certain point, it can basically plug in those answers. And I'll show what that looks like. So this is how we define our workflow. The activities here are implicit. The point is that this temporal agent takes care of turning all of the IO that you need to do to call an LLM into activities in the background, including tool calls. So, OpenAI claim to have temporal support, but they don't support tool calls as activities, which to me makes it slightly a chocolate teapot. Like, there's actually no point in having any of these things without without tool calling or little little point. But we define our workflow like this. I think you can for the most part just copy paste their their definitions of how to do it. Here we have our play mechanism. The point here is we're going to we're going to connect to the temporal server which is what's going to record the state of our task or our agent as it's executing and be able to resume and stuff. I have temporal running locally here. This is just the open source version of temporal which runs as a separate process and I can restart that to kind of kill the state and that's why we're connecting to local host. In production you use temporal's cloud. that's why they make so much money. Um, and here we we run the worker. This is where we're actually going to kick off our workflow. In general, we just kick off our workflow with execute workflow. We pl pass in the the workflow that we want to run. We pass the inputs. There aren't any inputs in this case because we just start. So, there aren't any here. And it will then go and run. And so, if I run this, we will you will see it start to to execute. You'll see it running. The only couple of things to note, there's a couple of log messages. Ah, and we immediately have this exception broken. And that was because to simulate some system that's unreliable inside the tool I added 20% of the time it's going to it's going to break. What you will see is that temporal has immediately taken care of continuing after that. So even though this broke, it will continue to run. And I may have set 20% to be too high um because it's now failing all the time, but it's actually going to continue and deal with those runtime errors and just continue to operate absolutely fine. Let me However, I think I dialed up 20% too high just before. So, I'm going to actually see if this is going to continue to operate. Obviously, when you give a demo, everything suddenly grinds to a halt. Someone recently said they hate demos where everything goes to plan, and I said you'll never need to worry about that with me. Um, I don't know why that has actually ground to a complete halt. I don't know whether that's it just repeatedly failing. Let me I'm going to kill temporal server here and restart it so that we don't have the state stored. And I will clear this and run it again. And you should now see it succeeding most of the time and failing 10% of the time. So yeah, you see it now asking questions and occasionally breaking. Good timing, but continuing. Um, so that's one of the things Temporal does. It just does the like retry logic that is like you could implement without temporal, but they do it very nicely. But there are more powerful things. So let's say this process that's in the middle of running gets killed by Kubernetes. Now, so we go across here and we we just like kill it. Process gets killed. Now, what I didn't show you is I also instrumented this with logfire. So if we look at our our workflow, we can see exactly what was going on here. Um, and we we can see what's going on here. So we have the different calls to claude and then inside that we have we're running the activity which is then running the um other agent. But in particular if we come to the top level start for the workflow I can take the the workflow ID if we come back over to code here you'll see I had some some code in here to basically allow me to continue with a given uh resume ID and to continue a workflow. Now for the most part you wouldn't have to do this. This is just for the sake of the demo. If I just reran the script again, it would kick off this workflow again and it would run the two in parallel. That would look really confusing. So instead of doing that, I'm I'm specifically hanging on a particular workflow to finish. So you can see what's going on. So if I if I run my script again, but I give it the workflow that was ongoing. Now you see it's already whizzed forward to question six and it's continuing to operate. So, we've got it to basically resume without having to add any resume code anywhere in our actual agent code. We just set up temporal and it works. And you can see exactly what's happened if you um look at logfire. What you will see is that that whole that first bunch of um calls to the LLM responded in like 5 milliseconds. So, these were not actually sent to the LLM. Temporal just returned the result, the kind of cached result that it already had for each of these cases. So we're able to effectively zoom forward to the point where it then continues to to call the LLM. It's it's as if you've gone through everywhere that you're doing IO and you've set up uh caching on each individual call so that you can run your code. Uh I see some people nodding which is making me feel a bit better about explaining this. But we don't have to do the inference. We don't have to wait the time. We can basically run our workflow code that's generally very fast because it's no IO. It's just procedural. and it will just keep getting results instantly until it gets to the point where it needs to needs to continue. And you see in this case, it's got itself completely confused and it's off um wondering about whether this thing is a salad bowl. So, you see how sometimes the LLM does well, sometimes it does does terribly. Um I'm going to actually I'll leave that running to see whether it you see it's it knows it's related to food, but it's got itself really confused. Um I will just say as as it might interest you just before this I was wondering how the different models would perform and so I I ran some evals with padantic evals on these different cases and you can see here we have it's a bit hard to read on the screen but uh GPT 4.1 Gemini and Claude Sonnet 4.5 and you can see the different assertions for each case whether they passed or failed here and you can see the the average cost. You can see Gemini was way way cheaper, way way faster. And somewhere we should have, if we look at an individual case, we have a a metric for how many steps it took to succeed. I think maybe we have to scroll over. Yeah. Question count. You can see here Gemini was way quicker each time. I discovered subsequently having having checked the results that actually the reason Gemini is way faster and answers much more quickly is it just invents an answer that's wrong and I wasn't checking it. So, this is not perfect yet, but like it's uh the the EV this is definitely an interesting case for evals and seeing and like working out which model is actually better because they're definitely not particularly good at it by default. But yeah, in my naive case, Gemini did way better, but that's not representative. Anyway, I'm going to leave that because it's got 46 steps in and it's still failing to work out that that thing's a potato. I I can show the evals case if you want, but I think it might be more interesting to look at a deep research case, which is a kind of more meaningful example of where you would run durable execution. Um, and also doing stuff in parallel, which is also one of the things that like just works out of the box with temporal without you having to write any any code. So, this is my very quick last night hours attempt at building deep research. I honestly think it's as good as lots of the actual deep research systems. So we have we define our plan for deep research and this is this is effectively our deep research plan. So it has an executive summary what you would effectively pump out to the user about what I'm going to go and do. Then we have a list of web search steps. We maximum we have a maximum of five here so it doesn't take forever and then we have analysis instructions. And the point is that like I think this is one of the big change in AI this year answering a bit the question I had on the other zoom. I think we've moved from thinking that like agents in the sense of so there are three definitions of agents. There is the like AI definition which is LLM's calling tools in a loop. There is the tech definition which is a micros service and then there is the business definition which is something that can replace a human. Um ignoring the business definition for a minute. If you think about the AI and the like engineering definitions, we thought at the beginning of this year you would have one AI agent, one LLM calling tools in the loop within each microser. I think we've moved more and more to think that the agents are actually the kind of quantum of development. They are the the micro tasks that are doing that you build up to to form a like what most people would think of as an agent, something that actually goes and autonomously completes a task. And so our deep research agent is actually made up of multiple agents. So we have this plan agent which goes off with a prompt and it returns an instance structured data extraction gives you an instance of this pyantic model which is your plan to run it. Then you have the search agent which has access in this case to search tool or I'll show using tavilli in the other case um which is using a a faster model gemini flash uh in this case and then you're using in this case I'm using claude son 4.5 for the final analysis stage. So I suppose this is a bit what people talk about when they're leaning towards graphs. I haven't built this in a graph although I could because it doesn't need a graph. It's not complex enough to need a graph. And durable execution is a way better way of getting snapshotting, but like much more granular support for for durable execution. We added a tool that allowed the analysis agent to do a bit more web search if it really wanted to. I don't think it uses it, but this is the actual deep research code. So you can see how concise it is. We run the plan agent. We get back our plan. We run in parallel all of the search agents. So, we're just using a task group from Python to run all of these. We get those results, which will all be the text results of the different bits of search. We use format as XML to basically smash all of that into a massive lump of reasonably readable data for the analysis agent. Then we go off and run the agent. And we run the kind of question that I'm asking AI relatively regularly for sales, which is find me a list of hedge funds that write Python in London. And if I go and run this uh UV run deep research, we'll see it starting to churn away. We can see it in the terminal with logfire. But we can also come over here to log fire. Let me clear that. Um go to the bottom here and we can see this this run here as it's going on. It's it's run the plan step in nine seconds and you can see all of the search steps going on in parallel. Once they've finished, it will start. You can see the analysis agent has just started. We can look at the individual searches. So, you get a pretty good idea of what happened, the question it got asked, uh the queries it decided to run, bunch of data from medium, different sites, structured data, and then like the agent also bangs in quite a lot of context, right? So this is each individual one of our like 10 parallel searches. And now the analysis is going to go and run with all of that input. You can see so far we've sent spent 8 cents on this particular run. We'll see what it gets to by the time it finishes. But obviously the problem with this is if I kill this now, it's just going to die and I'd have to restart from the beginning if I wanted to to run it again. So, while that churns away, let me start introducing you to the durable execution example, spoiler, it's going to be pretty similar. I discovered last night that there's a bug with the Vertex SDK. That means that you can't use it with temporal right now. So, I've swapped out uh I think we should fix that, or at least I'll be winging at uh Deep Mind today to go and fix that. So, I've switched it out to OpenAI responses here and I'm using Tavilli instead of the built-in search. Yeah, but other than that, this is all pretty similar code. I could have probably imported the code from the other module. I just decided to duplicate it just to keep things easy. But you see again, we do the same thing. We wrap uh our agents in temporal agent. This analysis one can take more than I think whatever the default activity duration is because it's a long basically build up a a summary. And so I give it I think it was taking longer than 2 minutes or whatever the default is. So I just gave it an hour so it's not going to fail. Um and then the the for me the most powerful bit is everything here inside my workflow looks exactly the same. I don't have to do any crazy stuff to do parallelism. I just use uh task group exactly the same. I could use async.io gather. It's all just imperative Python code as you would be used to. I could have a if I wanted to run this periodically, I could sleep for seven days in here. Temporal would take care of pausing everything. Again, I'm not here to be a temporal salesperson. I don't love everything about what they do, but it's a pretty powerful way of thinking about code. We don't have to do all the infra stuff. And then ultimately, again, smash all my context into the last agent and run it. And again, there's a bit of plug-in stuff. I have to plug in log add some plugins, add the agents as plugins. But again, my code to actually go and kick it off is just execute workflow. Simple as that. And I asked it here slightly more controversial question of what's the best Python agent framework to use for durable execution and type safety. And we will pray to God it gives the right answer when we run it in front of everyone. If I go and kick that off and run this again, we should see it. If we come over here, we should see it running in Logfire. You can see we have the stuff related to kicking off the agent. It's kicking off the workflow, excuse me, here. And we have the searches beginning to happen happen. But the the powerful bit here is again imagine that we're halfway through running all these searches. We're about to start the final step and something comes along and kills the process. And by in general, you'd have to go and completely restart this process and run your deep deep research all over again with temporal it will just go and rerun that workflow automatically. In this case, I'm restarting it and just running that one workflow. But in general, it would just automatically go and be restarted and on the the next time that Kubernetes comes up, the workflow will run as it would have done before, but it will get answers to each individual question basically instantly. And so if it's not going to fail for me, which it seems to be, there we are. It started again. You see a plan took 24 milliseconds. Search all took no time at all in the grand scheme of things because it just got the result back from temporal immediately. And then the analysis that was the the task we needed to that we hadn't run yet. Obviously that needs to go and start again because that's an activity and you can't activities obviously have to run again from scratch. And so once that finishes I think it does take quite a long time. Maybe I can show the previous output. Or did we not get to displaying the previous output? Did it ironically actually fail the time before? But hopefully once this finishes, we should be able to see uh its analysis, which you know, I think is on a par with what I see from the other deep research things. Obviously, there will be some there's some UI work to do to display this in a nice deep deep research interface. There we are. It's completed and it has primary recommendation is pantic AI with temporal. So, it it it did what I hoped it would do. And you see it's given a reasonable report here of like the relative trade-offs of the other inferior agent frameworks and it should have done an executive summary at the beginning with with links. Yeah. So it said podantic AI langraph obviously if you love snapshotting or writing unsafe code type unsafe code temporal on its own which makes sense. Yeah. So there's a there's the summary. That is the main stuff I had to show. I will merge the the durable execution stuff in here. So go here. I just other thing I want to just say quickly while I have I can't work out how to post a comment but like you'll find it on Pantic if you if you if you look for it. Um oh I have I can do that if anyone wants to take a picture of that QR code. The other thing I just wanted to mention we're about to announce uh Pantic AI gateway. So if anyone wants to try it early let us know. Um but yeah that platform will be landing soon. You'll be able to use panic gateway directly to buy inference from any of the big models or most of the open source models and self-hosting for enterprise all the observability stuff. But I I'll I'll save you the full spiel, but that's coming soon. I think some of you will find it interesting. That's it. Thanks so much for watching. If you want to learn more about Padantic AI, Padantic AI gateway or padantic logfire, please scan these QR codes. If you have any feedback, uh please come and talk to us. Thanks so much for listening.