Playground in Prod: Optimising Agents in Production Environments — Samuel Colvin, Pydantic
Channel: aiDotEngineer
Published at: 2026-05-07
YouTube video id: A48uhxfxbsM
Source: https://www.youtube.com/watch?v=A48uhxfxbsM
I'm Samuel. I'm probably best known as the creator of Pyantic, the the open source library. Now I run Pyantic, the company. We do a bunch of stuff. Um, Pantic, we still maintain Pyantic validation, of course. We have Pantic AI the agent framework and then we have pidantic logfire our observability platform. How many people coming into this know have heard the word pyantic just just to okay that's reassuring um and how many people have at some point played with panti yeah pyance kai and logfire okay so most people um logfire is fundamentally under the hood a general observability platform open telemetry logs metrics traces I don't really believe in AI observability I think it's a feature not a category and it will get eaten by either observability or AI at some point but today we sell is AI observability because that is understandably the thing that people want um one of so we go beyond like the standard observability of logs metrics traces we do stuff like eval which I'll talk about a bit today and managed variables is one of the one of the newer features um and then the step beyond manage managed variables which we're working on at the moment is then optimizing your agent uh autonomously effectively from the platform that's not what I'm going to demonstrate today I'm going to talk about Jepper and how you can use it with managed variables kind of in a somewhat Heath Robinson way until we have the full end to- end all singing or dancing. You give us uh your money and we magically optimize your agent. Probably also more interesting than the like magically we make your agent better argument. Anyway, um so yeah, there are two there are two um core subjects I'm going to talk about and then I'm going to try and put them together at the end. So Jepper um is a library. How many people here have heard of have heard of Jepper and and sort of feel like they understand what it does? Okay. Well, that that that probably includes me to an extent. Um so it the name comes from uh genetic paro. So it's a genetic algorithm in the sense that it uh looks for the best um value of some kind. Um and then once it's found a good value, it kind of mixes it with some other values that appear to be good and produces a new um candidate and tries to uh optimize that. So it's ultimately an optimization library. It optimizes a string. Now that string can be a simple text prompt or it can be some JSON data which ultimately contains whatever you want. Um and the paro bit comes from it basically takes candidate from the kind of paro frontier of the best uh examples it has. So it is effectively if you're imagine you're breeding raceh horses, you take the best raceh horses and you breed them each time. You don't for the most part go and take some really slow horse and like add it into the mix to see what happens. You take all of the like best resources and breed them and hope to get to to better better resources and it does a similar thing. And then the second is managed variables. So lots of platforms including ours have prompt management. So basically the idea that you can have uh your text prompt in the platform and you can edit it from there. We take uh managed variables one step further. Uh sorry, we take prop um prompt management one step further and we have managed variables. So they don't have to be just text. They can be effectively any object that you can define with a pyantic model can be uh managed inside inside logfire. And so we're going to we're going to kind of try and com combine those two things today. I will say at the beginning of this due to some family situations I wrote most of this talk overnight. So uh maybe in keeping with the rest of this conference, it may be slightly uh chaotic. If I if what I'm saying doesn't make sense or I go to sleep at any point in this talk, please throw something at me and I will endeavor to make more sense. Um so I'm going to talk quickly about the subject that we're going to go and optimize. It has to do with politics. Don't worry, it's not particularly controversial politics relative to what's going on at the moment. But um how many people actually I won't do the how many people. There's a there's a um podcast called um The Rest is Politics, which I'm a big big fan of and a big listener to. They were having a conversation back in April last year, almost exactly a year ago in fact, about how many politicians basically come from like political um dynasties, families of of politicians. And there was a lot of like bluster about what the answer might be. And I used padantic AI to go and analyze the Wikipedia article for each MP to basically look for references to um their relations who were also politicians. And therefore was able to come up with some percentage. I think it might have been 24% of MPs have uh some kind of ancestor who is a who is a politician. But at the time I basically just ran it with whatever model I could get. Got an answer, submitted my question to the rest is politics. They read it out. That was very nice. But like I didn't actually go and check how well it had done. And so here we're going to take um that challenge and try to optimize it to improve the prompt that we use uh the the subtlety. Let me let me go into some code here and and talk through the task. So hopefully this is a scale you can read. Um there's a bunch of uh fluff here. In particular, there's a there's a um archive file here that will be automatically downloaded if you run the script. So you don't have to go and do all the scraping from Wikipedia which just contains the the basically the raw HTML of the Wikipedia pages. Um we then have some schema for for the for the MP's data and then we have schema for their relations. So in particular so this is obviously the name the role they might have had in particular their relations. So whether they're a like sibling or a spouse or a grandparent and in particular with regard to the question that we're trying to answer about uh kind of um political dynasties, it isn't really relevant if someone has a child or a spouse or a sibling who is also a politician. It's to do with their like parents' generation or parents parents basically ancestors. And this is something that I found really hard to get models to um to respect. Once you said relations, it just couldn't help but add spouse or children or uh sister-in-law into the into the mix. And so the way I actually went and solved this last year was I basically said include the relation whatever it is and then afterwards I went through and scanned and removed the ones which were obviously um the same generation or beneath as it were. So here one of the things we're trying to optimize is finding a prompt which will get the agent to efficiently discount uh yeah non non- ancestors. Um so this is a schema we have um we got a task input which is which is relevant. We have some some very basic um initial instructions u for the agent. We have some slightly more advanced instructions so we can kind of see how the two perform. And then ultimately we're going to try to optimize to find an even better prompt. So uh the the core of this is this this pyantic AI agent. Uh it's actually just using a hard-coded model. It's a variable because in theory you can substitute it with the command line, but I don't think we need to today. Um the most interesting bit is that the way that we're doing this is effectively a structured output question. So this could be this is equivalent to any other structured output. find the address in this email, um, find the invoice lines in this PDF, whatever it might be, but we're using the same idea of structured outputs to go and find this this political relation. So, we're just we're just substituting in uh yeah, so we're just we're just setting the output type to be list of uh political relations. Um, we set the instructions from above. Um, and then ultimately when we go and run the agent, all we're doing is taking the HTML, doing a little bit of processing with Beautiful Soup to strip to get out the text. So we don't have to take all of the like other crap going on in the HTML page, and then we're passing it to the agent. And this works surprisingly surprisingly well. If you go and if you go and read through the the cases, you will see that it is pretty damn accurate at finding all political relations. Even the relatively dumb models do a pretty good job. The main place where they get confused is are their relations political versus like otherwise uh like public figures and this idea of a relation being an ancestor seem to really confuse the models. Um you will see if you look in the rest of uh the files here I'm sorry if the if the file explorer is small but um once you run it once you'll have this MPs directory which contains uh a JSON file with um the MPs and the URLs and the the the raw data um and most importantly of all the pages which is the the HTML pages as I say that will just be downloaded and put into that directory the first time you run it. Um yeah, and then there's also this golden relations. So this is supposedly the exactly correct answer for each politician. So um uh Steven Kinnuk has uh I mean for anyone who's British will know a like reasonably famous political name. He has a bunch of political relations. Um uh you see here that again it's included his wife who was the prime minister of Denmark. Even though um I'm seeing some question questions there. I believe that's true. Um even though we we said don't include relations, that's why I've used this technique of basically including those relations and then stripping them out based on what the relation is um etc etc. And this um I believe this is pretty accurate. Truth is we ran it we just ran a similar script with the with the like with Opus 4.6 to get that data. But I've I've checked it quite a lot and it appears to be pretty much correct. And so in terms of the the first step here is to run some evals against this and um uh demonstrate that it's the the like performance relative to the golden data set. But before we do that, I'm just going to run it um run a single case in the terminal and show what's going on just so that you have some feeling for what we're doing here. So I think if I look in task here, uh yeah, if I go um UV run task uh two is the ID of the the first politician for some reason, Steven Kennock, we'll go and run the model and successfully find the two relations. And in this case, we've prompted it and said uh don't include only include ancestors. So it's done that correctly and it's excluded his wife. Um do I mean I guess people want to kind would prefer to to like run along with this and and try this themselves. Is that is that true? Okay. So you will to do that you will need and um you ideally you'd have a logfire account that will allow you to do the managed variable stuff. You can use a free logfire account. The the free tier is extremely generous. So everything you're doing there is free. The other thing you need is obviously an API key to connect to the models. You have two choices there. either in the code we'll basically be using uh padantic gateway as a simple way of being able to access all of the models. So you just see gateway prefix in front of the model name. Uh if you have your own open AI or anthropic key and you want to use that just remove the gateway prefix and it will work. Um if you want to use pantic gateway so that everything just works. I'm just going to generate an API key in Logfire and share it with you all, hence that channel, and put a put a limit on it and hope you don't no one goes and uses it all up before we finish the lesson, finish the finish the workshop. So, um I'm in in Logfire here. I'm going to go into gateway. Um if I had set this up more formally, we could basically invite you all and you could all go and generate your own keys and you'd all have your own limits. I have not prepped that and so I'm just going to generate one key uh with a $1,000 limit and hope we that does the does the job today uh and then delete it after after the workshop. Uh so I'm just going to copy that into this channel here. Um, so you should just be able to uh export that environment variable. Um, and then when you run your code, for example, this this task here, you've rerun task two. Um, you should get get an output there. Um, should I wait a couple of minutes for everyone to the other thing you'll need to do is set up log fire for when you run um the evals. And the the simplest way of doing that is once you've run uv sync in this so you'll be in this directory you'll run uv sync. So note we're in the we're in the subdirectory for this this talk we'll run u sync um and then to connect it to your your project you'll see here my project is called uh demo. So, I'm going to run uv uh run logfire project use demo. Demo being the name of the project that I want to connect it to. And that will go and connect logfire in this directory to the demo. So, when I go and run stuff, it will output uh record uh traces to logfire and use the manage variables from there. Um, I'll wait a couple of minutes because I'm sure people want to get a bit set up and get that kind of like very simple hello world thing thing working. Um, and then we can move on. Anyone got any any questions while we're just getting that set up. >> What does gateway mean? >> Uh, so AI gateways, we have one, but we're by no means the only ones. Uh the idea is uh you can basically make requests to most models through through our gateway with one API key. And in particular, we we have observability. So if you're if you're using the the production version, you can see requests that your team are making or that your agents are making. Um we have things like caching and fallback. So if one model fails, we can fall back to another one. Uh etc., etc. But in this context, it just means you can have one API key and connect to Anthropic or OpenAI or Grock with a Q or Gemini models. >> So Gateway is your own. >> Yes. But as I said, you don't have to use it here, but it might just make things easier to to get all of the models working. I'll wait a couple of minutes, but any any questions or should I go on? >> Sorry. >> Uh, so it's it's uh the the repo itself uh is um github paidantic talks and then the directory is 20264 AI engineer. Have you logged in with logfire before? >> No. >> Okay. So, you'll need to do that so that you have the um machine key. So I think it's UV run logfire off. Sorry, I've done that. So, I'm going to keep going just to to keep things moving, but like unless anyone has any any questions. So, we now have the the very basic case working. What we want to do is run an eval to see how um uh our model with this with our prompt is per performing relative to our golden data set. So you will see if we go into evals here we have um uh where's the data set defined? Sorry. Um you'll see at the top um we have the load data set which is going to basically create our um our data set which is the the object that we can then call eval on. Um this is going to load um the the um the cases which um and and basically create one case for each of the each of the MPs that we're that we're going to run. Um and then it's going to create yeah go through and uh generate these cases and ultimately return this data set data set object and we're going to register one custom evaluator here which um in turn actually generates a whole bunch of metrics or assertions for like how accurate this um uh this run is. Um I don't think I mean you can go and read through the logic. I'm not going to go through each individual bit. I'm not claiming here that these are the perfect set of evals in terms of the actual evaluators, but they're they're a useful start. Um the point is if you go I can you can look through the docs, but we have a bunch of pre-built evaluators like LLM as a judge. Um and then we allow you to uh define your own. And generally defining your own is far better than LLM as a judge because the LLM as a judge is effectively the kind of you know I don't know if it's politically acceptable but the kind of lunatics running the asylum and it can lead to lead to slight issues. So if you can have a like deterministic eval like this where we're comparing what the result is versus a golden data set that's much better. Um u and so if you look in main where we're going to go and run this ultimately main is really just a CLI. So we set up a bunch of CLI cases and then in the case of eval we run the run evaluation function. If you look in here, ultimately what we're doing, we print our better stuff and then we're just going to call this evaluate function. We're using Pantic AI's override functionality to override uh the prompt that is set by default with um with the custom prompt that we're using for that particular that we're evaluating and a model that we're evaluating. We're not actually going to change model in this case. Um and so we're yeah we're going and we'll then go and call so ultimately we're calling data set evaluate which takes a function which is the actual function we're going to go and um perform the evaluation on and that will in parallel with a uh max concurrency of five go and run uh all of our cases um which we will you will see in a minute um and then we're going to give it some name um so that we can look at it in log log fire. Now you will also see here we've set up log fire at the at the top here. If token present just means it would run without causing an error if you don't have the the token set. Um we set an environment service name. We're not going to print to the console because uh in these complex cases it's actually very quite tricky to see what's going on in the terminal. It doesn't really help to have logire kind of printing to the terminal. And then we've switched off scrubbing because occasionally something like the word password or or something will appear in an output and we we're happy to for that to go to logfire. And then we're going to instrument pantic. We don't actually need it now. So I won't instrument prints but when we get to using jeer instrument print just as a nice way of getting the output from jer into logfire as well. Um and so if I look the and the as you may have realized the help file has help. MD has some useful examples of what we're going to run. So we're going to go and run um this script here. So we're going to UV run main.py. We're going to call with the arguments eval. We'll split. So we're we're splitting it based on just the test set that we're running. And then the prompt is the the initial one. So the very simple prompt um kind of oneliner description of what the agent uh should do rather than the more complex prompt. And if we go and run this eval, you won't see that much in the terminal because we we uh uh switched off tonal, but you will see it running through the cases. And if we go and look in logfire, you can see um those cases running and their their costs, etc. We can dive into what an individual case looks like. So if you look here, we're running um uh the agent. And I guess that's probably quite small, but I will try and zoom in to let you see what's happening. Here we have the system input again that's the very the um simple instructions that we talked about and then we have the input which is the the contents of the Wikipedia page as text then we have the final output and in this particular case it found no no relations um that has now finished and so as well as this view of the individual cases here we can go into evals in logfire and you will see um this case here and you Sorry, this data set. And then within the data set, you will see that most recent case from one minute ago. If we go in and look at that, we should get a view of what's going on. So again, this is pretty small on the big screen, I guess, but it'll look much clearer on your screen. We can see um where um the cases have the mo most most relevant metric is um accuracy where you'll see there are ones where it's correct and zeros where it's incorrect. And you'll see a few values where it's lower where the output from the golden data set is slightly different to the output from um the from the model that we're running here. So you see here this guy has a father who is a um something to do with UKIP and it's it's got it correctly his name but there's a slight difference in the description. Hence, it's like got a 0.9 score and you can see the overall performance of 85%. So 85% of the time it got it got all of the stuff uh right. That's not particularly good. It's also not particularly interesting in its um until we go and compare it to running with a with a better prompt. So um how many people are up to kind of being able to run that run that eval? A few people. Okay, I'll wait a couple of minutes because I think it's probably a slightly more entertaining for us to Yeah, we're 20 minutes in. We got quite a lot of time. I'm gonna wait a couple of minutes. >> Can you show the command again, please? >> Yeah. So, it's it's this command. It's if you look in um help.md. We're basically running the the eval command. >> Y running EVO. >> Yeah, >> it took >> it ran really quickly for me. Um, >> I think it should take about 30 seconds. So, I don't know whether you're on a different internet or changed a changed the model. Um >> ah yeah that will be much slower than I think we're using >> well a anthropic was down this morning when I was trying to use it a lot so it's being really really slow and GPT4.1 is flying because not many people are using it and it's doing a reasonable job. That's why I'm using one of those models is partly to like show the eval performance but also bluntly to get the answers more quickly. While people are are getting there, I'm just going to run this the the next one, which is compare. If you go and look at what this is doing, it's basically just running the eval twice with the with the like bad prompt and the good prompt. And then we'll be able to uh compare those outputs in in logfire and kind of see where the differences are coming from. Um and then we can go on to the full jer optimization step. I'll just run the compare which should be take a little bit more time but still still hopefully relatively quick here with these models. You see, we're just running the 65 uh test cases here rather than the full 650 MPs as we'll use when we get to the full um optimization. >> If you accidentally run the same twice, you'll just create a new >> you'll create a new obviously you'll have more spans in log file and you'll get another um run in the logfire evals view. you can archive them when you've when you've got confused by how many different ones you've got. So, if you come over here to evals, come into that most recent case, you'll see I've now got a bunch. Uh I could go and archive uh so yeah, archive the older ones. Well, do the other way around just to keep things uh clear. But if I take the those so I've now got two new new runs here uh two new experiments and I'm going to select them and um compare and now we have a bunch of different uh metrics um uh in particular and you will see that like ultimately accuracy being kind of the most important one here slightly higher 92% versus 87% for the like expert prompt versus the initial prompt and we have the like possibly slightly useless graph of different metrics and how they've performed. Um with with fewer um metrics or assertions, this can be more useful. Um and again, what what's probably most interesting though is to look into individual cases where the performance differs. So you see here um with this MP uh Joe White that the initial prompt um found one match um relation spouse which obviously it should not have found because we're not looking for spouses whereas the expert prompt correctly ignored that one. Um and then if you look in this case Laura Kyle um you will see that the the expected output is no matches at all both um of the these most recent runs. They found a match um because they had some four times great-grandfather who was the governor of Hudson Bay which is not a politician but is a like public figure. So you can kind of start to see where where the differences are coming from and the kinds of things you might want to do in the prompt to to improve the performance. That's ultimately the kind of data that Jepper is going to use to like um inform it uh generating a next the next like paro frontier prompt for it to go and try. I didn't manage to see the graphs. >> You you will see the graphs only if you're comparing two cases. So if you open a single case, it doesn't show the graphs. >> So I've run compare. Once I had run eval, I then ran compare with this uh command here. >> And then um here I've taken two cases. >> Oh, sorry. >> And then you hit compare. And then you see uh see the graphs um comparing the performance. >> So there are if you look in the um task uh pi there are two prompts uh as a kind of starting point. So one is a very simple oneliner and then the second is a bit more detailed description of the kinds of things you should be looking for. Still not kind of optimized but like what you might get if you as a human wrote out a decent prompt or as a model like write me a better prompt. So you can see we've like improved the performance from 87 to like 92% with this model through doing that approximately and we should with Jeopard be able to get better. We I think I got 96% when I was running it earlier. um by no means perfect but this is you know but you I you know there are examples of using jeopard to like I suppose what what people generally do is they either hold one variable constant like we need the following quality and now we want to reduce time or reduce cost or we need to like drive up the quality. So there was an example from Shopify using Jepper. They were using they were basically looking at um Shopify sites and looking for things like whether they were fraudulent or which which tax category they fell into. And they switched from basically just giving the entire website to GPT5 and saying what is this to using an agent and using a Quen model and Jeeper to optimize the prompt. and they got the price down from $5 million a year to 60 60 $73,000 a year. Um, and improved performance over time. So that's obviously also that's not just optimizing the prompt. That's like some mixture optimizing the prompt and moving to aantic and yada yada. But in this case, we're just trying to kind of improve performance whilst using the same relatively fast model for the sake of for the sake of uh the demo. Um, so now we've done the comparison. I guess it's time to look into well, anyone want me to wait a bit longer or shall I keep going? I'll take the no one asking me to wait as a signal to keep going. So we now get to the kind of jeer jer phase of this. So I'm going to reenable that while I remember. And then ultimately if you look here where we are going to run the uh optimize um case you'll see we have this run optimization function run optimization here. It's going to load in the the the the training and the validation data sets um error if it's not there. We're then going to go and create this JPE adapter. So the Jepper adapter is is effectively like Jepper is not it's an amazing it's a kind I guess the state-of-the-art now in agent optimization Lakshai who created it is a first year PhD student at Berkeley. He is not perhaps the most experienced kind of engineering Python developer and so Jeopard doesn't do async that nicely. It's not as type safe as I would like it to be. I'm like itching to go and either fix it for him or fork it bluntly but I have so far uh not done so but um yeah so so Jeppa has this concept of adapters which are basically how you define the agent that is going to propose the next case and so we have a we have this function here to create the adapter um which in turn generates one of the this adapter here which which we've defined which subclasses um their adapter type. It's a data data class, so it takes a bunch of arguments, but ultimately the bit that matters is build propos agent um which is obviously their their method name. And what we're going to what we're going to do here is return a pyantic AI agent u in turn. So we're using a we're using a pyantic AI agent to propose new cases to optimize a pyantic AI agent. Um, and we're going to uh use again GPT4.1 for the sake of speed. In this case, there is a slightly annoying characteristic where Jepper is sync, not async. And so if you try and run each time it runs the proposer agent, it runs it in a new um async context. And so we have to go and make sure we create a new HTTP connection, HTTPX connection, otherwise we get a bunch of errors. Hence this like slightly weird formulation. Here we have again a prompt. Obviously, you can get a bit uh going around in circles on optimizing the prompt of optimizing the prompt, but we're basically saying uh you're an expert prompt engineer, improve the um system prompt for this agent, yada yada yada. Here are the things to consider. Um and then we're going to um call the ultimate evaluate function which is going to um generate this bunch of cases generate a pyantic AI eval run that eval and that is what we're going to use to basically test the performance of our new proposed prompt. This will probably make some more sense when I actually go and run it. Um yeah, and then we uh apply skills for the for the failures and then we um return this evaluation batch which kind of is a summary of how it's performed which is what Jeppa will then use as to whether or not this is a new paro frontier or whether to to propose a new a new prompt. Um uh and so I think the simplest thing to do is just go and run this. It will take a little bit longer um and it will cost a little bit more but that's fine. I'm going to run it. If you run it with 50, it basically dies before it gets anywhere. But if I um run this with 400 calls, um you will see um Jeeper starts to has a has a similar kind of progress bar and I, as I said, I've switched off um log fire printing so that you don't see what's going on in the background here because it basically just becomes impossible to view. But if you look inside logfire, you will see already this optimization is going on. So you can see that we um evaluate um the the agent and then we call the proposer agent here. So the proposer agent, if you look at this step here, this is what we just had the system prompt. I said for the proposer agent um and then the input which is basically Jeepper's description of the context for the proposer agent to enable it to come up with a new prompt and then from there the proposer agent is going to propose a new system prompt um and we're going to then evaluate that see how it performs and iterate towards towards finding a better solution. So that's what's going on here. You see why I did the instrument print um is so that um we can get the the print output from Jeepper um nicely viewable inside logfire. There may be a neater way of doing this, but this worked for me this morning. Um and it's going to work through you can see it's a little bit more expensive. We've already spent $2. Um uh so I guess if everyone runs it, we will spend a few hundred dollars. uh OpenAI will no doubt appreciate it. Um uh and that's going to work through, but you see we're already like um 50% of the way through the optimization and it is proposing these new prompts and then um uh evaluating them and and using the the feedback from the evaluation to decide where to go next. And I would encourage you if you have some time to go and have a read through how the these traces because they're quite interesting. I think they're quite illuminating in terms of how the how the optimization is working. I think one of the things to note is people love to say that anything to do with AI is incredibly sophisticated and complicated and advanced. This optimization technique whilst the state-of-the-art is not actually that groundbreaking relative to the complexity of the model itself. It's a like relatively crude sense of like ask an agent to generate a new prompt. If it does better, take bits of that, put it into a new prompt, keep doing that until you either run out of time or get to some prescribed score. >> Yeah. >> Questions are >> uh how big the batches are. So we have >> the process of the So I think you can see it here where it's running the evals. So it's running I think it is running all of the cases in the test data set. >> Um but I I think I've seen it run fewer. So I think Jeepper is clever enough to like alter the number of cases based on how they performed but I honestly not not that sure. >> Not one. >> Yeah. So in terms of skill, what was the what was the question? >> I was thinking how to use that. I was thinking that >> I mean I I think we we will make this much easier with open source. We will also make it something where you just plug in logfire and it just happens in the background. Um but we also want to make it super easy to do it locally or open source as well as like we don't just want to say oh you must plug it into logfire to get optimization. Y >> now you just said the max goals, right? Which is I think max calls open. >> Y is there a smarter way to do this? Like if you don't improve for 10 iterations, just stop. >> I'm sure there is. I haven't looked into it. Like I say, I did this in a bit of a hurry. Totally. And like presumably for the most part, you're saying reach this threshold or as you say like run out of optimization or whatever it might be. >> I don't quite understand how you see the prompt optimization. So if you look here in the proposer agent run uh you will see effectively what um what input this is the this is the input to the agent that is that is um uh proposing a new prompt and it is being told based on this information propose a new prompt and then if you if you look the the the principle of the like paro frontier is this contains components of previous prompts that have done well uh and effect you know it's a bit like the >> it's not like a div it's a whole new >> yes it is propo proposing a whole new new div >> so so I think if you have multiple so so I think jpa operates on the idea of a kind of object or dict of known keys and so here we only have basically one a dict with one key value pair eg like system prompt value if you have multiple different keys then you can optimize by basically combining the best values from multiple different keys. So you could have like you could imagine e you could have imagine model and system prompt or model and some set of tools or even like different lines from a system prompt that you're basically uh allowing it to the the proposal would just be choosing different values from your set of different lines. That's the kind of standard way of doing this in in DSPY is like basically take all of my different examples and choose which ones to include in the prompt. >> DSP. >> So DSPY is another agent framework uh somewhat similar to pyantic AI. Um it comes from a very um machine learning background. I find it hideous to use because it's not type safe but the caliber of people using it is generally very high. Um and it it has has had this concept of optimization for a long time. For a long time it was basically the only agent framework that had this idea. Now Jeppa has come along and is like available within DSPY but is not only available within DSPY. >> Yeah. Uh what's your personal take on JPA and those type of algorithm like from your personal experience have have they worked well in practice? Have you all like used more home builds uh optimizers? Yeah, curious on on this informal take. >> So I think that in the case where you're trying to basically get a dumber model or a faster model or a cheaper model to be able to do some task, they make a lot of sense. If you go and take the state-of-the-art models are like Opus 4.6 and you ask it most questions, it will just if it has all the information it needs, it will just go and get it right for the most part. And so the optimization is slightly less relevant. I think one of the subtleties here is that the statistic is that 98% of data is private. I don't know if that number is is correct or not, but like let's say even if it's off by miles and it's 50/50 data. When you have a private set of data where the models have not been trained on it, you have some massive internal spec for how you're supposed to operate as a bank, let's say, adding the right bits of context into the system prompt or into the instructions is incredibly valuable. And the optimization really matters. The problem we have when we're trying to give demos like this is by definition there isn't private data that we can use because if someone lets us use their private data, they're not like, "Oh yeah, can we go and talk about it at AI engineer?" And so I've got this like slightly weird example of MPs, but actually you'll see in cases stuff I'm going to show you in the middle in a in a minute, the model actually just goes and like figures out who the MP for some some burough is even if it hasn't looked at the data because actually this is all public data that it knows. So, I think that one of the subtleties is that optimization matters way more where you have large amounts of private data. And honestly, we're still trying to figure this out as a company. One option for us is to use stuff that is public, but new enough that it's not in the training data. So, one option would be like news, but then that could get slightly politically dicey if we were using current news as our like source of stuff to optimize on. So, another option is to use like what's going on in sports because whether you find it interesting or not, at least it's not controversial. we would, you know, ask it which footballer has done well in this Premier League season over the last five matches. And that might be public data, but equivalent to private in the sense that it does not exist within the model, but it's not a really good example. It's something we're trying to figure out. >> Uh, yes, I have a question regarding Japa. Does it um >> I think it's not working your mic, but if you just speak up, I think we can keep going. Yeah, I'm just trying to understand if there is a um way where it prioritizes kind of optimizing and adjusting the system prompt rather than kind of appending to it because I feel like that's a >> recurrent issue. You find new edge cases and you're like appending and adding more information. Y >> so does it strike that balance of okay this we append and this we adjust. So I was using GPT5 mini this morning and the one of the problem a it was slower in general but also it was just producing this enormous system prompt which then meant that the agent was slower. I don't know if it did better or not. I think it it did roughly similarly well. Um so that definitely is a problem. I think the solution to that is the thing I was talking about earlier where ultimately Jepper wants a key value store key add a dict of keys and values and then instead of basically saying instead of the proposer agent being like go just generate me a new system prompt you're basically saying choose a new subset of our list of different inputs available for like you would go and basically split your system prompt up into like 200 sentences and say which is the best 20 sentences to use and now by definition if it can only choose 20 sentences it's not going to get verbose. In this case, we're using a pilance AI agent to basically summon up a whole new system prompt, which is probably better at finding the edge cases, but but does result in a verbose prompt. Um, so in fact on that very point, it has now finished. It has given us this optimal prompt. It has achieved a performance score of 96 uh.7%. So significantly above the 92 we got from our like best case before. Um and at the expense of a relatively verbose prompt. Um uh so you can see where it's gone through in relative detail and in particular it's banged on about like what relation to use, when to when to include a relation, what sort of things count as a relation. Um uh yeah. Um okay so so that that that is is jeoper working. Uh the other bit I was going to talk about is the managed variables. So I will now switch on to the managed variables. But before we go on to that any other questions on this bit of of the the talk or on jer. >> Yep. >> This is just being hard but you got the validation scores compared to like a golden set. >> Do you want to take that? >> The validation scores comparing to like a golden set right? Yep. >> Um is the golden set in this repository? It it is here, right? >> Yep. >> Oh, okay. Okay. So, >> yeah. So, if you look in um if you look in the uh cases directory, you will see golden relations and there's a y as a JSON file with the the supposedly golden set. Now, I'm sure it's not 100% perfect. As I say, we generated it with Opus 4.6 using a similar similar script, but like um >> I just I had a silly thought like if you didn't have that golden set then things get quite difficult doesn't it? >> Yeah. Yeah. So evals are much easier if you have some like golden reference against what it's supposed to be doing. Uh in general what people end up doing is they have some subset of data that's been like human annotated that works. But you can also have cases where for example if you have an agent that's writing code you go run the code and check whether or not the code fails for example. Um or if you can have a like feedback, if you can have a a full loop of like uh basically returning that data to the original source and seeing how accurate it is, something like that. Working out what your what your judges is always is the hard bit of evals. >> But like just for the your example where the if you're running the code, that would be like a like like it either runs or doesn't, right? It's not like a How do you do a percentage then? I'm a little bit confused there. Uh well you could check for example whether or not it was going to generate invalid code or use libraries that weren't available. That would be a like first step of like how performance is going. But yes I mean this is that this is the hard bit of evals is like what is what is right and and as models get more intelligent figuring out what right looks like is harder and harder. If you I mean the ultimate case right is like you have an agent whose job it is to persuade people to stop smoking. The ultimate eval is wait 40 years and see when they died. But obviously you can't have an eval where you wait 40 years to see if they died. So your eval is probably like does it include the like the correct words does it not include some bad words that you definitely don't want it to suggest. um like you know take up a cigar instead, right? So text does not include cigar would be one of your evals perhaps and not like that sounds dumb but that is what an awful lot of evalu >> can I yeah sorry um two two questions uh one is a very short one um with panici you kind of break up the structure a lot of the time in the prompt does this just jam everything together and you just get one prompt at the back at the end >> so you always have one in one one set of instructions Yeah. >> Um, so it is building one set of instructions, >> right? Okay, that's fine. Um, so the the the real question was, uh, you know, variance is huge with this. I mean, you showed your results. I ran it, I got results. >> Is there a way of like having this having the variance be part of the actual optimization? >> Uh, I wouldn't say that. So, so there's a when you run EVAs, I'm just looking for where we call run. You can basically set the number of times you run each case. And the best way of reducing variance is to run them lots and lots of times. I know a bunch of hedge funds who are spending about $20,000 a night running their whole set of evals uh to see if they can improve over time. And if you have your own GPUs, you're like, well, it's sort of free to go and run them. So, we should run them more and more times. We should run everything 100 times and look in the morning. It's definitely one of the challenges. And then do you like create a cost function or do you have like you can point it out like I want you to optimize like both accuracy and minimize variance like do you write your own cost function then within >> yeah so so if you look in the in the adapter code here you will see I mean have a look through it but like where am I looking um evaluate is returning this like um evaluation batch type which where you can basically include whatever score scores is uh a list of floats, right? So ultimately you can return whatever floats you you know you want in there. >> Okay. >> And just just in case people are wondering trajectories is effectively similar to traces. It is the like sequence of different steps that agents went through to get to a particular point. >> Um so how how do you handle systematic errors? Like I ran the optimization and it got a high value of accuracy 96% but it got wrong ons and uncles. So the final prompt excludes explicitly aunts and uncles even though they are in the golden relation. So do you suggest running and like crude code spotted this immediately? >> Yeah. >> So maybe one could have a harness where there is another check at the end. I I think what it's doing is it's using a small set of test cases and then a bigger set of evaluation cases and so that the subset of cases that the optimizer agent has gets exposed to maybe doesn't include a uncle and aunt and so it just made it simpler to exclude it. So one of the other things you end up needing is like a kind of uh at least 2x the amount of data that covers all of the different space so that you can have one full set of data that you train on and one set of full data that you evaluate on. This is like classic problems of machine learning of like how do we go about like we need an awful lot of data to build something that's really reliably you know yeah really reliable because ultimately you know you're already getting to the point with the agent where you're kind of encouraging it to overfitit and I suspect the uncle and aunt thing is it overfitting >> okay thanks >> there's a question behind you >> this is probably a very silly question but but surely the the prompts are and I assume that the prompt only works with the particular model that you tr you you would >> that is that is people's take is that you would need to run this all over again if you changed your model >> that's going to happen all the time isn't it >> yep and that's one of the reasons that eval are hard and that's one of the reasons that most people don't run evals and don't run optimization they write out a decent prompt they ask their coding agent of choice does this prompt look good if it says yes they kind of eyeball it and then they put it into prod and they worry about other things because probably the model that comes comes out in a month's time is going to go and supersede whatever optimization you did in some cases. Now, if you are a private equity firm who have uh 200 million invoices from across your portfolio that you want to go and analyze, it's worth going and optimizing for a particular version of Quen 3.5 light mini to go and do that job instead of just throwing GBD5 at it because the difference is like $10 million and it costs you you have your one analyst who earns half a million dollars and you put them on it for three weeks and you solved it. So, it like depends on the question that you're that you're trying to solve. But the ultimate example of this is the coding agents where the like the trajectories are extremely sparse as in there's an extremely wide range of things that they're doing and for the most part we are told that those companies don't have very many evals or don't have any evals at all. If you go and ask Boris Cheney how does he work on uh claude code he says mostly vibes mostly we just like see what works and tweak it a bit. So this is not a like I'm not claiming this is a panacea in all cases but there are definitely situations where being able to optimize your prompt or your agent choice is valuable. The other thing that we we don't have a demo of here but there's an awful lot more than just the prompt you can go and optimize. So there's the model, there's things like the compaction strategy, there is how you register tools, there is things like including code mode in this. And the next step, what we want to allow is basically for you to go and optimize across the full range of things you can do within your agent to find the optimal choice where perhaps the particular model choice doesn't doesn't matter so much. And it just makes me think that we you evaluate the prompt here, but could we just switch a bit and start evaluating different models or that would be silly? >> Yes. So that would that that would be something you could totally go and do is try different models. I mean given that there are maybe you have 10 models you want to try, you probably just run all those 10 models and see which one performs best and and pick that one or which one optimizes for your particular subset of like performance, price, time, etc. I think there was another question behind you. >> Yeah. Uh thanks. I already got the mic. Um so what you've shown is a task which is relatively narrow in what should be achieved and I'm wondering uh how you would approach it if the task is more open-ended. I I think that my suspicion is that that is a case where this optimization technique is harder to do because B like I say the coding agent cha case you could imagine you end up with 5,000 distinct tests for a coding agent and you go and optimize as much as possible in those 5,000 cases and it turns out that they are like drop in the ocean of different things people do and now your agent is over optimized for your thousand cases or 5,000 cases. is I think the bigger the breadth of the task your agent is performing the harder it is to do this and the more you end up just having a like clever agent. Thanks. The other thing to say is that my point earlier about where you have large amounts of private data and so you have to put masses of context into the system prompt for like understanding our specific domain then choosing the right examples from that to put in which basically cover the full data set is more and more relevant. Did you have another question? Thank you. >> So if I understand correctly, um if you're a company and you would use this JPA optimization technique and if you had more money and maybe more expertise, you could decide to fine-tune a model. So they are kind of competing uh strategies. Correct. >> Yeah. And in answer to your pre the previous question about oh but isn't it isn't it made obsolete by the next model? The main reason that the big model labs say don't bother fine-tuning is that there you really are spending tens of thousands of dollars to go and fine-tune a model and actually for the most part improve the harness, wait for the next model, like show a nicer loading icon to your users is probably better than fine-tuning. But that misses the cases in particular in places like finance where they have enormous numbers of runs where they really do care about that optimization and fine-tuning applies. >> Thanks. So, I'm going to go on to talk about um managed variables. Uh as I say said earlier, this is a this is a somewhat um uh it's not going to be a particularly sophisticated answer I've got here, but I'm going to kind of show the example just just to explain how it works. So, we have here a very simple fast API web server. Um you will see that it has ultimately two endpoints. One returns some HTML um and the other one is a form submission um endpoint. And if I go and run this um um where have I had it running before? Here we are. It was in fact running. So if I if you want to run it, you'll probably want to use this command here to run it. I'll put that into Slack as well in case anyone wants to run it. And if you run this web server, you will not you'll see logfire which is not so interesting yet, but you will also see the server running locally. And this is just a very simple basically form where you can ask questions about the same bunch of MP data. Um and um I can go and ask a question like we'll see how well this performs. Hammersmith where I live. It will run for quite a long time because what it what it's got in the background is basically one tool which lets it gp through through the mark the the the HTML pages and it has replied relatively quickly on this occasion. And uh if I load this here, you will see the HTTP request coming in in logfire and then you will see the agent run and it um it ran a rather the the you know a good good sequex got back the the data and was able to correctly identify Andy Slaughter is the MP for Hammer Smith where I live. um all very well, but if you look in how this code is defined, we have here this logfire variable. And so this allows us to basically go and change the behavior of this agent without redeploying. Um now obviously redeploying in this particular case where I'm running locally is very very simple. But where I've got some big production CI stack to go and run and deployment takes hours, being able to go and change variables in production or in staging very quickly is extremely valuable. So the way that that log five variables work is that we have have a type here a pyantic model um which takes has in turn three strings the instructions the model um and um the max tokens which is an integer and then we've given it a default value um here um some very simple prompt and then we've given it we we've chosen the model and we've chosen the temperature and what I have run here is I've run uvun web and then there's a there's a basically there's a like CLI command for this which just pushes the variable ultimately we're just calling logfire push variables in this case when I run that I'll get asked do I want to overwrite this variable which I don't but if you if you're running it for the first time you you won't be asked and you'll you'll create a new variable now it's worth saying to do this we need to make the the experience of using these variables slightly easier in logfire at the moment they're not configured with the logfire um projects use command that you did before. So you need to go and set a a separate um API key to use variables. That will be solved soon. But if you want to do it today, the way to do that is to go into settings uh API keys within logfire. You see I've created uh I can create I could show you the form. I can create a new API key and I just need to give it the three variables permissions. Well, or or whichever ones you want. And now you can go and like you and we're ultimately using that API key to push the variables uh from local development uh from from the code definition through into logfire and then pulling them to to update the values. And so because I have pushed that if I go into uh managed variables here you will see I have two variables set up. Um this MP search config is the one we were looking at. You see we have those three fields that we've set. um and we have um some history of some updates but we can in particular what we can do in targeting is we can basically define what percentage of calls are using which of our different values. So this is very much like uh AB testing in fact under the hood our managed variables use the open feature open standard for for doing this. So in theory you can connect to to logfire with anything that speaks open feature. Um but yes so at the moment we're using the code default. So you saw when I ran this now uh I got back this answer answer of Andy Slaughter. But now let's go and update the value here. Um I want to go and edit and I'm going to leave those values the same and I'm going to say uh reply in French to make it very clear what's happened. Save that. And I'm going to go into targeting and I'm going to dial up what use the use of latest to 100%. So it should now use uh that latest value. And now if I um come in here and I ask for example the same question again uh and I stop the server which was silly um but you could imagine the server is still running and it's going to pull the new variable each time that that that function is called and if the demo gods are with me and everything is hooked up correctly uh it has indeed gone and replied in French. So, we've managed to we've updated the server without redeploying. Obviously, I did in fact restart the server, but I could I could change it to not. So, just to prove that that works, let me come back over here, edit this again, and say reply in German. And oh, that's very annoying. Uh, and now if I ask the same question again, that variable defines a system prompt should be updated. It's taking its sweet time, but it is indeed replying in German. But the nice thing here is coming back to your point about like is it just a system prompt? We can set up our variables to not just be simple text values but be for example a pidantic model which allows us to edit multiple different fields. So I can switch here from gateway anthropic uh sonet 4.5 to open AAI uh GPT 4.1. I'll get rid of that because that's going to confuse me. uh save that 4.1 and the point is that we we can update multiple different values. So if I instead ask you ask it uh hopefully it won't bother going off and running it said it's now chat GPT rather than anthropic. So we can use this to to change variables and ultimately go and like experiment with what new prompts or what new models or temperature or whatever it might be uh how they behave in production. Now of course the ultimate aim here is to be able to take this whole system wire it up and instead of us having to to basically tweak uh man the managed variables manually that we basically get like self-driving for managed variables. So given some set of evals, it can basically go and perform uh optimize towards like hill climb towards some peak without us having to do anything. That's the feature that we're that we're working on. But I can kind of prototype what that might look like here. So let me just reload this. You will see in manage variables I actually have two variables here. And the other one is uh MP relations instructions which is instructions for the the the task agent we were running before. Um and you will see we don't have any values in here yet but what we have in our code uh in the web server here is we have registered so we register with this main agent two different tools one was the MP search which it was using before to find out who the MP for Hammersmith was but it's also got this extract political relations which is basically running the same extraction logic and then uh should be I'm just looking through. Yeah, ultimately it's calling that same extract relations function that we were optimizing earlier. So we can now use this agent. This agent can be asked who are the political relations of a politician and with a bit of luck it will go and use this tool to go and do the calculation. At the moment it's going to use the simple done prompt but the point is we can go and alter that prompt based on the optimized value that we got um and thereby basically apply our new state-of-the-art prompt without having to redeploy. So, if I come over here, first thing I'm going to do actually is dial that that previous variable back down to zero because we don't want we want it to be told correctly what to do. Um, so I'm going to put that to zero. Um, and just to prove that something here is working, I'm going to say, um, uh, who are the political relations of what was he called? Stephven Kenn because I know he does have some. I'm going to go run this case and we can go and look in Log in a minute and see what it's done. But hopefully it has used that tool uh as the correct way of identifying who the who the relations are. So yes, so it's found them correctly. But more importantly, if we look in uh sorry, if we look in logfire here, um you'll see that most recent call here. Um, and you can see that the that the search uh agent run uh called the extract political relations tool with the input Steven Kinnick which is correct. Um, and so it's then uh gone and correctly output the value. But if you look inside the tool call, we then ran the other agent uh nested. Um, so it ran and it correctly extracted the right data. And so we're we're now using our agent and we can now basically uh click the value and improve the prompt now. And so I can come over here and to be fair, I'm not going to have a particularly good way of showing you that it's working better because it would seem to be working pretty well unless I can find an example where it where it wasn't. So let me actually try I've got a bit of time. Let me try and see if I can find an eval case where the dumb prompt was doing badly. Oh, we've got loads of runs now. Um, sorry. Let me I want this case here and I want to compare these two. And so if you look, so for example, this person here, uh, Paul Masky, it was like getting it wrong. the the done prompt was um no wrong way around. Um I want to find a case where it's done badly, where I can get it to do better. Okay, let's use Andrew Gwyn where the correct answer is no um political relations, but but the other but the two uh prompts both return suggested that they did have relations. And I'm sure that's because if we look it'll be because yeah they're a sports commentator their their father which the dumb prompt wrongly suggested was a political relation and so if I go and run this now with a bit of luck this is using the dumb or the poor prompt and so it should um find those relations and then conversely if I update it to the correct prompt it should find yeah so it's it's identified the sports commentator as a political relation. And now without redeploying, I can come over to manage variables, update the prompt that we're using. I think I may have a mistake here. Let me You didn't see me just go and tweak the code. Um, it's going to tweak the pro tweak the code to use this version of the task where we are using the instructions manage variable. Um, let me restart the server. Um, and now without redeploying, I promise you, uh, I'm going to go and create a new case here. And instead of the the simple prompt, I'm going to put in the best one we got back from, uh, running Jeepper. Save that. Um, and without further ado, without redeploying, I promise we go and run that again and it'll almost certainly go wrong because that's what things do. But we can hope and it's gone wrong. It's again identified it unfortunately. Um, but uh hopefully you get the get the principle here. Yes, there's a question >> this web interface where you search and get a response from the the what do you call this sub? >> Uh well the form and then the agent who I'm submitting to. >> Yeah. Do you have any thoughts about collecting feedback from from users like >> Yeah. >> And my number one piece of feedback is no one clicks them. And so there's a well maybe there's a mic but >> um um so >> also or or you could be like nefarious like myself and just click bad even though >> so so I'm uh my my experience is that if you're going to do it you do only thumbs up thumbs down. Um we have an annotation system that I'm I'm not demoing today where you can basically record annotations against prompts. That is another place to build your golden data set from or at least a starting point for your golden data set. Um the the best thing you can possibly do is something which is implicit in the users's exper the user's interaction. So if you have something that more advanced than this simple um text prompt but you have a you have a chat the single best way of getting of evaluating the performance of a given question response pair is to look at what the user did next. Because if the user says, "You idiot, try again." You can get pretty strong evidence that they've that the that the agent behaved badly. And if the user is like, "Thanks, that's great." or goes away, you assume it's right. And if they they probably don't, it's probably not as extreme as you idiot, try again. But it's probably like, "No, I mean X." And so, and that's that's how Google works, right? Like the way they identify how good a website is back in the old days when we used uh Google was to see how long you spent on a site. If you came straight back and searched, clicked on another link, they assumed the page was bad. And this form here is it a chat or is it if you >> this is just I honestly just asked Claude go build me a go build me a single pane chat >> like >> we have um we have uh an integration with Vel AI pro AI protocol so that you can basically build a full chat interface on a panaski agent in in like a few lines of code supposedly that I would try and demo if I had a bit more time but you this is just me trying to have like the simplest example to run in fact go on I'll have one try at doing it now since that one didn't >> work and just and a chat would mean that it kept its context somewhat right. >> Yeah. Yeah. So if I do if I um get rid of this single run and instead I do uh what what's the agent called? Uh relations agent dot uh to web. Um, and then I go and run. Uh, let me kill that web server to avoid confusion. Um, why is that not running? Uh, maybe I do need to call.run. Uh, maybe I'm forgetting exactly how to call it, but like the principle is that we basically one line of code here. If I remember what it is, you can basically turn that into a proper chat interface. Um um let me just see if the docs will tell me what it is quickly. Um, so I I define the app which gives you a stylet app and then I run uh UV run uvicorn task uh app and there's a typo. So, and that should give me a like nice interface now to go and ask questions where I can where I can inter where I can say like um uh maybe I I don't know if that's going to work, but we will try it. But the point is now I would um uh you would get back a final result of it would it would go and like you could use it like an agent and then you could ask follow-up questions. I mean this agent is designed to always reply with structured data. So it's not very chatty but you get the idea. Um I'm going to stop there. Um thank you very much. Happy to answer any questions afterwards. >> Sorry. Did you have a question? Go on. >> Thank you. Can you hear me? Yeah. >> Yep. >> Uh do you have cool use cases internally where you use your stack but like agents that you've built? Because I hear like a lot of u look how many tokens are we burning and look my but have you seen some like you mentioned the private equity firms earlier like those are classic old problems classifiers. Do you have use cases that internally you're really impressed by agents or what the team is using internally to ship faster or create more value for either you or your clients? So we have the there's a there's an agent inside logfire to basically you do free tech search and it converts to SQL and we optimize that agent a fair bit. Um we don't care particularly about token count in that case because it's not high enough volume. We do care about generating good SQL. The reason we can't use it is that it very often ends up with like private data within the SQL as in someone will say find me results for invoice 1 2 3 4 5 and we don't want to go and show people that. It's one of the reasons we can't use that example in in demos. Um I think there was another question. Does that answer the question? >> Right. So in logf fire is it possible to kind of offiscate or encrypt the input and the output and just evaluate against a golden set. Let's say if you have set up an application to work in a field with sensitive information like medical or >> so you can choose not to send the system prompt or anything to the agent and just basically record the the output could be for example the performance like good or bad and then you have the metrics. You can choose not to send the the raw data. It's obviously less valuable. My somewhat biased answer would be you can obviously enterprise self-host logfire and then you can have it inside your VPC. Um that's probably what people do, but yes, you can do it without without recording the like actual input and output. >> Great. >> What what people do in those cases. So for example the big like Lora and Harvey the like the legal tech companies where they are very strictly not allowed to exfiltrate the the data because it is their client's clients very private data is they record categorical performance. So like good bad whatever like you could have like 50 grades of how it performed but like and you can exfiltrate exfiltrate that data without actually exfiltrating like any um generated content where you might include private data. Any other questions? Yeah, there's one here. >> Thank you. This is uh a tiny bit of an internal question. I noticed that uh you're using data classes throughout. >> Y >> uh is that for speed reasons or any other reasons? >> I think they're just the canonical thing in in Python. And so they're like if you're not trying to perform validation I think that they're generally the kind of yeah the canonical choice over pyantic models it doesn't make any difference the performance of any of any of these cases the pantic validation or the data class construction time will be shrinking you know 0001% of the performance. >> Thank you. >> Thanks for the presentation. I had a quick question. So with the um prompt optimization obviously that's like a type of context engineering right. So I was wondering have you done evals where you've combined that with um essentially some because obviously when you there's context degradation even though the context windows are so big obviously after like five or so it'll still be less than or after 20 you know calls it will be less than very quickly. So combining prop optimization, we're just spinning up new um LLM calls and new context uh restarts as opposed to say compaction which still reduces as well. >> The compaction strategy is definitely one of the things you want to go and optimize in a more complex case. I think here we optimize for something that everyone can gro in a in a in a session rather than the most complex use cases. But yes, definitely that's that's relevant. So, for example, the the Shopify example of like uh how they were analyzing these sites with GPT5, you could just give it the entire content of the website and be like, "Hey, go figure out whether this is fraudulent." Using a Quen model, you couldn't do that. You needed to do something more agantic where it like performed a bunch of like queries to look for certain terms. And so that is a like variation on context engineering to basically allow a smaller model to perform the same task and be more deterministic. >> Great. Thank you. >> Stop. Apparently I have to stop. Thank you everyone.