Playground in Prod: Optimising Agents in Production Environments — Samuel Colvin, Pydantic

Channel: aiDotEngineer

Published at: 2026-05-07

YouTube video id: A48uhxfxbsM

Source: https://www.youtube.com/watch?v=A48uhxfxbsM

I'm Samuel. I'm probably best known as
the creator of Pyantic, the the open
source library. Now I run Pyantic, the
company. We do a bunch of stuff. Um,
Pantic, we still maintain Pyantic
validation, of course. We have Pantic AI
the agent framework and then we have
pidantic logfire our observability
platform. How many people coming into
this know have heard the word pyantic
just just to okay that's reassuring um
and how many people have at some point
played with panti yeah pyance kai and
logfire
okay so most people um logfire is
fundamentally under the hood a general
observability platform open telemetry
logs metrics traces I don't really
believe in AI observability I think it's
a feature not a category and it will get
eaten by either observability or AI at
some point but today we sell is AI
observability because that is
understandably the thing that people
want um one of so we go beyond like the
standard observability of logs metrics
traces we do stuff like eval which I'll
talk about a bit today and managed
variables is one of the one of the newer
features um and then the step beyond
manage managed variables which we're
working on at the moment is then
optimizing your agent uh autonomously
effectively from the platform that's not
what I'm going to demonstrate today I'm
going to talk about Jepper and how you
can use it with managed variables kind
of in a somewhat Heath Robinson way
until we have the full end to- end all
singing or dancing. You give us uh your
money and we magically optimize your
agent. Probably also more interesting
than the like magically we make your
agent better argument. Anyway, um so
yeah, there are two there are two um
core subjects I'm going to talk about
and then I'm going to try and put them
together at the end. So Jepper um is a
library. How many people here have heard
of have heard of Jepper and and sort of
feel like they understand what it does?
Okay. Well, that that that probably
includes me to an extent. Um so it the
name comes from uh genetic paro. So it's
a genetic algorithm in the sense that it
uh looks for the best um value of some
kind. Um and then once it's found a good
value, it kind of mixes it with some
other values that appear to be good and
produces a new um candidate and tries to
uh optimize that. So it's ultimately an
optimization library. It optimizes a
string. Now that string can be a simple
text prompt or it can be some JSON data
which ultimately contains whatever you
want. Um and the paro bit comes from it
basically takes candidate from the kind
of paro frontier of the best uh examples
it has. So it is effectively if you're
imagine you're breeding raceh horses,
you take the best raceh horses and you
breed them each time. You don't for the
most part go and take some really slow
horse and like add it into the mix to
see what happens. You take all of the
like best resources and breed them and
hope to get to to better better
resources and it does a similar thing.
And then the second is managed
variables. So
lots of platforms including ours have
prompt management. So basically the idea
that you can have uh your text prompt in
the platform and you can edit it from
there. We take uh managed variables one
step further. Uh sorry, we take prop um
prompt management one step further and
we have managed variables. So they don't
have to be just text. They can be
effectively any object that you can
define with a pyantic model can be uh
managed inside inside logfire. And so
we're going to we're going to kind of
try and com combine those two things
today. I will say at the beginning of
this due to some family situations I
wrote most of this talk overnight. So uh
maybe in keeping with the rest of this
conference, it may be slightly uh
chaotic. If I if what I'm saying doesn't
make sense or I go to sleep at any point
in this talk, please throw something at
me and I will endeavor to make more
sense.
Um so I'm going to talk quickly about
the subject that we're going to go and
optimize. It has to do with politics.
Don't worry, it's not particularly
controversial politics relative to
what's going on at the moment. But um
how many people actually I won't do the
how many people. There's a there's a um
podcast called um The Rest is Politics,
which I'm a big big fan of and a big
listener to. They were having a
conversation back in April last year,
almost exactly a year ago in fact, about
how many politicians basically come from
like political um dynasties, families of
of politicians. And there was a lot of
like bluster about what the answer might
be. And I used padantic AI to go and
analyze the Wikipedia article for each
MP to basically look for references to
um their relations who were also
politicians. And therefore was able to
come up with some percentage. I think it
might have been 24% of MPs have uh some
kind of ancestor who is a who is a
politician. But at the time I basically
just ran it with whatever model I could
get. Got an answer, submitted my
question to the rest is politics. They
read it out. That was very nice. But
like I didn't actually go and check how
well it had done. And so here we're
going to take um that challenge and try
to optimize it to improve the prompt
that we use uh the the subtlety. Let me
let me go into some code here and and
talk through the task. So hopefully this
is a scale you can read.
Um
there's a bunch of uh fluff here. In
particular, there's a there's a um
archive file here that will be
automatically downloaded if you run the
script. So you don't have to go and do
all the scraping from Wikipedia which
just contains the the basically the raw
HTML of the Wikipedia pages. Um we then
have some schema for for the for the
MP's data and then we have schema for
their relations. So in particular so
this is obviously the name the role they
might have had in particular their
relations. So whether they're a like
sibling or a spouse or a grandparent and
in particular with regard to the
question that we're trying to answer
about
uh kind of um political dynasties, it
isn't really relevant if someone has a
child or a spouse or a sibling who is
also a politician. It's to do with their
like parents' generation or parents
parents basically ancestors. And this is
something that I found really hard to
get models to
um to respect. Once you said relations,
it just couldn't help but add spouse or
children or uh sister-in-law into the
into the mix. And so the way I actually
went and solved this last year was I
basically said include the relation
whatever it is and then afterwards I
went through and scanned and removed the
ones which were obviously um the same
generation or beneath as it were. So
here one of the things we're trying to
optimize is finding a prompt which will
get the agent to efficiently discount uh
yeah non non- ancestors.
Um so this is a schema we have um we got
a task input which is which is relevant.
We have some some very basic um initial
instructions u for the agent. We have
some slightly more advanced instructions
so we can kind of see how the two
perform. And then ultimately we're going
to try to optimize to find an even
better prompt.
So
uh the the core of this is this this
pyantic AI agent. Uh it's actually just
using a hard-coded model. It's a
variable because in theory you can
substitute it with the command line, but
I don't think we need to today. Um the
most interesting bit is that the way
that we're doing this is effectively a
structured output question. So this
could be this is equivalent to any other
structured output. find the address in
this email, um, find the invoice lines
in this PDF, whatever it might be, but
we're using the same idea of structured
outputs to go and find this this
political relation. So, we're just we're
just substituting in uh yeah, so we're
just we're just setting the output type
to be list of uh political relations.
Um,
we set the instructions from above. Um,
and then ultimately when we go and run
the agent, all we're doing is taking the
HTML, doing a little bit of processing
with Beautiful Soup to strip to get out
the text. So we don't have to take all
of the like other crap going on in the
HTML page, and then we're passing it to
the agent. And this works surprisingly
surprisingly well. If you go and if you
go and read through the the cases, you
will see that it is pretty damn accurate
at finding all political relations. Even
the relatively dumb models do a pretty
good job. The main place where they get
confused is are their relations
political versus like otherwise uh like
public figures and this idea of a
relation being an ancestor seem to
really confuse the models.
Um you will see if you look in the rest
of uh the files here I'm sorry if the if
the file explorer is small but um once
you run it once you'll have this MPs
directory which contains uh a JSON file
with um the MPs and the URLs and the the
the raw data um and most importantly of
all the pages which is the the HTML
pages as I say that will just be
downloaded and put into that directory
the first time you run it. Um
yeah, and then there's also
this golden relations. So this is
supposedly the exactly correct answer
for each politician. So
um uh Steven Kinnuk has
uh I mean for anyone who's British will
know a like reasonably famous political
name. He has a bunch of political
relations. Um
uh you see here that again it's included
his wife who was the prime minister of
Denmark. Even though um I'm seeing some
question questions there. I believe
that's true. Um even though we we said
don't include relations, that's why I've
used this technique of basically
including those relations and then
stripping them out based on what the
relation is um
etc etc. And this um I believe this is
pretty accurate. Truth is we ran it we
just ran a similar script with the with
the like with Opus 4.6 to get that data.
But I've I've checked it quite a lot and
it appears to be pretty much correct.
And so in terms of the the first step
here is to run some evals against this
and um
uh demonstrate that it's the the like
performance relative to the golden data
set. But before we do that, I'm just
going to run it um
run a single case in the terminal and
show what's going on just so that you
have some feeling for what we're doing
here. So I think if I look in task here,
uh yeah, if I go um UV run task
uh two is the ID of the the first
politician for some reason, Steven
Kennock, we'll go and run the model
and successfully find the two relations.
And in this case, we've prompted it and
said uh don't include only include
ancestors. So it's done that correctly
and it's excluded his wife. Um
do I mean I guess people want to kind
would prefer to to like run along with
this and and try this themselves. Is
that is that true? Okay. So you will to
do that you will need and um
you ideally you'd have a logfire account
that will allow you to do the managed
variable stuff. You can use a free
logfire account. The the free tier is
extremely generous. So everything you're
doing there is free. The other thing you
need is obviously an API key to connect
to the models. You have two choices
there. either in the code we'll
basically be using
uh padantic gateway as a simple way of
being able to access all of the models.
So you just see gateway prefix in front
of the model name. Uh if you have your
own open AI or anthropic key and you
want to use that just remove the gateway
prefix and it will work. Um if you want
to use pantic gateway so that everything
just works. I'm just going to generate
an API key in Logfire and share it with
you all, hence that channel, and put a
put a limit on it and hope you don't no
one goes and uses it all up before we
finish the lesson, finish the finish the
workshop. So,
um I'm in in Logfire here. I'm going to
go into gateway.
Um if I had set this up more formally,
we could basically invite you all and
you could all go and generate your own
keys and you'd all have your own limits.
I have not prepped that and so I'm just
going to generate one key uh with a
$1,000 limit and hope we that does the
does the job today uh and then delete it
after after the workshop. Uh
so I'm just going to copy that into
this channel here.
Um, so you should just be able to uh
export that environment variable. Um,
and then when you run your code, for
example, this this task here, you've
rerun task two. Um, you should get get
an output there. Um,
should I wait a couple of minutes for
everyone to the other thing you'll need
to do is set up log fire for when you
run um the evals. And the the simplest
way of doing that is once you've run uv
sync in this so you'll be in this
directory you'll run uv sync. So note
we're in the we're in the subdirectory
for this this talk we'll run u sync
um and then to connect it to your your
project you'll see here my project is
called uh demo. So, I'm going to run uv
uh run logfire project
use demo. Demo being the name of the
project that I want to connect it to.
And that will go and connect logfire in
this directory to the demo. So, when I
go and run stuff, it will output uh
record uh traces to logfire and use the
manage variables from there.
Um, I'll wait a couple of minutes
because I'm sure people want to get a
bit set up and get that kind of like
very simple hello world thing thing
working. Um, and then we can move on.
Anyone got any any questions while we're
just getting that set up.
>> What does gateway mean?
>> Uh, so AI gateways, we have one, but
we're by no means the only ones. Uh the
idea is uh you can basically make
requests to
most models through through our gateway
with one API key. And in particular, we
we have observability. So if you're if
you're using the the production version,
you can see requests that your team are
making or that your agents are making.
Um we have things like caching and
fallback. So if one model fails, we can
fall back to another one. Uh etc., etc.
But in this context, it just means you
can have one API key and connect to
Anthropic or OpenAI or Grock with a Q or
Gemini models.
>> So Gateway is your own.
>> Yes.
But as I said, you don't have to use it
here, but it might just make things
easier to to get all of the models
working.
I'll wait a couple of minutes, but any
any questions or should I go on?
>> Sorry.
>> Uh, so it's it's uh the the repo itself
uh
is um
github paidantic talks and then the
directory is 20264
AI engineer.
Have you logged in with logfire before?
>> No.
>> Okay. So, you'll need to do that so that
you have the um machine key.
So I think it's UV run logfire off.
Sorry, I've done that.
So, I'm going to keep going just to to
keep things moving, but like unless
anyone has any any questions. So, we now
have the the very basic case working.
What we want to do is run an eval to see
how
um uh our model with this with our
prompt is per performing relative to our
golden data set. So you will see if we
go into evals here we have um
uh where's the data set defined? Sorry.
Um
you'll see at the top um we have the
load data set which is going to
basically create our um our data set
which is the the object that we can then
call eval on. Um this is going to load
um
the the
um the cases which um and and basically
create one case for each of the each of
the MPs that we're that we're going to
run. Um and then it's going to create
yeah go through and uh generate these
cases and ultimately return this data
set data set object and we're going to
register one custom evaluator here which
um in turn actually generates a whole
bunch of metrics or assertions for like
how accurate this um uh this run is. Um
I don't think I mean you can go and read
through the logic. I'm not going to go
through each individual bit. I'm not
claiming here that these are the perfect
set of evals in terms of the actual
evaluators, but they're they're a useful
start. Um the point is if you go I can
you can look through the docs, but we
have a bunch of pre-built evaluators
like LLM as a judge. Um and then we
allow you to uh define your own. And
generally defining your own is far
better than LLM as a judge because the
LLM as a judge is effectively the kind
of you know I don't know if it's
politically acceptable but the kind of
lunatics running the asylum and it can
lead to lead to slight issues. So if you
can have a like deterministic eval like
this where we're comparing what the
result is versus a golden data set
that's much better. Um
u and so if you look in main where we're
going to go and run this ultimately main
is really just a CLI. So we set up a
bunch of CLI cases and then in the case
of eval we run the run evaluation
function. If you look in here,
ultimately what we're doing, we print
our better stuff and then we're just
going to call this evaluate function.
We're using Pantic AI's override
functionality to override uh the prompt
that is set by default with um with the
custom prompt that we're using for that
particular that we're evaluating and a
model that we're evaluating. We're not
actually going to change model in this
case. Um
and so we're yeah we're going and we'll
then go and call so ultimately we're
calling data set evaluate which takes a
function which is the actual function
we're going to go and um perform the
evaluation on and that will in parallel
with a
uh max concurrency of five go and run uh
all of our cases um which we will you
will see in a minute um and then we're
going to give it some name um so that we
can look at it in log log fire. Now you
will also see here we've set up log fire
at the at the top here. If token present
just means it would run without causing
an error if you don't have the the token
set. Um we set an environment service
name. We're not going to print to the
console because uh in these complex
cases it's actually very quite tricky to
see what's going on in the terminal. It
doesn't really help to have logire kind
of printing to the terminal. And then
we've switched off scrubbing because
occasionally something like the word
password or or something will appear in
an output and we we're happy to for that
to go to logfire. And then we're going
to instrument pantic. We don't actually
need it now. So I won't instrument
prints but when we get to using jeer
instrument print just as a nice way of
getting the output from jer into logfire
as well. Um and so if I look the and the
as you may have realized the help file
has help. MD has some useful examples of
what we're going to run. So we're going
to go and run um this script
here. So we're going to UV run main.py.
We're going to call with the arguments
eval. We'll split. So we're we're
splitting it based on just the test set
that we're running. And then the prompt
is the the initial one. So the very
simple prompt um
kind of oneliner
description of what the agent uh should
do rather than the more complex prompt.
And if we go and run this eval, you
won't see that much in the terminal
because we we uh uh switched off tonal,
but you will see it running through the
cases. And if we go and look in logfire,
you can see um those cases running and
their their costs, etc. We can dive into
what an individual case looks like. So
if you look here, we're running um
uh the agent. And I guess that's
probably quite small, but I will try and
zoom in to let you see what's happening.
Here we have the system input again
that's the very the um simple
instructions that we talked about and
then we have the input which is the the
contents of the Wikipedia page as text
then we have the final output and in
this particular case it found no no
relations
um
that has now finished and so as well as
this view of the individual cases here
we can go into evals in logfire and you
will see um this case here and you
Sorry, this data set. And then within
the data set, you will see that most
recent case from one minute ago.
If we go in and look at that,
we should get a view of what's going on.
So again, this is pretty small on the
big screen, I guess, but it'll look much
clearer on your screen. We can see um
where um the cases have the mo most most
relevant metric is um accuracy where
you'll see there are ones where it's
correct and zeros where it's incorrect.
And you'll see a few values where it's
lower where the output from the golden
data set is slightly different to the
output from um the from the model that
we're running here. So you see here this
guy has a father who is a um something
to do with UKIP and it's it's got it
correctly his name but there's a slight
difference in the description. Hence,
it's like got a 0.9 score
and you can see the overall performance
of 85%. So 85% of the time it got it got
all of the stuff uh right. That's not
particularly good. It's also not
particularly interesting in its um until
we go and compare it to running with a
with a better prompt. So um how many
people are up to kind of being able to
run that run that eval?
A few people. Okay, I'll wait a couple
of minutes because I think it's probably
a slightly more entertaining for us to
Yeah, we're 20 minutes in. We got quite
a lot of time. I'm gonna wait a couple
of minutes.
>> Can you show the command again, please?
>> Yeah.
So, it's it's this command. It's if you
look in um help.md. We're basically
running the the eval command.
>> Y
running EVO.
>> Yeah,
>> it took
>> it ran really quickly for me. Um,
>> I think it should take about 30 seconds.
So, I don't know whether
you're on a different internet or
changed a changed the model. Um
>> ah yeah that will be much slower than I
think we're using
>> well a anthropic was down this morning
when I was trying to use it a lot so
it's being really really slow and GPT4.1
is flying because not many people are
using it and it's doing a reasonable
job. That's why I'm using one of those
models is partly to like show the eval
performance but also bluntly to get the
answers more quickly.
While people are are getting there, I'm
just going to run this the the next one,
which is compare. If you go and look at
what this is doing, it's basically just
running the eval twice with the with the
like bad prompt and the good prompt. And
then we'll be able to uh compare those
outputs in in logfire and kind of see
where the differences are coming from.
Um and then we can go on to the full jer
optimization step. I'll just run the
compare which should be take a little
bit more time but still still hopefully
relatively quick here with these models.
You see, we're just running the 65 uh
test cases here rather than the full 650
MPs as we'll use when we get to the full
um optimization.
>> If you accidentally run the same twice,
you'll just create a new
>> you'll create a new obviously you'll
have more spans in log file and you'll
get another um run in the logfire evals
view. you can archive them when you've
when you've got confused by how many
different ones you've got. So, if you
come over here to evals, come into that
most recent case, you'll see I've now
got a bunch. Uh I could go and archive
uh so yeah, archive the older ones.
Well, do the other way around just to
keep things uh clear.
But if I take the those so I've now got
two new new runs here uh two new
experiments and I'm going to select them
and um
compare
and now we have a bunch of different uh
metrics um uh in particular and you will
see that like ultimately accuracy being
kind of the most important one here
slightly higher 92% versus 87%
for the like expert prompt versus the
initial prompt and we have the like
possibly slightly useless graph of
different metrics and how they've
performed. Um with with fewer um metrics
or assertions, this can be more useful.
Um and again, what what's probably most
interesting though is to look into
individual cases where the performance
differs. So you see here um with this MP
uh Joe White that the initial prompt um
found one match um
relation spouse which obviously it
should not have found because we're not
looking for spouses whereas the expert
prompt correctly ignored that one. Um
and then if you look in this case Laura
Kyle um you will see that the the
expected output is no matches at all
both um of the these most recent runs.
They found a match um because they had
some four times great-grandfather who
was the governor of Hudson Bay which is
not a politician but is a like public
figure. So you can kind of start to see
where where the differences are coming
from and the kinds of things you might
want to do in the prompt to to improve
the performance. That's ultimately the
kind of data that Jepper is going to use
to like um inform it
uh generating a next the next like paro
frontier prompt for it to go and try.
I didn't manage to see the graphs.
>> You you will see the graphs only if
you're comparing two cases. So if you
open a single case, it doesn't show the
graphs.
>> So I've run compare. Once I had run
eval, I then ran compare with this uh
command here.
>> And then um
here I've taken two cases.
>> Oh,
sorry.
>> And then you hit compare. And then you
see uh see the graphs um comparing the
performance.
>> So there are if you look in the um task
uh pi there are two prompts uh as a kind
of starting point. So one is a very
simple oneliner and then the second is a
bit more detailed description of the
kinds of things you should be looking
for. Still not kind of optimized but
like what you might get if you as a
human wrote out a decent prompt or as a
model like write me a better prompt. So
you can see we've like improved the
performance from 87 to like 92% with
this model through doing that
approximately
and we should with Jeopard be able to
get better. We I think I got 96% when I
was running it earlier. um by no means
perfect but this is you know but you I
you know there are examples of using
jeopard to like I suppose what what
people generally do is they either hold
one variable constant like we need the
following quality and now we want to
reduce time or reduce cost or we need to
like drive up the quality. So there was
an example from Shopify using Jepper.
They were using they were basically
looking at um Shopify sites and looking
for things like whether they were
fraudulent or which which tax category
they fell into. And they switched from
basically just giving the entire website
to GPT5 and saying what is this to using
an agent and using a Quen model and
Jeeper to optimize the prompt. and they
got the price down from $5 million a
year to 60 60 $73,000 a year. Um, and
improved performance over time. So
that's obviously also that's not just
optimizing the prompt. That's like some
mixture optimizing the prompt and moving
to aantic and yada yada.
But in this case, we're just trying to
kind of improve performance whilst using
the same relatively fast model for the
sake of for the sake of uh the demo.
Um, so now we've done the comparison. I
guess it's time to look into well,
anyone want me to wait a bit longer or
shall I keep going?
I'll take the no one asking me to wait
as a signal to keep going. So we now get
to the kind of jeer jer phase of this.
So I'm going to reenable that while I
remember. And then ultimately if you
look here where we are going to run the
uh optimize um case you'll see we have
this run optimization function run
optimization here.
It's going to load in the the the the
training and the validation data sets
um error if it's not there. We're then
going to go and create this JPE adapter.
So the Jepper adapter is is effectively
like Jepper is not it's an amazing it's
a kind I guess the state-of-the-art now
in agent optimization Lakshai who
created it is a first year PhD student
at Berkeley. He is not perhaps the most
experienced kind of engineering Python
developer and so Jeopard doesn't do
async that nicely. It's not as type safe
as I would like it to be. I'm like
itching to go and either fix it for him
or fork it bluntly but I have so far uh
not done so but um yeah so so Jeppa has
this concept of adapters which are
basically how you define the agent that
is going to propose the next case and so
we have a we have this function here to
create the adapter
um which in turn generates one of the
this adapter here which which we've
defined which subclasses
um their adapter type.
It's a data data class, so it takes a
bunch of arguments, but ultimately the
bit that matters is build propos agent
um which is obviously their their method
name. And what we're going to what we're
going to do here is return a pyantic AI
agent u in turn. So we're using a we're
using a pyantic AI agent to propose new
cases to optimize a pyantic AI agent.
Um, and we're going to
uh use again GPT4.1 for the sake of
speed. In this case, there is a slightly
annoying characteristic where Jepper is
sync, not async. And so if you try and
run each time it runs the proposer
agent, it runs it in a new
um
async context. And so we have to go and
make sure we create a new HTTP
connection, HTTPX connection, otherwise
we get a bunch of errors. Hence this
like slightly weird formulation. Here we
have again a prompt. Obviously, you can
get a bit uh going around in circles on
optimizing the prompt of optimizing the
prompt, but we're basically saying uh
you're an expert prompt engineer,
improve the um system prompt for this
agent, yada yada yada. Here are the
things to consider. Um and then we're
going to um call the ultimate evaluate
function which is going to
um generate this bunch of cases generate
a pyantic AI eval run that eval and that
is what we're going to use to basically
test the performance of our new proposed
prompt.
This will probably make some more sense
when I actually go and run it. Um yeah,
and then we
uh apply skills for the for the failures
and then we um return this evaluation
batch which kind of is a summary of how
it's performed which is what Jeppa will
then use as to whether or not this is a
new paro frontier or whether to to
propose a new a new prompt. Um
uh and so I think the simplest thing to
do is just go and run this. It will take
a little bit longer um and it will cost
a little bit more but that's fine. I'm
going to run it. If you run it with 50,
it basically dies before it gets
anywhere. But if I um
run this with 400 calls,
um
you will see um Jeeper starts to has a
has a similar kind of progress bar and
I, as I said, I've switched off um log
fire printing so that you don't see
what's going on in the background here
because it basically just becomes
impossible to view. But if you look
inside logfire, you will see already
this optimization is going on. So you
can see that we um evaluate um the the
agent and then we call the proposer
agent here. So the proposer agent, if
you look at this step here, this is what
we just had the system prompt. I said
for the proposer agent
um and then the input which is basically
Jeepper's description of the context for
the proposer agent to enable it to come
up with a new prompt and then from there
the proposer agent is going to propose a
new system prompt um
and we're going to then evaluate that
see how it performs and iterate towards
towards finding a better solution. So
that's what's going on here. You see why
I did the instrument print um is so that
um we can get the the print output from
Jeepper um nicely viewable inside
logfire. There may be a neater way of
doing this, but this worked for me this
morning. Um and it's going to work
through you can see it's a little bit
more expensive. We've already spent $2.
Um
uh so I guess if everyone runs it, we
will spend a few hundred dollars. uh
OpenAI will no doubt appreciate it. Um
uh and that's going to work through, but
you see we're already like um 50% of the
way through the optimization and it is
proposing these new prompts and then um
uh evaluating them and and using the the
feedback from the evaluation to decide
where to go next. And I would encourage
you if you have some time to go and have
a read through how the these traces
because they're quite interesting. I
think they're quite illuminating in
terms of how the how the optimization is
working. I think one of the things to
note is people love to say that anything
to do with AI is incredibly
sophisticated and complicated and
advanced. This optimization technique
whilst the state-of-the-art is not
actually that groundbreaking relative to
the complexity of the model itself. It's
a like relatively crude sense of like
ask an agent to generate a new prompt.
If it does better, take bits of that,
put it into a new prompt, keep doing
that until you either run out of time or
get to some prescribed score.
>> Yeah.
>> Questions
are
>> uh how big the batches are. So we have
>> the process
of the
So I think you can see it here where
it's running the evals. So it's running
I think it is running all of the cases
in the test data set.
>> Um but I I think I've seen it run fewer.
So I think Jeepper is clever enough to
like alter the number of cases based on
how they performed but I honestly not
not that sure.
>> Not one.
>> Yeah. So in terms of skill, what was the
what was the question?
>> I was thinking how to use that.
I was thinking
that
>> I mean I I think we we will make this
much easier with open source. We will
also make it something where you just
plug in logfire and it just happens in
the background. Um but we also want to
make it super easy to do it locally or
open source as well as like we don't
just want to say oh you must plug it
into logfire to get optimization.
Y
>> now you just said the max goals, right?
Which is I think max calls open.
>> Y is there a smarter way to do this?
Like if you don't improve for 10
iterations, just stop.
>> I'm sure there is. I haven't looked into
it. Like I say, I did this in a bit of a
hurry. Totally. And like presumably for
the most part, you're saying reach this
threshold or as you say like run out of
optimization or whatever it might be.
>> I don't quite understand how you see the
prompt optimization.
So if you look here in the proposer
agent run uh you will see
effectively what um what input this is
the this is the input to the agent that
is that is um
uh proposing a new prompt and it is
being told based on this information
propose a new prompt and then if you if
you look the the the principle of the
like paro frontier is this contains
components of previous prompts that have
done well uh and effect you know it's a
bit like the
>> it's not like a div it's a whole new
>> yes it is propo proposing a whole new
new div
>> so so I think if you have multiple so so
I think jpa operates on the idea of a
kind of object or dict of known keys and
so here we only have basically one a
dict with one key value pair eg like
system prompt value if you have multiple
different keys then you can optimize by
basically combining the best values from
multiple different keys. So you could
have like you could imagine e you could
have imagine model and system prompt or
model and some set of tools or even like
different lines from a system prompt
that you're basically
uh allowing it to the the proposal would
just be choosing different values from
your set of different lines. That's the
kind of standard way of doing this in in
DSPY is like basically take all of my
different examples and choose which ones
to include in the prompt.
>> DSP.
>> So DSPY is another agent framework uh
somewhat similar to pyantic AI. Um it
comes from a very um machine learning
background. I find it hideous to use
because it's not type safe but the
caliber of people using it is generally
very high. Um and it it has has had this
concept of optimization for a long time.
For a long time it was basically the
only agent framework that had this idea.
Now Jeppa has come along and is like
available within DSPY but is not only
available within DSPY.
>> Yeah. Uh what's your personal take on
JPA and those type of algorithm like
from your personal experience have have
they worked well in practice? Have you
all like used more home builds uh
optimizers? Yeah, curious on on this
informal take.
>> So I think that in the case where you're
trying to basically get a dumber model
or a faster model or a cheaper model to
be able to do some task, they make a lot
of sense. If you go and take the
state-of-the-art models are like Opus
4.6 and you ask it most questions, it
will just if it has all the information
it needs, it will just go and get it
right for the most part. And so the
optimization is slightly less relevant.
I think one of the subtleties here is
that the statistic is that 98% of data
is private. I don't know if that number
is is correct or not, but like let's say
even if it's off by miles and it's 50/50
data. When you have a private set of
data where the models have not been
trained on it, you have some massive
internal spec for how you're supposed to
operate as a bank, let's say, adding the
right bits of context into the system
prompt or into the instructions is
incredibly valuable. And the
optimization really matters. The problem
we have when we're trying to give demos
like this is by definition there isn't
private data that we can use because if
someone lets us use their private data,
they're not like, "Oh yeah, can we go
and talk about it at AI engineer?" And
so I've got this like slightly weird
example of MPs, but actually you'll see
in cases stuff I'm going to show you in
the middle in a in a minute, the model
actually just goes and like figures out
who the MP for some some burough is even
if it hasn't looked at the data because
actually this is all public data that it
knows. So, I think that one of the
subtleties is that optimization matters
way more where you have large amounts of
private data. And honestly, we're still
trying to figure this out as a company.
One option for us is to use stuff that
is public, but new enough that it's not
in the training data. So, one option
would be like news, but then that could
get slightly politically dicey if we
were using current news as our like
source of stuff to optimize on. So,
another option is to use like what's
going on in sports because whether you
find it interesting or not, at least
it's not controversial. we would, you
know, ask it which footballer has done
well in this Premier League season over
the last five matches. And that might be
public data, but equivalent to private
in the sense that it does not exist
within the model, but it's not a really
good example. It's something we're
trying to figure out.
>> Uh, yes, I have a question regarding
Japa. Does it
um
>> I think it's not working your mic, but
if you just speak up, I think we can
keep going. Yeah, I'm just trying to
understand if there is a um way where it
prioritizes kind of optimizing and
adjusting the system prompt rather than
kind of appending to it because I feel
like that's a
>> recurrent issue. You find new edge cases
and you're like appending and adding
more information. Y
>> so does it strike that balance of okay
this we append and this we adjust. So I
was using GPT5 mini this morning and the
one of the problem a it was slower in
general but also it was just producing
this enormous system prompt which then
meant that the agent was slower. I don't
know if it did better or not. I think it
it did roughly similarly well. Um so
that definitely is a problem. I think
the solution to that is the thing I was
talking about earlier where ultimately
Jepper wants a key value store key add a
dict of keys and values and then instead
of basically saying instead of the
proposer agent being like go just
generate me a new system prompt you're
basically saying choose a new subset of
our list of different inputs available
for like you would go and basically
split your system prompt up into like
200 sentences and say which is the best
20 sentences to use and now by
definition if it can only choose 20
sentences it's not going to get verbose.
In this case, we're using a pilance AI
agent to basically summon up a whole new
system prompt, which is probably better
at finding the edge cases, but but does
result in a verbose prompt.
Um, so in fact on that very point, it
has now finished. It has given us this
optimal prompt. It has achieved a
performance score of 96 uh.7%. So
significantly above the 92 we got from
our like best case before. Um and at the
expense of a relatively verbose prompt.
Um
uh so you can see where it's gone
through in relative detail and in
particular it's banged on about like
what relation to use, when to when to
include a relation, what sort of things
count as a relation. Um
uh yeah.
Um okay so so that that that is is
jeoper working. Uh the other bit I was
going to talk about is the managed
variables. So I will now switch on to
the managed variables. But before we go
on to that any other questions on this
bit of of the the talk or on jer.
>> Yep.
>> This is just being hard but you got the
validation scores compared to like a
golden set.
>> Do you want to take that?
>> The validation scores comparing to like
a golden set right? Yep.
>> Um is the golden set in this repository?
It it is here, right?
>> Yep.
>> Oh, okay. Okay. So,
>> yeah. So, if you look in um if you look
in the uh cases directory, you will see
golden relations and there's a y as a
JSON file with the the supposedly golden
set. Now, I'm sure it's not 100%
perfect. As I say, we generated it with
Opus 4.6 using a similar similar script,
but like um
>> I just I had a silly thought like if you
didn't have that golden set then things
get quite difficult doesn't it?
>> Yeah. Yeah. So evals are much easier if
you have some like golden reference
against what it's supposed to be doing.
Uh in general what people end up doing
is they have some subset of data that's
been like human annotated that works.
But you can also have cases where for
example if you have an agent that's
writing code you go run the code and
check whether or not the code fails for
example. Um or if you can have a like
feedback, if you can have a a full loop
of like
uh basically returning that data to the
original source and seeing how accurate
it is, something like that. Working out
what your what your judges is always is
the hard bit of evals.
>> But like just for the your example where
the if you're running the code, that
would be like a like like it either runs
or doesn't, right? It's not like a How
do you do a percentage then? I'm a
little bit confused there. Uh well you
could check for example whether or not
it was going to generate invalid code or
use libraries that weren't available.
That would be a like first step of like
how performance is going. But yes I mean
this is that this is the hard bit of
evals is like what is what is right and
and as models get more intelligent
figuring out what right looks like is
harder and harder. If you I mean the
ultimate case right is like you have an
agent whose job it is to persuade people
to stop smoking. The ultimate eval is
wait 40 years and see when they died.
But obviously you can't have an eval
where you wait 40 years to see if they
died. So your eval is probably like does
it include the like the correct words
does it not include some bad words that
you definitely don't want it to suggest.
um like you know take up a cigar
instead, right? So text does not include
cigar would be one of your evals perhaps
and not like that sounds dumb but that
is what an awful lot of evalu
>> can I yeah sorry um
two two questions uh one is a very short
one um with panici you kind of break up
the structure a lot of the time in the
prompt does this just jam everything
together and you just get one prompt at
the back at the end
>> so you always have one in one one set of
instructions Yeah.
>> Um, so it is building one set of
instructions,
>> right? Okay, that's fine. Um, so the the
the real question was, uh, you know,
variance is huge with this. I mean, you
showed your results. I ran it, I got
results.
>> Is there a way of like having this
having the variance be part of the
actual optimization?
>> Uh, I wouldn't say that. So, so there's
a when you run EVAs, I'm just looking
for where we call run. You can basically
set the number of times you run each
case. And the best way of reducing
variance is to run them lots and lots of
times. I know a bunch of hedge funds who
are spending about $20,000 a night
running their whole set of evals uh to
see if they can improve over time. And
if you have your own GPUs, you're like,
well, it's sort of free to go and run
them. So, we should run them more and
more times. We should run everything 100
times and look in the morning. It's
definitely one of the challenges. And
then do you like create a cost function
or do you have like you can point it out
like I want you to optimize
like both accuracy and minimize variance
like do you write your own cost function
then within
>> yeah so so if you look in the in the
adapter code here you will see
I mean have a look through it but like
where am I looking um evaluate is
returning this like um evaluation batch
type which where you can basically
include whatever score scores is uh a
list of floats, right? So ultimately you
can return whatever floats you you know
you want in there.
>> Okay.
>> And just just in case people are
wondering trajectories is effectively
similar to traces. It is the like
sequence of different steps that agents
went through to get to a particular
point.
>> Um so how how do you handle systematic
errors? Like I ran the optimization and
it got a high value of accuracy 96% but
it got wrong ons and uncles. So the
final prompt excludes explicitly aunts
and uncles even though they are in the
golden relation. So do you suggest
running and like crude code spotted this
immediately?
>> Yeah.
>> So maybe one could have a harness where
there is another check at the end. I I
think what it's doing is it's using a
small set of test cases and then a
bigger set of evaluation cases and so
that the subset of cases that
the optimizer agent has gets exposed to
maybe doesn't include a uncle and aunt
and so it just made it simpler to
exclude it. So one of the other things
you end up needing is like a kind of uh
at least 2x the amount of data that
covers all of the different space so
that you can have one full set of data
that you train on and one set of full
data that you evaluate on. This is like
classic problems of machine learning of
like how do we go about like we need an
awful lot of data to build something
that's really reliably you know yeah
really reliable because ultimately you
know you're already getting to the point
with the agent where you're kind of
encouraging it to overfitit and I
suspect the uncle and aunt thing is it
overfitting
>> okay thanks
>> there's a question behind you
>> this is probably a very silly question
but but surely the the prompts are and I
assume that the prompt
only works with the particular model
that you tr you you would
>> that is that is people's take is that
you would need to run this all over
again if you changed your model
>> that's going to happen all the time
isn't it
>> yep and that's one of the reasons that
eval are hard and that's one of the
reasons that most people don't run evals
and don't run optimization they write
out a decent prompt they ask their
coding agent of choice does this prompt
look good if it says yes they kind of
eyeball it and then they put it into
prod and they worry about other things
because probably the model that comes
comes out in a month's time is going to
go and supersede whatever optimization
you did in some cases. Now, if you are a
private equity firm who have uh 200
million invoices from across your
portfolio that you want to go and
analyze, it's worth going and optimizing
for a particular version of Quen 3.5
light mini to go and do that job instead
of just throwing GBD5 at it because the
difference is like $10 million and it
costs you you have your one analyst who
earns half a million dollars and you put
them on it for three weeks and you
solved it. So, it like depends on the
question that you're that you're trying
to solve. But the ultimate example of
this is the coding agents where the like
the trajectories are extremely sparse as
in there's an extremely wide range of
things that they're doing and for the
most part we are told that those
companies don't have very many evals or
don't have any evals at all. If you go
and ask Boris Cheney how does he work on
uh claude code he says mostly vibes
mostly we just like see what works and
tweak it a bit.
So this is not a like I'm not claiming
this is a panacea in all cases but there
are definitely situations where being
able to optimize your prompt or your
agent choice is valuable. The other
thing that we we don't have a demo of
here but there's an awful lot more than
just the prompt you can go and optimize.
So there's the model, there's things
like the compaction strategy, there is
how you register tools, there is things
like including code mode in this. And
the next step, what we want to allow is
basically for you to go and optimize
across the full range of things you can
do within your agent to find the optimal
choice where perhaps the particular
model choice doesn't doesn't matter so
much.
And it just makes me think that we you
evaluate the prompt here, but could we
just switch a bit and start evaluating
different models or that would be silly?
>> Yes. So that would that that would be
something you could totally go and do is
try different models. I mean given that
there are maybe you have 10 models you
want to try, you probably just run all
those 10 models and see which one
performs best and and pick that one or
which one optimizes for your particular
subset of like performance, price, time,
etc. I think there was another question
behind you.
>> Yeah. Uh thanks. I already got the mic.
Um so what you've shown is a task which
is relatively narrow in what should be
achieved and I'm wondering uh how you
would approach it if the task is more
open-ended. I I think that my suspicion
is that
that is a case where this optimization
technique is harder to do because B like
I say the coding agent cha case you
could imagine you end up with 5,000
distinct tests for a coding agent and
you go and optimize as much as possible
in those 5,000 cases and it turns out
that they are like drop in the ocean of
different things people do and now your
agent is over optimized for your
thousand cases or 5,000 cases. is I
think the bigger the breadth of the task
your agent is performing the harder it
is to do this and the more you end up
just having a like clever agent.
Thanks.
The other thing to say is that my point
earlier about where you have large
amounts of private data and so you have
to put masses of context into the system
prompt for like understanding our
specific domain then choosing the right
examples from that to put in which
basically cover the full data set is
more and more relevant.
Did you have another question?
Thank you.
>> So if I understand correctly, um if
you're a company and you would use this
JPA optimization technique and if you
had more money and maybe more expertise,
you could decide to fine-tune a model.
So they are kind of competing uh
strategies. Correct.
>> Yeah. And in answer to your pre the
previous question about oh but isn't it
isn't it made obsolete by the next
model? The main reason that the big
model labs say don't bother fine-tuning
is that there you really are spending
tens of thousands of dollars to go and
fine-tune a model and actually for the
most part improve the harness, wait for
the next model, like show a nicer
loading icon to your users is probably
better than fine-tuning. But that misses
the cases in particular in places like
finance where they have enormous numbers
of runs where they really do care about
that optimization and fine-tuning
applies.
>> Thanks.
So, I'm going to go on to talk about um
managed variables. Uh as I say said
earlier, this is a this is a somewhat um
uh it's not going to be a particularly
sophisticated answer I've got here, but
I'm going to kind of show the example
just just to explain how it works. So,
we have here a very simple fast API web
server. Um you will see that it has
ultimately two endpoints. One returns
some HTML um and the other one is a form
submission um endpoint. And if I go and
run this um um
where have I had it running before?
Here we are. It was in fact running. So
if I if you want to run it, you'll
probably want to use this command here
to run it. I'll put that into Slack as
well in case anyone
wants to run it.
And if you run this web server, you will
not you'll see logfire which is not so
interesting yet, but you will also see
the server running locally. And this is
just a very simple basically form where
you can ask questions about the same
bunch of MP data. Um and um I can go and
ask a question like
we'll see how well this performs.
Hammersmith where I live.
It will run for quite a long time
because what it what it's got in the
background is basically one tool which
lets it gp through through the mark the
the the HTML pages and it has replied
relatively quickly on this occasion. And
uh if I load this here, you will see the
HTTP request coming in in logfire and
then you will see the agent run and it
um it ran a rather the the you know a
good good sequex
got back the the data and was able to
correctly identify Andy Slaughter is the
MP for Hammer Smith where I live. um all
very well, but if you look in how this
code is defined,
we have here this logfire variable. And
so this allows us to basically go and
change the behavior of this agent
without redeploying. Um now obviously
redeploying in this particular case
where I'm running locally is very very
simple. But where I've got some big
production CI stack to go and run and
deployment takes hours, being able to go
and change variables in production or in
staging very quickly is extremely
valuable. So the way that that log five
variables work is that we have have a
type here a pyantic model um which takes
has in turn three strings the
instructions the model um and um the max
tokens which is an integer and then
we've given it a default value um here
um some very simple prompt and then
we've given it we we've chosen the model
and we've chosen the temperature and
what I have run here is I've run uvun
web and then there's a there's a
basically there's a like CLI command for
this which just pushes the variable
ultimately we're just calling logfire
push variables in this case when I run
that I'll get asked do I want to
overwrite this variable which I don't
but if you if you're running it for the
first time you you won't be asked and
you'll you'll create a new variable now
it's worth saying to do this we need to
make the the experience of using these
variables slightly easier in logfire at
the moment they're not configured with
the logfire um projects use command that
you did before. So you need to go and
set a a separate um API key to use
variables. That will be solved soon. But
if you want to do it today, the way to
do that is to go into settings uh API
keys within logfire. You see I've
created uh I can create I could show you
the form. I can create a new API key and
I just need to give it the three
variables permissions.
Well, or or whichever ones you want. And
now you can go and like you and we're
ultimately using that API key to push
the variables uh from local development
uh from from the code definition through
into logfire and then pulling them to to
update the values. And so because I have
pushed that if I go into uh managed
variables here you will see I have two
variables set up. Um this MP search
config is the one we were looking at.
You see we have those three fields that
we've set.
um and we have um some history of some
updates
but we can in particular what we can do
in targeting is we can basically define
what percentage of calls are using which
of our different values. So this is very
much like uh AB testing in fact under
the hood our managed variables use the
open feature open standard for for doing
this. So in theory you can connect to to
logfire with anything that speaks open
feature. Um
but yes so at the moment we're using the
code default. So you saw when I ran this
now uh I got back this answer answer of
Andy Slaughter. But now let's go and
update the value here. Um I want to go
and edit
and I'm going to leave those values the
same and I'm going to say uh reply in
French to make it very clear what's
happened. Save that.
And I'm going to go into targeting and
I'm going to dial up what use the use of
latest to 100%. So it should now use
uh that latest value. And now if I um
come in here and I ask for example the
same question again
uh and I stop the server which was silly
um
but you could imagine the server is
still running and it's going to pull the
new variable each time that that that
function is called and if the demo gods
are with me and everything is hooked up
correctly
uh it has indeed gone and replied in
French. So, we've managed to we've
updated the server without redeploying.
Obviously, I did in fact restart the
server, but I could I could change it to
not. So, just to prove that that works,
let me come back over here, edit this
again, and say
reply in German.
And oh, that's very annoying. Uh,
and now if I ask the same question
again,
that variable defines a system prompt
should be updated.
It's taking its sweet time, but it is
indeed replying in German. But the nice
thing here is coming back to your point
about like is it just a system prompt?
We can set up our variables to not just
be simple text values but be for example
a pidantic model which allows us to edit
multiple different fields. So I can
switch here from gateway anthropic uh
sonet 4.5 to
open AAI uh GPT
4.1.
I'll get rid of that because that's
going to confuse me.
uh save that
4.1
and the point is that we we can update
multiple different values. So if I
instead ask you ask it uh
hopefully it won't bother going off and
running it said it's now chat GPT rather
than anthropic. So we can use this to to
change variables and ultimately go and
like experiment with what new prompts or
what new models or temperature or
whatever it might be uh how they behave
in production. Now of course the
ultimate aim here is to be able to take
this whole system wire it up and instead
of us having to to basically
tweak uh man the managed variables
manually that we basically get like
self-driving for managed variables. So
given some set of evals, it can
basically go and perform uh optimize
towards like hill climb towards some
peak without us having to do anything.
That's the feature that we're that we're
working on. But I can kind of prototype
what that might look like here. So let
me just reload this.
You will see in manage variables I
actually have two variables here. And
the other one is uh MP relations
instructions which is instructions for
the the the task agent we were running
before. Um and you will
see we don't have any values in here yet
but what we have in our code uh in the
web server here is we have registered so
we register with this main agent two
different tools one was the MP search
which it was using before to find out
who the MP for Hammersmith was but it's
also got this extract political
relations which is basically running the
same extraction logic and then
uh should be I'm just looking through.
Yeah, ultimately it's calling that same
extract relations function that we were
optimizing earlier. So we can now use
this agent. This agent can be asked who
are the political relations of a
politician and with a bit of luck it
will go and use this tool to go and do
the calculation. At the moment it's
going to use the simple done prompt but
the point is we can go and alter that
prompt based on the optimized value that
we got um and thereby basically apply
our new state-of-the-art prompt without
having to redeploy. So, if I come over
here,
first thing I'm going to do actually is
dial that that previous variable back
down to zero because we don't want we
want it to be told correctly what to do.
Um, so I'm going to put that to zero.
Um,
and just to prove that something here is
working, I'm going to say, um,
uh, who are the political relations of
what was he called? Stephven
Kenn because I know he does have some.
I'm going to go run this case and we can
go and look in Log in a minute and see
what it's done. But hopefully it has
used that tool uh as the correct way of
identifying who the who the relations
are. So yes, so it's found them
correctly. But more importantly, if we
look in uh sorry, if we look in logfire
here,
um
you'll see that most recent call here.
Um, and you can see that the that the
search uh agent run
uh called the extract political
relations tool with the input Steven
Kinnick which is correct. Um, and so
it's then
uh gone and correctly output the value.
But if you look inside the tool call, we
then ran the other agent uh nested. Um,
so it ran and it correctly extracted the
right data. And so we're we're now using
our agent and we can now basically
uh click the value and improve the
prompt now. And so I can come over here
and to be fair, I'm not going to have a
particularly good way of showing you
that it's working better because it
would seem to be working pretty well
unless I can find an example where it
where it wasn't. So let me actually try
I've got a bit of time. Let me try and
see if I can find an eval case where the
dumb prompt was doing badly. Oh, we've
got loads of runs now.
Um,
sorry. Let me I want this case here and
I want
to compare these two.
And so if you look,
so for example, this person here,
uh, Paul Masky, it was like getting it
wrong. the the done prompt was
um no wrong way around.
Um
I want to find a case where it's done
badly, where I can get it to do better.
Okay, let's use Andrew Gwyn where the
correct answer is no um political
relations, but but the other but the two
uh prompts both return suggested that
they did have relations. And I'm sure
that's because if we look it'll be
because yeah they're a sports
commentator their their father which the
dumb prompt wrongly suggested was a
political relation and so if I go and
run
this now with a bit of luck this is
using the dumb or the poor prompt and so
it should um find those relations and
then conversely if I update it to the
correct prompt it should find yeah so
it's it's identified the sports
commentator as a political relation. And
now without redeploying, I can come over
to manage variables, update the prompt
that we're using. I think I may have a
mistake here. Let me You didn't see me
just go and tweak the code.
Um,
it's going to tweak the pro tweak the
code to use this version of the task
where we are using the instructions
manage variable. Um,
let me restart the server.
Um, and now without redeploying, I
promise you, uh, I'm going to go and
create a new case here. And instead of
the the simple prompt, I'm going to put
in the best one we got back from, uh,
running Jeepper.
Save that.
Um, and without further ado, without
redeploying, I promise we go and run
that again and it'll almost certainly go
wrong because that's what things do. But
we can hope
and it's gone wrong. It's again
identified it unfortunately. Um, but uh
hopefully you get the get the principle
here. Yes, there's a question
>> this web interface where you search and
get a response from the
the
what do you call this sub?
>> Uh well the form and then the agent who
I'm submitting to.
>> Yeah.
Do you have any thoughts about
collecting feedback from from users like
>> Yeah.
>> And my number one piece of feedback is
no one clicks them. And so there's a
well maybe there's a mic but
>> um um so
>> also or or you could be like nefarious
like myself and just click bad even
though
>> so so I'm uh my my experience is that if
you're going to do it you do only thumbs
up thumbs down. Um we have an annotation
system that I'm I'm not demoing today
where you can basically record
annotations against prompts. That is
another place to build your golden data
set from or at least a starting point
for your golden data set. Um the the
best thing you can possibly do is
something which is implicit in the
users's exper the user's interaction. So
if you have something that more advanced
than this simple um text prompt but you
have a you have a chat the single best
way of getting of evaluating the
performance of a given question response
pair is to look at what the user did
next. Because if the user says, "You
idiot, try again." You can get pretty
strong evidence that they've that the
that the agent behaved badly. And if the
user is like, "Thanks, that's great." or
goes away, you assume it's right. And if
they they probably don't, it's probably
not as extreme as you idiot, try again.
But it's probably like, "No, I mean X."
And so, and that's that's how Google
works, right? Like the way they identify
how good a website is back in the old
days when we used uh Google was to see
how long you spent on a site. If you
came straight back and searched, clicked
on another link, they assumed the page
was bad. And this form here is it a chat
or is it if you
>> this is just I honestly just asked
Claude go build me a go build me a
single pane chat
>> like
>> we have um we have uh an integration
with Vel AI pro AI protocol so that you
can basically build a full chat
interface on a panaski agent in in like
a few lines of code supposedly that I
would try and demo if I had a bit more
time but you this is just me trying to
have like the simplest example to run in
fact go on I'll have one try at doing it
now since that one didn't
>> work and just and a chat would mean that
it kept its context somewhat right.
>> Yeah. Yeah. So if I do if I um get rid
of this single run and instead I do uh
what what's the agent called?
Uh relations agent
dot uh to web.
Um, and then I go and run. Uh, let me
kill that web server to avoid confusion.
Um,
why is that not running?
Uh, maybe I do need to call.run.
Uh, maybe I'm forgetting exactly how to
call it, but like the principle is that
we basically one line of code here. If I
remember what it is, you can basically
turn that into a proper chat interface.
Um
um let me just see if the docs will tell
me what it is quickly.
Um,
so I I define the app which gives you a
stylet app and then I run
uh UV run uvicorn task uh app
and there's a typo. So,
and that should give me a like nice
interface now to go and ask questions
where I can where I can inter where I
can say like um
uh maybe I
I don't know if that's going to work,
but we will try it.
But the point is now I would um uh you
would get back a final result of it
would it would go and like you could use
it like an agent and then you could ask
follow-up questions. I mean this agent
is designed to always reply with
structured data. So it's not very chatty
but you get the idea. Um I'm going to
stop there. Um thank you very much.
Happy to answer any questions
afterwards.
>> Sorry. Did you have a question?
Go on.
>> Thank you. Can you hear me? Yeah.
>> Yep.
>> Uh do you have cool use cases internally
where you use your stack but like agents
that you've built? Because I hear like a
lot of u look how many tokens are we
burning and look my but have you seen
some like you mentioned the private
equity firms earlier like those are
classic old problems classifiers. Do you
have use cases that internally you're
really impressed by agents or what the
team is using internally to ship faster
or create more value for either you or
your clients? So we have the there's a
there's an agent inside logfire to
basically you do free tech search and it
converts to SQL and we optimize that
agent a fair bit. Um we don't care
particularly about token count in that
case because it's not high enough
volume. We do care about generating good
SQL. The reason we can't use it is that
it very often ends up with like private
data within the SQL as in someone will
say find me results for invoice 1 2 3 4
5 and we don't want to go and show
people that. It's one of the reasons we
can't use that example in in demos. Um I
think there was another question. Does
that answer the question?
>> Right. So in logf fire is it possible to
kind of offiscate or encrypt the input
and the output and just evaluate against
a golden set. Let's say if you have set
up an application to work in a field
with sensitive information like medical
or
>> so you can choose not to send the system
prompt or anything to the agent and just
basically record the the output could be
for example the performance like good or
bad and then you have the metrics. You
can choose not to send the the raw data.
It's obviously less valuable. My
somewhat biased answer would be you can
obviously enterprise self-host logfire
and then you can have it inside your
VPC. Um that's probably what people do,
but yes, you can do it without without
recording the like actual input and
output.
>> Great.
>> What what people do in those cases. So
for example the big like Lora and Harvey
the like the legal tech companies where
they are very strictly not allowed to
exfiltrate the the data because it is
their client's clients very private data
is they record categorical performance.
So like good bad whatever like you could
have like 50 grades of how it performed
but like and you can exfiltrate
exfiltrate that data without actually
exfiltrating like any um generated
content where you might include private
data.
Any other questions? Yeah, there's one
here.
>> Thank you. This is uh a tiny bit of an
internal question. I noticed that uh
you're using data classes throughout.
>> Y
>> uh is that for speed reasons or any
other reasons?
>> I think they're just the canonical thing
in in Python. And so they're like if
you're not trying to perform validation
I think that they're generally the kind
of yeah the canonical choice over
pyantic models it doesn't make any
difference the performance of any of any
of these cases the pantic validation or
the data class construction time will be
shrinking you know 0001% of the
performance.
>> Thank you.
>> Thanks for the presentation. I had a
quick question. So with the um prompt
optimization obviously that's like a
type of context engineering right. So I
was wondering have you done evals where
you've combined that with um essentially
some because obviously when you there's
context degradation even though the
context windows are so big obviously
after like five or so it'll still be
less than or after 20 you know calls it
will be less than very quickly. So
combining prop optimization, we're just
spinning up new um LLM calls and new
context uh restarts as opposed to say
compaction which still reduces as well.
>> The compaction strategy is definitely
one of the things you want to go and
optimize in a more complex case. I think
here we optimize for something that
everyone can gro in a in a in a session
rather than the most complex use cases.
But yes, definitely that's that's
relevant. So, for example, the the
Shopify example of like uh how they were
analyzing these sites with GPT5, you
could just give it the entire content of
the website and be like, "Hey, go figure
out whether this is fraudulent." Using a
Quen model, you couldn't do that. You
needed to do something more agantic
where it like performed a bunch of like
queries to look for certain terms. And
so that is a like variation on context
engineering to basically allow a smaller
model to perform the same task and be
more deterministic.
>> Great. Thank you.
>> Stop. Apparently I have to stop. Thank
you everyone.