Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

Channel: aiDotEngineer

Published at: 2026-04-10

YouTube video id: X4dEHRzBLmc

Source: https://www.youtube.com/watch?v=X4dEHRzBLmc

Hello everyone and welcome to my talk
slashworkshop judge the judge and today
we're going to talk about LLM as a judge
quite sure you know this scenario you
have an agent in production and someone
from the team says we need to monitor
the reliability so you go to one of the
libraries and maybe use the
hallucination LM as a judge you put it
in production within your observability
platform and things looks fine But
customers are actually saying that the
agent is not working and you look at the
traces, it's not working. You look now
under the hood about this hallucination
LLM as a judge and you'll find a prompt
not very far from this one. You'll be
given an LLM output rate whether it's in
a hallucination. Make no mistakes. Now,
obviously, how the hell would the agent
know whether it's a hallucination? If it
could, then your app would have worked
from the day one. So today we're going
to talk about how can we build
calibrated LLM that as a judge that
works calibrated mean calibrated with
human annotation and the way we're going
to calibrate them is by using
optimization or prompt optimization
specifically we're going to use GAPA
quite good algorithm for optimizing
prompts
now why do we want this why do we want
calibrated LM charges or good LM charges
first thing is for our off our offline
evaluations.
As you know like usually to create uh a
good agent or a good prompt the way you
do is you try to experiment with a
prompt then run your AV valves see if it
improves things or not. If it does good
if it does not you go back and you
improve it a little bit. Prove the
harness the prompt and do it again and
again. And the speed in which you move
to production or add features is
actually the speed into which you can
complete this loop. And the bottleneck
in this loop is actually the evaluation.
How fast can you evaluate? Obviously the
slowest possible evaluation is having
human annotator really look at your
whole test set and annotate it manually.
Um the quality is quite good but then
each iteration will take a lot. um you
can have faster ones by using an LLM as
a judge. But then if that LLM as a judge
does not um correlate with a human
annotation, then you'll end up with
useless signal and although this loop
will move fast, it won't go anywhere. So
having calibrated LM as a judge with a
similar quality let's say as human
annotator will make your development
much faster. The second second thing is
basically the online eval like our
example from the beginning. If you have
an online eval and you want to see
basically in production if things are
improving or not improving same thing if
you have LLM as a jobs that are
calibrated with your business goals then
you can quickly see whether changes that
you've made are improving not
improvement whether um there is some
change in the distribution of the data
how people are interacting uh with with
your agent or model and basically react
rapidly and finally and I would All this
is the holy grail of AI engineering is
really to build this data flywheel where
you optimize your harness, observe some
traces and then add new a valves based
on these traces the edge cases and do it
again and again and again here if if you
have a way to kind of add new
evaluations quickly obviously automatic
evaluations
um from the traces from kind of the
annotations and data you can go through
this loop faster and faster to to the
moment or to the point that you can
think of it as an automatic loop, right?
Because you can optimize the harnesses
with optimization techniques like GIA
what we're going to use here. But the
same thing you can do it for the eval.
And basically over time your application
will improve just with uh new
observations. So today we're going to
build and optimize these LLM as judges
that are calibrated with human
annotations. Uh but before going there a
small intro about myself. My name is
Mahmud. I'm the co-founder and CEO of
Agenta. Uh Agenta is an open-source
Lenomops platform. Uh basically
providing you all the tools from
observability, prompt management,
evaluation covering the whole life cycle
of building reliable agents. Um my
experience is in machine learning. I
have more than 15 years experience in
that. In a previous life, I was in
academia. uh worked on machine learning
applied to computational biology,
protein structure prediction. And right
now we're working a lot on these
sampling and autooptimization workflows.
So uh if you're interested in that,
please reach out. We'd love to have a
conversation and show you also what
we're building. So what's the plan for
this talk? Basically, we're going to
work on a practical use case. It's a
customer support agent that we want to
evaluate and we are going to build an
LLM as a judge that is calibrated with
the annotations basically human
annotations for that customer support.
The plan would be really to go over the
whole process of building this starting
with how we design the metrics, how to
think about the data curation, the
labeling, but the main focus would be
really about um the part about
optimizing the LM as a judge using GPA
and obviously then validating the
results. All the code and the data used
in this uh will be found in GitHub and
you can find them in the links in this
video and the last slide. So let's start
with the data set. We're going to use
Towbench. TBench is a benchmark in a
large data set built by Sierra uh
customer support um scaleup I think and
they have like multiple benchmark for
real world scenarios for customer
support agents. Uh one of them is the
airline agent which we are going to use
um in this example. And basically what
they have is an airline customer support
agent that has access to multiple tools
to manage reservations, access flight
information, access user information and
has a quite complex policy to be held to
like when to change a reservation
uh when to provide information and so on
so forth just like a customer support
agent, a human one. And the data that we
have is is the agent itself. about most
most importantly um 599 conversation
traces that are generated with
annotations.
uh now the format the original format of
the annotations is like in the format of
assertion but uh I pre-processed or by
post-processed the data so that we have
for each trace an annotation like a
human annotation uh where it says for
example like in this case you have the
conversation that you see here and then
you have an annotation that the agent is
not compliant because it improved
approved the cancellation without
verifying that the reservation met the
airline cancellation rule. So basically
the evaluation failed and the reason is
because the agent uh um canceled the
reservation without verifying something.
So basically uh it did not behold to the
policy and the data is is more or less
uh kind of um not very skewed. It's 62%
compliant, 38% compliant and it's
generated with multiple models and
trials. Overall the data is is kind of
the problem is quite complex because the
policy is quite complex. The data has
caveats honestly it's not very clean um
due to the reasons how it's also
generated but for our use case I think
it's very um interesting use case to
test uh GIA and to kind of demo how it
would work uh in a real uh test case.
The workflow that we have is four steps.
First thing is designing the metrics
that is deciding what will the LLM as a
judge measure. What are the different
axis that it would look to the second
thing is annotating the data uh and then
we will optimize the judge and validate
the results. Now the most important
thing to take here is that the metrics
need to come from uh the use case
itself. it does not make sense to have
general metrics like hallucination
uh when you're evaluating your AI agent.
It really depends on kind of the
business use case and the best person
the best people to to um to determine
these metrics are the subject matter
experts. For example, in the case of a
customer support agent, you need you
need to have um a subject matter expert
look at the conversation, provide the
feedback, and I think the the workflow
to do that, the best one that described
it is Hamill uh from Ham Dev. Um I'm
going to share his uh blog and uh in the
YouTube video and really describe this
idea of error analysis very well but but
I'm going to go over it very quickly and
also the annotation workflow very
quickly. So the idea is that you you
provide your subject matter expert with
all these traces of uh the trajectories
of the conversation and they would
annotate them first by commenting what
did work or what did not work but then
uh kind of slowly trying to cluster
create clusters of the error types like
when it's failing why is it failing and
uh here I'm showing an agenda how it's
done. And basically here I have like
these I I discovered kind of these four
error types while going through these uh
traces. So there is sometimes issues
with policy adherance uh sometimes issue
with the response style. Uh information
delivery basically the agent is not
informing the customer that they made
the change or something like that. And
finally some tools has not been colle
called correctly. And the idea is that
we're going to take these four error
types and then we're going to build four
LLM as judges for these. So it does not
make sense to have one LLM as a judge
which is success.
Um uh basically try to evaluate all of
these. it will make it too complex and
it's very hard to learn and you will see
a little bit later that even with a
simple like when we're going to simplify
it it's still hard to learn a calibrated
LLM as a church or optimize a cal
calibrated LLM as a church so it makes a
lot of sense to make things very
specific in these metrics uh that we
want to evaluate and the second thing is
to move away from one to five scores or
like percentages and instead have like
really binary solution like whether it
adhereed to policy or not um with
obviously some reasoning um and the
reason it's again it's already quite
hard to calibrate an LLM as a judge with
a true false like binary classification
adding another layer and saying okay it
should be a number between one and five
it's it's hard it's even hard for human
annotator like to have two human
annotator
agree on the same score. So the moment
that we have defined these different
metrics uh start the point of
annotation. Um here again I'm using uh
agenta basically um uh you would take
your traces create an annotation queue
and and kind of specify for your
annotator like uh the name of the um the
feedback or the evaluator policy
adherance here and then providing like
they should provide each one with
whether it adears to the policy whether
it does not and provide the reasoning
and the reasoning here is very important
uh because without the reasoning we will
see uh the optimization algorithm will
need to discover itself like why it
failed and it's going to be very hard
like in unless it's a very specific uh
kind of feedback for example tool um
failure where it can infer things. It's
very hard to see for example from a
conversation why it did not adhere to
policy without someone providing
information in the beginning about it.
So having that reasoning is very
important later for the LLM as a judge
to learn. And again this reasoning as
I've shown previously with the
annotation uh it it kind of describe why
for example the agent here is
non-compliant
uh because uh it approved the
cancellation without before verifying
um that the reservation met the
cancellation rules. So now that we have
So now that we have the annotations, we
can go to the optimization. But before
going there, I want to add a small note.
Although we went very quickly through
the first step and the second step,
these are actually the hardest part of
the problem. Like in reality, as every
data scientist know, getting your data
um is the hardest thing. And you need to
make sure to look at the data. Look at
the annotation. Make sure that the
distribution is is good. Uh that uh the
annotation and the information within
the data is enough for the algorithm to
learn a representation of the LLM as a
meaningful. Uh in our case here, the
data is not that good. The number of
traces is small. The problem is quite
complex. uh they're not very well
distributed because of the reason they
have been generated. Also the annotation
is actually uh kind of AI generated
based on the assertion in the original
data. You can see a little bit more how
it's done in um in the repository. Um so
it makes things uh quite complex for
this problem. Um but it's still quite a
good demonstration of how it would work.
So now that we have the annotated data,
we can start with the optimization. And
for the optimization, we are going to
use the GI algorithm. So I'm going to
explain in the beginning how the GI
algorithm works and then we can jump to
the Jupyter notebook and start
optimizing based on the annotated data.
It's very important to understand how
the algorithm works because you will see
that practically you need to play around
a little bit with the parameters and to
get it to work and it's very hard
obviously to play around with the
parameters if you don't understand what
do they do. Uh the algorithm is um
similar to genetic uh algorithm. So
basically the idea is that you start
with a seat prompt and then at each
iteration you try to sample new prompts
see which one works and then basically
select the new ones and and kind of
improve over time. Uh that's kind of the
general shape of uh the algorithm and
we're going to go uh and look at each
step. So it's three steps basically you
sample new candidate each times evaluate
them see which one good work well and
then do some filtering using this kind
of parto frontier I'm going to talk
about and then do it again and again. So
let's see how it works. So the way uh it
works is first you start with a seat
candidate here in our case we're going
to use kind of a a very simple LLM as a
judge like evaluate whether this
customer service agent violated policy
and start with assuming is the agent is
compliant and now in each iteration we
are going to sample new candidates from
the filtered candidates from the last
iteration. Now here in the first one we
have only one seat candidate but in in
the next iteration we will have a larger
bag of candidates.
So GPA has two strategies to sample new
candidates. Uh one is prompt mutation
and the other one is merging multiple
candidates. For prompt mutation which is
what we're going to use in the beginning
since we have only one candidate. The
idea is that you would run the first LLM
as a judge here um whether the
trajectory and uh if it fails uh it's
like this LLM as a judge will reflect
and propose a new prompt. Basically
there will be some kind of reflection
which means basically we're using the
intelligence of the LLM to try to
improve the problem because it looks at
the input looks at the outputs looks at
the results and try to infer how to make
it better. The other strategy is the
merge strategy which we're going to use
in the next iteration. And basically
here it takes two prompts and then kind
of put them together. And if you think
about it with an LLM as a judge, usually
you have like these guidelines and
basically probably what's going to try
to do is to take guidelines from prompt
A, prompt B and put them together. So
now after we generated a lot of samples,
uh we need to select which one are good.
And basically the way it works is that
it would evaluate these new prompts
against many batches of the event. So
not everything. Um and if they improve
the performance uh compared to the
starting point uh then we select them
and uh they will be added to our bag of
prompts and then starts the next
iteration which is the other uh
innovation of this algorithm which is
the idea of the paral frontier.
Basically, the way we select which
prompts or which candidates we're going
to use as a seed for the new iteration
is not that we select the ones that have
the average best score. Like that would
be the trivial um
solution, right? Look at my my prompt,
see which one work most by looking at
the average over everything and then
select these. Uh instead what they do is
that they um try to add diversity by
trying to look at what are the best
candidate per task. Right? You have like
a set of kind of tasks in your
evaluation like in our case set of
trajectories and you look for each of
these trajectory which is the best
candidate and that is kind of the parto
frontier. Um and then you try to select
from these and basically what you do is
you try to select a set at the end of
the day that covers your whole test
case. So basically for for each test
case there is at least one candidate
that solves it and obviously you see
there that the idea is that you get like
a good par frontier and then you start
merging things and at the end of the day
you have this prompt that solves
everything in your training. Now that
you have like this filtered set of
candidate again we sample new candidates
from these uh these ones using uh kind
of the mutation and the merge strategy
and we keep doing these um until
basically the compute uh budget uh
finishes.
Now there's a lot of libraries that
implement this algorithm. I think the
most known is despite it popularized the
idea of optimizing prompts or harnesses.
uh but now there's uh I think a new
library by the authors of GAPA an open
source one called GPA and they have
implemented uh in the last month uh new
interface called optimize anything a new
API which is what we're going to use and
which can be used not only to optimize
prompts but really to optimize
um any almost any algorithm using this
same idea it's quite powerful uh let Let
me show you how it works. So basically
uh the API here is called optimize
anything. You see this function and what
it takes is a seat candidate. The
candidate is the configuration that you
want to optimize. In our case that would
be the LLM as a judge prompt, right? We
can make it even a dict if we want. So
it's kind of for example the LLM as a
judge uh prompt plus temperature let's
say. So or it could be chain of prompt
and so on so forth. So it's not limited.
Then we have uh the evaluator which is
basically the thing that would be used
by GAP to optimize and the expectation
from the evaluator is that um it would
obviously run the system. In our case,
it would run the LLM as a judge uh
parameterized by the candidate
uh and then it would uh log uh kind of
diagnostic not only the output uh but
also the error the reasoning and and the
idea you can add as much as you want and
you see you'll see this is how we're
going to do it with with our
optimization for the LLM as a judge. Uh
but the idea is that Uh if you remember
here we used some kind of uh reflection
and and uh um reasoning to improve our
prompt and that reasoning is something
that we oursself will build uh through
this evaluate
and then other than that there is ways
in the configuration for example to
configure how many calls to do per
iteration.
um the objective basically providing
context for uh the refinement prompt on
how to uh improve and so on and so
forth. But but the um the corn uh flow
is actually quite simple. So now let's
jump to the Jupyter notebook and really
look into how to do it uh step by step.
So you'll find this Jupyter notebook in
the GitHub repository that you'll find
in the links and in the last slide. So
we start by installing the library. So
it's flight lm and u ga. I'm not
installing here ga because I did the
optimization. It takes uh kind of quite
a time. I think a couple of hours uh
before so I'm going to jump this step.
Um, but obviously if you want to run it
within the reposi within the Jupyter
notebook, you should also install it. So
we install this and we kind of do our
imports. We have kind of a couple of um
of functions that that I extracted
outside. We're going to look at them
later.
And we start by loading the data. Uh, as
I mentioned, so we're using the data
from TaBench.
uh I just kind of uh pre-processed
processed it in the beginning to change
the type of the assertion so that they
look like the annotation I showed in the
presentation.
>> [snorts]
>> So I've already kind of split uh the
data into a training and a validation um
uh data sets. Uh so one which we'll use
with the GAPA and the other one the
second one to validate the result at the
end. And the way I did the split is uh
based on different tasks that are
created in um uh in the tow bench. And
if we look here is basically we have a a
training set with 480 traces
uh with with around 2/3 uh that are
compliant and a validation set with 112
traces that are compliant. As I
mentioned the data here is not not very
very nice because there are some
redundancies. So there are sometimes the
same task uh that is um being run with
with the same model multiple times. So
there is a little bit of redundancies
but there are no redundancies between
the training and the validation set. So
we look here how the annotation uh looks
like and again uh basically we have an
example of a compliant annotation or
non-compliant annotation basically a
trajectory that uh that kind of adheres
to the policy that's the LM as a charge
that we want to learn another one that
does not and we can see here basically
it describes like okay it's compliant
because uh it correctly identified uh
the basic economy reservation while here
did not identify uh the user membersh
membership as a regular and and this
annotation is actually quite important
for us for the LLM as a judge to learn
uh the policy especially in the case
here um this policy adurance it's kind
of very complex system uh that the LLM
as a judge need to uh to learn and
without kind of information about uh
what is compliant like why is something
uh correct correct or not correct, it
would be quite impossible for the u gape
algorithm to to reach uh a good NLM as a
judge, right? I mean, it would be the
same for a human, right? If you gave me
all these trajectories, told me this is
kind of correct. This is nonconformed to
policy, but you did not give me uh more
information, it would be very hard for
me to um to basically make a judgment,
learn how to assess policies. So having
this information and the quality of the
annotation as I mentioned quality of the
data is really paramount of being able
to learn this and again obviously this
is kind of bit of a complex um LLM as a
judge to learn. So uh first thing we
start is is with a naive uh judge and uh
basically this is the seed judge uh we
start with um and and it's it's
something that that actually I engineer
engineered I'm going to talk a little
bit about it later with the learnings on
like how exactly did we reach this and
basically you can see here is like it
evaluates whether the customer service
agent violated policy And it tells that
you should start by assuming that the
agent is compliant and uh only change to
non-compliant if there is a specific
reason right I mean here we we are
starting with an LLM as a judge which
means that um like the seed judge the
initial judge should actually in my
opinion start by by saying everything is
all right I mean if I don't have any
rules it should said everything is all
right I started
Like in another example when I started
as I created an LLM as a judge that says
okay you should check whether the agent
violated policy and what you end up
being is basically the LLM with its own
biases trying to make that decision by
itself says okay this is violates this
doesn't violate without having any
information so without telling it in the
beginning that you should start assuming
that it's compliant I mean if you don't
have any reason to to believe it's
non-compliant it should be compliant
Um then uh basically you start with with
some kind of random LLM as a judge that
would be very hard to fix later unless
you end up with one prompt that
discovered this thing right start with
the compliance. So the I discovered here
that the initial seed is actually very
important in this case. There might be
simpler scenarios where you don't need
this but in this case it's kind of uh
quite important. So if we take this and
we run this on the validation set like
this uh initial uh LLM as a judge uh we
basically find 61% accuracy but but we
look like the bias it's actually most of
the time saying it's compliant which is
actually what we want right I mean if
that's I would say saying it's compliant
98% of the time is actually the unbiased
thing to do
>> [snorts]
>> uh the logical place where to start. Um
so uh we run the metrics in the
beginning right the accuracy 61% 98% is
saying like the recall of of compliance
with very low to recall full
non-compliance
um and I think as I mentioned this is
quite all right it's biased towards
compliance and I've had experiments in
the beginning where it was kind of
almost random but then it doesn't learn
at all and we can look into why or where
it goes who's wrong and uh basically by
looking at the places where does does
work and we see like it says compliant
but it's not compliant because it
doesn't know the policy right so we
start here like the main code of kind of
optimizing the judge with gapa and what
you will see is that I have actually um
wrote um a reflection template so that
prompt that gi uses to reflect and to
improve uh to sample new candidates. I
did not use their default but actually I
wrote one. Uh I tried in the beginning
by using the default prompt in in um in
uh in GIA but the results were not uh as
good as I expected. It was very hard for
it to uh to learn and what I tried to do
is is to provide um basically a bias and
prior within the reflection uh template.
So you see here for example I mentioned
obviously that uh it's basically the
judge is reading an airline customer
service
um and it needs to kind of decide which
is the basics but then you can look at
um more information for example that our
annotation that this uh kind of
reflection template sees includes also
uh the judge verdict the ground truth
like the annotation like this is an
important information that the
reflection template should look into and
improve. Um and uh basically I explain
to to to the LLM how to do this. You can
add rules, restructure existing one, uh
reward things for clarity um and and try
to think about it that the um the
reflection template should basically
create some real policy rules, right? It
should find the right policies and
abilities. And I think adding that was
very um important to kind of improve the
quality. Otherwise the default
reflection template did not understand
uh that that it should kind of try to uh
learn the the policy more or less right
with the LLM judge uh to some degree. Um
so that's the second thing uh kind of I
changed and I mean honestly I iterated
only a little bit on it and then we run
the optimization a run optimization is
basically a wrapper around optimize
anything and it just uh kind of
parameterize it since I run like
multiple experiments to find the one and
it's actually uh it would be part of the
GitHub repository and it's actually
something that you can play around with
and start with when you're exploring uh
the space of design and gapan. You can
see here it basically tries um to build
um the configuration based on the
parameters like the quarks and then
calls optimize anything with these and
for the uh kind of the make evaluator
it's basically it calls the llm as a
charge and all adds all the side
information. So it doesn't only provide
the trajectory but also kind of the
annotation which is quite important.
Um and then we run this um as I
mentioned it takes around an hour to
run. Um
and you can see here the uh basically uh
the results this is the optimized rubric
and you can see compared to the default
where we started it's it learned part of
the policy criteria like the flight
cancellation and refines flight
modification
um uh how to communicate and so on and
so forth.
Now if we look at the results like with
the kind of evaluate uh rubric to
evaluate it we see that the accuracy
increased from 69% to 74%.
And uh we removed the bias right so what
especially changed is the recall for the
non-compliant and the precision for the
non-compliant which was basically zero
in the beginning. Now uh the LLM as a
judge is uh has less bias at 64%. So
previously it was 98% and it's really
learned parts of the policy. So looking
at the results as I mentioned like for
the validation set we had a quite a lot
of improvement like 14%.
And uh for the training accuracy it
improved by nine points and we can see
that the parto frontier interestingly
accuracy is now 100%. Meaning that for
each task there is one candidate that we
generated that solved it. uh the issue
the algorithm faced is how to merge all
these candidates and all the information
to have one prompt that solves
everything. Um and it struggled to do
this. So at the end of the day we
improved the LLM as a judge. We improved
its accuracy but it's very still quite
far from kind of 95% accuracy or
something that is really well aligned
with the human judgment. Obviously here
I didn't invest extreme amount of time
in it and as I mentioned in the
beginning the quality of the data is
also um would be better I would guess in
other cases it's really a tricky uh
example
um but but nevertheless it took actually
quite a number of iteration I think
that's the biggest learning to uh to
reach this LLM as a judge it's not an
algorithm that you just take and it
works from from day one unless for kind
of toy examples and uh I wanted to show
a little bit in the end like what are
the experiments I I tried in the
beginning uh that failed and how did I
think about fixing them. Um the first
thing was actually uh using smaller or
older model like using GPD40
uh for both the refiner and the LLM as a
judge and that was a complete failure
like uh smaller models really are very
bad at least in this example to be
either an LLM as a judge or a refiner.
um uh for the LLM as a judge providing
all this policy especially it has a lot
of kind of complicated logic it just
failed and could not improve it. I tried
other models. I tried mini and nano and
gemini and deepsek and and you see that
the best kind of results with this kind
of uh using uh Gemini for reflection and
uh Grock for a judge. But I would say
also using uh GPT4 mini for both is
actually quite good and the results
quite well. The other thing was actually
trying how to try to debug it and and
what I tried to do from the beginning is
really to not start sampling doing big
experiments from the beginning but
trying first uh kind of a small
iterations looking at the reasoning LLM
looking at the candidate how do they
improve how many improved and
understanding what's happening and that
actually what uh what allowed me to kind
of think about improving the refined
prompts um and and basically adding some
uh prior there to to kind of help solve
it. Basically what I did is I stopped at
the first iteration, found some example
and then looked a little bit kind of
fine-tuned
um in closed code. Uh that refinement
prompt to uh to basically allow it to
improve um the candidates and and as
always in machine learning like what you
always try to do is to overfitit for the
training data. Not trying to run the old
algorithm but really trying to find a
way so that it works. And I think what
we saw with the par to frontier reaching
100% is we almost overfit to the
training data but obviously for the
merge there are like things I think we
can do to improve and the final thing
was kind of this iteration on the seat
prompt there I actually iterated on
multiple seat prompts and there were two
families one which as you have seen here
is uh kind of was not did not include
any information about the agent prompt
because we have access to that right and
the agent prompt does have access to the
policy and one which had access to the
policy so basically it was this prompt
that I've shown uh but then the policy
of the agent like really copypasted and
interestingly
uh the uh prompt that did not have
access to the policy did uh better
because my hypothesis is that if you
have access to the agent policy from the
beginning then it's very hard to
fine-tune it. You're already stuck in a
local minima that you can not improve
on. But if you don't have access to the
policy and yet obviously you have access
to the annotations
that describe in this case all the
policy uh or a large part of the policy
then you you are able uh to um to
explore the space of the prompts much
better.
Um and finally the last point is beware
of the cost. I mean um even these small
experiments I've done I think they cost
like $2 $300
uh in tokens especially since uh the
trajectories are long. So there is a lot
of input tokens uh in this case. Uh but
um the models that are used
um are actually quite expensive, right?
Uh GPT4, I mean I tried a little bit to
play around with GP4, but that ate a lot
of money. So I stopped the experiment.
But even GP4 mini is quite expensive to
some degree. And then if you go uh nano,
at least from what I seen, it doesn't
work right. So you you go to kind of
smaller model, cheaper model, it does
not work. Usually what they say you
should use uh kind of a bigger model for
the refinement prompt and smaller model
for the LLM as a judge. I think it makes
sense especially if you're running LM as
a judge against a lot of traces like in
the case of online evaluation. It's
obvious that it's it's worthwhile the
investment of spending money on the
optimization to lower the cost on the
long term. And I think there is a lot of
use cases where it worked. So again
first uh overfitit to the training data
start with a small iteration uh
visualize. So basically instrument the
traces I instrumented them using agenta
in this case uh try to look at them try
to look at the prompts that have been
generated and understand how the
algorithm is working before uh
increasing the sampling. And in this
case I think we had u around 200 300
iterations per experiment. Um in
addition to that there is actually a
number of parameters like the batch size
and so on that you need to uh fine-tune
to get the algorithm to work. So that's
it.
Thanks a lot for watching. I hope that
has been helpful and that you'll build
good LLM as a judges that helps you to
improve your applications.
Uh, I'd love if you check out Agenta,
our open source LMOPS uh platform. Um,
and you can follow me both on LinkedIn
and X. And finally, if you're thinking
and working about autooptimization
uh about how to uh optimize prompts, uh
feel free to reach out or to write in
the comments in YouTube. Have a great
day. Thank you.