Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI
Channel: aiDotEngineer
Published at: 2026-04-10
YouTube video id: X4dEHRzBLmc
Source: https://www.youtube.com/watch?v=X4dEHRzBLmc
Hello everyone and welcome to my talk slashworkshop judge the judge and today we're going to talk about LLM as a judge quite sure you know this scenario you have an agent in production and someone from the team says we need to monitor the reliability so you go to one of the libraries and maybe use the hallucination LM as a judge you put it in production within your observability platform and things looks fine But customers are actually saying that the agent is not working and you look at the traces, it's not working. You look now under the hood about this hallucination LLM as a judge and you'll find a prompt not very far from this one. You'll be given an LLM output rate whether it's in a hallucination. Make no mistakes. Now, obviously, how the hell would the agent know whether it's a hallucination? If it could, then your app would have worked from the day one. So today we're going to talk about how can we build calibrated LLM that as a judge that works calibrated mean calibrated with human annotation and the way we're going to calibrate them is by using optimization or prompt optimization specifically we're going to use GAPA quite good algorithm for optimizing prompts now why do we want this why do we want calibrated LM charges or good LM charges first thing is for our off our offline evaluations. As you know like usually to create uh a good agent or a good prompt the way you do is you try to experiment with a prompt then run your AV valves see if it improves things or not. If it does good if it does not you go back and you improve it a little bit. Prove the harness the prompt and do it again and again. And the speed in which you move to production or add features is actually the speed into which you can complete this loop. And the bottleneck in this loop is actually the evaluation. How fast can you evaluate? Obviously the slowest possible evaluation is having human annotator really look at your whole test set and annotate it manually. Um the quality is quite good but then each iteration will take a lot. um you can have faster ones by using an LLM as a judge. But then if that LLM as a judge does not um correlate with a human annotation, then you'll end up with useless signal and although this loop will move fast, it won't go anywhere. So having calibrated LM as a judge with a similar quality let's say as human annotator will make your development much faster. The second second thing is basically the online eval like our example from the beginning. If you have an online eval and you want to see basically in production if things are improving or not improving same thing if you have LLM as a jobs that are calibrated with your business goals then you can quickly see whether changes that you've made are improving not improvement whether um there is some change in the distribution of the data how people are interacting uh with with your agent or model and basically react rapidly and finally and I would All this is the holy grail of AI engineering is really to build this data flywheel where you optimize your harness, observe some traces and then add new a valves based on these traces the edge cases and do it again and again and again here if if you have a way to kind of add new evaluations quickly obviously automatic evaluations um from the traces from kind of the annotations and data you can go through this loop faster and faster to to the moment or to the point that you can think of it as an automatic loop, right? Because you can optimize the harnesses with optimization techniques like GIA what we're going to use here. But the same thing you can do it for the eval. And basically over time your application will improve just with uh new observations. So today we're going to build and optimize these LLM as judges that are calibrated with human annotations. Uh but before going there a small intro about myself. My name is Mahmud. I'm the co-founder and CEO of Agenta. Uh Agenta is an open-source Lenomops platform. Uh basically providing you all the tools from observability, prompt management, evaluation covering the whole life cycle of building reliable agents. Um my experience is in machine learning. I have more than 15 years experience in that. In a previous life, I was in academia. uh worked on machine learning applied to computational biology, protein structure prediction. And right now we're working a lot on these sampling and autooptimization workflows. So uh if you're interested in that, please reach out. We'd love to have a conversation and show you also what we're building. So what's the plan for this talk? Basically, we're going to work on a practical use case. It's a customer support agent that we want to evaluate and we are going to build an LLM as a judge that is calibrated with the annotations basically human annotations for that customer support. The plan would be really to go over the whole process of building this starting with how we design the metrics, how to think about the data curation, the labeling, but the main focus would be really about um the part about optimizing the LM as a judge using GPA and obviously then validating the results. All the code and the data used in this uh will be found in GitHub and you can find them in the links in this video and the last slide. So let's start with the data set. We're going to use Towbench. TBench is a benchmark in a large data set built by Sierra uh customer support um scaleup I think and they have like multiple benchmark for real world scenarios for customer support agents. Uh one of them is the airline agent which we are going to use um in this example. And basically what they have is an airline customer support agent that has access to multiple tools to manage reservations, access flight information, access user information and has a quite complex policy to be held to like when to change a reservation uh when to provide information and so on so forth just like a customer support agent, a human one. And the data that we have is is the agent itself. about most most importantly um 599 conversation traces that are generated with annotations. uh now the format the original format of the annotations is like in the format of assertion but uh I pre-processed or by post-processed the data so that we have for each trace an annotation like a human annotation uh where it says for example like in this case you have the conversation that you see here and then you have an annotation that the agent is not compliant because it improved approved the cancellation without verifying that the reservation met the airline cancellation rule. So basically the evaluation failed and the reason is because the agent uh um canceled the reservation without verifying something. So basically uh it did not behold to the policy and the data is is more or less uh kind of um not very skewed. It's 62% compliant, 38% compliant and it's generated with multiple models and trials. Overall the data is is kind of the problem is quite complex because the policy is quite complex. The data has caveats honestly it's not very clean um due to the reasons how it's also generated but for our use case I think it's very um interesting use case to test uh GIA and to kind of demo how it would work uh in a real uh test case. The workflow that we have is four steps. First thing is designing the metrics that is deciding what will the LLM as a judge measure. What are the different axis that it would look to the second thing is annotating the data uh and then we will optimize the judge and validate the results. Now the most important thing to take here is that the metrics need to come from uh the use case itself. it does not make sense to have general metrics like hallucination uh when you're evaluating your AI agent. It really depends on kind of the business use case and the best person the best people to to um to determine these metrics are the subject matter experts. For example, in the case of a customer support agent, you need you need to have um a subject matter expert look at the conversation, provide the feedback, and I think the the workflow to do that, the best one that described it is Hamill uh from Ham Dev. Um I'm going to share his uh blog and uh in the YouTube video and really describe this idea of error analysis very well but but I'm going to go over it very quickly and also the annotation workflow very quickly. So the idea is that you you provide your subject matter expert with all these traces of uh the trajectories of the conversation and they would annotate them first by commenting what did work or what did not work but then uh kind of slowly trying to cluster create clusters of the error types like when it's failing why is it failing and uh here I'm showing an agenda how it's done. And basically here I have like these I I discovered kind of these four error types while going through these uh traces. So there is sometimes issues with policy adherance uh sometimes issue with the response style. Uh information delivery basically the agent is not informing the customer that they made the change or something like that. And finally some tools has not been colle called correctly. And the idea is that we're going to take these four error types and then we're going to build four LLM as judges for these. So it does not make sense to have one LLM as a judge which is success. Um uh basically try to evaluate all of these. it will make it too complex and it's very hard to learn and you will see a little bit later that even with a simple like when we're going to simplify it it's still hard to learn a calibrated LLM as a church or optimize a cal calibrated LLM as a church so it makes a lot of sense to make things very specific in these metrics uh that we want to evaluate and the second thing is to move away from one to five scores or like percentages and instead have like really binary solution like whether it adhereed to policy or not um with obviously some reasoning um and the reason it's again it's already quite hard to calibrate an LLM as a judge with a true false like binary classification adding another layer and saying okay it should be a number between one and five it's it's hard it's even hard for human annotator like to have two human annotator agree on the same score. So the moment that we have defined these different metrics uh start the point of annotation. Um here again I'm using uh agenta basically um uh you would take your traces create an annotation queue and and kind of specify for your annotator like uh the name of the um the feedback or the evaluator policy adherance here and then providing like they should provide each one with whether it adears to the policy whether it does not and provide the reasoning and the reasoning here is very important uh because without the reasoning we will see uh the optimization algorithm will need to discover itself like why it failed and it's going to be very hard like in unless it's a very specific uh kind of feedback for example tool um failure where it can infer things. It's very hard to see for example from a conversation why it did not adhere to policy without someone providing information in the beginning about it. So having that reasoning is very important later for the LLM as a judge to learn. And again this reasoning as I've shown previously with the annotation uh it it kind of describe why for example the agent here is non-compliant uh because uh it approved the cancellation without before verifying um that the reservation met the cancellation rules. So now that we have So now that we have the annotations, we can go to the optimization. But before going there, I want to add a small note. Although we went very quickly through the first step and the second step, these are actually the hardest part of the problem. Like in reality, as every data scientist know, getting your data um is the hardest thing. And you need to make sure to look at the data. Look at the annotation. Make sure that the distribution is is good. Uh that uh the annotation and the information within the data is enough for the algorithm to learn a representation of the LLM as a meaningful. Uh in our case here, the data is not that good. The number of traces is small. The problem is quite complex. uh they're not very well distributed because of the reason they have been generated. Also the annotation is actually uh kind of AI generated based on the assertion in the original data. You can see a little bit more how it's done in um in the repository. Um so it makes things uh quite complex for this problem. Um but it's still quite a good demonstration of how it would work. So now that we have the annotated data, we can start with the optimization. And for the optimization, we are going to use the GI algorithm. So I'm going to explain in the beginning how the GI algorithm works and then we can jump to the Jupyter notebook and start optimizing based on the annotated data. It's very important to understand how the algorithm works because you will see that practically you need to play around a little bit with the parameters and to get it to work and it's very hard obviously to play around with the parameters if you don't understand what do they do. Uh the algorithm is um similar to genetic uh algorithm. So basically the idea is that you start with a seat prompt and then at each iteration you try to sample new prompts see which one works and then basically select the new ones and and kind of improve over time. Uh that's kind of the general shape of uh the algorithm and we're going to go uh and look at each step. So it's three steps basically you sample new candidate each times evaluate them see which one good work well and then do some filtering using this kind of parto frontier I'm going to talk about and then do it again and again. So let's see how it works. So the way uh it works is first you start with a seat candidate here in our case we're going to use kind of a a very simple LLM as a judge like evaluate whether this customer service agent violated policy and start with assuming is the agent is compliant and now in each iteration we are going to sample new candidates from the filtered candidates from the last iteration. Now here in the first one we have only one seat candidate but in in the next iteration we will have a larger bag of candidates. So GPA has two strategies to sample new candidates. Uh one is prompt mutation and the other one is merging multiple candidates. For prompt mutation which is what we're going to use in the beginning since we have only one candidate. The idea is that you would run the first LLM as a judge here um whether the trajectory and uh if it fails uh it's like this LLM as a judge will reflect and propose a new prompt. Basically there will be some kind of reflection which means basically we're using the intelligence of the LLM to try to improve the problem because it looks at the input looks at the outputs looks at the results and try to infer how to make it better. The other strategy is the merge strategy which we're going to use in the next iteration. And basically here it takes two prompts and then kind of put them together. And if you think about it with an LLM as a judge, usually you have like these guidelines and basically probably what's going to try to do is to take guidelines from prompt A, prompt B and put them together. So now after we generated a lot of samples, uh we need to select which one are good. And basically the way it works is that it would evaluate these new prompts against many batches of the event. So not everything. Um and if they improve the performance uh compared to the starting point uh then we select them and uh they will be added to our bag of prompts and then starts the next iteration which is the other uh innovation of this algorithm which is the idea of the paral frontier. Basically, the way we select which prompts or which candidates we're going to use as a seed for the new iteration is not that we select the ones that have the average best score. Like that would be the trivial um solution, right? Look at my my prompt, see which one work most by looking at the average over everything and then select these. Uh instead what they do is that they um try to add diversity by trying to look at what are the best candidate per task. Right? You have like a set of kind of tasks in your evaluation like in our case set of trajectories and you look for each of these trajectory which is the best candidate and that is kind of the parto frontier. Um and then you try to select from these and basically what you do is you try to select a set at the end of the day that covers your whole test case. So basically for for each test case there is at least one candidate that solves it and obviously you see there that the idea is that you get like a good par frontier and then you start merging things and at the end of the day you have this prompt that solves everything in your training. Now that you have like this filtered set of candidate again we sample new candidates from these uh these ones using uh kind of the mutation and the merge strategy and we keep doing these um until basically the compute uh budget uh finishes. Now there's a lot of libraries that implement this algorithm. I think the most known is despite it popularized the idea of optimizing prompts or harnesses. uh but now there's uh I think a new library by the authors of GAPA an open source one called GPA and they have implemented uh in the last month uh new interface called optimize anything a new API which is what we're going to use and which can be used not only to optimize prompts but really to optimize um any almost any algorithm using this same idea it's quite powerful uh let Let me show you how it works. So basically uh the API here is called optimize anything. You see this function and what it takes is a seat candidate. The candidate is the configuration that you want to optimize. In our case that would be the LLM as a judge prompt, right? We can make it even a dict if we want. So it's kind of for example the LLM as a judge uh prompt plus temperature let's say. So or it could be chain of prompt and so on so forth. So it's not limited. Then we have uh the evaluator which is basically the thing that would be used by GAP to optimize and the expectation from the evaluator is that um it would obviously run the system. In our case, it would run the LLM as a judge uh parameterized by the candidate uh and then it would uh log uh kind of diagnostic not only the output uh but also the error the reasoning and and the idea you can add as much as you want and you see you'll see this is how we're going to do it with with our optimization for the LLM as a judge. Uh but the idea is that Uh if you remember here we used some kind of uh reflection and and uh um reasoning to improve our prompt and that reasoning is something that we oursself will build uh through this evaluate and then other than that there is ways in the configuration for example to configure how many calls to do per iteration. um the objective basically providing context for uh the refinement prompt on how to uh improve and so on and so forth. But but the um the corn uh flow is actually quite simple. So now let's jump to the Jupyter notebook and really look into how to do it uh step by step. So you'll find this Jupyter notebook in the GitHub repository that you'll find in the links and in the last slide. So we start by installing the library. So it's flight lm and u ga. I'm not installing here ga because I did the optimization. It takes uh kind of quite a time. I think a couple of hours uh before so I'm going to jump this step. Um, but obviously if you want to run it within the reposi within the Jupyter notebook, you should also install it. So we install this and we kind of do our imports. We have kind of a couple of um of functions that that I extracted outside. We're going to look at them later. And we start by loading the data. Uh, as I mentioned, so we're using the data from TaBench. uh I just kind of uh pre-processed processed it in the beginning to change the type of the assertion so that they look like the annotation I showed in the presentation. >> [snorts] >> So I've already kind of split uh the data into a training and a validation um uh data sets. Uh so one which we'll use with the GAPA and the other one the second one to validate the result at the end. And the way I did the split is uh based on different tasks that are created in um uh in the tow bench. And if we look here is basically we have a a training set with 480 traces uh with with around 2/3 uh that are compliant and a validation set with 112 traces that are compliant. As I mentioned the data here is not not very very nice because there are some redundancies. So there are sometimes the same task uh that is um being run with with the same model multiple times. So there is a little bit of redundancies but there are no redundancies between the training and the validation set. So we look here how the annotation uh looks like and again uh basically we have an example of a compliant annotation or non-compliant annotation basically a trajectory that uh that kind of adheres to the policy that's the LM as a charge that we want to learn another one that does not and we can see here basically it describes like okay it's compliant because uh it correctly identified uh the basic economy reservation while here did not identify uh the user membersh membership as a regular and and this annotation is actually quite important for us for the LLM as a judge to learn uh the policy especially in the case here um this policy adurance it's kind of very complex system uh that the LLM as a judge need to uh to learn and without kind of information about uh what is compliant like why is something uh correct correct or not correct, it would be quite impossible for the u gape algorithm to to reach uh a good NLM as a judge, right? I mean, it would be the same for a human, right? If you gave me all these trajectories, told me this is kind of correct. This is nonconformed to policy, but you did not give me uh more information, it would be very hard for me to um to basically make a judgment, learn how to assess policies. So having this information and the quality of the annotation as I mentioned quality of the data is really paramount of being able to learn this and again obviously this is kind of bit of a complex um LLM as a judge to learn. So uh first thing we start is is with a naive uh judge and uh basically this is the seed judge uh we start with um and and it's it's something that that actually I engineer engineered I'm going to talk a little bit about it later with the learnings on like how exactly did we reach this and basically you can see here is like it evaluates whether the customer service agent violated policy And it tells that you should start by assuming that the agent is compliant and uh only change to non-compliant if there is a specific reason right I mean here we we are starting with an LLM as a judge which means that um like the seed judge the initial judge should actually in my opinion start by by saying everything is all right I mean if I don't have any rules it should said everything is all right I started Like in another example when I started as I created an LLM as a judge that says okay you should check whether the agent violated policy and what you end up being is basically the LLM with its own biases trying to make that decision by itself says okay this is violates this doesn't violate without having any information so without telling it in the beginning that you should start assuming that it's compliant I mean if you don't have any reason to to believe it's non-compliant it should be compliant Um then uh basically you start with with some kind of random LLM as a judge that would be very hard to fix later unless you end up with one prompt that discovered this thing right start with the compliance. So the I discovered here that the initial seed is actually very important in this case. There might be simpler scenarios where you don't need this but in this case it's kind of uh quite important. So if we take this and we run this on the validation set like this uh initial uh LLM as a judge uh we basically find 61% accuracy but but we look like the bias it's actually most of the time saying it's compliant which is actually what we want right I mean if that's I would say saying it's compliant 98% of the time is actually the unbiased thing to do >> [snorts] >> uh the logical place where to start. Um so uh we run the metrics in the beginning right the accuracy 61% 98% is saying like the recall of of compliance with very low to recall full non-compliance um and I think as I mentioned this is quite all right it's biased towards compliance and I've had experiments in the beginning where it was kind of almost random but then it doesn't learn at all and we can look into why or where it goes who's wrong and uh basically by looking at the places where does does work and we see like it says compliant but it's not compliant because it doesn't know the policy right so we start here like the main code of kind of optimizing the judge with gapa and what you will see is that I have actually um wrote um a reflection template so that prompt that gi uses to reflect and to improve uh to sample new candidates. I did not use their default but actually I wrote one. Uh I tried in the beginning by using the default prompt in in um in uh in GIA but the results were not uh as good as I expected. It was very hard for it to uh to learn and what I tried to do is is to provide um basically a bias and prior within the reflection uh template. So you see here for example I mentioned obviously that uh it's basically the judge is reading an airline customer service um and it needs to kind of decide which is the basics but then you can look at um more information for example that our annotation that this uh kind of reflection template sees includes also uh the judge verdict the ground truth like the annotation like this is an important information that the reflection template should look into and improve. Um and uh basically I explain to to to the LLM how to do this. You can add rules, restructure existing one, uh reward things for clarity um and and try to think about it that the um the reflection template should basically create some real policy rules, right? It should find the right policies and abilities. And I think adding that was very um important to kind of improve the quality. Otherwise the default reflection template did not understand uh that that it should kind of try to uh learn the the policy more or less right with the LLM judge uh to some degree. Um so that's the second thing uh kind of I changed and I mean honestly I iterated only a little bit on it and then we run the optimization a run optimization is basically a wrapper around optimize anything and it just uh kind of parameterize it since I run like multiple experiments to find the one and it's actually uh it would be part of the GitHub repository and it's actually something that you can play around with and start with when you're exploring uh the space of design and gapan. You can see here it basically tries um to build um the configuration based on the parameters like the quarks and then calls optimize anything with these and for the uh kind of the make evaluator it's basically it calls the llm as a charge and all adds all the side information. So it doesn't only provide the trajectory but also kind of the annotation which is quite important. Um and then we run this um as I mentioned it takes around an hour to run. Um and you can see here the uh basically uh the results this is the optimized rubric and you can see compared to the default where we started it's it learned part of the policy criteria like the flight cancellation and refines flight modification um uh how to communicate and so on and so forth. Now if we look at the results like with the kind of evaluate uh rubric to evaluate it we see that the accuracy increased from 69% to 74%. And uh we removed the bias right so what especially changed is the recall for the non-compliant and the precision for the non-compliant which was basically zero in the beginning. Now uh the LLM as a judge is uh has less bias at 64%. So previously it was 98% and it's really learned parts of the policy. So looking at the results as I mentioned like for the validation set we had a quite a lot of improvement like 14%. And uh for the training accuracy it improved by nine points and we can see that the parto frontier interestingly accuracy is now 100%. Meaning that for each task there is one candidate that we generated that solved it. uh the issue the algorithm faced is how to merge all these candidates and all the information to have one prompt that solves everything. Um and it struggled to do this. So at the end of the day we improved the LLM as a judge. We improved its accuracy but it's very still quite far from kind of 95% accuracy or something that is really well aligned with the human judgment. Obviously here I didn't invest extreme amount of time in it and as I mentioned in the beginning the quality of the data is also um would be better I would guess in other cases it's really a tricky uh example um but but nevertheless it took actually quite a number of iteration I think that's the biggest learning to uh to reach this LLM as a judge it's not an algorithm that you just take and it works from from day one unless for kind of toy examples and uh I wanted to show a little bit in the end like what are the experiments I I tried in the beginning uh that failed and how did I think about fixing them. Um the first thing was actually uh using smaller or older model like using GPD40 uh for both the refiner and the LLM as a judge and that was a complete failure like uh smaller models really are very bad at least in this example to be either an LLM as a judge or a refiner. um uh for the LLM as a judge providing all this policy especially it has a lot of kind of complicated logic it just failed and could not improve it. I tried other models. I tried mini and nano and gemini and deepsek and and you see that the best kind of results with this kind of uh using uh Gemini for reflection and uh Grock for a judge. But I would say also using uh GPT4 mini for both is actually quite good and the results quite well. The other thing was actually trying how to try to debug it and and what I tried to do from the beginning is really to not start sampling doing big experiments from the beginning but trying first uh kind of a small iterations looking at the reasoning LLM looking at the candidate how do they improve how many improved and understanding what's happening and that actually what uh what allowed me to kind of think about improving the refined prompts um and and basically adding some uh prior there to to kind of help solve it. Basically what I did is I stopped at the first iteration, found some example and then looked a little bit kind of fine-tuned um in closed code. Uh that refinement prompt to uh to basically allow it to improve um the candidates and and as always in machine learning like what you always try to do is to overfitit for the training data. Not trying to run the old algorithm but really trying to find a way so that it works. And I think what we saw with the par to frontier reaching 100% is we almost overfit to the training data but obviously for the merge there are like things I think we can do to improve and the final thing was kind of this iteration on the seat prompt there I actually iterated on multiple seat prompts and there were two families one which as you have seen here is uh kind of was not did not include any information about the agent prompt because we have access to that right and the agent prompt does have access to the policy and one which had access to the policy so basically it was this prompt that I've shown uh but then the policy of the agent like really copypasted and interestingly uh the uh prompt that did not have access to the policy did uh better because my hypothesis is that if you have access to the agent policy from the beginning then it's very hard to fine-tune it. You're already stuck in a local minima that you can not improve on. But if you don't have access to the policy and yet obviously you have access to the annotations that describe in this case all the policy uh or a large part of the policy then you you are able uh to um to explore the space of the prompts much better. Um and finally the last point is beware of the cost. I mean um even these small experiments I've done I think they cost like $2 $300 uh in tokens especially since uh the trajectories are long. So there is a lot of input tokens uh in this case. Uh but um the models that are used um are actually quite expensive, right? Uh GPT4, I mean I tried a little bit to play around with GP4, but that ate a lot of money. So I stopped the experiment. But even GP4 mini is quite expensive to some degree. And then if you go uh nano, at least from what I seen, it doesn't work right. So you you go to kind of smaller model, cheaper model, it does not work. Usually what they say you should use uh kind of a bigger model for the refinement prompt and smaller model for the LLM as a judge. I think it makes sense especially if you're running LM as a judge against a lot of traces like in the case of online evaluation. It's obvious that it's it's worthwhile the investment of spending money on the optimization to lower the cost on the long term. And I think there is a lot of use cases where it worked. So again first uh overfitit to the training data start with a small iteration uh visualize. So basically instrument the traces I instrumented them using agenta in this case uh try to look at them try to look at the prompts that have been generated and understand how the algorithm is working before uh increasing the sampling. And in this case I think we had u around 200 300 iterations per experiment. Um in addition to that there is actually a number of parameters like the batch size and so on that you need to uh fine-tune to get the algorithm to work. So that's it. Thanks a lot for watching. I hope that has been helpful and that you'll build good LLM as a judges that helps you to improve your applications. Uh, I'd love if you check out Agenta, our open source LMOPS uh platform. Um, and you can follow me both on LinkedIn and X. And finally, if you're thinking and working about autooptimization uh about how to uh optimize prompts, uh feel free to reach out or to write in the comments in YouTube. Have a great day. Thank you.