Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith
Channel: aiDotEngineer
Published at: 2025-07-27
YouTube video id: 89NuzmKokIk
Source: https://www.youtube.com/watch?v=89NuzmKokIk
[Music] Hello everybody. Um, my name is Taylor Smith. I am an AI developer advocate slashevangelist slashtechnical marketing and other titles depending on where I am. Um, I work at Red Hat in our AI business unit because yes, we do also do AI too now. Great. Um, I'm really happy to be here. This is my first time at this conference. So, I'm new to the conference, new to speaking at this conference, ready to have a good time and also relax in about an hour and a half and just learn stuff. stoked that this was at 9:00 am. Um, my presentation clicker isn't working, so I'm gonna be a little bit glued over here. But to kind of overview what we're going to do today, I'm going to talk a little bit about kind of the issues of setting up a large language model in production and the reason why you need evaluations and benchmarks and all of these things and why that is so critical. And then I do have some hands-on activities that we'll get to do to use some evaluation um methods and benchmark tools to kind of get a sense of what is out there to use, how we can use those tools, what that might look like to put it all together for an actual production system. So, ready? Are you excited? >> Yay. First thing in the morning. Love that. I'm glad it's 9:00 a.m. because that feels like a reasonable start time. I don't like that 8 am stuff that happens. Um, setting up generative AI tech in production. If you think about it, crazy pants. Okay, there's so many things that could go wrong. This is such a complex type of technology that is very creative. Setting this up to be scalable and reliable and safe is challenging. Okay, that's why we need evaluations. That's why we need all of these tests and we need to be careful. We need to understand the technology we're working with when we're looking to implement this as well. So hard to do most organizations. So they don't typically start off with a multi- aent framework, right? And go crazy. Typically, they're going to have a kind of standard repeatable path to maturity. They might start out with, okay, how can I automate some things with AI? How can I have like a chatbot type of situation? That's kind of the standard that everybody starts out with. Okay, I maybe want to implement a rag setup. All right, let's start to look at agents to for on the Red Hat side. We're still dealing with probably like the first three phases of this maturity with our customers and that's where a lot of people um still are. But we get to talk about all the cool advanced stuff at this conference so we can plan ahead and think about what we want to implement. But enterprises, they need to take an incremental approach to do this successfully. Genai models, they have a number of drawbacks, right? And we all know about these things. Policy restrictions. Typically, if you're a developer at a company or have whatever type of role, you're restricted in the type of AI that you can use. Red Hat was just um allowed to use m Gemini was just made available to us, but before that, we weren't really allowed to use anything officially. Um, so the company policy restrictions of the tools you can use and the AI you can use is pretty locked down. Typically the legal exposures and risks of these models, you know, there was the glue incident. Um, they're going to say crazy stuff sometimes. How do we guard rail against that and how do we protect our customers against that and account for that when that does happen? Because it's hard to um, completely avoid. There's the bias and discrimination issues. Most of the internet data is still largely eurosentric and US-based. So the models trained on this public internet data of course is going to be a little skewed, right? So we need to be aware of it just like in regular everyday life. We know that bias and discrimination exists. How do we account and kind of provide guard rails to make sure that we adjust and prevent what we can? Cost and performance. That's probably kind of a baseline big one. how much these genai models cost to run at scale in production and the performance throughput latency etc that we need to account for when we have these production system set up and then the knowledge cut off as well is another limitation you know these large frontier models have a knowledge cut off because they're not consistently trained so you might be working with a model that was cut off a year ago so it's not going to have that up-to-date information which is why they you know implement rag systems and agent systems to look out into the internet for more up-to-date info. But these are just some of the kind of drawbacks we need to be aware of and account for. Inference at scale, no matter how I'm going to go kind of into these categories a little more in depth. No matter how good your model is, if it's not fast, if it's not reliable, if it's not affordable, you're screwed a little bit from the get-go. Okay, so this graphic, it shows a classic bottleneck type of scenario. You got concurrent user requests which might be represented by those green, yellow, and orange dots flowing into a system. But then traditional inference runtimes typically can't handle that kind of load efficiently. Um to serve real world traffic, whether you're powering a customer support agent, a developer type of co-pilot system, or maybe a rag pipeline, you need an inference engine that's purpose-built for scale. So that's where inference runtimes like TRT SG lang which I know we have a session on or at least a couple sessions on VLLM um that's where Red Hat's kind of focusing that's where those type of production grade inference runtimes really need to be utilized. There's a lot of pain points with inference that I just want to double down on and some of the activity will be benchmarking and evaluating um these types of metrics. model inference performance evaluation under enterprise level workload scenarios. It's very complicated to actually evaluate this appropriately. It requires manual setup of evaluation runs with various parameters you have to test. The compute load just for performance evaluations as well is also pretty taxing. Um you have to make sure the data sets that you're using for benchmarking are compatible with the models that you're using. the resource optimization and identifying sizing so that you're efficiently using your hardware appropriately for whatever model size you're using. That's a big challenge um for enterprises today. Making sure they're efficiently using their GPU investments. And then actually cost estimating is a little bit of a black magic thing. Like it's really hard to do that appropriately. um you have to like backwards mathmap imperence performance to tokens and it's a whole thing. So these are what enterprises are trying to achieve and it's hard. Um just a little kind of more examples of the challenges. So we have you know this is just an example of stable diffusion bias. Like I said most of our data is euroentric. This is going to be there. So how do we what tools can we use to provide guardrails against this type of behavior? the glue incident. This was because there was something on Reddit was like it's like a joke um satire and did this AI overview tool used that information and didn't have the right mitigation techniques in place to identify that satire. So then it came out in that um AI overview uh suggestion and then we have this um like MAD situation where a lot of the we're getting into a lot of synthetic data on the internet and each generation of these AI models that come out are consuming more and more AI generated data which over time is going to get you oh there's music happening um further away from that original human anchored data. So this is going to lead to a loss of output diversity, a loss of pre a loss of precision. This would be an area where you would need to use those kind of accuracy evals to mitigate again and identify that this is occurring. So just to cover of course Google and the stable diffusion project they have introduced additional evaluation frameworks and mitigation techniques to address this. um just like anytime we have any story like that right they are certainly working to make sure that that does not happen again we don't know with the closed source um AI offerings how exactly they're doing that but we can speculate you know the AI overview technology maybe introduced more rag mitigations to where it helps to identify that that was satire or whatever the case is um it has more safeguarding triggers um the stable diffusion model likely introduced is some level of bias mitigation guard rails. So we need so of course that they've been working to fix this but ideally we don't run into this and we prevent it ahead of time before a model release or before an application release. So how do we prevent these kinds of issues at scale in production environments? I want to just look at a couple of definitions because sometimes evaluation and benchmarking are terms used um kind of they kind of conflate a little bit and people kind of use them for whatever they want. So benchmarking is just a subcategory of evaluation. Evaluation is a comprehensive process to assess a model end to end and it could include a lot of different kinds of evaluations about a lot of different components. Benchmarking is very specifically controlled specific data sets and specific tasks typically used to compare models against one another. So this would be like a latency score that compares different hardware setups and different models or like the MMLU benchmark scoring things like that. We'll look at both custom evals that aren't benchmarking and also some benchmarks in their hands-on. These are just some examples of what is typically considered a model evaluation versus a benchmarking specific test just to kind of get a little bit of a sense. Um, but again like there's so many types of evaluations, so many tools, you can customize it in so many different ways, but this kind of helps hopefully a little bit with the definitions. So hopefully kind of seeing all these challenges with Genai. We understand that this is a critical process. Um you need to manage risk for customers like there's less of a concern right when I'm tinkering on my laptop whatever if you know I see something weird who cares. But when we're talking about a production level environment we're serving thousands whatever customers we need to think about these things obviously more in these types of scenarios. the credibility of the company. When those stories come out, that takes a hit for a good chunk of time. Um, and you obviously also need to continuously improve your evaluation frameworks as well because you're not going to catch everything. You need that CI process to make sure you are continuously improving your evaluations and benchmark setups. It's also going to very much depend on the type of system that you have, what you set up. So again, it's very much there's tons of tools. This could look a lot of different ways. We'll get a sense of it today. If you have a Rag setup, you're going to be maybe focused on a the Ragas um benchmark or evaluation tool, agents, you need to look at function tool calling capabilities, etc. you can kind of get a sense there's going to be specific metrics that you need to set up and look for depending on the system that you have which requires a lot of planning in advance and kind of architecture scoping. I'll give an example of a rag use case and also it's incremental too because you could and I'll I'll talk about this a bit like you could literally evaluate every single part of things but that's going to be um time and resource extensive to set up immediately. So you likely want to take an incremental approach with these types of setups. So you might start out with okay I'm just going to evaluate the chunk retrieval my retrieval application in a rag system. I want to set up some kind of evaluation test there. I might just want to set up a latency throughput um benchmark test for my LLM output. You can start with those kind of incremental approaches for specific components and then from there based on priority levels as well branch out into a full system eval that covers all the components the integration layer of how the components work together the UI endto-end experience and kind of have a software engineering kind of test period approach to this where you have that unit test layer kind of approach at the bottom um integration layer in the middle and that UI end to end at the top and you can take that layer by layer as you're building this evaluation framework for your systems. So there we created this or we there is this pyramid also for model evaluation that represents the same kind of setup for um that that software engineering pyramid um represents as well. So the base layer and uh very base layer is the system performance because like I said no matter how good your model is if you don't have fast throughput you aren't able to handle concurrent users you're going to be in a bit of a pickle GPU utilization etc. you need to make sure kind of the basics are handled uh the as the main kind of event. Um and we'll talk about that'll be the first hands-on activity is is evaluating the system performance. Formatting might be making sure it's um religiously giving you JSON output that you need for your application. Something like that. The factual accuracy which we'll also talk about um in one of our hands-on that would be like the MMLU benchmark. So evaluating that it's performing well on various subjects kind of standard large language model accuracy as well as if you've fine-tuned a model potentially making sure that it's accurate based on the information that you fine-tuned that model on and you kind of go up from there into you know safety bias there might be specific custom evaluations that are very specific to your application. So gives you kind of a sense of the tiered approach that can be taken here. So, we're going to talk first system performance and we're going to have our first hands-on around this. We're going to be looking at guide LLM, which is kind of a new project that's associated with the VLM inference runtime project. Um, we're going to use that for system performance benchmarks like latency throughput and you'll get a little bit of hands-on there. The general user flow there is like you you know you select your model you select your particular data set that you want to use to test throughput and to test inter token latency time to first token those types of metrics and then guide LLM allows gives you a nice kind of um internal UI to visualize the results of that. Uh and then once you get kind of the results that you want based on your use case again then you're ready to deploy. We you'll see this in the hands-on as well but you want to test based on the use case and the p one of the primary ways you test via guide LLM is adjusting the input and output tokens. So, if you have a chatbot use case, a rag use case, you can adjust the um input and output token levels based on your use case. And you'll see that in the hands-on and have an opportunity to kind of play around with that depending on what you're most interested in. So, we're going to start with the hands-on. That is the link that red.htvals is the link to the workshop. You will be signing in with your email. I have no marketing game. It just requires you to do that. So, I'm not going to haunt you after this. Um, you'll put in your email and that is the password. And me, hold on. I just want to show you. Well, I don't want to take up one of the systems. Um, otherwise I would show you, but you're going to get instruct once you are in the workshop. You're going to get your instructions on the lefthand side and you're going to have two terminals avail two terminal sessions available to you on the right hand side to the same system. It's a rail system. Um the instructions will overview what the system includes a little bit for you so you get a sense. Um they each have an L4 GPU as an example. Um anything else I want to call out before we get started. That'll be primarily what you use. It has T-Mox enabled if you like to use that to open up different things and gives you some flexibility. We have three different activities. I'm going to pause after each activity so we can have a little bit of a discussion in between. Um, we'll kind of get this my first time running this particular uh activity. So, we'll gauge kind of the time it takes, but I'm going to give it about 15 20 minutes for this first one. Okay. So, I'm also pulling up a system and I'm just going to like walk through some of the stuff. The initial page is just going to give you if of course if it loads Jesus the internet. Um the initial page is just going to give you a little bit of background and preparing your system instructions and then my terminals are on this second tab and everything is going to be glacial pace. Um so the first thing I have to do these systems don't have the container toolkit installed so I just got to do that some uh system logistics because I'm going to be running VLM inference runtime in a container and I need that to work. So that's what that's doing. And then I'm gonna deploy a model with VLM. I So you're gonna have to grab a hugging face token. Probably have gotten there by now. Probably most of us have a hugging face token, but just disclaimer there. So incognito window if you're hitting the hugging face rate limit to grab a new token. So VLM you can also install locally like if you have a Mac or whatever um Linux machine, but we're deploying it as a container here. So I'm going to get that deploy and I'm um you can see the VLM serve command at the end. So that's just the VLM CLI tool and I'm using an IBM granite model because you know Red Hat IBM um so we'll be working with that for a chunk of the activity and it takes a bit for VLM to load the model. So you're going to be waiting for info the words info like four times in green and then it's deployed. What's nice about VLM is that it is um it's compatible with the safe tensor formats. So I has anybody used TRT to load up a model. Okay. Anyway, it's crazy um because it requires you to also convert the model formats initially. Um so there's less kind of configuration steps with VLM. Takes up less space too. So when you do these kinds of system and performance um benchmarks, you can make a lot of adjustments like the um the input and output tokens that I mentioned um for the guide LLM configurations, but there's a lot of also configuration opportunities for the inference runtime itself um depending on what you're trying to do. sometimes will reduce the max um the context window of the model so it runs more quickly because it's if it's a big context window it's going to be pretty beefy. There's a lot of knobs you can use for VLM. We're not really going to touch that um this particular time but just so you're aware. So I have my my three in green infos. So that means it's working and the model is successfully deployed. So I'm going to um get into my virtual environment which is already in place and then I already have guide LLM installed but I'm going to pip install guide LM and these are copy buttons by the way so you can just easily copy paste things over. So once I have that up then this command is set up to just work with the um model deployed by VLM and that I'm just keeping up here in the top terminal. can run this in the background but I'm just not doing that and this is so I have my target um the rate type is a sweep of uh various benchmarks like inter token latency there's various types of benchmarks that it'll run that you'll see in the output um but these are all things that can be adjusted you can run one particular benchmark at a time for instance you can take a look at I think it's uh guide lm-help typical type of commands you can kind of see what the where all the knob are. And the documentation is pretty good, too. So, that'll take a couple of minutes to run because I have it um set at a rate of five to reduce the amount of time that it takes to process. >> So, you can kind of get a sense of the output here once it all processes. And I have explainers on the left hand side of kind of how to read some of this. Um the mean performance for each. is we have the the benchmark info on the top and then benchmark stats on the bottom. Um so the for the constant rate the on the very left hand side so those are the number of requests sent to the model per second at that particular rate. So three the constant at 3.63 6.93 if I did in um the guide LLM command rate and it was five. If I did rate 10, you would see more lines of that at at um more progressive rates >> and like whether or not these numbers are good also totally depend on your use case as well. And I would be comparing I would have a better hardware configuration obviously in production as well because again I'm I'm running an 8 billion parameter size or two two billion parameter size model. Um on an L4, >> which is like okay, but obviously if you're doing anything concurrently and at scale, like that's going to go bonkers pretty quickly. >> So you get like the mean performance, the median performance and P99, which is like the the extreme level um which matters for SLOs's and things like that. And you can also output this into um JSON format as well >> to take a closer look. >> So once you reach this, you can try also with an additional like tweak the parameters and then maybe compare the results and kind of see what what that changed if you do a rag setup. Um so I think we had I forget what we had what the initial command said. Um, but you can adjust those based on a different use case and kind of compare and contrast what the stats look like after because it just takes a couple minutes to run. Who's done with this first exercise? Okay, good. Great. We're a few minutes away from the additional um 10 to 15 systems being ready. So, Just for the sake of time, I am not going to do breaks for kind of discussion in between if everybody's okay with that and we can kind of just converse independently and I'll just awkwardly walk around the room. Um, so if you're done with activity one, move on to activity two because there are three activities and I want to make sure everybody has appropriate time for each that they would like. Um, but we will also have the systems up until probably about noon, I don't know, early afternoon. Um, we'll keep them up if you wanted to also go back and look at it after. So, is that good with everybody? We just kind of steamroll power through. Okay. Yes. >> So, I have a new URL where we have three more available so far, but others are provisioning. And I just wanted to go ahead and put the URL up. It'll come they'll appear here. And it's the same password, right? You said >> it's the same password. It's the same password you said. Okay. Uh, >> okay. So, that's the new URL. Three more systems, but they'll more will appear. >> Um, I put descriptions on the left um to explain that, but I just wanted to heads up. We're kind of moving from system performance to now that kind of factual accuracy part of the pyramid. Um so you get a sense of so we're going to be doing the MMLU pro um in the second activity and then the third activity is going to be focused on safety and bias and more custom evals. So that's the trajectory of activities. Feel free if one is also more interesting than the other feel free to skip around. Totally fine. explain more about the pro. I'm curious about >> or you know like proprietary data sets if that's a thing >> you can c you can basically customize everything because like all these things are open source so you can create a similar type of eval in that multiple choice format that MMLU does with your own data set. Um there's different ways to do custom accuracy evals with your fine-tuned data. So we kind of do it in a we have so as part of one of the products we we incorporate an eval for um our fine-tuned models on your proprietary data and we do like a branch of MMLU essentially. So there's kind of there's a lot of ways to skin a cat even though I love cats um in regards to how to set up the eval tools available. >> Yeah. So yeah. Yes. >> You can just like fork it and ch and change the data sources. Yeah. Yeah. >> Put that link back up. The instructions don't some of the instructions don't look updated as I expected. So, I'm going what I'm going to do um is everybody in the Slack for a engineer worldfare? I created a um Slack channel called work workshop beyond benchmarks. Um and I'm going to put some of the so activity three. Does anybody see activity 3? >> No. >> Okay. I'm putting content in this um in this slack. >> If people can it's a public channel. If people can go there my um you'll see starting I'm the link is at activity 2 but then you'll also see the page for activity three it didn't um approaching activity three and are looking for that the systems for some reason didn't render my latest changes yesterday where I improved on activity three But I put the link to it to my repo in the Slack channel in this channel if you search for it. But there's information about the Slack in the emails that we got about the event, but also on the back of our badges as well to navigate there. And this will be good for after if anybody has any questions and I can send more info about any particular tool here as well. I wanted to have a wrap-up moment because we have we have about eight minutes. So please feel free to continue working. I put the link to the activities in the slack channel. These environments will be available um until the end of the day today. So you have time also to tinker around with whatever you want. Um so you would just to kind of recap we went through the fir sounds funny. Um >> the first was at that system performance latency throughput level. Did everybody kind of get through that successfully? Of course, there's I tried to include reading material and stuff to kind of look more into things after because it is it is a very big topic and there's a lot going on and there's a lot of terms and it is very complicated. Um, so hopefully you can use it as a learning res my GitHub repository as a learning resource to kind of poke around after. But we started there and then we moved into the MMLU Pro with MLE Valh harness um which also allows you to do a lot of other evaluation um benchmarks as a part of that MLE valh harness framework. Um I happen to choose MMLU Pro because it took the least amount of time even though it took still 10 minutes. Um but there's other ones also that you can play around with um within that ML val harness repository. You can see the different um eval there. And then we ended with a safety evaluation with prompt which with prompt fu which um is a tool that allows you to do a lot of customizing um and your own eval like you can do all kinds of custom tests with prompt fu. So I wanted to get you exposed to that tool so that you can start looking around there as well. That repository on github also has a lot of different examples. So we used that particular um safety focused example, but if you look at the promptu repository, it's very easy to play around with other types of examples as well. Um so we kind of moved up the pyramid throughout the activity. So hopefully you get a sense of kind of how you can layer this approach when you're looking at and trying to plan for how to strategically implement evals across your entire system. Um does anybody have any kind of questions or general what they experienced notes of import does this feel like valuable? I'm curious also about use cases and happy to talk to you after too. Yeah. >> So um I'm like super new to evals and stuff. Uh so like when I'm doing evals it's like uh testing my prompts and like my data sets or if I want to switch like models out and stuff. Right now I'm actively thinking about like those eval connecting those actually to like my production like running like use cases to like track that like my real performance matches kind of like the scene. what is like is there like a word for that concept or like >> um so for me what I hear is the kind of CI/CD automation implementation of an evaluation framework just like with software engineering testing is kind of what I hear from that um >> you know yeah so I don't know it it's evaluation but in a more CI/CD format >> like when it's actually running for like customers and stuff >> yeah you should have a CI/CD framework that includes these evaluation tests just like for unit testing setups got Yeah. >> Anybody else? >> Okay. Thank you everybody. I really appreciate it. Of course, feel free to Thank you. Feel free to um continue to use the repo, message me on Slack. Again, the environments will be up until about 6 pm, 5:00 pm tonight. So, thank you everybody. [Music]