Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

Channel: aiDotEngineer
Published at: 2025-07-27
YouTube video id: 89NuzmKokIk
Source: https://www.youtube.com/watch?v=89NuzmKokIk
[Music]
Hello everybody. Um, my name is Taylor
Smith. I am an AI developer advocate
slashevangelist slashtechnical marketing
and other titles depending on where I
am. Um, I work at Red Hat in our AI
business unit because yes, we do also do
AI too now. Great. Um, I'm really happy
to be here. This is my first time at
this conference. So, I'm new to the
conference, new to speaking at this
conference, ready to have a good time
and also relax in about an hour and a
half and just learn stuff. stoked that
this was at 9:00 am. Um, my presentation
clicker isn't working, so I'm gonna be a
little bit glued over here. But to kind
of overview what we're going to do
today, I'm going to talk a little bit
about kind of the issues of setting up a
large language model in production and
the reason why you need evaluations and
benchmarks and all of these things and
why that is so critical. And then I do
have some hands-on activities that we'll
get to do to use some evaluation um
methods and benchmark tools to kind of
get a sense of what is out there to use,
how we can use those tools, what that
might look like to put it all together
for an actual production system. So,
ready? Are you excited?
>> Yay. First thing in the morning. Love
that. I'm glad it's 9:00 a.m. because
that feels like a reasonable start time.
I don't like that 8 am stuff that
happens. Um, setting up generative AI
tech in production. If you think about
it, crazy pants. Okay, there's so many
things that could go wrong. This is such
a complex type of technology that is
very creative. Setting this up to be
scalable and reliable and safe is
challenging. Okay, that's why we need
evaluations. That's why we need all of
these tests and we need to be careful.
We need to understand the technology
we're working with when we're looking to
implement this as well. So hard to do
most organizations. So they don't
typically start off with a multi- aent
framework, right? And go crazy.
Typically, they're going to have a kind
of standard repeatable path to maturity.
They might start out with, okay, how can
I automate some things with AI? How can
I have like a chatbot type of situation?
That's kind of the standard that
everybody starts out with. Okay, I maybe
want to implement a rag setup. All
right, let's start to look at agents to
for on the Red Hat side. We're still
dealing with probably like the first
three phases of this maturity with our
customers and that's where a lot of
people um still are. But we get to talk
about all the cool advanced stuff at
this conference so we can plan ahead and
think about what we want to implement.
But enterprises, they need to take an
incremental approach to do this
successfully.
Genai models, they have a number of
drawbacks, right? And we all know about
these things. Policy restrictions.
Typically, if you're a developer at a
company or have whatever type of role,
you're restricted in the type of AI that
you can use. Red Hat was just um allowed
to use m Gemini was just made available
to us, but before that, we weren't
really allowed to use anything
officially. Um, so the company policy
restrictions of the tools you can use
and the AI you can use is pretty locked
down. Typically the legal exposures and
risks of these models, you know, there
was the glue incident.
Um, they're going to say crazy stuff
sometimes. How do we guard rail against
that and how do we protect our customers
against that and account for that when
that does happen? Because it's hard to
um, completely avoid. There's the bias
and discrimination issues. Most of the
internet data is still largely
eurosentric and US-based. So the models
trained on this public internet data of
course is going to be a little skewed,
right? So we need to be aware of it just
like in regular everyday life. We know
that bias and discrimination exists. How
do we account and kind of provide guard
rails to make sure that we adjust and
prevent what we can?
Cost and performance. That's probably
kind of a baseline big one. how much
these genai models cost to run at scale
in production and the performance
throughput latency etc that we need to
account for when we have these
production system set up and then the
knowledge cut off as well is another
limitation you know these large frontier
models have a knowledge cut off because
they're not consistently trained so you
might be working with a model that was
cut off a year ago so it's not going to
have that up-to-date information which
is why they you know implement rag
systems and agent systems to look out
into the internet for more up-to-date
info. But these are just some of the
kind of drawbacks we need to be aware of
and account for.
Inference at scale, no matter how I'm
going to go kind of into these
categories a little more in depth. No
matter how good your model is, if it's
not fast, if it's not reliable, if it's
not affordable, you're screwed a little
bit from the get-go. Okay, so this
graphic, it shows a classic bottleneck
type of scenario. You got concurrent
user requests which might be represented
by those green, yellow, and orange dots
flowing into a system. But then
traditional inference runtimes typically
can't handle that kind of load
efficiently. Um to serve real world
traffic, whether you're powering a
customer support agent, a developer type
of co-pilot system, or maybe a rag
pipeline, you need an inference engine
that's purpose-built for scale. So
that's where inference runtimes like TRT
SG lang which I know we have a session
on or at least a couple sessions on VLLM
um that's where Red Hat's kind of
focusing that's where those type of
production grade inference runtimes
really need to be utilized.
There's a lot of pain points with
inference that I just want to double
down on and some of the activity will be
benchmarking and evaluating um these
types of metrics. model inference
performance evaluation under enterprise
level workload scenarios. It's very
complicated to actually evaluate this
appropriately. It requires manual setup
of evaluation runs with various
parameters you have to test. The compute
load just for performance evaluations as
well is also pretty taxing. Um you have
to make sure the data sets that you're
using for benchmarking are compatible
with the models that you're using. the
resource optimization and identifying
sizing so that you're efficiently using
your hardware appropriately for whatever
model size you're using. That's a big
challenge um for enterprises today.
Making sure they're efficiently using
their GPU investments. And then actually
cost estimating is a little bit of a
black magic thing. Like it's really hard
to do that appropriately. um you have to
like backwards mathmap imperence
performance to tokens and it's a whole
thing. So these are what enterprises are
trying to achieve and it's hard.
Um just a little kind of more examples
of the challenges. So we have you know
this is just an example of stable
diffusion bias. Like I said most of our
data is euroentric. This is going to be
there. So how do we what tools can we
use to provide guardrails against this
type of behavior?
the glue incident. This was because
there was something on Reddit was like
it's like a joke um satire and did this
AI overview tool used that information
and didn't have the right mitigation
techniques in place to identify that
satire. So then it came out in that um
AI overview uh suggestion
and then we have this um like MAD
situation where a lot of the we're
getting into a lot of synthetic data on
the internet and each generation of
these AI models that come out are
consuming more and more AI generated
data which over time is going to get you
oh there's music happening um further
away from that original human anchored
data. So this is going to lead to a loss
of output diversity, a loss of pre a
loss of precision. This would be an area
where you would need to use those kind
of accuracy evals to mitigate again and
identify that this is occurring.
So just to cover of course Google and
the stable diffusion project they have
introduced additional evaluation
frameworks and mitigation techniques to
address this. um just like anytime we
have any story like that right they are
certainly working to make sure that that
does not happen again we don't know with
the closed source um AI offerings how
exactly they're doing that but we can
speculate you know the AI overview
technology maybe introduced more rag
mitigations to where it helps to
identify that that was satire or
whatever the case is um it has more
safeguarding triggers um the stable
diffusion model likely introduced is
some level of bias mitigation guard
rails. So we need so of course that
they've been working to fix this but
ideally we don't run into this and we
prevent it ahead of time before a model
release or before an application
release.
So how do we prevent these kinds of
issues at scale in production
environments? I want to just look at a
couple of definitions because sometimes
evaluation and benchmarking are terms
used um kind of they kind of conflate a
little bit and people kind of use them
for whatever they want. So benchmarking
is just a subcategory of evaluation.
Evaluation is a comprehensive process to
assess a model end to end and it could
include a lot of different kinds of
evaluations about a lot of different
components. Benchmarking is very
specifically controlled specific data
sets and specific tasks typically used
to compare models against one another.
So this would be like a latency score
that compares different hardware setups
and different models or like the MMLU
benchmark scoring things like that.
We'll look at both custom evals that
aren't benchmarking and also some
benchmarks in their hands-on.
These are just some examples of what is
typically considered a model evaluation
versus a benchmarking specific test just
to kind of get a little bit of a sense.
Um, but again like there's so many types
of evaluations, so many tools, you can
customize it in so many different ways,
but this kind of helps hopefully a
little bit with the definitions.
So hopefully kind of seeing all these
challenges with Genai. We understand
that this is a critical process. Um you
need to manage risk for customers like
there's less of a concern right when I'm
tinkering on my laptop whatever if you
know I see something weird who cares.
But when we're talking about a
production level environment we're
serving thousands whatever customers we
need to think about these things
obviously more in these types of
scenarios. the credibility of the
company. When those stories come out,
that takes a hit for a good chunk of
time. Um, and you obviously also need to
continuously improve your evaluation
frameworks as well because you're not
going to catch everything. You need that
CI process to make sure you are
continuously improving your evaluations
and benchmark setups.
It's also going to very much depend on
the type of system that you have, what
you set up. So again, it's very much
there's tons of tools. This could look a
lot of different ways. We'll get a sense
of it today. If you have a Rag setup,
you're going to be maybe focused on a
the Ragas um benchmark or evaluation
tool, agents, you need to look at
function tool calling capabilities, etc.
you can kind of get a sense there's
going to be specific metrics that you
need to set up and look for depending on
the system that you have which requires
a lot of planning in advance and kind of
architecture scoping.
I'll give an example of a rag use case
and also it's incremental too because
you could and I'll I'll talk about this
a bit like you could literally evaluate
every single part of things but that's
going to be um time and resource
extensive to set up immediately. So you
likely want to take an incremental
approach with these types of setups. So
you might start out with okay I'm just
going to evaluate the chunk retrieval my
retrieval application in a rag system. I
want to set up some kind of evaluation
test there. I might just want to set up
a latency throughput um benchmark test
for my LLM output. You can start with
those kind of incremental approaches for
specific components and then from there
based on priority levels as well branch
out into a full system eval that covers
all the components the integration layer
of how the components work together the
UI endto-end experience and kind of have
a software engineering kind of test
period approach to this where you have
that unit test layer kind of approach at
the bottom um integration layer in the
middle and that UI end to end at the top
and you can take that layer by layer as
you're building this evaluation
framework for your systems.
So there we created this or we there is
this pyramid also for model evaluation
that represents the same kind of setup
for um that that software engineering
pyramid um represents as well. So the
base layer and uh very base layer is the
system performance because like I said
no matter how good your model is if you
don't have fast throughput you aren't
able to handle concurrent users you're
going to be in a bit of a pickle GPU
utilization etc. you need to make sure
kind of the basics are handled uh the as
the main kind of event. Um and we'll
talk about that'll be the first hands-on
activity is is evaluating the system
performance. Formatting might be making
sure it's um religiously giving you JSON
output that you need for your
application. Something like that. The
factual accuracy which we'll also talk
about um in one of our hands-on that
would be like the MMLU benchmark. So
evaluating that it's performing well on
various subjects kind of standard large
language model accuracy as well as if
you've fine-tuned a model potentially
making sure that it's accurate based on
the information that you fine-tuned that
model on and you kind of go up from
there into you know safety bias there
might be specific custom evaluations
that are very specific to your
application. So gives you kind of a
sense of the tiered approach that can be
taken here.
So, we're going to talk first system
performance and we're going to have our
first hands-on around this. We're going
to be looking at guide LLM, which is
kind of a new project that's associated
with the VLM inference runtime project.
Um, we're going to use that for system
performance benchmarks like latency
throughput and you'll get a little bit
of hands-on there. The general user flow
there is like you you know you select
your model you select your particular
data set that you want to use to test
throughput and to test inter token
latency time to first token those types
of metrics and then guide LLM allows
gives you a nice kind of um internal UI
to visualize the results of that. Uh and
then once you get kind of the results
that you want based on your use case
again then you're ready to deploy.
We you'll see this in the hands-on as
well but you want to test based on the
use case and the p one of the primary
ways you test via guide LLM is adjusting
the input and output tokens. So, if you
have a chatbot use case, a rag use case,
you can adjust the um input and output
token levels based on your use case. And
you'll see that in the hands-on and have
an opportunity to kind of play around
with that depending on what you're most
interested in.
So, we're going to start with the
hands-on.
That is the link that red.htvals
is the link to the workshop.
You will be signing in with your email.
I have no marketing game. It just
requires you to do that. So, I'm not
going to haunt you after this. Um,
you'll put in your email and that is the
password.
And
me, hold on. I just want to show you.
Well, I don't want to take up one of the
systems. Um, otherwise I would show you,
but you're going to get instruct once
you are in the workshop. You're going to
get your instructions on the lefthand
side and you're going to have two
terminals avail two terminal sessions
available to you on the right hand side
to the same system. It's a rail system.
Um the instructions will overview what
the system includes a little bit for you
so you get a sense. Um they each have an
L4 GPU as an example. Um anything else I
want to call out before we get started.
That'll be primarily what you use. It
has T-Mox enabled if you like to use
that to open up different things and
gives you some flexibility. We have
three different activities. I'm going to
pause after each activity so we can have
a little bit of a discussion in between.
Um, we'll kind of get this my first time
running this particular uh activity. So,
we'll gauge kind of the time it takes,
but I'm going to give it about 15 20
minutes for this first one.
Okay. So, I'm also pulling up a system
and I'm just going to like walk through
some of the stuff. The initial page is
just going to give you if of course if
it loads Jesus the internet. Um the
initial page is just going to give you a
little bit of background and preparing
your system instructions and then my
terminals are on this second tab
and everything is going to be glacial
pace. Um
so the first thing I have to do these
systems don't have the container toolkit
installed so I just got to do that some
uh system logistics because I'm going to
be running VLM inference runtime in a
container and I need that to work. So
that's what that's doing. And then I'm
gonna deploy a model with VLM. I So
you're gonna have to grab a hugging face
token. Probably have gotten there by
now. Probably most of us have a hugging
face token, but just disclaimer there.
So incognito window if you're hitting
the hugging face rate limit to grab a
new token.
So VLM you can also install locally like
if you have a Mac or whatever um Linux
machine, but we're deploying it as a
container here.
So I'm going to get that deploy and I'm
um you can see the VLM serve command at
the end. So that's just the VLM CLI tool
and I'm using an IBM granite model
because you know Red Hat IBM
um so we'll be working with that for a
chunk of the activity and it takes a bit
for VLM to load the model. So you're
going to be waiting for info the words
info like four times in green and then
it's deployed.
What's nice about VLM is that it is um
it's compatible with the safe tensor
formats. So I has anybody used TRT to
load up a model. Okay. Anyway, it's
crazy um because it requires you to also
convert the model formats initially. Um
so there's less kind of configuration
steps with VLM. Takes up less space too.
So when you do these kinds of system and
performance um benchmarks, you can make
a lot of adjustments like the um the
input and output tokens that I mentioned
um for the guide LLM configurations, but
there's a lot of also configuration
opportunities for the inference runtime
itself um depending on what you're
trying to do. sometimes will reduce the
max um the context window of the model
so it runs more quickly because it's if
it's a big context window it's going to
be pretty beefy. There's a lot of knobs
you can use for VLM. We're not really
going to touch that um this particular
time but just so you're aware. So I have
my my three in green infos. So that
means it's working and the model is
successfully deployed. So I'm going to
um get into my virtual environment which
is already in place and then I already
have guide LLM installed but I'm going
to pip install guide LM and these are
copy buttons by the way so you can just
easily copy paste things over.
So once I have that up then this command
is set up to just work with the um model
deployed by VLM and that I'm just
keeping up here in the top terminal. can
run this in the background but I'm just
not doing that and this is so I have my
target um the rate type is a sweep of uh
various benchmarks like inter token
latency there's various types of
benchmarks that it'll run that you'll
see in the output um but these are all
things that can be adjusted you can run
one particular benchmark at a time for
instance you can take a look at I think
it's uh guide lm-help typical type of
commands you can kind of see what the
where all the knob are. And the
documentation is pretty good, too.
So, that'll take a couple of minutes to
run because I have it um set at a rate
of five to reduce the amount of time
that it takes to process.
>> So, you can kind of get a sense of the
output here once it all processes. And I
have explainers on the left hand side of
kind of how to read some of this. Um the
mean performance for each. is we have
the the benchmark info on the top and
then benchmark stats on the bottom. Um
so the for the constant rate the on the
very left hand side so those are the
number of requests sent to the model per
second at that particular rate. So three
the constant at 3.63 6.93 if I did in um
the guide LLM command rate and it was
five. If I did rate 10, you would see
more lines of that at at um more
progressive rates
>> and like whether or not these numbers
are good also totally depend on your use
case as well. And I would be comparing I
would have a better hardware
configuration obviously in production as
well because again I'm I'm running an 8
billion parameter size or two two
billion parameter size model. Um
on an L4,
>> which is like okay, but obviously if
you're doing anything concurrently and
at scale, like that's going to go
bonkers pretty quickly.
>> So you get like the mean performance,
the median performance and P99, which is
like the the extreme level um which
matters for SLOs's and things like that.
And you can also output this into um
JSON format as well
>> to take a closer look.
>> So once you reach this, you can try also
with an additional like tweak the
parameters and then maybe compare the
results and kind of see what what that
changed if you do a rag setup. Um so I
think we had I forget what we had what
the initial command said. Um, but you
can adjust those based on a different
use case and kind of compare and
contrast what the stats look like after
because it just takes a couple minutes
to run.
Who's done with this first exercise?
Okay, good. Great. We're a few minutes
away from the additional um 10 to 15
systems being ready. So,
Just for the sake of time, I am not
going to do breaks for kind of
discussion in between if everybody's
okay with that and we can kind of just
converse independently and I'll just
awkwardly walk around the room. Um, so
if you're done with activity one, move
on to activity two because there are
three activities and I want to make sure
everybody has appropriate time for each
that they would like. Um, but we will
also have the systems up until probably
about noon, I don't know, early
afternoon. Um, we'll keep them up if you
wanted to also go back and look at it
after. So, is that good with everybody?
We just kind of steamroll power through.
Okay.
Yes.
>> So, I have a new URL where we have three
more available so far, but others are
provisioning. And I just wanted to go
ahead and put the URL up. It'll come
they'll appear here.
And it's the same password, right? You
said
>> it's the same password. It's the same
password you said. Okay.
Uh,
>> okay. So, that's the new URL. Three more
systems, but they'll more will appear.
>> Um, I put descriptions on the left um to
explain that, but I just wanted to heads
up. We're kind of moving from system
performance to now that kind of factual
accuracy part of the pyramid. Um so you
get a sense of so we're going to be
doing the MMLU pro um in the second
activity and then the third activity is
going to be focused on safety and bias
and more custom evals. So that's the
trajectory of activities. Feel free if
one is also more interesting than the
other feel free to skip around. Totally
fine.
explain more about the pro. I'm curious
about
>> or you know like proprietary data sets
if that's a thing
>> you can c you can basically customize
everything because like all these things
are open source so you can create a
similar type of eval in that multiple
choice format that MMLU does with your
own data set. Um there's different ways
to do custom accuracy evals with your
fine-tuned data. So we kind of do it in
a we have so as part of one of the
products we we incorporate an eval for
um our fine-tuned models on your
proprietary data and we do like a branch
of MMLU essentially. So there's kind of
there's a lot of ways to skin a cat even
though I love cats um in regards to how
to set up the eval tools available.
>> Yeah. So yeah. Yes.
>> You can just like fork it and ch and
change the data sources. Yeah. Yeah.
>> Put that link back up.
The instructions don't some of the
instructions don't look updated as I
expected. So, I'm going what I'm going
to do um is everybody in the Slack for a
engineer worldfare?
I created a um Slack channel called work
workshop beyond benchmarks. Um and I'm
going to put
some of the so activity three. Does
anybody see activity 3?
>> No.
>> Okay. I'm putting content in this um in
this slack.
>> If people can it's a public channel. If
people can go there my um you'll see
starting I'm the link is at activity 2
but then you'll also see the page for
activity three
it didn't
um approaching activity three and are
looking for that the systems for some
reason didn't render my latest changes
yesterday where I improved on activity
three But I put the link to it to my
repo in the Slack channel in this
channel if you search for it. But
there's information about the Slack in
the emails that we got about the event,
but also on the back of our badges as
well to navigate there. And this will be
good for after if anybody has any
questions and I can send more info about
any particular tool here as well.
I wanted to have a wrap-up moment
because we have we have about eight
minutes. So please feel free to continue
working. I put the link to the
activities in the slack channel. These
environments will be available um until
the end of the day today. So you have
time also to tinker around with whatever
you want. Um so you would just to kind
of recap we went through the fir
sounds funny. Um
>> the first was at that system performance
latency throughput level. Did everybody
kind of get through that successfully?
Of course, there's I tried to include
reading material and stuff to kind of
look more into things after because it
is it is a very big topic and there's a
lot going on and there's a lot of terms
and it is very complicated. Um, so
hopefully you can use it as a learning
res my GitHub repository as a learning
resource to kind of poke around after.
But we started there and then we moved
into the MMLU Pro with MLE Valh harness
um which also allows you to do a lot of
other evaluation um benchmarks as a part
of that MLE valh harness framework. Um I
happen to choose MMLU Pro because it
took the least amount of time even
though it took still 10 minutes. Um but
there's other ones also that you can
play around with um within that ML val
harness repository. You can see the
different um eval there. And then we
ended with a safety evaluation with
prompt which with prompt fu which um is
a tool that allows you to do a lot of
customizing um and your own eval like
you can do all kinds of custom tests
with prompt fu. So I wanted to get you
exposed to that tool so that you can
start looking around there as well. That
repository on github also has a lot of
different examples. So we used that
particular um safety focused example,
but if you look at the promptu
repository, it's very easy to play
around with other types of examples as
well. Um so we kind of moved up the
pyramid throughout the activity. So
hopefully you get a sense of kind of how
you can layer this approach when you're
looking at and trying to plan for how to
strategically implement evals across
your entire system. Um does anybody have
any kind of questions or general what
they experienced
notes of import
does this feel like valuable? I'm
curious also about use cases and happy
to talk to you after too. Yeah.
>> So um I'm like super new to evals and
stuff. Uh so like when I'm doing evals
it's like uh testing my prompts and like
my data sets or if I want to switch like
models out and stuff. Right now I'm
actively thinking about like those eval
connecting those actually to like my
production like running like use cases
to like track that like my real
performance matches kind of like the
scene. what is like is there like a word
for that concept or like
>> um so for me what I hear is the kind of
CI/CD automation implementation of an
evaluation framework just like with
software engineering testing is kind of
what I hear from that um
>> you know yeah so I don't know it it's
evaluation but in a more CI/CD format
>> like when it's actually running for like
customers and stuff
>> yeah you should have a CI/CD framework
that includes these evaluation tests
just like for unit testing setups got
Yeah.
>> Anybody else?
>> Okay. Thank you everybody. I really
appreciate it. Of course, feel free to
Thank you.
Feel free to um continue to use the
repo, message me on Slack. Again, the
environments will be up until about 6
pm, 5:00 pm tonight. So, thank you
everybody.
[Music]