How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe

Channel: aiDotEngineer
Published at: 2025-07-22
YouTube video id: coKKKKh8Vns
Source: https://www.youtube.com/watch?v=coKKKKh8Vns
[Music]
Hey everyone. Um hope you are having a
great conference. Um so I'm going to
talk about uh how to run events at scale
and thinking beyond accuracy or
similarity.
Uh so in the last uh presentation we we
learned about like how to art u
architect the AI applications um and
then whys are important. In this
presentation I am going to talk about
like the importance of ewells as well as
what type of ewells we have to choose
when we are crafting an application.
This is a bit about me. Um so I work as
a lead engineer for applied AI for
developer platforms at Adobe. Um I have
also co-authored um CI/CD design
patterns book um and also involved in a
lot of open source work uh across the
communities.
So let's get started.
So how many of you have seen this or are
active on the Twitter right now like
have you seen these kind of patterns
emerging? Um I think this morning there
was a talk where this snapshot was again
surfaced. So the one of the most
important trends in AI application
development is EVs because without
Ewells we can't uh we can't craft any AI
application
then um how many of you are developing
an AI application be it a rag chatbot
agents anything so if you are working on
that you often have come across these
kind of questions like how do I test
applications when outputs are
nondeterministic and require subjective
judgment because we all know in LLM
world uh you can have the different
output for the same set of input. LLMs
are nondeterministic
or how many times you wondering like if
I am changing a prompt what is going to
break or how am I going to test that and
then most importantly when you are
developing an application in order to uh
measure the performance or accuracy you
need to find out what tools to use what
metrics to uh use or what models are
best because models are getting capable
day by day
and the answer is Ewells. So, Ewells is
the fundamental approach where you are
writing sort of test cases to measure
your AI applications.
And why do we why do they matter? Uh
because without measuring something uh
it can have various impacts. It can uh
impact your business. Uh you need to
measure whatever system output is being
produced. How do you align your
application with system goals or one of
the important aspect is how do you keep
getting better because applications are
you are developing applications day by
day and you need to make sure it is
getting better and then trust and
accountability. This is one of the
aspects um uh which is very important
because whenever you are developing
something for a customer uh you need to
make sure uh they trust your application
whatever output is being generated.
Now when we talk about EVLs, one of the
important aspect to focus on is data. So
when we think about EVs, when we think
about the tests, how do we start? So the
very first step is starting with the
data. Now how do you get the data? So
there are a couple of approaches to get
the data. One is you start small and you
start with the synthetic data. Synthetic
data means you can generate the um you
can generate the ancillary data, you can
generate the artificial data and start
validating uh your application's output
against that data. Then it's evalu
uh it's a continuous improvement
process. It means every time you
generate some output you need to observe
the system and then you need to keep on
defining that data set whatever data set
you are procuring
and another aspect is you need to label
your data accordingly. So because data
is fundamental to writing Ewells. So you
need to when generating the data you
need to define your data set in in a way
where it is labeled into different
aspects. It is covering multiple flows
or application prospects. So things like
that and then you need to continuously
refine that. Another another u approach
which I have learned from my experience
is you one data set is never sufficient.
So when you are thinking about ewell you
need to think about multiple data sets
based on the flows based on the
applications and whatever you are trying
to achieve.
Now when we think about evaluation uh
what do we think about? So what do you
want to evaluate? The answer is
everything. But what does that mean? So
you need to defi start by defining your
goals and objectives or what do you want
to evaluate in your system. Then you
need to design in a way where you have
mod modules defined for each of the
components. You need to optimize your
data handling. Um and I I notice I'm
mentioning data again and again but the
point is you need to have different data
sets for different flows. Uh you need to
test your flows outputs and paths. So if
your application involves multiple
flows, multiple path, you need to
evaluate in in all all parts.
Now adaptive evals. So one of the uh
previous presentation talked about like
there is no universal eval and that's
that is again most important thing
because your evals depend upon what you
want what type of application you want
to evaluate. For example, evaluating a
rag application a typical rag
application is different from code
generation.
uh if you are dealing with a typical Q&A
type of application, you can define your
email such as accuracy or similarity or
usefulness versus when you are
generating a code u you want to uh test
the generated code against the actual
code base. So that is where you need to
def uh measure your functional
correctness of the code generated or how
robust that code is generated.
Then uh when you are trying to evaluate
agents. So one of the important aspect
for evaluating agents is trajectory
evaluation because agents can take a
different path and often times you need
to define which path they are taking in
order to execute a flow. There is also
multi-turn sim uh simulation where most
of these agents are complex and you you
need to check like when you are having a
conversation like how do we evaluate
that? Then if you are doing the tool
call then you also need to check the
correctness or test suite or like how
they are how the data is being
generated.
Now another aspect is how do you scale
EL. So one one strategy is you can catch
the intermediate results and regression.
You need to focus on orchestration and
parallelism like how you are how you are
running your EV vals how you are
orchestrating them how you are
paralyzing them. You need to aggregate
the results and then you need the
important aspect here is you need to run
them frequently and then improve upon.
So one of the uh term which is being
used in industry is measure, monitor,
analyze and repeat. So you need to often
measure it, you need to analyze that and
iterate on that. Then you need to
strategize what you want to measure. So
again depending upon the use case there
are different type of matrices or
different type of methodologies you need
to adapt to.
And then again use there is no fixed
strategy to run your ewell. So use what
what fits best. In some cases uh you
want your humans in the loop to be
taking precedence. In some cases you
have u uh automation test automation
ewells running in there is a fine
balance or trade-off between human in
the loop versus automation like whether
you want the high speed versus high
fidelity. So again depending upon what
you want to achieve um you want to take
give a fine balance on that and rely on
process over tools. Reason is because
tools again you cannot automate
everything. So you need to define and
establish the process. How do you want
to run the EVs?
So these are some of the key takeaways
we just talked about. Um so one is EVs
are the most important aspect for AI
application. uh there is a term being
coined now evalu development uh which is
if if you think about typical software
like testdriven development this is the
eval development define ewells based on
the use cases uh you need to focus on
positive as well as negative cases then
focus on the data that is I cannot
emphasize enough on uh on that and then
remember to measure monitor analyze
iterate in a loop continuously and
always take a balance approach in
fidelity versus speed. Uh if you have
any questions, uh there's a barcode. You
can come later and chat with me. Happy
to chat more. And that's all from now.
[Music]