[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)

Channel: aiDotEngineer
Published at: 2025-07-29
YouTube video id: jxrGodnopHo
Source: https://www.youtube.com/watch?v=jxrGodnopHo
[Music]
Uh we're a bit early but I guess
everyone's here so we we'll get started.
Uh maybe we'll spend uh
first few minutes to get people a little
oriented and actually we we are we are
quite curious like what brings people
here. Yeah. Why why are you here?
Uh what what about evals? Uh did do you
like maybe by a show of hands a few
people like have you done evals before?
Okay. And Okay.
Have we struggled with them? Is that was
it hard? Maybe back again. Show hand
fans away from the speaker. Um
are you so people who have not done eval
before what bring what brings you here?
Somebody anyone wants to volunteer?
Yeah, feel feel free to raise your hand.
Sorry. Should I move away from the
speaker a little bit? Maybe for a little
bit. All right, I'll stand on this.
Trying to understand like how to
approach it. Oh, I I see.
Okay. So, so trying to understand how to
how to approach evals. Um, for people
who have tried before, what have they
struggled with? What has been the Go
ahead. I think the struggle one is to
define the metrics that because it's
hard to like I do situations. So, first
it's hard to define what's the correct
answer because it can look in many
different ways and two is um they are
end based business questions and then
you can
and how do you pragmatically
improve it versus I design and I review
it and I do it again and I feel like
that's still very manual and labor
intensive to me right
yeah it's a eval can be labor intensive
evals can be hard to hard to set up um
they can be painful to get get started
with um any other thoughts
It's like machine learning is dependent
upon training data. Agents are depending
upon eval to provide feedback from the
world to let them know whether or not
they're getting it right. And clearly
it's sophisticated and challenging
problem. So like how to do science you
have to have experiments. This is like
that. How do you experiment? Right. So
we are all I'm just saying it's super
important. We are all getting pulled
into the the world of data science and
machine learning in some ways even if we
don't want to. Go ahead.
Yes. Sorry. Uh engineering was a part of
quality assurance team and you know all
these years people wasn't happy that
testing testing took I don't know 30% of
feature development time and now with AI
I think that validation took 80% of
feature development time. people even
more not happy about it and we are
trying to find ways also to to find a
shortcuts but it seems that it's not
always possible because each case is so
unique that you just can't to reuse some
previous works and you have to create
many things from scratch on different
levels and and because of this unclear
nature pro probable nature of evaluation
it's every time it's such a
optimized way to do it,
right? So, basically um you said speak
up or speak low. Okay. Speak up. Okay.
Um
yeah. So, so basically like um it's
custom sub subjective eval like for your
specific use case. You just copy paste
something that exists much like tests.
We used to do testing. Some people did
testing, some didn't. But now for for in
the world of AI, everyone has to do
evals. Everyone is spending a lot of
time doing eval. Uh the best practices
don't exist. So I go ahead please. Oh,
uh I was just going to add that I'm
looking forward to understand how
synthetic data is playing a role there
and especially for novelty like type of
problems where you don't have access to
real real world data to use as a data
line. So how much of this can impact the
evolve system?
Right. There are like two aspects of
eval. There's the synthetic data and
there are the metrics. Metrics are
checking. Synthetic data is testing it.
Uh how what does synthetic data do? How
do you generate good synthetic data to
test and evaluate? Um go ahead. Mine
goes along with the first comment. We've
been trying on element models and use to
evaluate but
it's not accurate. So we have to do this
manually and hoping to learn that what
are some right
right like
automatic evals are are something
everybody desires because you don't want
a like human beings to sit and do it uh
but the options that are are somewhat
limited typically for most things you
can lean on AI to do human tasks for for
some reason eval tend to be hard LMS to
do Right. Um, so I guess that's that
makes a lot of sense. I we're not going
to solve all your eval problems here.
I think this is a honest a very hard
research area that we would probably
continue working on in the next few
months. Um, but David and I were were at
Google for for over a decade and uh we
have been doing eval for for a long
time. Uh, we we dealt with stoastic
application. We both search we were
building search. search was also
similarly stochastic and most of uh the
core effort in Google was to do a good
job of evaluating where we are and
improving it and some of the techniques
we learned at Google we are bringing in
here um so we built a product around it
a bunch of technology around it but
really the biggest takeaway I want
people to have is the methodology and
the stuff that we have learned uh
whatever manifestation it takes right so
I'm hoping that you get to play with
some of our stuff but also to learn some
some good ideas. Um David you want to
add? Yeah, I mean one thing I would add
is at Google we used to call this
quality and evalu was part of it and
there was this constant idea like
benchmarking and you know exhausting a
benchmark and then moving on to the next
benchmark whenever we work with clients
right now we we take a similar approach
so like a chin mentioned so much of eval
is just good methodology setting up
benchmarks trying to figure out the
metrics that work calibrating metrics
with humans calibrating metrics with
user data uh what I'm trying to say is
it can get arbitrarily complex like
evals is not like oh there's one way to
do it and then you're done it's it's
just part of how you do development
uh which you know everybody's been
struggling with because we've all been
trying to learn what it means to develop
on top of this new stack but again today
in today's session we're just hoping to
bring you some of these ideas where like
methodologies that make sense like how
to think about your metrics maybe people
think of like four metrics so I would
challenge you search had 300 metrics so
like you start to think about like how
can you expand the scope of how you're
doing this work uh and you know part of
what we're doing is trying to build
technologies that make it easy to adopt
these these technologies how to create
benchmarks that are interesting how to
like make it harder um and then we can
talk about about this as we go. Uh this
workshop will be mostly hands-on. Uh
we'll show a few slides and then we'll
dive into the code because I think the
best way to to learn about these things
is just to to do them. Um I think as we
go we'd love to just pass it around and
you know if people have you know again
people are at different parts of the
journey for some people like hey I don't
have metrics and I'll just develop my
own metrics. Uh we worked with clients
that have metrics but they want to for
example correlate them to user behavior.
They have a lot of thumbs up thumbs down
data. Um so that goes all the way into a
feedback loop. So you see like evals is
not just one thing related to testing.
It goes all the way into your online
system and your feedback loops and all
that. So but hopefully a lot of the
mental models today will help you kind
of gauge that. Um just two two is things
the slack channel it's uh workshops
uh so you can join there on the slack
and then we'll just be posting things
and then we can have discussions and
continue the conversations uh even after
the the the workshop. And uh the second
one is uh we have a document that will
have all the steps uh of the workshop uh
all the places where you can get the
code where you can get the sheet just go
with pi.ai/workshop you will land in a
Google doc uh so there's a lot of links
the Google doc also has a link to the
slide deck we're presenting so you can
have it we'll keep it online even after
the workshop so you guys can reference
it. Um all right we'll get started to to
keep you guys on time and maximize sort
of coding time. Yeah, I uh there are
four of us. A couple of other people
will just walk around. If you guys get
stuck on any step, if you're having
trouble with anything, just raise your
hands and then we'll come and raise your
hand over. Um all right, I'll give you a
very like maybe five minute uh quick
blur, show a quick demo and then get
started. Um
so
can you guys still hear me? Yeah,
I think we went went through this, but
basically like most people start with
VIP testing. That's like, you know, try
it out, see how it works, change
prompts. And honestly, for many
applications, you can go pretty far with
just that. Like I think AI AIS have
become much quite good. I think agent
systems, as you uh you were saying,
agent systems are more complex because
of multiple steps. Things fail more
often. But wipe testing gets you pretty
far. Human evals are expensive.
Typically most companies don't bother
setting up human var. Some do. Some have
subject m matter experts. Uh codebased
evals is where I think majority of the
people are spending time. They're
writing some sort of a code to test some
verifiable things that they can and and
people are moving into natural language
like LM as a judge type things where it
gets this is not a very good task. We'll
get into some of it like for for t t t t
t t t t t t t t t t t t t t t t t t t t
t t t t t t t t t t t t t t t typical
decoder models, the AI models. And so
you kind of have to fight against um
they're these models are designed to be
creative. They're designed to be um
that's not what you want from a judge.
Typically
this the scoring system uh scoring
system is this idea that you you you're
not trying hard to build a comprehensive
set of metrics from the beginning. You
start with some correlated signals.
Maybe you start with five, 10 signals
that you know for a fact are correlated
with goodness. They're very simple
signals that you can easily derive and
then time you build upon those as you
see problems as you as you debug and and
then you test your application works
well or not you learn from it. You build
your application you measure it again.
So that grip makes it a much more of a
feedback loop type process. Uh that's
the thing that we're trying to introduce
in terms of methodology.
Uh the thing that David was saying yeah
one thing I will add to that is like
there's no right or wrong like vibe
testing actually gets you a very long
way. Um sorry we we have a lot of echo.
Um maybe you want to turn that off and
then we can switch the mic as we
yeah so ju just say like it it's there's
no right or wrong. I think you should
absolutely start with vibe testing. A
lot of people use tracing and they just
monitor traces. I think a as your system
scales, you do want to get a little bit
more sophisticated. Uh so you you start
to layer those things in order of
complexity. That's a lot of how we talk
about quality generally is this idea of
like there's complex things but they
give some amount of investment of some
of amount of return. So it's like a
little bit of an ROI. Some things are
very cheap to do and then you do them
and as your system scales they become
impractical and then you just layer in
more techniques. So there's not a
statement that like any of these things
are good or bad. You just need all of
them. So think of these things as tools
in your toolbox, but eventually what you
really want is a scoring system. Now how
it manifests for you, you should you
should be the judge of it, but like this
is one thing to get away with like just
keep layering into your tools and with
increasing levels of sophistication, you
get probably to a system that will look
like the one we'll develop today.
Yeah. And as David was saying like um
are such such an investment. So they
might might ask like well why am I
spending so much time on just eval why
am I not building and I think one of the
things that that was core to to Google
and I think we this industry is going to
adopt this over time is that eval are
actually the only place you're going to
spend most of your time because that's
where domain knowledge is going to live
everything else will just work off of
those evals and so um so for example if
you have really good evals you don't
have to write prompts you can write like
you can find problems and use meta
prompts or optimizers like DSPI and
others to improve your prompts yourself.
You can filter out synthetic data and
then use that for fine-tuning if you're
really interested in fine-tuning or or
reinforcement learning. But you can also
use these techniques online, right? One
of the things that you're going to test
to try today is a very simple but very
effective technique that almost all big
labs use. um Google used it extensively
which is the the idea that you crank up
the temperature, you generate a bunch of
responses instead of just one response.
You can think of this as online
reinforcement learning, generate four or
five responses and then score those
responses online and see which one's the
best one and you get a pretty decent
lift just by doing this like without
actually doing any changes to your
prompts or models and so on. So that
these are the kinds of things you can do
uh with when when you have a really good
scoring system that you can lean on. So
you get to try some of these techniques
today. Um but but the key point is don't
think of eval as testing in the classic
sense. Think of these as the primary
place where domain knowledge lives. Um
and then and then one of one of the
things like that I I think um one of the
attendees was pointing out is that tip
right now at this point in the AI
industry we have figured out a handful
of standard evals like you can think of
simple helpfulness harmfulness
hallucinations
and that doesn't really get you far
enough. That's good enough to just sort
of do a guardrail and make sure you're
not doing anything wrong. But we are
we're moving into the world where you
want to build really good. Like for
example, if you're trying to build a
trip plan, you you you're curious about
like, you know, how how to make that
trip plan really be perfect. Like one of
the things I I really hate about trip
plans if they're too like not
interesting enough. I'm looking at a
trip plan. I want to be excited about
this place. Typically, when LM give me a
trip plan, it's a very kind of very
planishish, right? So now you're trying
to build these sort of nuances in your
applications. And that's the kind of
stuff if you want to evaluate that's
where that's where the industry is going
to go. So how do you build like these
much more nuanced evals uh that that's
one of the places where these
traditional evals fail. So, as I went
over, like start simple, iterate, see
what's broken, and then improve, right?
And so, that's the other thing you get
to try today is in a co-pilot-like
setting. Start with something, test it
out with a handful of examples, maybe
generate some synthetic examples, test
it out with good examples, bad examples,
see what's working, what's not, and then
iterate. We have picked a relatively
simple example for today because for
workshop purposes we wanted to keep it
simple but you can absolutely go and try
it on fairly complex things and and
there you can see a lot more of the
nuances.
Um
so yeah so there are two parts of
today's workshop the first part is
um setting up your scoring system and
the second part is once you have set up
the scoring system and played with it
and iterated on it in a co-pilot trying
to use it in a collab you would need a
Google account and some some proficiency
in working with collabs and python code
um we'll also introduce a spreadsheet
component to this so that you can
actually try to play around with this in
a spreadsheet um so that you uh you can
easily make changes and test things out.
Uh one last thing I want to hit on
before we jump into the the the workshop
is what what is the scoring system? What
is this idea of scoring system? Right?
is just to give you a mental framing of
it. Um
ranking is scoring is eval
to think about it. When Google does a
search, what it's trying to do is it's
scoring every document and seeing
whether this is good or not and then
giving you the best document that's the
best for you. And that's not that
different from te checking and scoring
an LM generated content. And the way
Google does it is by breaking this
problem down into a ton of signals,
right? So you can imagine you're looking
at SEO, you can look at document
popularity, you're looking at title
scores, whether the content is good or
not, uh maybe feasibility of things, um
spam, you know, click batiness, stuff
like that, and then brings all of these
signals together into a single score
that combines these these ideas.
Individual signals that you have are
very easy to understand. You know what
that is, you can inspect it very easily.
It's not a complex prompt with some
random score. It's something that you
can easily understand but it all comes
together into a single score and that's
the idea we are sort of bringing in and
that's what we call a scoring system. At
the bottom level things are very
objective tend to be deterministic
sometimes just Python code but as you
bring this up to the top it becomes
fairly subjective and you can like bring
these together into very complex ways.
Yeah. And I would just uh encourage you
to think about like why this why why
this kind of why this could work and and
we know it does work. Um some of what
you will be struggling with with eval is
just like you're not measuring the
things you need to be measuring. So you
have a little bit of a comprehensive
issue and so like the question is like
how many metrics do you need to add? So
I give the example Google search uses
around 300 signals. Now maybe you don't
want to be that but like Google does
care about all those things. They care
about like you know the score of the
site. They care about popularity. They
care about content relevance, spam, porn
seeeking. They have all these
classifiers, all these ways they
understand the content to then bring it
together. What this gives you is like
really visibility over your application
and more and more ways to marry your own
judgment into it. So if if instead you
just go and say, "Hey, is this a helpful
response?" Mostly what you're doing is
delegating that that eval to the LLM
itself, right? Or to a raider who you're
asking like, "Hey, see if this is good."
But when you break this all down, you
get some really nice properties where
like your variance goes down a lot just
because you're measuring way more
objective things. So things are not like
going back and forth all the time. And
it's very precise because like you're
you get all these things and you add
them together to a more high fidelity
score. And when you are analyzing the
data then you can like slice and dice by
way finer grain things. And that's kind
of like why the system tends to work
much better. And the best part of it is
as you iterate you just add more
signals. like this now doesn't leave you
as like oh I either have evals or I
don't have heals but like rather you
just have a set of metrics that you keep
adding over over time as you discover
what actually matters about your
oh
um
okay uh
how do I do this
um okay so I'll just quickly show you a
demo of where you're going to start
today. Uh give you some sort of basic
ideas of the kinds of things you would
be doing. Uh and then and then I'll just
we we we should just get started, right?
And so this is uh this is a co-pilot
that helps you put together your value
eval. That's the idea where I'm starting
is basically just a system prompt. This
is this application is basically a
simple meeting summarizer. It takes uh a
meeting script talking talking between
multiple people and then generates some
sort of a structured JSON in the end
which is a summary with very specific um
action items key key u key insights and
then a title right so it's a relatively
simple thing it's easy for you to
inspect and see where things are not
going working and not working well um so
typically you would start with something
like a system prompt you can also start
with examples if you have a bunch of
examples or you can start with criteria
itself. Um, and
the the first step is it would try to
use this to build your scoring system.
Um,
yeah. And uh, this is doing exactly what
was in that slide which is trying to say
like from that coar grain subjective
thing what are all the smaller things I
can I can sus out out of it. Now this
uses a reasoning model like if you drop
this into a chat GPT or so uh we just
try to replicate that experience where
you get these artifacts on the right
hand side which are your is your actual
scoring system but if you think you just
want to do it like iteratively through
your favorite like sort of chat and
phase you can also do it just say oops
go ahead we do it in reverse where we
provide examples of input and output and
yes exactly right so right in front
there is an example button you can start
with example you can give it actually
hundreds of examples if you want 20
examples uh and then it gets into a much
more complex process of like figuring
out these dimensions based on example or
you can just copy paste one or two
examples in the prompt itself and it
will it'll generate it. Um the this is
just a starting point by the way that's
the idea it it starts you somewhere and
now you're going to iterate over it and
that's the exercise you will you'll
spend time on. I'll show you a few
things like this is your scoring system.
These are your individual
um these are individual dimensions is
what we call it or you can think of
these as signals. Uh they're all
questions um effectively. For example,
this is a does the output include any
insights from the meeting. That's a
natural language question. Um or you can
have code um just Python code that we
have generated. You can edit this code
however you want. Uh this this uh this
this is actually as simple as what this
this says. When you when you look at the
actual code, the code is effectively
just a bunch of questions and you're
sending these questions to our
specialized foundation models that are
designed for scoring and evaluation. Um,
so you'll get to play with this a bunch
in the collab. You can see what the form
of it looks like. There's another thing
that you would notice that there's this
idea of critical, major, minor. These
are just weights. These are just way
ways for you to control what's
important, what's not. The combination
of this is done through a mathematical
function that that you can learn over
time. So actually eventually you would
give it a bunch of examples and it'll
learn it. But in this particular
exercise you have a little bit more
control. And finally once you have your
scoring system uh done uh there is a way
for you to integrate it into Google
Sheets. That's the thing that you're
going to play around with today which is
basically taking this criteria moving it
to a Google sheet and testing it against
real examples. So here you're going to
work with synthetic examples. You'll
develop your scoring system and then
completely blinded to this. We have
label data set where users are set
thumbs up thumbs down on the summaries.
You're going to apply this to that
scoring system and see how well it
aligns with real real thumbs up thumbs
down. Um so we don't know. It depends on
you. You you're you're building your own
scoring system whether it aligns or not.
Um yeah and and this is a really
interesting point and why why eval start
to get hard like we we call this
workshop like solving the hardest
challenge which is metrics that actually
work. So this idea of like correlation
ends up being really really important.
Metrics that work are not necessarily
good metrics or bad metrics. They're
either calibrated metrics or
unccalibrated metrics. And this is where
like uh at Google for example we had a
lot of data scientists that we worked
with right because they would do like
all these correlation analyses and
confusion matrices and such. So part of
the challenge of of of good evals is
just getting comfortable with the
numerical aspect of these things. And of
course again having a scoring system
that dissects things into much simpler
things makes it easier to analyze. But
you still have to think about those
things like does this actually correlate
with goodness? Like if it gives a high
score is this an actually a good score
and that's a big part of the
methodology. But the good news is as as
a chin showed in the previous slide,
once you have metrics you can trust,
like almost all of the rest of your
stack gets radically simplified as a
result.
Okay. So how do you use this copilot?
It's created this. Maybe a place to
start would be, you know, generate an
example. This is a synthetic generation
uh um happening behind the scenes. It's
taking uh your system prompt and other
information and trying to generate uh
some sort of an example and and then and
then this example that is generated is
scored and you can see how these
individual scores work. Now you can
start uh kind of testing this a little
bit more. You can say um can you
generate a bad example? Um and then it
will it will try to generate an example
that's broken in some particular way. it
the the copilot understands what you've
done so far. It has the full context.
It's using all of this information to
kind of like sort of walk you through
this. So in this particular way like in
this particular case, it created
something that is missing a bunch of
information. But you can even do things
like you know specific things like can
you create uh an example that has broken
JSON. Uh so this is like basically
example generation. That's one one thing
you can do here. The other thing you can
do is you can make changes to your
scoring system itself through the
copilot. So you can either go and ask
the copilot to make changes to your um
anyways this is a broken JSON example.
You can go to the c-ilot and either ask
it to change the python code itself. You
can say you know can you update the
python code or and such of any of these
things. You can also ask it to remove or
add dimensions. Maybe you can say can
you generate
a dimension that checks
for the title to be less than 20 words
and then the copilot can add these
dimensions. Go ahead.
Uhu.
Exactly.
Exactly. What do you do just add you
just ask the copilot and say I've
updated here's a new examples can you
add new dimensions to test these new
examples or we have changed things and
it automatically I mean it knows what
layer it knows what layer you're exactly
yeah the other thing I would say is the
co-pilot is very helpful if you want
this sort of human in a loop type
process
uh but once you get comfortable with the
system what we see people doing is just
use collab to and large amounts of data
and let our system figure things out on
its own. Uh I I find by personally I
find it very good to play with my
scoring system here every so often to
see if it's like working well like maybe
paste an example from from a user and
see how well it's working and so on and
so forth. You can kind of go back and
forth. Uh but but most of our clients
actually just fire off these uh long
longunning processes. Um yeah I think
maybe so um so anyway so that's uh so it
just created this new uh title length
thing for you. So you can sort of play
around with this um and it'll you can
change the questions you can remove
dimensions so on and so forth. Right.
The last thing I would show before we
get going is uh one of the things that
you would do is we we have pre-filled a
spreadsheet in the in the in the
workshop directions. you will make a
copy of that spreadsheet and we have a
spreadsheet integration where you can
run our score inside a spreadsheet
itself. You see there are other other
sort of places where you can use it. I'm
not going to get into this, but you can
use it for reinforcement learning, for
example, using onslaught integration,
but there are other places you can
actually integrate uh with PICORE uh if
not sure if you're familiar with these,
but basically in this in the sheets
integration, what you're going to do in
this workshop is you can actually just
copy this wholesale and put it into a
new spreadsheet with the examples that
are here. What we're trying to do is
just copy the criteria. So, you want to
build a criteria. Here's a copy signal
icon. Just click on the copy icon and
then go to the spreadsheet that we will
which which you have replace the
criteria that exists there with this
criteria that you've created and then
under extensions these directions are
all in the in the doc so you don't have
to remember it but under extensions you
can go and call the scorer the score is
going to run across about 120 examples
and then you'll see a confusion matrix
which shows you how many times there's
alignment on thumbs up how many times
there's alignment on thumbs down and how
many times there's no alignment and then
you can play around in the spreadsheet
itself make changes to the dimensions
and see if you can bring the alignment
closer. So this gives you a sense of u
yeah so this is the this is what a
spreadsheet would look like. This is
your criteria
um which is basically just the English
form uh the the spreadsheet form of the
English that you saw.
Oh sorry
is that better? Okay. So like you can
see this is the label the question these
are the weights. In the case of Python
there's Python code. So this is what
your criteria sheet looks like right
now. This is the default one that we put
in there for you, but you would replace
it with your own. This is the data. This
this basically has actually feedback,
which is a thumbs up, thumbs down, which
we got from users. Um, some of this is
fairly complex. Some of this is easy.
So, it's all kinds of different sort of
mix of things. And what you're going to
do is you're going to select these two
rows. This is the input, the output, and
then in the extensions uh under here,
you'll have score selected ranges, which
is going to create a score. And then you
can look at the confusion matrix and see
how it works. So that's sort of how
that's one part of the exercise, but you
can easily go and make changes here or
even test out by making changes to the
data, you know, messing up your like
your JSON and seeing whether how that
impacts things and so on and so forth.
So that's that's the first phase of our
co-pilot the of of our workshop and
we'll talk about the second phase when
we get started with that one. Go ahead.
Yeah, I'm just curious how uh like best
practices for using this in production
when you have you know tens of thousands
maybe hundreds of thousands of examples
or do teams run this on a subset or just
like how do you do this at scale right
so that will we will hit a lot of that
in the second part of our workshop we
will directly go into Python collab code
uh we have um the the SDK goes through a
bunch of this um the details and best
practices on how you do it But you will
get to play with this uh in this
workshop. You you can you will our our
our
scorers are specifically designed for
online workflows. These like 20
dimensions that you have, they score
them all in sub 20 like 50 milliseconds.
So you can run on a very large scale.
You can run it online um fairly easily
and we have batch processes and stuff
like that set up for you. So you'll be
able to play with that create sets. Um,
typically you want to create eval sets
which have a combination of hard, easy,
medium, that kind of stuff. We're not
going to get into data generation,
synthetic data generation that much, but
our uh you'll play with it in the
co-pilot, but the actual data generation
stuff there there's actually um
documentation that we you can follow up
afterwards and how to create like easy
to set hard medium sets for your testing
purposes. uh the best thing is to to
sample from your logs some number of
things and evaluate or just run it
online. Majority of this these kinds of
quality checks at Google were run online
whether it's spam detection, whether
it's like you know what what decisions
to make. That's the ideal place you want
to be. Most people it's very difficult
to kind of like sort of implement at
that point. I think the challenge one of
the biggest challenge with LLM as a
judge is it's so expensive you can't run
run online. So that's one of the things
that this solves. Um
all right so go ahead how this is
different as a traditional models or
smaller um so so we had a bunch of
discussion on this maybe I'll go very
quick quick quickly through these slides
um so
so the the these models are designed for
high precision like for example if you
run the same score twice on the same
thing it's not going to give you
different scores exactly the same score
with small variations keep the same
scores heights. It's high designed for
like super high precision. It's just the
architecture of these these models is is
basically um very low variance. Um they
the part of the reason is that they're
using this birectional attention instead
of like the typical decoder models
attention. They have a regression head
on top instead of a token generation. So
it's not auto reggressively generating
tokens which has a lot of like weirdness
that happens to it. Like there's a lot
of post talk uh explanation for scores.
will come up with the score then it'll
try to justify the score that doesn't
happen. The other thing is these models
have been uh trained on um a lot of data
like you know billions and billions of
tokens which are only for scoring but
different kinds of content so coding
content other type of content. So these
are these generalize really well across
but that also stabilizes them quite a
bit. Um the the one thing that I would
say which is really nice is the
interface is very nice. It basically
just you ask a question and give it the
the data and it will answer the question
with a score and then you can inspect
the score and understand why why it gave
you that score. So it's a fairly simple
interface. There's no like prompt tuning
and so on and so forth because in this
particular case when you're evaluating
prompt tuning doesn't really like it's
not very natural like you know how do
you explain a rubric to to a model. So
these models understand internally why
should this score high why shouldn't
they score high and then these things
come together uh using uh a fairly
sophisticated model which is like a
extension of generalized additive model
which brings all of these different
signals together um by baiting them
based on your thumbs up thumbs down data
so this is a process called calibration
where you give it a bunch of data and it
understands what's important what's not
what should I some if something fails I
should fail everything for example,
spam, but if it succeeds, I shouldn't
contribute those kinds of decisions. It
makes those decisions for you based on
the data. So that's sort of it's it's a
much more advanced way of doing evalu
purposes. It gives them that sort of
stability that you desire and that and
the reason they're very fast is because
when you use birectional attention, you
can build much more denser embeddings.
So with fewer parameters, you can get
fairly high quality scores. Go ahead.
Does it work well with other languages?
Uh right now we have trained it with
handful of languages. Uh we have we are
pretty soon going to release a model um
with multilang multilingual
capabilities. We don't support
multimodal yet but this is in our road
map. Right. So right now it's English in
a few languages and then we're going to
expand it beyond that.
Let's kick we should kick off the
workshop and then we'll pass it on and
we can just answer all the questions as
well. Um, do you just want to share the
doc? Just uh again reminder it's like
with pi.aiworkshop.
I'll just share the doc here as well so
that you all can see it. Yeah, I'll also
put it in the slack channel for
everybody. Uh but
please get started and let us know.
Um sorry uh some people maybe have
already started working with the
collabs. have seen um but uh if I'll
quickly show others uh what
um what what the second phase of the
exercise is about um but of course we
can continue all of this going forward
uh how do I project this
um so this the second part of the
exercise by the way we may not have
enough time to wrap it all up here but
the second part of the exercise feel
free to do it on your own the collab is
available for you. So you can you can
you can use this um this collab um
this this is the pre-prepared collab um
and
how do I
I don't know if you can see this but
basically this particular collab will
take you through these multiple steps
and a lot of people ask me questions
about how to use it in code so I just
wanted to quickly go over it. Um this is
a basic installation. This is your
scoring spec. This is this is where all
of your intelligence going to live.
Again, this is a relatively simple
example. So, it's a relatively simple
scoring spec. But the way you get this
spec is basically over here. You get
into code, you copy it, and you put it
into your collab. Right? So, this is
basically how you get the spec here.
It's all in natural language spec uh
with some Python code in there. This is
what you'll change. This is what you'll
tweak. This is where everything is going
to be. And what you're now doing in this
particular collab which you can
literally click through it and you don't
have to do anything more than that but
or you can play around with it a whole
bunch is one is we have some data sets
public public data sets on hugging face
which has the thumbs up thumbs down
data. This is the same data if you've
seen the sheet that's the same data over
here. you it you you you this collab is
loading that data running the score on
it returning your results um and then
building a similar confusion matrix that
you saw there which indicates how well
well aligned your scores are so that
just getting you warmed up uh then you
can use it to compare models this is
where like really interesting stuff
starts happening right so in this
particular case we are comparing 1.5 and
2.5 models you can see that 2.5 has a
slightly higher score than 1.5 5 for
this particular task. But because 2.5
was mostly for reasoning, there's not
that much of a delta here. If you go to
smaller models uh like some of the the
the the mini models like the cloud
haiku, you could see you'll see a much
bigger delta between the quality based
on your own scoring system. So now you
can eva use your scoring system for
evaluating different models. What this
is doing is it's taking about you know
10 examples
calling these five different models
generating responses and then scoring
them using our scoring system. Right? So
that's a model comparison. The other
upset is to try different prompts. Um
people change their system prompts but
they're worried about them when they
change it. What would happen? This is
the right way to kind of make sure that
you're not regressing. You have your
scoring spec. You try different system
prompts. see if your scores are going
down on your on your test set. So again,
taking 10 examples just to demonstrate
how you compare them. Uh here's like we
we created a bad and a good prompt just
to kind of accentuate this. So like bad
prompts getting much lower score than
good prompts on this particular task. Uh
this is the one that I that I'm very
excited about because it brings you into
the online world and how to actually do
this online where you're taking this one
particular transcript and you are
testing it out with different number of
samples. So if you're if you use just
one sample which is typically what you
do generate one response with a
temperature of.7 you get a particular
score but as you up the the number of
samples what it's doing behind the
scenes is creating three or four four of
those responses each one is a different
response and then ranking them using our
the pi scoring system that you just
built and picking the one that's best.
And what you would see is as you
increase the number of examples, you'll
see the score steadily going up, the
response quality going up. So this is uh
literally like you can click through
this in the collab and run through it
and you don't won't need anything else.
But there's a bunch of options for you
to play around with this right now
later. Just want to introduce this to
you guys before um before you all
disappear.
[Music]