Evals Are Not Unit Tests — Ido Pesok, Vercel v0

Channel: aiDotEngineer

Published at: 2025-08-06

YouTube video id: L8OoYeDI_ls

Source: https://www.youtube.com/watch?v=L8OoYeDI_ls

[Music]
My name is Ido. I'm an engineer at
Verscell working on Vzero.
If you don't know, Vzero is a full stack
Vibe coding platform. It's the easiest
and fastest way to prototype, build on
the web, and express new ideas. Uh, here
are some examples of cool things people
have built and shared on Twitter.
And to catch you up, we recently just
launched GitHub sync, so you can now
push generated code to GitHub directly
from VZO. You can also uh automatically
pull changes from GitHub into your chat,
and furthermore, switch branches and
open PRs to collaborate with your team.
I'm very excited to announce we recently
crossed 100 million messages sent and
we're really excited to keep growing
from here.
So my goal of this talk is for it to be
an introduction to eval specifically at
the application layer. You may be used
to eval layer which is what the research
labs will site in model releases but
this will be a focus on what do evals
mean for your users, your apps and your
data. The model's now in the wild, out
of the lab, and it needs to work for
your use case. And to do this, I have a
story. Uh, it's a story about this app
called Fruit Letter Counter. And if the
name didn't already give it away, all it
is is an app that counts the letters in
fruit.
So, the vision is we'll make a logo with
Chachi BT. Uh, there might be product
market fit already because everyone on X
is dying to know the number of letters
in fruit. If you didn't get it, it's a
joke on the how many Rs are in a
strawberry prompt. Uh we'll have Vzero
make all the UI and backend and then we
can ship.
So we had Vzero write the code. It it
used uh AI SDK to do the stream text
call and what do you know? It worked
first try. GPT 4.1 said three. And not
only did it say three once, I even
tested it twice and it worked both times
in a row. So from there we're good to
ship, right? Let's launch on Twitter.
Want to know how many letters are in a
fruit? Just launched fruit
lettercounter.io.
The.com andai were taken. Um, and yeah,
everything was going great. We launched
and deployed on versell. We had fluid
compute on until we suddenly get this
tweet. John said, "I asked how many Rs
in Strawberry and it said two." So, of
course, I just tested it twice. How is
this even possible? Um, but I think you
get where I'm going with this, which is
that by nature, LMS can be very
unreliable. And this principle scales
from a small letter counting app all the
way to the biggest AI apps in the world.
The reason why it's so important to
recognize this is because no one is
going to use something that doesn't
work. It's literally unusable. Um, and
this is a significant challenge when
you're building AI apps. So, I have a
funny meme here, but basically AI apps
have this unique property. They're very
like demo savvy. You'll demo it, it
looks super good, you'll show it to your
co-workers and then you ship to prod and
then suddenly hallucinations come and
get you. Um, so we always have this in
in the back of our head when we're
building
back to where we were. Let's actually
not give up, right? We actually want to
solve this for our users. We want to
make a really good fruit letter counting
app. So you might say, how do we make
reliable software that uses LLMs? Our
initial uh prompt was a simple question,
right? But maybe we can try prompt
engineering. Maybe we can add some chain
of thought, something else to make it
more reliable. So, we spend all night
working on this new prompt. Uh you're an
exuberant fruitloving AI on an epic
quest dot dot dot. Uh and this time we
actually tested it 10 times in a row on
ChachiBT. And it worked every single
time. 10 times in a row. It's amazing.
So, we ship and everything was going
great until
John tweeted on me again and he said, "I
asked how many Rs are in strawberry,
banana, pineapple, mango, kiwi, dragon
fruit, apple, raspberry, and it said
five."
So, we failed John again. Um, although
this example is pretty simple, but this
is actually what will happen when you
start deploying to production. You'll
get users that come up with queries you
could have never imagined and you
actually have to start up thinking about
how do we solve it.
And an interesting thing if you think
about it is 95% of our app works 100% of
the time. We can have unit tests for
every single function end to end test
for the off the login the sign out it
will all work but it's that most crucial
5% that can fail on us. So let's improve
it.
Now to visualize this I have a diagram
for you. Hopefully you can see the code.
Uh maybe I need to make my screen
brighter. Can you see the code? I don't
know. Okay. Um
Okay. Well, we'll come back to this. But
basically, we're going to start building
evals. And to visualize this, I have a
basketball court. So, today's day one of
the NBA finals. I don't know if you
care. Um you don't need to know much
about basketball, but just know that
someone is trying to throw a ball in the
basket. And here the basket is the
glowing golden cir uh glowing golden
circle. So blue will represent a shot
make and red will represent a shot miss.
And one property to consider is that the
farther away your shot is from the
basket the harder it is. Uh another
property is that the court has
boundaries. So this blue dot although
the shot goes in it's out of your uh out
of the court. So it doesn't really count
in the game. Let's start plotting our
data. So here we have a question. How
many Rs in strawberry? This after our
new prompt will probably work. So, we'll
label it blue. Um, and we'll put it
close to the basket because it's pretty
easy. However, how many Rs are in that
big array? We'll label it red. And we'll
put it farther away from the basket.
Hopefully, you can see that. Maybe we
can make it a little bit brighter. But
this is the data part of our eval.
Basically, you're trying to collect uh
what what prompts your users are asking.
And you want to just store this over
time and keep building it and store
where these points are on your court.
Two more prompts I want to bring up is
like what if someone says how many Rs
are in strawberry, pineapple, dragon
fruit, mango after we replace all the
vowels with Rs, right? Insane prompt,
but still technically in our domain. Uh
so we'll we'll label it as red all the
way down there. U but a funny one is
like how many syllables are in carrot?
So this we'll call it out of bounds,
right? This no none of our users are
actually going to ask. Um it's not part
of our app, so no one is going to care.
Um I hope you can see the code, but
basically when you're making evout,
here's how you can think about it. Your
data is the point on the court. Your
shot or in this case in brain trust,
they call it a task is the way you shoot
the ball towards the basket. And your
score is basically a check of did it go
in the basket or did it not go in the
basket.
To make good evals, you must understand
your court. This is the most important
step.
And you have to be careful of falling
into some traps. First is the out-of-
bounds traps. Don't spend time making
emails for your data your users don't
care about. You have enough problems, I
promise you, of problem uh queries that
your users do care about. So be careful
not try and be productive and you know
you're making a lot of evals but they're
not really applicable to your app. And
another visualization is don't have a
concentrated set of points. When you
really understand your core you're going
to understand you know where the
boundaries are and you want to make sure
you you test across the entire court.
Uh a lot of people have been talking
about this today but to collect as much
data as possible here are some uh things
you can do. First is collect thumbs up
thumbs down data. This can be noisy but
it also can be really really good signal
as to where your app is struggling.
Another thing is if you have
observability which is highly
recommended you can just read through
random samples in your log in your logs.
Um although users might not be you know
giving you signal but if you take like a
hundred random samples and go through it
like once a week you'll get a really un
good understanding of what your users
and how your users are using the
product. Uh if you have community forums
these are also great. People will often
report issues they're having with the
LLM and also X and Twitter are also
great but can be noisy. And there really
is no shortcut here. You really have to
do the work and understand what your
court looks like. So here is actually
what if you are doing a good job of
understanding your court and a good job
of building your data set. This is what
it should look like. You should know the
boundaries. You should be testing in
your boundaries and you should
understand where your system is has blue
and verse where it has red. So here it's
really easy to tell, okay, maybe next
week we need to prioritize uh the team
to work on that bottom right corner.
This is something where a lot of users
are struggling and we can really do a
good job on flipping the tiles from red
to blue.
Another thing you can do and I hope I
really hope you can see but you want to
put constants in data variables in the
task. So just like in math or
programming, you want to factor
constants so it improves clarity, reuse,
and generalizations.
If you have, let's say you want to test
your system prompt, right? Keep the
constant data that all that your users
are going to ask. So for example, how
many RS and strawberry that goes in the
data that's a constant. It's never going
to change throughout your app. But what
you're going to test is in that task,
you're going to try different system
prompts. You might try different
pre-processing, different rag, and
that's what you want to put in your task
section. This way your app actually
scales and you never have to let's say
when you change your system prompt redo
all your data and this is a really nice
feature of brain trust. Um and if you
don't know AI SDK actually offers a
thing called middleware and it's a
really good abstraction to put basically
all your logic of pre-processing. So rag
system prompt you can put in here etc.
And you can now share this between your
actual API route that's doing the
completion and your evals. So, if you
think about the court, the basketball
court as if we're doing we're going like
basketball practice and we're trying to
practice our system ac across different
models. Um, you want your practice to be
as similar as possible to the real game.
That's what makes a good practice. So,
you want to share the pretty much the
exact same code between evals and what
you're actually running.
Now, I want to talk a little bit about
scores, which is the last step of the
eval. The unfortunate thing is it does
vary greatly depending on your domain.
So in this case it's like super simple.
Uh you're just checking if you know the
output contains the correct number of
letters. But maybe if you're doing
writing a task like writing that's very
very difficult. Um
from principles you want to actually
lean towards deterministic scoring and
pass fail. This is because when you're
doing debugging, uh, you're going to get
a ton of input and logs and you want to
make it as easy as possible for you to
actually figure out what's going wrong.
So, if you're sh if you're building if
you're overengineering your score, it
might be very difficult to share with
your team and distribute across
different teams uh, your evals because
no one will understand how these things
are getting scored. Keep your scores as
simple as possible. Um, and a good
question to ask yourself is when you're
looking at the data, what am I looking
for to see if this failed? Right? So
with Vzero, we're looking for if the
code didn't work. Um, but maybe for
writing, you're looking for a certain
linguistics.
Ask yourself that question and write the
code that looks for you. Um, there are
some cases where it's so hard to write
the code that you may need to do human
review and that's okay. At the end of
the day, you want to build your core and
you want to collect signal even if you
need you must do human human review to
get the correct signal. Don't worry at
the if you do the correct practice, it
will pay off in the long run and you'll
get better results for your users.
One trick you can do for scoring is
don't be scared to like add a little bit
of extra um prompt to your to the
original prompt. So for example, here we
can say output your final answer uh in
these answer tags. What this will do is
basically make it very easy for you to
do string matching um and etc. Whereas
in production you don't really want this
but yeah you can do some little twe
tweaks to your prompts so that scoring
is easier.
Another thing we really highly recommend
is add evals to your CI. So brain trust
is really nice because you can get these
eval reports. Um so it'll run your e
your task across all your data and then
it will give you this uh report at the
end for the improvements and
regressions. Assume my colleague made a
PR that changes a bit of the prompt. We
want to know like how did it do across
the court, right? Visualize like did it
change more tiles from red to blue?
Maybe now our prompt fixed one part but
it broke the other part of our app. Um
so this is a really useful report to
have when you're doing PRs.
So yeah, going back this this is the
summary of the talk. You want to make
your evals a a core of your data. And
this you can treat it like practice.
Your model is basically going to
practice. Maybe you want to switch
players, right? When you switch models,
you can see how a different player is
going to perform in your practice. But
this gives you such a good understanding
of how your system is doing when you
change things like maybe your rag or
your system prompt. And you can now go
to your colleague and say, "Hey, this
actually did help our app, right?"
Because improvement without measurement
is limited and imprecise. And eval give
you the clarity you need to
systematically improve your app.
When you do that, you're going to get
better reliability and quality, higher
conversion and retention, and you also
get to do just spend less time on
support and ops, right? Because your
evals, your practice environment will
take care of that for you. Uh, and if
you're wondering about how I built all
these court diagrams, I actually just
used Vzero and it made me some app that
I just added these shots made and missed
in uh the basket. So, yeah, thank you
very much. I hope you learned a little
bit about evals. Thank you.
So, we do have some time for some
questions. There are two mics, one over
here, one over there. Um, we can take
two or three of those, please, if
anybody's interested in asking.
We have one over there.
Um, mic five, please.
Or you can repeat the question as well
if you don't mind.
Yeah. Yeah. You can think of it. It's
really like practice. Like maybe your a
basketball player will like, you know,
in general score like 90% but they might
miss more shots here or there. If you
run it like we do it like we run every
day at least. Um and then we get a good
sense of like where are we actually like
failing? Did we have some regression? Um
so yeah, running it like pre daily or at
least in some schedule will give you a
good idea. I was thinking what if you
ran like you know the same question
through it five times, right? It's like
like what's the percentage? just making
it four out of five or you know five
right. So it's definitely like as you go
further away like the harder questions
get like