Evals Are Not Unit Tests — Ido Pesok, Vercel v0
Channel: aiDotEngineer
Published at: 2025-08-06
YouTube video id: L8OoYeDI_ls
Source: https://www.youtube.com/watch?v=L8OoYeDI_ls
[Music] My name is Ido. I'm an engineer at Verscell working on Vzero. If you don't know, Vzero is a full stack Vibe coding platform. It's the easiest and fastest way to prototype, build on the web, and express new ideas. Uh, here are some examples of cool things people have built and shared on Twitter. And to catch you up, we recently just launched GitHub sync, so you can now push generated code to GitHub directly from VZO. You can also uh automatically pull changes from GitHub into your chat, and furthermore, switch branches and open PRs to collaborate with your team. I'm very excited to announce we recently crossed 100 million messages sent and we're really excited to keep growing from here. So my goal of this talk is for it to be an introduction to eval specifically at the application layer. You may be used to eval layer which is what the research labs will site in model releases but this will be a focus on what do evals mean for your users, your apps and your data. The model's now in the wild, out of the lab, and it needs to work for your use case. And to do this, I have a story. Uh, it's a story about this app called Fruit Letter Counter. And if the name didn't already give it away, all it is is an app that counts the letters in fruit. So, the vision is we'll make a logo with Chachi BT. Uh, there might be product market fit already because everyone on X is dying to know the number of letters in fruit. If you didn't get it, it's a joke on the how many Rs are in a strawberry prompt. Uh we'll have Vzero make all the UI and backend and then we can ship. So we had Vzero write the code. It it used uh AI SDK to do the stream text call and what do you know? It worked first try. GPT 4.1 said three. And not only did it say three once, I even tested it twice and it worked both times in a row. So from there we're good to ship, right? Let's launch on Twitter. Want to know how many letters are in a fruit? Just launched fruit lettercounter.io. The.com andai were taken. Um, and yeah, everything was going great. We launched and deployed on versell. We had fluid compute on until we suddenly get this tweet. John said, "I asked how many Rs in Strawberry and it said two." So, of course, I just tested it twice. How is this even possible? Um, but I think you get where I'm going with this, which is that by nature, LMS can be very unreliable. And this principle scales from a small letter counting app all the way to the biggest AI apps in the world. The reason why it's so important to recognize this is because no one is going to use something that doesn't work. It's literally unusable. Um, and this is a significant challenge when you're building AI apps. So, I have a funny meme here, but basically AI apps have this unique property. They're very like demo savvy. You'll demo it, it looks super good, you'll show it to your co-workers and then you ship to prod and then suddenly hallucinations come and get you. Um, so we always have this in in the back of our head when we're building back to where we were. Let's actually not give up, right? We actually want to solve this for our users. We want to make a really good fruit letter counting app. So you might say, how do we make reliable software that uses LLMs? Our initial uh prompt was a simple question, right? But maybe we can try prompt engineering. Maybe we can add some chain of thought, something else to make it more reliable. So, we spend all night working on this new prompt. Uh you're an exuberant fruitloving AI on an epic quest dot dot dot. Uh and this time we actually tested it 10 times in a row on ChachiBT. And it worked every single time. 10 times in a row. It's amazing. So, we ship and everything was going great until John tweeted on me again and he said, "I asked how many Rs are in strawberry, banana, pineapple, mango, kiwi, dragon fruit, apple, raspberry, and it said five." So, we failed John again. Um, although this example is pretty simple, but this is actually what will happen when you start deploying to production. You'll get users that come up with queries you could have never imagined and you actually have to start up thinking about how do we solve it. And an interesting thing if you think about it is 95% of our app works 100% of the time. We can have unit tests for every single function end to end test for the off the login the sign out it will all work but it's that most crucial 5% that can fail on us. So let's improve it. Now to visualize this I have a diagram for you. Hopefully you can see the code. Uh maybe I need to make my screen brighter. Can you see the code? I don't know. Okay. Um Okay. Well, we'll come back to this. But basically, we're going to start building evals. And to visualize this, I have a basketball court. So, today's day one of the NBA finals. I don't know if you care. Um you don't need to know much about basketball, but just know that someone is trying to throw a ball in the basket. And here the basket is the glowing golden cir uh glowing golden circle. So blue will represent a shot make and red will represent a shot miss. And one property to consider is that the farther away your shot is from the basket the harder it is. Uh another property is that the court has boundaries. So this blue dot although the shot goes in it's out of your uh out of the court. So it doesn't really count in the game. Let's start plotting our data. So here we have a question. How many Rs in strawberry? This after our new prompt will probably work. So, we'll label it blue. Um, and we'll put it close to the basket because it's pretty easy. However, how many Rs are in that big array? We'll label it red. And we'll put it farther away from the basket. Hopefully, you can see that. Maybe we can make it a little bit brighter. But this is the data part of our eval. Basically, you're trying to collect uh what what prompts your users are asking. And you want to just store this over time and keep building it and store where these points are on your court. Two more prompts I want to bring up is like what if someone says how many Rs are in strawberry, pineapple, dragon fruit, mango after we replace all the vowels with Rs, right? Insane prompt, but still technically in our domain. Uh so we'll we'll label it as red all the way down there. U but a funny one is like how many syllables are in carrot? So this we'll call it out of bounds, right? This no none of our users are actually going to ask. Um it's not part of our app, so no one is going to care. Um I hope you can see the code, but basically when you're making evout, here's how you can think about it. Your data is the point on the court. Your shot or in this case in brain trust, they call it a task is the way you shoot the ball towards the basket. And your score is basically a check of did it go in the basket or did it not go in the basket. To make good evals, you must understand your court. This is the most important step. And you have to be careful of falling into some traps. First is the out-of- bounds traps. Don't spend time making emails for your data your users don't care about. You have enough problems, I promise you, of problem uh queries that your users do care about. So be careful not try and be productive and you know you're making a lot of evals but they're not really applicable to your app. And another visualization is don't have a concentrated set of points. When you really understand your core you're going to understand you know where the boundaries are and you want to make sure you you test across the entire court. Uh a lot of people have been talking about this today but to collect as much data as possible here are some uh things you can do. First is collect thumbs up thumbs down data. This can be noisy but it also can be really really good signal as to where your app is struggling. Another thing is if you have observability which is highly recommended you can just read through random samples in your log in your logs. Um although users might not be you know giving you signal but if you take like a hundred random samples and go through it like once a week you'll get a really un good understanding of what your users and how your users are using the product. Uh if you have community forums these are also great. People will often report issues they're having with the LLM and also X and Twitter are also great but can be noisy. And there really is no shortcut here. You really have to do the work and understand what your court looks like. So here is actually what if you are doing a good job of understanding your court and a good job of building your data set. This is what it should look like. You should know the boundaries. You should be testing in your boundaries and you should understand where your system is has blue and verse where it has red. So here it's really easy to tell, okay, maybe next week we need to prioritize uh the team to work on that bottom right corner. This is something where a lot of users are struggling and we can really do a good job on flipping the tiles from red to blue. Another thing you can do and I hope I really hope you can see but you want to put constants in data variables in the task. So just like in math or programming, you want to factor constants so it improves clarity, reuse, and generalizations. If you have, let's say you want to test your system prompt, right? Keep the constant data that all that your users are going to ask. So for example, how many RS and strawberry that goes in the data that's a constant. It's never going to change throughout your app. But what you're going to test is in that task, you're going to try different system prompts. You might try different pre-processing, different rag, and that's what you want to put in your task section. This way your app actually scales and you never have to let's say when you change your system prompt redo all your data and this is a really nice feature of brain trust. Um and if you don't know AI SDK actually offers a thing called middleware and it's a really good abstraction to put basically all your logic of pre-processing. So rag system prompt you can put in here etc. And you can now share this between your actual API route that's doing the completion and your evals. So, if you think about the court, the basketball court as if we're doing we're going like basketball practice and we're trying to practice our system ac across different models. Um, you want your practice to be as similar as possible to the real game. That's what makes a good practice. So, you want to share the pretty much the exact same code between evals and what you're actually running. Now, I want to talk a little bit about scores, which is the last step of the eval. The unfortunate thing is it does vary greatly depending on your domain. So in this case it's like super simple. Uh you're just checking if you know the output contains the correct number of letters. But maybe if you're doing writing a task like writing that's very very difficult. Um from principles you want to actually lean towards deterministic scoring and pass fail. This is because when you're doing debugging, uh, you're going to get a ton of input and logs and you want to make it as easy as possible for you to actually figure out what's going wrong. So, if you're sh if you're building if you're overengineering your score, it might be very difficult to share with your team and distribute across different teams uh, your evals because no one will understand how these things are getting scored. Keep your scores as simple as possible. Um, and a good question to ask yourself is when you're looking at the data, what am I looking for to see if this failed? Right? So with Vzero, we're looking for if the code didn't work. Um, but maybe for writing, you're looking for a certain linguistics. Ask yourself that question and write the code that looks for you. Um, there are some cases where it's so hard to write the code that you may need to do human review and that's okay. At the end of the day, you want to build your core and you want to collect signal even if you need you must do human human review to get the correct signal. Don't worry at the if you do the correct practice, it will pay off in the long run and you'll get better results for your users. One trick you can do for scoring is don't be scared to like add a little bit of extra um prompt to your to the original prompt. So for example, here we can say output your final answer uh in these answer tags. What this will do is basically make it very easy for you to do string matching um and etc. Whereas in production you don't really want this but yeah you can do some little twe tweaks to your prompts so that scoring is easier. Another thing we really highly recommend is add evals to your CI. So brain trust is really nice because you can get these eval reports. Um so it'll run your e your task across all your data and then it will give you this uh report at the end for the improvements and regressions. Assume my colleague made a PR that changes a bit of the prompt. We want to know like how did it do across the court, right? Visualize like did it change more tiles from red to blue? Maybe now our prompt fixed one part but it broke the other part of our app. Um so this is a really useful report to have when you're doing PRs. So yeah, going back this this is the summary of the talk. You want to make your evals a a core of your data. And this you can treat it like practice. Your model is basically going to practice. Maybe you want to switch players, right? When you switch models, you can see how a different player is going to perform in your practice. But this gives you such a good understanding of how your system is doing when you change things like maybe your rag or your system prompt. And you can now go to your colleague and say, "Hey, this actually did help our app, right?" Because improvement without measurement is limited and imprecise. And eval give you the clarity you need to systematically improve your app. When you do that, you're going to get better reliability and quality, higher conversion and retention, and you also get to do just spend less time on support and ops, right? Because your evals, your practice environment will take care of that for you. Uh, and if you're wondering about how I built all these court diagrams, I actually just used Vzero and it made me some app that I just added these shots made and missed in uh the basket. So, yeah, thank you very much. I hope you learned a little bit about evals. Thank you. So, we do have some time for some questions. There are two mics, one over here, one over there. Um, we can take two or three of those, please, if anybody's interested in asking. We have one over there. Um, mic five, please. Or you can repeat the question as well if you don't mind. Yeah. Yeah. You can think of it. It's really like practice. Like maybe your a basketball player will like, you know, in general score like 90% but they might miss more shots here or there. If you run it like we do it like we run every day at least. Um and then we get a good sense of like where are we actually like failing? Did we have some regression? Um so yeah, running it like pre daily or at least in some schedule will give you a good idea. I was thinking what if you ran like you know the same question through it five times, right? It's like like what's the percentage? just making it four out of five or you know five right. So it's definitely like as you go further away like the harder questions get like