AI Engineer World's Fair 2025 - Evals
Channel: aiDotEngineer
Published at: 2025-06-06
YouTube video id: Vqsfn9rWXR8
Source: https://www.youtube.com/watch?v=Vqsfn9rWXR8
two testing. Hello. Okay. When you're ready, you one, two, one, two, three. Look at Look at Look at your data. Look at your data. Great to meet you. So big fan of the spy. I think I've come across your Twitter. Yeah. Yeah. Yeah. I tried to do the same in in Typescript and failed. So now that I have a good portion of I have something funny here. I wonder if you're all okay with it or we don't have to do it, but I asked the AI to give us the intros. And so, one of the funny things I asked it to do is tell me the pronunciation for the name. So, I'd be curious which AI model does better. So, it'll also be a way that I can kind of like interact with the audience. So, like, all right, pick me pick the AI model that you want to try to make the the model the thing funny. Is that okay with everybody? Sounds good. All right, cool. I don't know if I'll have has to connect it, but I can always just use it and like say it out loud. So, that's cool. For sure. All right. Cool. And uh no guarantees it did a good job. Yeah, I think that's the point. Yeah. And it probably might have been me entry. So, feel free to like call that out. Okay. That thing was wrong about this. Hello. Hello. For example, let's try Omar real quick. Where's your mic? Right here. It's invisible. visible, bro. Nice. There you go. I don't know if it used the web or what other things to like maybe I I think I submit we submitted. Yeah, I submitted the I included whatever you provided. Yeah, that's cheating. But some of them I want to turn your belt back off your presentation. Radio frequencies. handle. Yours is fine. Mine is good. Uh, should I twist it or Yeah. You done a lot of these? I bet. Have you done a lot of these? It's been a while. I didn't know you worked at database. Is that recent? Been there for a year. Yeah, I must might have just missed it. Yeah. Cool. Nice. Check. Check. Check. Check. Do you want me to head back or do I stay? There's some cans there. I think that's the water. Super stressful to be honest. It's all good. Just be yourself. Ignore the crap. Usually ends of that. Look at the last row. You know that that makes it look like you're looking at everyone. Oh yeah, I predict a lot. Maybe we contradict each other. That would be interesting. That would be even more interesting. Well, he's going to contradict us for sure if you ours are going to be pretty complimentary, I think, because ours are like, okay, we did a lot of unit tests like Evos and then we regretted it and then your yours came after like Evos are not unit test, duh. So they complent each other. Yeah. We just do a bunch of different types of evos now do more like trajectory full conversation evos and stuff like that. Oh yeah. 18 people watching. 18 people, bro. I'm famous. [Laughter] Wow. Test test test. Hello. Hello. Test test. Test. Test. Hello. Hello. How was your flight? Good. Good flight. Yeah. Very long. Oh, I actually live in Spain now. So, it's like a very brutal very brutal flight. Like 12 hours. Damn. Yep. Yep. Brutal. Yeah, it's rough. It's rough. So I was like I want to go to France. Yeah, that's a good move. Good move. Yeah. Yeah. The shortest way from United States to Europe. Yeah, indeed. Yeah. Uh no, I guess it's pretty much more or less the same. Maybe slightly shorter. Maybe they're better flights. [Music] because totally they told you the dream, but it was it was a lie. Makes sense. You're good. You nervous? I think it was in the big room. Yeah. Oh, yeah. No, I would be melting now if I wasn't going to present that room. I would be flying back to Poland. Like, sorry guys, I'm sick. I'm changing jobs. I'm going to be a electrician. Yeah. Yeah. Okay. So, if you if we have trouble Yeah, that's true. If we have trouble with the AV, you take over because you you're going to see better. I thought ours might be right there on the limit I feel. Yeah. At the very beginning it was like half an hour and then we like started trimming stuff removing slides. Really? Yeah. Yeah. Yeah. I don't think we will have time for the Q&A. Ain't nobody ain't nobody got time for Q&A. Move on to the next talks. Yes. You cannot optimize for any domain, right? Indeed. [Music] Exactly. Yeah. Yeah. It's really difficult. Yeah. Um Yeah. We mostly bring like the average for everyone high, but it's still like not super good at any specific vertical. Uh, so we struggle with that a lot trying to work. Yeah. Yeah. Yeah. It's it's like it's for the normies. So I get people hooked into it, you know, and then I think for us really they just gauing you don't know if you're improving. Oh yeah, we have something really pretty similar. Yeah, actually like we changed the model sonnet to newer set and the scores like all over the board like what the [ __ ] is this? Yeah, even though it was obviously the best model because we were already like using it in cursor and everything and hey, how you doing? How are you? Nice to meet you. Yeah, he works at Oh, nice to meet you. I used to work with these guys. Yeah, I'm excited for your talk. And Brain Trust is MCing this whole track. Uh yeah, almost up there. They they just got the whole track. Yeah. Cool. Yeah, makes sense. Yeah. I mean, not if you're a competitor, but Yeah. True. through that. Y'all ready? Yeah. No. Have you given a talk before? Not not like this fancy. No. No. I've done a lot of like online talks. Yeah. Yeah. When you did the course, you did. Yeah. Yeah. And that's where a lot of the content much easier. I just don't look at any any face. I just look at my presenter notes and I and I just zone out. I'm going to try to do the same, I think. Did you repurpose the content from uh a lot of it was repurposed and then there's some new stuff. The good stuff, right? Yeah. There's some the good stuff repurposed and then some new stuff. Cool. So like it's it's a mix half and half. You'll get longer and longer prompting guides for the latest models that are supposed to be, you know, closer and closer to AGI. Um and if you're less lucky, you have to figure that out on your own, right? Um if you're even less lucky, the prompting guides from the provider are not even that good. So you have to actually kind of figure out what what the right thing is. And every day uh maybe at an even faster pace, someone is releasing an archive paper or a tweet or something that introduces a new learning algorithm, maybe some reinforcement learning uh bills and whistles, maybe some prompting tricks, maybe a prompt optimization, you know, technique, something or the other that promises to make your system learn better and sort of fit your goals better. someone else is introducing some search or scaling or inference strategies or agent frameworks or agent architectures that are promising to finally unlock levels of reliability or quality better than what you had before. And I think if you're actually doing a reasonable job now most like you know I've got to stay on top of at least some of this stuff so that like I don't fall behind. Um and in many cases like you know model APIs actually change the model under the hood even though you know you're using the same name. So it's actually you're forced to scramble and actually I would say maybe the question isn't whether you will scramble every week maybe a different question is will you even get to scramble for long if you think about the rate of progress of these LLMs like are they going to eat your lunch right so these are I think questions that are on a lot of people's minds and this is where the talk is is is going to be addressing so the talk mentions the bitter lesson which is sounds like this you know really ancient old kind of AI lore but it's just you know six years old uh where the current Turing award winner Rich Sutton who's a pioneer of reinforcement learning wrote this short essay uh on his website basically that says 70 years of AI has taught him and taught you know other people in the AI community from his perspective that when AI researchers leverage domain knowledge to solve problems like I don't know chess or something we build complicated methods that essentially don't scale and we get stuck and we get beat by methods that leverage scale a lot better. What seems to have to work better according to uh to Sutton is general methods to scale and he identifies search which is not like retrieval more of like uh you know exploring large spaces and learning um so getting the system to kind of understand its environment maybe for example um work best and search here is what we'd call in LLM land maybe inference time scaling or something so I don't speak for Sutton and I'm not you know suggesting that I have the right understanding of what he's saying or that I necessarily agree or disagree but I think this is just fundamental and important kind of um concept in this space. So I think it's it's it raises interesting questions for us as people who build uh you know engineer AI systems because if leveraging domain knowledge is bad what exactly is AI engineering supposed to be about? I mean engineering is understanding your domain and working in it with a lot of human ingenuity in repeatable ways let's say or with principles. So like are we just doomed like are we just wasting our time? Why are we at an AI engineering you know fair? And I'll tell you um how to resolve this. I've not really seen a lot of people discuss that. Sutton is talking about and a lot of people, you know, throw the bitter lesson around. So clearly somebody has to think about this, right? Sutton is talking about maximizing intelligence. All of us probably care about that to some degree, but which is like something like the ability to figure things out in a new environment really fast, let's say. Um, all of us kind of care about this to some degree. I'm also an AI researcher. Um, but when we're building AI systems, I think it's important to remember that the reason we build software is not that we lack AGI. We build software um you know and and the reason for this and the way kind of to understand this is we already have general intelligence everywhere. We have eight billions of them. Um they're unreliable because that's what intelligence is. Um and they've not solve the problems that we want to solve with software. That's why we're building software. Um so we program software not because we lack AGI but because we want reliable, robust, controllable uh scalable systems. Um and we want these things that to be things that we can reason about, understand at scale. Um and actually if you think about engineering and reliable systems, if you think about checks and balances in any case where you try to systematize stuff, it's about subtracting agency and subtracting intelligence in exactly the right places uh carefully uh and not restricting the intelligence otherwise. So this is a very different axis from the kinds of lessons that you would draw on from the bitter lesson. Now that does not mean the bitter lesson is irrelevant. Let me tell you the precise way in which it's relevant. Um so the first takeaway here is that scaling search and learning works best for intelligence. this is the right thing to do if you're an AI researcher interested in building, you know, agents that learn really well, really fast in new environments, right? Don't hard code stuff at all. Um, or unless you really have to. But in building AI systems, it's helpful to think about, well, sure, search and learning, but searching for what, right? Like what is your AI system even supposed to be doing? What is the the the fundamental problem that you're solving? It's not intelligence. It's something else. Um, and what are you learning for, right? Like what is the system learning in order to do well? And that is what you need to be engineering. Um not the specifics of search and not the specifics of learning as I'll talk about in the rest of this talk. So he's saying um Sutton is saying complicated methods get in the way of scaling uh especially if you do it early uh like before you know what you're doing essentially. Did we hear that before? I feel like I heard that back in the 1970s although I wasn't around. This is you know the notion of structured programming uh with with KUS saying his popular you know uh uh phrase in a paper premature optimization is the root of all evil. I think this is the bitter lesson for software and thereby also for AI software. So it's human ingenuity and human knowledge of the domain. It's not that it's harmful. It's that when you do it prematurely in ways that constrain your system in ways that reflect poor understanding they're bad. But you can't get away in an engineering field with not engineering your system. Like you're just quitting or something, right? So here's a little piece of code. Um, if you follow me on X on Twitter, you might recognize it, but otherwise I think it looks pretty opaque to me in like 3 seconds. And I can't really look at this and tell exactly what it's doing. And I al also honestly don't really care. Um, so lo and behold, um, this is computing a square root in a certain floating point representation on an old machine. And I think the thing that jumps at me immediately is this is not the most futurep proof program possible. If you change the machine architecture, different floatingoint representations, better CPUs, first of all, it'll be wrong because, you know, like it's just hard coding some values here. Um, and second of all, it'll probably be slower than a normal, you know, square root that maybe is a single instruction or maybe the compiler has a really smart way of doing it or, you know, a lot of other things that could be optimized for you, right? So someone who wrote this, maybe they had a good reason, maybe they didn't, but certainly if you're writing this kind of thing often, you're probably messing up as an engineer. So premature optimization is maybe the um square root of all evil or something, but what counts as uh premature? Like I mean that's kind of the name of the game, right? Like we could just say that, but it doesn't mean anything. Um so I don't think any strategy will work in tech. Nobody can anticipate what will happen in three years, 5 years, 10 years. But I think you still have to have a conceptual model that you're working off of. And I happen to have built two things that are, you know, on the order of several years old that have fundamentally stayed the same over the years from the days of BERT text Dainci 2 up to four 04 mini and they're bigger now than they ever were. And they're sort of like these um stable fundamental kind of um abstractions or AI systems around around LLMs. So what gives what happens u in order to get something like Kar or something like DSP in this ecosystem? um and sort of endure a few years which is like you know centuries in AI land. Um I'll try to reflect on this and you know again none of this is guaranteed to be something that lasts forever. So here's my hypothesis. Um premature optimization is what is happening if and only if you're hard coding stuff at a lower level of abstraction that you can then you can justify. Um if you want a square root please just say give me a square root. Don't start doing random bit shifts and bit stuff like you know bit manipulation that happens to appease your particular machine today. Um but actually take a step back. Do you even want a square root or are you computing something even more general? And is there a way you could express that thing that is more general right and only you know uh stoop down or go down to the level of abstraction that's lower if you've demonstrated that a higher level of abstraction is not good enough. So I think the bigger picture here is applied machine learning and definitely prompt engineering has a huge issue here. Tight coupling is known tighter coupling than necessary is known to be bad in software but it's not really something we talk about when we're building machine learning systems. In fact the name of the game in machine learning is usually like hey this latest thing came out let's rewrite everything so that we're working around that specific thing. And I tweeted about this a year ago, 13 months ago, last May in 2024, saying the bitter lesson is just an artifact of lacking high level good highle ML abstractions. Deep learning uh scaling deep learning helps predictably, but after every paradigm shift, the best systems always include modular specializations because we're trying to build software. We need those. Um and every time they basically look the same and they should have been reusable, but they're not because we're writing code bad. We're writing bad code. So here's a nice example just to demonstrate this. It's not special at all. Here's a 2006 paper. The title could have really been a paper of now, right? A modular approach for multilingual question answering. And here's a system architecture. It looks like your favorite multi- aent framework today, right? It has an execution manager. It has some question analyzers and retrieval strategy strategists from a bunch of corpora. And it's like a figure you, you know, if you color it, you would think it's a paper maybe from last year or something. Um, now here's a problem. It's a pretty figure. The system architecturally is actually not that wrong. I'm not saying it's the perfect architecture, but in a normal software environment, you could actually just upgrade the machine, right? Put it on a new hardware, put it on a new operating system, and it would just work and actually work reasonably well because the architecture is not that bad. But we know that that's not the case for these ML sort of architectures because they're not expressed in the right way. So, I think fundamentally I can express this most um passionately against prompts. A prompt is a horrible abstraction for programming and this needs to be fixed ASAP. Um, I say for programming because it's actually not a horrible one for management. If you want to manage an employee or an agent, a prompt is a reasonably kind of like it's an Slack channel. You have a remote employee. If you want to be a pet trainer, you know, working with tensors and, you know, objectives is a great way to iterate. That's how we build the models. But I want us to be able to also engineer AI systems. And I think for engineering and programming, a prompt is a horrible abstraction. Here's why. It's a stringly typed canvas, just a big blur, no structure whatsoever, even if structure actually exists in a latent way. Um, that couples and entangles the fundamental task definition you want to say, which is really important stuff. This is what you're engineering with some random over uh fitted halfbaked decisions about, hey, this LLM responded to this language, you know, when I talk to it this way or I put this example to demonstrate my point um and it kind of clicked for this model, so I'll just keep it in. And there's no way to really tell the difference. What was the fundamental thing you're solving and like you know what was the random uh trick you applied? It's like a square root thing except you don't call it a square root and we just have to stare at it and be like wait why are we shifting to the left five you know by five bits or something. Um you're also inducing the inference time strategy which is like changing every few weeks or people are proposing stuff all the time and you're baking it literally entangling it into your system. So if it's an agent your prompt is telling it it's an agent. your system has no big like deal knowing about the fact it's an agent or a reasoning system or whatever. What are you actually trying to solve? Right? If it's like if you're writing a square root function and then you're like hey here is the layout of the strcts in memory or something. Um you're also talking about formatting and parsing things you know write in XML produce JSON whatever like again that's really none of your business most of the time. So you want to write a human readable spec but you're saying things like do not ignore this generate XML answer in JSON. you are professor Einstein, a wise expert in the field of I'll tip you $1,000, right? Like that is just not engineering, guys. Um, so what should we do? Trusty old separation of concerns, I think, is the answer. Your job as an engineer is to invest in your actual system design. And you know, starting with the spec, the spec unfortunately or fortunately cannot be reduced to one thing. And this is the time I'll talk about evals. I know everyone hears about eval. This is the one line about evals that makes this talk about evals. A lot of the time um you want to invest in natural language descriptions because that is the power of this new framework. Natural language definitions are not prompts. They are highly localized pieces of ambiguous stuff that could not have been said in any other way. Right? I can't tell the system certain things except in English. So I'll say it in English. But a lot of the time I'm actually iterating to uh to appease a certain model and to make it perform well relative to some criteria I have not telling it the criteria just tinkering with things there eval is the way to do this because eval say here's what I actually care about change the model the eval are still what I care about it's a fundamental thing now eval are not for everything if you try to use eval to define the core behavior of your system you will not learn induction learning from data is a lot harder than following instructions right so you need to have Both code is another thing that you need. You know, a lot of people are like, "Oh, it's just like a, you know, just just ask it to do the thing." Well, who's going to define the tools? Who's going to define the structure? How do you handle information flow? Like, you know, like things that are private should not flow in the wrong places, right? You need to control these things. Um, how do you apply function composition? LLMs are horrible at composition because neural networks kind of essentially don't learn things that reliably. Function composition in software is always perfectly reliable basically, right, by construction. So a lot of the things um are often best delegated to code, right? But it's it's hard and it's really important that you can actually juggle and combine these things and you need a canvas that can allow it can allow you to combine these things. Well, when you do this, a good canvas, the definition here of a good canvas or a c the the criteria for a good canvas is that it should allow you to express those three in a way that's highly streamlined and in a way that is decoupled and not entangled with models that are changing. I should just be able to hot swap models. Uh inference strategies that are changing. Hey, I want to switch from a chain of thought to an agent. I want to switch from an agent to a Monte Carlo research. Whatever the latest thing that has come out is, right? I should be able to just do that. Um and new learning algorithms. So, this is really important. We talked about learning, but learning uh is, you know, always happening at the level of your entire system if you're engineering it or at least you got to be thinking about it that way where you're saying, I want the whole thing to work as a whole for my problem, not for some general default. Right? So that's what the eval here are going to be doing. And you want a a way of expressing this that allows you to do reinforcement learning, but also allows you to do prompt optimization, but also allows you to do any of these things at the level of abstraction that you're actually working with. So the second takeaway is that you should invest in defining things specific to your AI system and decouple from uh the lower level swappable pieces because they'll expire faster than ever. Um so I'll just conclude by telling you we've built and been building for three years this DSPI framework which is the only framework that actually decouples u your job from which is writing the lower level AI software from our job which is giving you powerful evolving toolkits for learning and for search which is scaling um and for swapping LLMs through adapters um so there's only one concept you have to learn it is a new concept which is which we call signatures a new first class concept If you learn it, you've learned DSPI. Um, I'll have to unfortunately skip this because of the because of the time for the other speakers. But let me give you a summary. I can't predict the future. I'm not telling you if you do this, you know, this the code you write tomorrow will be there forever. But I'm telling you the least you can do. This is not like the the the kind of the top level. It's just like the B the baseline I would say is avoid handineering at lower levels than today allows you to do. Right? That's the that's the big lesson from the bitter lesson and from premature optimization being the root of all evil. Um, among your safest bets, they could all turn out to be wrong. I don't know, is um, models are not anytime soon going to read uh, specs off of your mind. I don't know if like we'll figure that out. Um, and they're not going to magically collect all the structure and tools specific to your application. So, that's clearly stuff you should invest in, right? When you're building a system, invest in the signatures, which I again you can learn about on the DSP site, uh, DSpi.ai. Invest in essential control flow and tools and invest in evaluating on by hand and ride the wave of swappable models. Write the wave of the modules we build. Um you just swap them in and out and ride the wave of optimizers which can do things like reinforcement learning or prompt optimization for any application that it is that you've built. Uh all right, thank you everyone. All right. All right. We won't take up time on the computer. I think I learned that lesson. So, for now, I'll go ahead and pick something. Llama 3. Let's see how it did. So, here's the introduction, even though I asked it not to give me a preface. It's my pleasure to introduce our next speakers. Rafael Wulinski, also known as Rafal Wii, and Vtor Bologo, also known as Vtor Baloo as the AI tech lead and staff AI engineer at Zapier. Recently, they'll share their insights on how to turn agent failures into opportunities for growth and improvement. And then it said, "Note, I've also used the phonetic pronunciation guides to help with the speaker's names, which are often tricky for pronounce for native speakers." So that's llama 3 right there. All right, I'll let you all take it from here. Cool. Thank you. How's everybody doing? I don't think I have a mic yet. So, give me a sec. Yeah. Oh, now is the time to take a selfie? Take a All right, cool. Hey everybody, how's it going? Uh we're Vtor and Rafal and we're AI engineers at Zapier and we're building Zapier agents and today we would like to share a little bit with you some of the hard one lessons about building agents that we learned from the past two years building Zapier agents. Um yeah and brief introduction to Zapier agents. U I believe many of you know what Zapier is. This automation software a lot of boxes arrows essentially about automation uh automating your business processes. Um, agents is just well more agentic alternative to Zapier. You describe what you want, we propose you a bunch of tools, a trigger and you enable that and hopefully we enable your we automate your whole business processes. Um, and a key lesson that we have after those those two years is that um, building good AI agents is hard and building good platform to enable nontechnical people to build AI agents is even harder. That's because AI is nondeterministic. But on top of that, your users are even more nondeterministic. They are going to use your products in a way that you cannot imagine up front. So um if you think that building agents is not that hard, you probably have this kind of picture in mind. Um you probably stumbled upon this library called blank chain. Um you pulled some examples tutorial. um you tweaked the prompt, pulled a bunch of tools, you chatted with this solution, and they thought, well, it's actually kind of working. All right, so let's deploy it and let's collect some profit. Um, turns out the reality has a surprising amount of detail, and we believe that building probabilistic software is a little bit different than building traditional software. Um, the initial prototype is only a start and after you ship something to your users, your responsibility switches to building the data flywheel. So once you start once your user starts using your product uh you need to collect the feedback you're starting to understand the usage patterns the failures so they can then you can build more evolves be an understanding of what's failing what are the use cases um as you're building more evolves and burn features probably your product is getting better so you're getting more users and there are more failures and you have to build more features and on and on and on. So yeah, it form it forms this data flywheel. But starting with the first step. Okay. Yeah. So starting from the beginning, how do you start collecting actionable feedback? Backing up for just a second. The first step is to make sure you're instrumenting your code, right? Which you probably already are doing. Whether you're using Brain Trust or something else, they all offer like an easy way to get started like just tracing your completion calls. And this is a good start, but actually you also want to make sure that you're recording much more than that in your traces. You want to record the two calls, the errors from those two calls, the pre- and postprocessing steps. That way, it will be much easier to debug what went wrong with the run. And you also want to strive to make the run repeatable for eval purposes. So, for instance, if you log data in the same shape as it appears in the runtime, it makes it much easier to convert it to an Evo run later because you can just prepopulate the inputs and expected outputs directly from your trace for free. And this is especially useful as well for two calls because if your two call produces any side effects, you probably want to mock those in your emails. So you get all that for free if you're recording them in a trace. Okay, great. So you've instrumented your code and you started getting uh all this raw data from your runs. Now it's time to figure out what runs to actually pay attention to. Um explicit user feedback is feedback is really high signal. So that's a good good place to start. Unfortunately, not many people actually click those classic thumbs up, thumbs up and thumbs down buttons. So, you got to work a bit harder for that feedback. And in our experience, this works best when you ask for the feedback in the right context. So, you can be a little bit more aggressive about asking for the feedback, but you're in the right context. You're not bothering the user before that. So, for us, one example of this is once an agent finished running, even if it was just a test run, we show a feedback call to action at the bottom, right? Did this run do what you expected? Give us the feedback now. Um, and this small change actually gave us like a really nice bump in feedback submissions surprisingly. So, thumbs up and thumbs down are a good benchmark, a good baseline, but try to find these critical moments in your user's journey where they'll be most likely to provide you that feedback either because they're happy and satisfied or because they're angry and they want to tell you about it. Um, even if you work really hard for the feedback, explicit feedback is still really rare. uh and explicit feedback that's detailed and actionable is even harder because people are just not that interested in providing feedback generally. So you also want to mine user interaction for implicit feedback. And the good news is there's actually a lot of lowhanging fruit possibilities here. Here's an example from our from our app. Users can test an agent before they turn it on to see if everything's going okay. So if they do turn it on, that's actually really strong positive implicit feedback, right? Copying a model's response is also good implicit feedback. Even uh OpenAI is doing this for chat GPT. And you can also look for implicit signals in the conversation. Here the user is clearly letting us know that they're not happy with the results. Here they're telling the agent to stop slacking around, which is clearly implicit negative feedback. I think sometimes the user sends a follow-up message that is mostly rehashing what it asked the previous time to see if LM interprets that phrasing better. That's also also good implicit negative feedback. And there's also a surprising amount of of cursing. Uh recently we had a lot of success as well using an LLM to detect and group frustrations and we have this weekly report that we post in our Slack. But it took us a lot of tinkering to make sure that LLM understood what frustration means in the context of our products. So I encourage you to try it out but expect a lot of tinkering. You should also not forget to look at more traditional user metrics right uh there's a lot of stuff in there for you to mine implicit signal too. So find what metrics your business cares about and figure out how to track them. Then you can distill some signal from that data. You can look for customers for example that turned in the last seven days and go look at their last interactions with your product before they left and you're likely to find some signal there. Okay. So I have raw data but now I'll let the indust I'll let the industry experts speak. Why is it Yeah. Or the beating will continue until everyone looks at their data. Um okay, but how actually you're going to do that? So um we believe that the first step is to either buy or build LM ops software. We do both. Um, you're definitely going to need that to understand your agent runs because one agent run is probably multiple LM calls, multiple database interactions, tool calls, rest calls, whatever. Each one of them can be source of failure and it's really important to piece together this whole story. Understand this, you know, what caused this cascading failure. Um yeah, I said we are doing both because I believe v coding your own internal tooling is really really easy right now with cursor and cloud cod and it's going to pay you massive dividends in the future um for two reasons. First of all, it gives you an ability to understand your data in your own specific domain context. Uh and the second of all it also you should be also able to create a functionality to turn every single interacting case or every failure into an eval with the minimal amount of um fraction. So whenever you see something interesting there should be like a one click to turn it into an eval. It should become your instinct. Once you understand what's going on on a singular run basis you can start understanding things at scale. So now you can do feedback aggregations, clustering, you can bucket your uh your failure modes, you can bucket your interactions, and then you're going to starting to see what kind of tools are failing the most, what kind of interactions are the most problematic. That's going to almost create for you like an automatic road map. So you'll know where to apply your time and effort to improve your product the most. Um doing anything else is going to be a suboptimal strategy. Something that we are also experimenting with is using reasoning models to explain the failures. Turns out that if you give them the trace output input instructions and anything you can find, they are pretty good at finding the root cause of a failure. Um, even if they are not going to do that, they're probably going to explain you the whole run or just direct your attention into something that's really interesting and might help you find the root cause of the problem. Cool. So now you have a good short list of failure modes you want to work on first. It's time to start building out your evals. And we realized over time that there are different types of evals and the types of EVs that we want to build can be placed into this hierarchy that resembles the testing pyramid for those of you that know that. Um so with unit test like EVLs at the base, end to end EVLs or trajectory evals how we like to call them in the middle and the ultimate way of evaluating using AB testing with stage walls rollouts at the top. So let's talk a bit about those. Starting with unit test eos, we're just trying to predict the n plus1 state from the current state. So these work great when you want to do simple assertions, right? For instance, you could check whether the next state is a specific tool call or if the two call parameters are correct or if the answer contains a specific keyword or if agent determined that it was done, all that good stuff. So if you're starting out, we recommend focusing on unit test evos first because these are the easiest to add. It helps you build that muscle of looking at your data, spotting problems, creating evals that reproduce them, and then just focusing on fixing them, right? Beware though of like turning every positive feedback into an eval. We found that unit test evoss are best for hill climbing specific failure modes that you spot in your data. So now unit test evals are not perfect and we realized that ourselves. uh we realized we had overindexed on your test evos when the new models were coming out that were objectively stronger models but we were they were still performing worse in our internal benchmarks which was weird um and because the majority of our evals were so fine grain this made it really hard to see the forest for the trees when benchmarking new new models there was always a lot of noise when we try comparing runs like when you're looking at a single trace it's it's easy to kind of go through the trace and understand what's happening but when you need to kind of look at it from I don't know how to play it again uh when you want to look at it uh through an aggregation of many traces then it starts getting difficult to understand what's happening why are so many of these passing and some of these are regressing yeah so we realized that uh maybe machine can help us it turns out in that pre previous video when I was investigating uh one experiment inside brain trust there's a lot of looking at that screen trying to figure out what went wrong and we were like hey maybe we can like just give this old data to once again a reasoning LLM and compare the models For us, it turns out that with brain trust MCP and reasoning model, you can just ask it to, hey, look at this run, look at this run and tell me what's actually different about the new model that we are going to deploy. In this case, it was Gemini Pro versus Claude. And what the reasoning model found was actually really, really good. It found that Claude is like a decisive exeutor, whereas Gemini is really yapping a lot. It's asking follow-up questions. It needs some positive affirmations, and it's sometimes even hallucinating. uh bad JSON structures. So yeah, it helped us a lot. It also surfaces a problem with unit test evals a lot which is um different models have different ways of trying to achieve the same goal and unit test evals are penalizing different paths. They're like hardcoded to only follow follow one path and yeah our unit test evals were overfitting to our existing models are actually data collecting using that model. So what we started experimenting with is trajectory evals. Um yeah instead of grading just one iteration of an agent, we let the agent run till the whole uh till the to the end state and we are not grading just the end state but we are also grading all the tool calls that were made along the way and uh all the artifacts that have been generated along the way. Um this can be also paired with LM as a judge. Ver is going to speak about it later. Um yeah, but they are not free. I think they have really high return on investment, but they are much harder to set up. Um especially if you are evaluating runs that have tools that cause side effects, right? When you are running an eval, you definitely don't want to send a email on behalf of the customer once again, right? Um so we had a fundamental question whether we should mock environment or not. And we decided that we are not going mock we are not going to mock the environment because otherwise you're going to get data that is just not reflecting the reality. So what we started doing is just mirroring uh users environment and crafting a synthesis uh synthetic copy of that. Uh also they are much slower right so they can sometimes take like up to an hour. Um so it's not pretty great. And we're also learning a bit more into LM as a judge. Uh this is when you use an LLM to grade or compare results from your EBOS. And it's tempting to lean into them for everything, but you need to make sure that the judge is judging things correctly, which can be surprisingly hard. Uh and you also have to be careful not to introduce subtle biases, right? Because even small things that you might overlook might end up influencing it. Lately, we've also been experimenting with this concept of rubrics based scoring. We use an LLM to judge the run, but each row in our data set has a different set of rubrics, rubrics that were handcrafted by a human and describe in natural language what specifically about this run should the LM be paying attention to for the score. Uh so one example of this, did the agent react to an unexpected error from the calendar API and then try it again. So to sum it up, here's our current mental model of the types of evals that we built for Zapper agents. We use LM as a judge or rubrics based evals to build a highle overview of your systems capabilities and these are great for benchmarking new models. We use trajectory evals to capture multi-turn criteria and we use unit test like evals to debug specific failures. Uh he'll climb them. Uh but beware of overfitting with these. Yeah. And a couple of closing thoughts. Don't obsess over metrics. Uh remember that when a good metrics become a target, it ceases to be a good target. Uh so when you're close to achieving 100% score on your evil data set, it's not meaning that you're doing good job. Actually meaning that your data set is just not interesting, right? Because we don't have AGI yet. So it's probably not true that your model is that good. Um something that we're experimenting with lately is dividing the data set into two pools. uh into the regressions data set to make sure that we are making any changes. We are not breaking existing use cases for the customers and also the aspirational data set of things that are extremely hard. For instance, like nailing 200 tool calls uh in a row. And lastly, um let's take a step back. What's the point of creating evils in the first place? Uh your goal isn't to maximize some imaginary number in a lab-like setting. Uh your end goal is user satisfaction. So the ultimate judge um are your users. You shouldn't be optimizing for the biggest scores for the evos and completely disregard the vibes. So that's why you think the ultimate verification method is an AB test. Just take a small proportion of your portion of your traffic, let's say 5% and route it to the new model, route it to the new prompt, monitor the feedback, check your metrics like activation, user retention and so on. Based on that probably you can make the most educated guess uh instead of being in the lab and optimizing this imaginary number. That's all. Thank you. We have time for a couple of questions like one or two. There's a mic here on the left and another mic right there for anybody that wants to ask a question uh for VTO and for Rafael. Oh yeah, I'll stay. Anybody interested? You can also raise your hand and not interesting. You can also say the question and if you don't mind veto repeat five mic five. Hey, I'm here. Um the question I had for you was um you mentioned at the very end you were talking about splitting these things up into three different data sets. Do you combine that with like the rubrics thing to sort of create like a like a map of like how things end up looking or like how how does that actually uh end up driving product decisions and like everything in general? Yeah, great question. We're still experimenting with the rubrics based evos and I think they're really useful when you want to eval trajectories or especially useful when you want to compare like a new model that just came out. Um but I think ultimately the combination of all the different EVOs is what what tracks our our our product roadmap right especially when we look at things in aggregate and we see we group things in buckets and then we know okay this is the the most important thing that we have to focus now and then we can kind of go and like slice our our data sets to find which which part of that data set like is most representative for us to kind of iterate over and of course afterwards we run the complete complete data set as well to catch any regressions but that's kind of how we think about Cool. Do we have another question over there? Yeah, one more question here. I'm Drew from Vabs. Um, I was really curious what your process for like evaling and aligning your eval looked like. Was that a pretty time consuming thing? It's something we've struggled with and met a lot of people who have struggled with so uh yeah sure. So, first of all, we I think we start building EVO by looking at the customer feedback especially negative customer feedback. Uh we used to have labeling parties where we like group together and we are looking at the trace and we are trying to understand what went wrong and what's the expected behavior. Uh we also have slack channel and we have something like an on call rotation. There's always like a one engineer that just looks at it and does its uh best to to find the problem. Yeah. But there is no like really hard process for that. I think it's best effort I would say. Uh yeah. Yeah. And we kind of stumbled through different types of evals over time. We started like with big long trajectory evolves them as a judge. Then we doubled down on unit test uh evals and kind of neglected the rest and now we're feeling a bit of the pain from that. So in the end we kind of spread out uh over different types of evals and we're still kind of building that intuition about uh what the what a ratio of those should be. Yeah. All right. Thank you folks. Right. [Applause] Good luck. All right. So, thank you. Uh in in case folks uh came in at the last second, what I'm doing is I actually have eval uh Vive presenting and I have some multiple models here. Thank you. So, let's try claw four. Good. So really up next we have Ido Pesal an AI engineer at Verscell who's been deep in the trenches building the AI systems behind V0ero Verscell's AI powered web development tool. Uh he's here to share hard one wisdom about why evaluating AI systems requires throwing out everything you thought you knew about testing because when your system is nondeterministic your emails need to be anything but conventional. All right, take it away. Thank you. Hello everyone. How are you? Let's see. Hopefully my slides go up. Awesome. Okay, thank you so much for coming. My name is Ido. I'm an engineer at Verscell working on Vzero. If you don't know, Vzero is a full stack Vibe coding platform. It's the easiest and fastest way to prototype, build on the web, and express new ideas. Uh, here are some examples of cool things people have built and shared on Twitter. And to catch you up, we recently just launched GitHub sync, so you can now push generated code to GitHub directly from VZero. You can also uh automatically pull changes from GitHub into your chat, and furthermore, switch branches and open PRs to collaborate with your team. I'm very excited to announce we recently crossed 100 million messages sent and we're really excited to keep growing from here. So my goal of this talk is for it to be an introduction to eval specifically at the application layer. You may be used to eval layer which is what the research labs will site in model releases but this will be a focus on what do evals mean for your users, your apps and your data. The model's now in the wild, out of the lab, and it needs to work for your use case. And to do this, I have a story. Uh, it's a story about this app called Fruit Letter Counter. And if the name didn't already give it away, all it is is an app that counts the letters in fruit. So, the vision is we'll make a logo with Chachi BT. Uh, there might be problem market fit already because everyone on X is dying to know the number of letters in fruit. If you didn't get it, it's a joke on the how many Rs are in strawberry prompt. Uh we'll have Vzero make all the UI and backend and then we can ship. So we had Vzero write the code. It it used uh AI SDK to do the stream text call. And what do you know? It worked first try. GPT 4.1 said three. And not only did it say three once, I even tested it twice and it worked both times in a row. So from there, we're good to ship, right? Let's launch on Twitter. Want to know how many letters are in a fruit? Just launched fruit lettercounter.io. The.com and.ai were taken. Um, and yeah, everything was going great. We launched and deployed on Verscell. We had fluid compute on until we suddenly get this tweet. John said, "I asked how many Rs in Strawberry and it said two." So, of course, I just tested it twice. How is this even possible? Um, but I think you get where I'm going with this, which is that by nature, LMS can be very unreliable. And this principle scales from a small letter counting app all the way to the biggest AI apps in the world. The reason why it's so important to recognize this is because no one is going to use something that doesn't work. It's literally unusable. Um, and this is a significant challenge when you're building AI apps. So, I have a funny meme here, but basically AI apps have this unique property. They're very like demo savvy. You'll demo it, it looks super good, you'll show it to your co-workers and then you ship to prod and then suddenly hallucinations come and get you. Um, so we always have this in in the back of our head when we're building back to where we were. Let's actually not give up, right? We actually want to solve this for our users. We want to make a really good fruit letter counting app. So you might say, how do we make reliable software that uses LMS? Our initial uh prompt was a simple question, right? But maybe we can try prompt engineering. Maybe we can add some chain of thought, something else to make it more reliable. So, we spend all night working on this new prompt. Uh, you're an exuberant fruitloving AI on an epic quest, dot dot dot. Uh, and this time we actually tested it 10 times in a row on ChachiBT. And it worked every single time. 10 times in a row. It's amazing. So, we ship and everything was going great until John tweeted at me again and he said, "I asked how many Rs are in strawberry, banana, pineapple, mango, kiwi, dragon fruit, apple, raspberry, and it said five." So, we failed John again. Um, although this example is pretty simple, but this is actually what will happen when you start deploying to production. You'll get users that come up with queries you could have never imagined and you actually have to start thinking about how do we solve it. And the interesting thing if you think about it is 95% of our app works 100% of the time. We can have unit tests for every single function end to end test for the off the login the sign out it will all work but it's that most crucial 5% that can fail on us. So let's improve it. Now to visualize this I have a diagram for you. Hopefully you can see the code. Uh maybe I need to make my screen brighter. Can you see the code? I don't know. Okay. Um Okay. Well, we'll come back to this. But basically, we're going to start building evals. And to visualize this, I have a basketball court. So, today's day one of the NBA finals. I don't know if you care. Um you don't need to know much about basketball, but just know that someone is trying to throw a ball in the basket. And here the basket is the glowing golden c uh glowing golden circle. So blue will represent a shot make and red will represent a shot miss. And one property to consider is that the farther away your shot is from the basket the harder it is. Uh another property is that the court has boundaries. So this blue dot although the shot goes in it's out of your uh out of the court. So it doesn't really count in the game. Let's start plotting our data. So here we have a question. How many Rs in strawberry? This after our new prompt will probably work. So, we'll label it blue. Um, and we'll put it close to the basket because it's pretty easy. However, how many Rs are in that big array? We'll label it red. And we'll put it farther away from the basket. Hopefully, you can see that. Maybe we can make it a little bit brighter. But this is the data part of our eval. Basically, you're trying to collect uh what what prompts your users are asking. And you want to just store this over time and keep building it and store where these points are on your court. Two more prompts I want to bring up is like what if someone says how many Rs are in strawberry, pineapple, dragon fruit, mango after we replace all the vowels with Rs, right? Insane prompt, but still technically in our domain. Uh so we'll we'll label it as red all the way down there. U but a funny one is like how many syllables are in carrot? So this we'll call it out of bounds, right? This no none of our users are actually going to ask. Um it's not part of our app, so no one is going to care. Um I hope you can see the code, but basically when you're making evout, here's how you can think about it. Your data is the point on the court. Your shot, or in this case in brain trust, they call it a task, is the way you shoot the ball towards the basket. And your score is basically a check of did it go in the basket or did it not go in the basket. To make good evals, you must understand your court. This is the most important step. And you have to be careful of falling into some traps. First is the out-of- bounds traps. Don't spend time making emails for your data your users don't care about. You have enough problems, I promise you, of problem uh queries that your users do care about. So be careful not try and be productive and you know you're making a lot of evals but they're not really applicable to your app. And another visualization is don't have a concentrated set of points. When you really understand your core you're going to understand you know where the boundaries are and you want to make sure you you test across the entire court. Uh a lot of people have been talking about this today but to collect as much data as possible here are some uh things you can do. First is collect thumbs up thumbs down data. This can be noisy but it also can be really really good signal as to where your app is struggling. Another thing is if you have observability which is highly recommended you can just read through random samples in your log in your logs. Um although users might not be you know giving you signal but if you take like a hundred random samples and go through it like once a week you'll get a really un good understanding of what your users and how your users are using the product. Uh if you have community forums these are also great. People will often report issues they're having with the LLM and also X and Twitter are also great but can be noisy. And there really is no shortcut here. You really have to do the work and understand what your court looks like. So here is actually what if you are doing a good job of understanding your court and a good job of building your data set. This is what it should look like. You should know the boundaries. You should be testing in your boundaries and you should understand where your system is has blue and verse where it has red. So here it's really easy to tell, okay, maybe next week we need to prioritize uh the team to work on that bottom right corner. This is something where a lot of users are struggling and we can really do a good job on flipping the tiles from red to blue. Another thing you can do and I hope I really hope you can see but you want to put constants in data variables in the task. So just like in math or programming, you want to factor constants so it improves clarity, reuse, and generalizations. If you have, let's say you want to test your system prompt, right? Keep the constant data that all that your users are going to ask. So for example, how many Rs in Scarby that goes in the data that's a constant. It's never going to change throughout your app. But what you're going to test is in that task, you're going to try different system prompts. You might try different pre-processing, different rag, and that's what you want to put in your task section. This way your app actually scales and you never have to let's say when you change your system prompt redo all your data and this is a really nice feature of brain trust. Um and if you don't know AI SDK actually offers a thing called middleware and it's a really good abstraction to put basically all your logic of pre-processing. So rag system prompt you can put in here etc. And you can now share this between your actual API route that's doing the completion and your evals. So, if you think about the court, the basketball court as if we're doing we're going like basketball practice and we're trying to practice our system ac across different models. Um, you want your practice to be as similar as possible to the real game. That's what makes a good practice. So, you want to share the pretty much the exact same code between evals and what you're actually running. Now, I want to talk a little bit about scores, which is the last step of the eval. The unfortunate thing is it does vary greatly depending on your domain. So in this case it's like super simple. Uh you're just checking if you know the output contains the correct number of letters. But maybe if you're doing writing a task like writing that's very very difficult. Um from principles you want to actually lean towards deterministic scoring and pass fail. This is because when you're doing debugging, uh, you're gonna get a ton of input and logs and you want to make it as easy as possible for you to actually figure out what's going wrong. So, if you're sh if you're building if you're overengineering your score, it might be very difficult to share with your team and distribute across different teams uh, your evals because no one will understand how these things are getting scored. Keep your scores as simple as possible. Um, and a good question to ask yourself is when you're looking at the data, what am I looking for to see if this failed? Right? So with Vzero, we're looking for if the code didn't work. Um, but maybe for writing, you're looking for a certain linguistics. Ask yourself that question and write the code that looks for you. Um, there are some cases where it's so hard to write the code that you may need to do human review and that's okay. At the end of the day, you want to build your core and you want to collect signal. Even if you need you must do human human review to get the correct signal. Don't worry at the if you do the correct practice, it will pay off in the long run and you'll get better results for your users. One trick you can do for scoring is don't be scared to like add a little bit of extra um prompt to your to the original prompt. So for example, here we can say output your final answer uh in these answer tags. What this will do is basically make it very easy for you to do string matching um and etc. Whereas in production you don't really want this but yeah you can do some little twe tweaks to your prompts so that scoring is easier. Another thing we really highly recommend is add evals to your CI. So brain trust is really nice because you can get these eval reports. Um so it'll run your e your task across all your data and then it will give you this uh report at the end for the improvements and regressions. Assume my colleague made a PR that changes a bit of the prompt. We want to know like how did it do across the court, right? Visualize like did it change more tiles from red to blue? Maybe now our prompt fixed one part but it broke the other part of our app. Um so this is a really useful report to have when you're doing PRs. So yeah, going back this this is the summary of the talk. You want to make your evals a a core of your data. And this you can treat it like practice. Your model is basically going to practice. Maybe you want to switch players, right? When you switch models, you can see how a different player is going to perform in your practice. But this gives you such a good understanding of how your system is doing when you change things like maybe your rag or your system prompt. And you can now go to your colleague and say, "Hey, this actually did help our app, right?" Because improvement without measurement is limited and imprecise. And eval give you the clarity you need to systematically improve your app. When you do that, you're going to get better reliability and quality, higher conversion and retention, and you also get to do just spend less time on support and ops, right? Because your evals, your practice environment will take care of that for you. Uh, and if you're wondering about how I built all these court diagrams, I actually just used Vzero and it made me some app that I just added these shots made and missed in uh the basket. So, yeah, thank you very much. I hope you learned a little bit about evals. Thank you. So, we do have some time for some questions. There are two mics, one over here, one over there. Um, we can take two or three of those, please, if anybody's interested in asking. We have one over there. Um, mic five, please. Or you can repeat the question as well if you don't mind. Yeah. Yeah. You can think of it. It's really like practice. Like maybe your a basketball player will like, you know, in general score like 90% but they might miss more shots here or there. If you run it like we do it like we run every day at least. Um and then we get a good sense of like where are we actually like failing? Did we have some regression? Um so yeah, running it like pretty daily or at least in some schedule will give you a good idea. I was thinking what if you ran like the same question through it five times, right? It's like like what's the percentage? just making it four out of five or you know five. I see. So it's definitely like as you go further away like the harder questions get like lower pass rates. Um but overall you can measure that too like how is it running like four out of five like a week or something and you just want to improve that. Evals are just a way to measure so you can actually do the improvement. Yeah. Uh hello I have a question about v0ero evaluation like real evaluation. Yeah. uh as I understand you have a lot of trajectories user trajectories internal one so it's uh not about analyzing one trajectory but hundreds or even thousands so the question is how do you analyze which method do you use to analyze thousand of trajectories and getting valuable insights from it at scale yeah so vzero spe uh in a special domain because we're we're doing code so we can we actually like run the code when you run in v 0ero we we can collect errors Um, so we do a lot of measurements based on those errors that we get. However, if you have a task like writing or something else that's like more difficult to measure, you might be in a a worse spot where you need more human evals. But what we do is yeah, we track a lot of it like errors in the apps. Um, and that usually is a good signal for the final trajectory of like the the app that user is building. Does that make sense? I'm also interested about how do you analyze which tools uh was called and how properly it was called like it's undeterministic and it's interesting how to understand it. So we also found like the zap year presentation earlier you want to just look like the best thing is look at the final output. Um it can you can get a little carried away with looking at every tool call but at v 0ero because it's like did the app work a lot of times that's after all the uh steps in the between and that's what we really look at is like did the thing work at the at the end or not. Um and then if it didn't work then we go in and we look at like all the steps and kind of try to find the one that broke. But we start with looking at the final trajectory like after all the stuff's happened. Yeah. Okay. Yes. Thank you. Awesome. And then we have one more on the side. Yeah. How do you think about uh eval in terms of like each user query in like a full conversation? Do you I know you were just talking about kind of like thinking through like the whole thing, but like what if the user doesn't know where they're going at the end and Yeah. So for VZ like the conversations can get really long. I would equate that to like just a more difficult like that's like actually a different data point. Um I would equate that to like a very difficult shot on the court. Um and we test those separately. So like maybe you can have multiple evals within one chat like depending on where they are in the conversation and you want to just you guess you at that point you test like one message at a time. Um but yeah, does that answer your question? Yeah, somewhat. Thanks. Awesome. Thank you. Yeah. Did you have one really quickly? Yeah. I was going to ask how you deal with uh if you have an error that's like three or five turns into a conversation and you need to mock up to that state and figure out whether it's able to resolve it from there. Yeah, that's a good question. So on v 0ero like we have um the state at every point. So we were lucky so we can uh kind of really easily traverse that that state. But what we do is again like I was saying earlier, we basically on our emails you test like one message ahead. Um, and if they got into a broken state already, then it's not really worth testing anymore because usually like the LLM's like really suck at coming back from it. So we're mostly focused on like finding yeah where was that point where it did break and you want to try to reduce those number of points. Uh maybe in the future as they get better you can start um and we're working on it like more a better loop of like error correction. This is something we're constantly working on. Sure. But usually it's it's really helpful to find like yeah that one point where it where it ended. Okay. So go from there. Makes sense. Thank you all. Thank you. Thank you. Yeah, the remainder. Yep. Appreciate that. So, we do have one remaining uh speaker, but they're not in right now. So, feel free to stick around. Hopefully, he shows up. Otherwise, there's also lunch available, right? Appreciate it. All right. Oh, yeah. I was like thinking, how do I visualize this? I was actually talking to All right. So, we have a guest speaker. Actually, if if you don't mind introducing yourself because I don't think even the LLM Yeah, go for it. Hey, my name is Randall. Um, I just launched the framework today for evals that I'm trying to figure out if it's good. Uh, it's called Bolt Foundry. That's our company name. And we are trying to help people who do JavaScript. Most people who do JavaScript don't have a great way to run evals um or create synthetic data. Uh we have a different approach that uh is based on journalism actually. Um building prompts through a structured way that's more similar to like the way a newspaper is written than the way most people do where they scare the crap out of a a LLM. Um let's see if I can plug in. This is totally gorilla but no one else is here so why not, right? All right, let's do this. So, um, it's live on GitHub right now. Uh, if you want to go github.comboldfoundry or just boldfoundry.com has our thing. All right, cool. So, let me spin up my environment. And the cool thing is that to start you can just run um npx bff-eval and then we have a demo that you can run d-demo and what this does is it will create um it all works with open router so you can test against like any uh system that you want like you can oh this is a replet bug hold on let me fix this really quick shell All right, one more time. BFF npx bff-eval d-demo. So here we've made a simple JSON validator. Rather than write the JSON validator to actually parse the JSON, this actually is a grader. So a grader is like someone who it's a uh LLM that you can use to figure out if your thing is good. Um we have a set of samples. Let me just grab them. Samples.jsonl. Uh, let's do this one. Sure, that's the wrong one, but it's okay. So, essentially, the the messages come in as a user message, an assistant response, and then um optionally a score. So, you can actually set your ground truth and say, "This is a sample of something that's really good. This is a sample of something that's really shitty." Um, and then you can describe it. The description is mostly for your team. So that um when you have a like they were talking about if you're pulling something from production and you have a a sample that you want your team to know about you can describe like why this is relevant. Um the the eval runs in this case we have 10 runs um each have a a rubric score from positive three to negative three and um the other person's here now. So thank you so much for letting me show you something cool. bullfoundry.com. Oh I have a minute left. All right. So the other cool thing is I'll just show you the end. Um, you can actually see where your um where your greater diverged from um the from the ground truth. So you can say like, "Oh, this is weird. Why is this one acting this way?" Anyway, we're going to have more stuff. It's going to be cool. Thanks, man. Thank you for giving me a second. Yeah, no problem. All right. Uh, next up, Diego. Uh, I I had your introduction ready, but hopefully you don't mind uh if you introduce yourself. Go ahead and set up um connect the USB there and then the HDMI there. There's no audio in your computer. I just project. Yeah, I'm happy to project. That's fine. Yeah, it I'm gonna give the this Yeah. Yeah. It just gave me for context. Everyone just gave me like some instruction that I was supposed to be there and I was like wait someone's starting have to go there afterwards. Yeah. And then you can use this one. There's one here that I think you can use. Go back there. Okay. Cool, cool, cool. [Music] Okay. Yeah. Yeah. Yeah, that makes sense. Are we starting like very very soon or should I wait a few minutes? Yeah. And Hello. Hello. Test one, two, one, two, one, two. Okay, good. Okay, perfect. Cool. Should be relatively short, too. So, we have time. We can wait few minutes if that's better. [Music] Okay, cool. I'm gonna start. Okay, so hello everyone. Uh my name is Diego Rodriguez. Uh I'm the co-founder of Korea, a startup in the AI space like many others uh in particular generative media, multimedia, multimodel and all the uh buzzwords. But uh I I come here mainly to tell a story about how we think about evaluations when we have to take into account human perception and human opinion and aesthetics uh into the mix. Right? So I'm going to start with a very simple story. It's like I put an AI generated image of a hand. Obviously it looks horrible. Uh, and then I asked 03, what do you think of this image? Then he thought for 17 seconds, obviously tool calling, does Python analysis, OpenCV goes crazy, and then after he charges me a few cents, it's like, oh, just a couple M jumps out. It's like it's mostly natural, but like and it's like, okay, we have like what many people claim is basically AGI and it is completely unable of answering a very simple question. And and like that's it that's a surprising thing if you think about it because we as humans when people see that image is like we just react so naturally right against that is like what is that like that that's not natural and and I feel like that's precisely what AI models are being trained on a on human data right uh uh second on human preference data uh and and third like in a way limit by the data that we humans based on our preconceived notions and perception and all of that. So that's what this talk is about about why what um what can we do better and honestly to ask ourselves some questions that I think are not being asked enough in the field. Um cool. So tiny tiny bit of history. Uh there's uh we all know about Claude uh Claude Shannon that is the the father of information theory uh and according to many his master thesis one of the most important master thesis in the world where he laid foundations for digital circuits and then eventually communication and all and to a degree we can say that even LLMs uh nowadays right if we fast forward um and I want I want to focus on the inform all the like the fact that we call his work foundational and information theory. Well, when he published it, it was actually called like mathematical foundation for communication theory and he was always focused on communication. Um there's this image uh appears on that work uh and it's all about okay this is the source this is the channel this is nation there can be some noise there um and as a well as a founder of a company that is focusing on on media to me is interesting to realize like these parallels between classic information theory and communication Let me see. Did I put the image? Let's see. Well, if you I didn't put the image, but if you have any context around variational autoenccoders or neural networks and whatever, you you can squint and be like, oh, is that a neural network, right? Um and in the context of information and communication I want to talk about how compression is going to be uh related to how we think about evaluation right in I'm going to talk for an for example on JPEG um JPEG exploits like human nature in the sense that we are very sensitive to brightness but not to color and this is a illusion that also talks about that where A and B is actually the same color but we are basically unable to perceive it until we do this and then suddenly it's like oh really um and it's kind of like what's going on there right um and so JPEG just does the same thing where okay we have RGB color space to represent images with computers uh we notice that there's a diagonal that represents the brightness of the images we can change into a different color space uh that separates it's color versus brightness and then we can down sample the channels around color because we are actually not even that sensitive to it. So we can remove that or parts of it and then uh once we do that this is an image where we can see the uh brightness and color component separated. Once we down sample, we can try to recreate the image. And this is an example of like basically original image and then the image with the down sample color looks the same to us. Uh and the image is like 50% less information, right? And other stuff. There's husman Huffman coding and more stuff, but like the point is the same, right? And then the thing is if you exploit the same for audio like what can we hear? What can we not hear? Well, you you do the same and we have MP3. And then if you do the exact same thing across time, well, congrats, now you have MP4. It's like it's all this principle of like let's exploit how we humans perceive the world, right? Um, but this made me think about myself because I studied artificial systems engineering, which is engineering around all of these mic how microphones work, how speakers work. And it was just interesting to me that I was coding. I start deleting information. I know for a fact that I'm deleting information yet and then I rerender the image and I see the same. It's like like philosophers always tell you about like oh we are limited by our senses but like this is the first time I was like I'm seeing it right like I am not seeing the difference. Um but then if all if if a lot of our data is the internet right like we're scraping data from the internet bunch of those images are also compressed right like are we taking into account that perhaps our AIS are limited too because we're kind like we have some sort of contagion going on of our flaws into the AI um and then it gets more tricky because for instance uh this is a just a screenshot I took from a paper I think it's called clean FID and FID scores for all of you who don't have context is one of the standard metrics used for uh how well for instance diffusion models are are reproducing an image but then you start adding JPEG artifacts and the score is like oh no no no this is horrible horrible image and it's like perceptually the four images are basically the same yet the FID score is like no no no this is really bad. So then it's like why are we using FID scores or metrics along those lines to deci to decide oh this generative AI model is good or bad right um so the thing is sometimes I feel like we are focused on measuring just things that are easy to measure right like problem adher adherence with clip how many objects are there is this blue is this red etc but what about here. Oh, it's like, oh no, really bad, really bad generator because that's not how clock looks and the sky that makes no sense. And it's like, okay, how not only are we limiting our AIS by our human perceptions? Uh, on top of that, we forget about the relativity of metrics, right? like uh no actually this is art and this is great and and and and there's sometimes meaning behind the work that is not like it is conveyed in the image but only if you're human you you get it right like oh this is what the author is trying to tell me but I feel like the metrics don't show that and kind like commercially and professionally my job is kind of like okay how can we make a company that allows creatives artists is of all sorts. Uh we can start with imagery, we can start with video, but to better express themselves. But how are we supposed to do that if this is kind of like the state-of-the-art, right? Um then a friend of mine uh Changloo actually works at at Mid Journey. it was uh he has um like he has great talks that you should all check but he told me once a little bit over a year ago a quote that I just can't stop thinking about which goes something like hey man if you think about it like predicting the car back when everything was horses it's not that hard I was like well like yeah it's not that hard to to like oh cars are the future and whatever like we have We have a thing that goes like this. We have horses that make energy. So, you swap the thing for the engine. That's essentially a car, right? I was like, come on, how hard is that? It's like, you know what's hard to predict? Traffic, right? And and then I just kept thinking about I was like, oh man, like as engineers, as researchers, as founders, what are the traffics that we're missing now? because I feel like everyone's focused on like yeah but you can you can I don't know transform from JSON to JAMAL and like who cares that dude who cares like or or yes it's important right like but what kind of big picture are we all missing right um then he talks about well you know the myth of the tower of babel where in a nutshell is like God like we we want to go and meet God and then he's like no I don't want that. So instead I'm just going to confuse all of you and then you're not going to be able to uh coordinate uh and then you're all each one is going to speak a different language and then it's just basically going to be impossible to keep the thing going which like reminds me of like a standard infrastructure meetings with backend engineers. is like, "No, we should use Kubernetes. No, we should use." And it's like it's just all fighting and whatever and nothing get nothing gets built. I'm like, "Dude, God is winning. God damn it." Um, but then this makes me think about like we are now in we just entered the age where you can have models essentially they solve translation, right? Or they solved it to a very high degree. So, so what happens now that we that we can all speak our own languages yet at the same time communicate with each other? I'm already doing it. For instance, I do sometimes customer support manually for Korea and I literally speak Japanese with some of my users and I don't speak Japanese. I learn a little bit but I don't speak it and and and like I'm now able to provide an excellent founder whatever that means uh customer support level to a country that otherwise I would be unable to do right and and so I invite us all to think about what that really means. Um because this for for instance means that we can now understand better or transmit our own opinion better to others. And on the previous point that I was talking about with the art that's kind of like an opinion, right? Like evolves are not just about are there four cuts here. It's about this cut is blue and it's like yeah but is it blue or is it teal? What kind of blue? and I don't like this blue and all of that. Um, so like in a nutshell is like how how do we evolve our evals, right? Like from my opinion like from my opinion this is bad. Um, then I want metrics that take into account my my opinion too. And then it's like okay consider myself I may be a visual learner. What that means is like maybe your evil should take into account how we humans perceive images, right? So and and and also the the nature of the data such as oh it's all trained on JPEG on the internet. So take into account the artifacts, take into account uh like all of these while while training your data. Um, okay. I guess mandatory slide before the thank you. Uh, bunch of users, bunch of money. We did all of that with eight people. Now we're 12. And this is an email that I set up today for high priority applications. Uh, for anyone who wants to work on research around, uh, aesthetics research, uh, hyperpersonalization, scaling generative AI models in real time for multimedia, image, video, audio, 3D uh, across the globe. Uh we have customers like those and that's it. Thank you. Oh, Q& A. Okay, perfect. Any questions? Uh Can there's many points there. Can you like refrain the question like Yeah. Yeah. Yeah. So, so the question like in a nutshell is like are there uh perceptually aware metrics, right? like like okay I showed an example of FID score it changes a lot with JPEG artifacts are those where it's almost like the opposite barely changes uh and the metric is still good like there are some and many of these are used also in traditional uh encoding uh techniques um but in a way I'm here to invite us all to start thinking about those like like to we can actually train like we can train uh I mean it's it's called a classifier right like or or or a continuous classifier we can train so that it understands what we mean and it's like hey I show you these five images these five images are actually all good and then they can have all sorts of artifact not just JPEG artifact and this is exactly where machine learning excels right when it's all about opinions and it's like let me just know and you will know you know you know what you will know when you see That's precisely the type of question that AI is amazing at. Uh all right, thank you everyone. We are Thank you. He'll be sticking around for questions on the side if you have some. Thank you. And we are doing a working lunch. like three, right? Okay. So, check it out. It'll give you a graphic EQ analyzer if you hit it again. Check check check check check. Oh, it doesn't do it in the in the ma in the main, but check it out. Go to the individual channel. No, I go in the individual channel. I hit it again. There it is. It's an analyzer. So, something's feeding back. You'll see it. You'll see. There's a port. United States. All right. Yes. Hello. Hello. Hi. Good. Good morning. Cool. Thank you. Hello. Hello. Hello. Is it on? I think it I just Hey everybody, my name is Doug Guthrie. Uh, I'm a solutions engineer at Brain Trust. Um, as you can see here, we're an endto-end developer platform for building AI products. We do eval. If you watched the the keynote this morning, you saw our founding engineer jumping up and down on stage yelling evals. I am not going to do that. I'm not as uh as funny or cool as him, but uh we should all be very excited about eval here. very brief agenda of of what we'll cover today in this sort of uh this intro. Uh very very brief company overview. Uh give you an intro to eval. Why you why would you even uh start thinking about using them? What are they? Uh what are the different components that you need to create an eval? U some more brain trust specific things via uh running evals via our SDK. you'll see uh in the examples um that you can run evals both uh in the platform itself as well as the SDK. I think it's it's a really kind of cool thing that we can connect uh maybe the local development that we're that we're doing with what we're doing within the platform. Uh and then how we then move to production. Uh this is uh maybe human review. This is online scoring. So how well is our our application performing uh in production? And then lastly, a little bit of human in the loop. uh getting user feedback from your users. How do we now uh take some of these uh these production logs and feed them into the data sets that we're using in our evals uh creating this really great flywheel effect. Cool. Quick quick company overview. Um you see up there maybe in the top right uh some of our fun excuse me some of our investors. Uh maybe a quick call out on the on the leadership side. Anker Goyle is our CEO. Uh maybe the the reason to call out that is uh Anker in the last uh two two stops he he essentially built brain uh brain trust uh from scratch at the last two places and and this is really where he found the the idea to like like maybe this is actually a a thing that that people need and and really like the origination story of brain trust. Uh the other thing to call out here is like we already have a lot of companies using brain trust in production today. This is just a a few of uh the companies that are that are utilizing us for for eval uh for observability of their their Genai applications. I won't bore you too much with with that, but uh let's jump into this. If you're at the keynote, you probably saw a a similar slide here. I didn't I didn't take it out, but uh you see like the tech luminaries here. I think as as Manu referenced them talking about evals and the importance of them. I think this is a you know obviously this is an intro to eval uh track and what better way to start this out with uh some some really um you know influential people in the space talking about eval and and why they are so important. So why would you think about eval? Right? They they they help you answer questions. So here's here's a few of them. Um when I change the underlying model to my application, is it is it getting better or is it getting worse? When I change my prompt, right, when I change uh certain certain things about the application, is it getting better or is it getting worse? We want to get away from uh not having a rigorous sort of process around building with large language models, which as you all know, right? nondeterministic outputs uh creates somewhat of a challenge and without uh eval becomes really really hard to uh to create a good application that we can put into production. Maybe some other ones like obviously being able to uh detect regressions within the within the code. I think the other the other thing that Ancher mentioned uh to me when I first started which I didn't really mention this but uh this is my my third week at at brain trust as a solutions engineer but one of the things that he mentioned to me that I thought really resonated was uh eval are a really great way I think people think of them as almost like unit tests for uh for you know our applications but he kind of described it another way of like this is a really great way for us to play offense uh as opposed to just playing defense where I think maybe unit tests are are kind of used for This can actually be used as a tool to to really help uh create a lot of rigor around us building and developing these applications and ensuring that we actually build uh things that we can put into production. Maybe from like a business perspective, uh why would you think about running evals or using evals? Uh here's a few here. Uh if you have um eval, you create the this feedback loop or this flywheel effect. I think Manu mentioned in the in the keynote the keynote, excuse me. Um, and this flywheel effect allows us to, you know, uh, cut dev time, allows us to enhance the quality of this application that we're putting out in production. If you're able to connect the things that are happening in in real life with your users in production and being able to uh filter those logs down, add those SP spans to data sets, inform what you're doing in an offline way creates uh that that flywheel effect that becomes really powerful from a development perspective. Uh again, a little bit more on the on the customer side. Here's a few of our customers and some of the outcomes that they've that they've seen using brain trust, whether it's uh moving a little bit faster, pushing more AI features into production or just increasing the quality of of the applications that they have. Here's a few of the the outcomes that we've seen or that they have seen. So, let's let's start talking a little bit about the the core concepts of of brain trust. Obviously, we're here to talk about evals. uh the the the things that I think um you see like these arrow these arrows going uh one way and and the other this again is that that flywheel effect that that I described earlier. Um there's the the prompt engineering aspect of this uh in brain trust think of uh this playground that we have as an IDE for LLM outputs. Uh the playgrounds allow for that rapid prototyping as we make those changes as we change the underlying model. what is the impact to that to that uh particular uh task or that application and those evaluability aspect right this is the the logs that we're generating in production the the ability to uh have a human review those logs in a really easy uh intuitive interface and then have user feedback from actual users be logged into the application as well so what is an eval you probably heard this several times today throughout the week if you stop by the booth. But sort of our definition here is that structured test that checks how well your AI system performs, right? It helps you measure these things that are important. Quality, reliability, uh correctness. So what are the ingredients in an eval? Right? I've been talking a little bit about task, right? This is the thing uh the code or the prompt that we want to uh evaluate. The the really cool thing about brain trust is that this can be as simple as a as a s excuse me a single prompt or it could be this this full sort of agentic workflow where we're calling out the tools. There's no sort of limit onto the the complexity that we put into this task. The only thing it requires is an input and an output. The second thing is a data set. This is our uh real world examples. This is essentially what we're going to run the task against to understand how well uh our application is performing. And how we do that is via scores. So the score is really the the logic behind your evals. There's there's a couple different ways to think about this. There's the LLM as a judge type score. So uh you give it the the output and some criteria and it is able to assess uh like say I want uh based on this output is this excellent, is this fair, is this poor? And then those outputs then correspond to you know zero point five or one. Uh you also have codebased scores right these are maybe a little bit more heruristic or binary but uh we can use both of these to really aid in the development of that of that eval and ensuring that we're building a really good application. Uh I think I just sort of mentioned this here as well but like the the two mental models here of of eval there's there's offline and online. Offline is pre-production. This is us actually doing that that iteration. It's identifying and resolving issues before deployment. Uh this is where we're defining those tasks. It's where we're defining those scores. Uh online evals, this is that real-time tracing of of the application in production. It's logging the model inputs and the outputs uh the intermediate steps, the tool calls, everything that's happening. It allows us to diagnose performance and reliability issues, latency. Uh based on how you instrument your your application with brain trust, we can pull back lots of different metrics related to cost and tokens and duration and all of these things be help inform uh how we build this application. Um I'll I'll jump into a little bit more of like how we can instrument our app for online evals. Uh we're going to first I think talk a little bit more of the offline. Before doing that maybe just like level set on how to improve. I think one of the the the things that I've seen uh in the last few days here the conversations that I've had it's like almost how do I get started or what do I do if X or you know those types of questions. Another thing I heard from Anker uh very early on is that like just get started, create that baseline that you can then iterate and build from. I think a lot of people get caught in like creating this this golden data set of test cases uh that they they can then like iterate from. Start you don't necessarily have to do that. Start and build that uh that baseline. Establish that foundation that you can then improve upon. But this is a really good sort of matrix of like uh if I have good output but a low score, what do I do? Right? Improve your evals. If I have bad output and a high score, improve your your evals or your scoring. But really good kind of like highlevel uh understanding of of where to start to target your your efforts when you are building these apps and your and you're creating evals. So let's jump into the actual components that I just that I just talked about. So within the brain trust platform we have a task. Uh again this could be a prompt. This could be like this full agentic workflow. Uh very basic. You see that that gif running. This is a prompt within the platform. You uh specify an underlying model that you wanted to use. You give it a a system prompt. You can also uh give it access to tools. It has access to musta mustache templating. So you can pass in variables like user questions or um you know the input from the user it's chat history or metadata. Right. So when we actually go and want to parse through these logs, the metadata becomes actually beneficial uh enabling us to do that in a really easy way. Um going forward, uh maybe we have a multi-turn sort of chat type scenario where we want to add uh additional messages for the system and the assistant and the user uh and our tool calls as well. the uh the the platform allows for that just via this uh plus this messages uh button and then you're able to add those different messages to the prompt. Um also we can add tools. So oftentimes the the prompt will will need access to to something, right? Maybe it's a a rag type workflow, maybe it's doing web search, whatever it is, we can now use those tools as part of that prompt, right? And so when you you sort of encode in that prompt like make sure you use X tool, this prompt has access to that tool while it's running. Uh the last one, this is actually a feature that is in beta right now. Uh it's actually creating more of that agentic type workflow within your uh within the brain trust platform itself. So it's it's a way right now at least to chain together prompts where the the output of one now becomes the input of the other. But if you think back maybe to to this slide, if I have sort of um this prompt that has access to tools, you you create a pretty powerful system here where you're able to go from uh maybe that first step that has access to a certain tool and we get some output from that. We can then go to that next step that has access to maybe some other tools, right? This sort of like maybe multi- aent type of workflow we can create with the underlying tools within those prompts. The second thing that I talked about is data sets, right? These are our test cases that we want to give to the the task to run. So we can sort of iterate over that. We can get the output and then going down a little bit further actually score that. But this is um obviously really important when we're when we're running our eval. And then when we are uh trying to pull from production, right, the the actual logs that are happening, we can add those those uh spans, those traces to the data sets in a really easy way. And I can show you what that looks like. But uh if you look there at the bottom, the only thing that's required is the the input. Uh you also have the ability to add uh expected. So what is the expected output for that input, right? You can sort of like uh create some sort of score that looks at the output with the expected. There is a a score called Levvenstein that allows you to, you know, measure the the the difference between those two. So you can do some different things based on what you provide to that data set. You also have metadata as well. Again, being able to filter down different things, pulling the data set maybe into your own codebase and I want to filter by again X, Y, or Z via the metadata. That's all possible. Uh I mentioned this a little bit ago, but just start small and iterate, right? You don't have to create this golden data set to to get started here. Um just just start and then uh then continue to iterate and build from that baseline. Uh the human review portion also becomes really powerful again when we have stuff uh being logged within production having humans actually go through those logs and you there's lots of different ways to filter it down to the things that they should be looking at and then we can now decide to add those things to certain data sets that then inform uh the offline evals that we're running. Uh the last thing, the last ingredient here that we need for our excuse me uh for our evals are our scores. We have both codebased scores, right? Again, this is like more of those binary type conditions, but you can actually uh code TypeScript or Python. You can do that within the UI as you see over there on the bottom left. Or you can within your own codebase, create that score, and then push it into Brain Trust. So we can use it in the platform. Other users who maybe aren't in the codebase can use that score as well. The other score that we have access to is called LLM as a judge. So this allows us to use an LLM to sort of judge the output. We can give it the the set of criteria that indicates what a good or a fair or a bad score or whatever it is. You you get to decide uh what that looks like. So you give it that criteria and it says if it's good uh I want to do a one. If it's bad I want to do a zero. But this starts to create the scores uh that we can use in that offline in that online sense. The other thing to call out here is that we have um internally built a package called auto evals. So this is something that you can now pull into your project. These are out of the box scores that are both LLM as a judge as well as codebased. And so it just allows you to get started very very quickly. Another thing I heard Anker mention is um maybe starting with Levvenstein maybe not the best score in a lot of cases but again it establishes a baseline very very little development work for for uh our users but it creates that thing that we can then build from and now you have uh a direction a direction to go in to go build maybe that more custom score. some of the things that we've heard from our customers. Uh some tips that um you know important to think about a lot of our customers are using higher quality models for scoring. Uh even if the prompt uses a cheaper model just just makes a lot of sense like while we're running that application to use the cheaper model but use the the more expensive one to actually go out and score it. Um also break your scoring into uh very focused areas. So the example that I'll show is a an application that that generates a change log from a series of commits. So I could create a score that says assess my accuracy, my formatting and my correctness. Or I could create three different scores that assess accuracy and then formatting and then correctness. So have your scores be very targeted to the thing uh that they're supposed to be doing. uh test your score promp excuse me prompts in the playground before use and then avoid overloading the score or prompt with context. Uh focus it on the relevant input and the output. Uh a couple things here. Here's where over on the left we have our playgrounds. This is where we do that that sort of like rapid iteration where we can pull in those prompts. We can pull in those agents, add our data sets, and add our scores. and we can click run and it'll go out and sort of churn through that data set that we've defined and will give you a sense for how well your task is performing against the data set with the scores that we defined. But this is the place where uh developers uh PMS we even have a healthcare company that has doctors coming into the platform and interacting with the playground and even doing human review as well. Uh depends a little bit obviously on on the organization. The thing on the right is our experiments. This is our sort of like snapshot in time of those eval. So imagine now like as we are doing this development and we're trying to understand uh like the last you know the last month or so are are we getting better right the the changes that we are making the model changes whatever it is are we improving our our application and the experiments is a really great way to to understand that. Uh really important maybe to call out as well. You can see in the bottom right the eval can happen from uh the application right the the brain trust platform as well as via the SDK. Cool. Maybe just really quick because uh nobody likes looking at at slides all the time. I I certainly don't. Uh maybe if you haven't seen brain trust yet this is maybe a good a good quick demo. Uh so again like the the uh the idea here is I have this this application. I'll just give you I'll show you over here. uh you give it a GitHub repository URL. It uh grabs the most recent commits and then creates a change log from there. And then once this completes, you can even provide some user feedback. But this is the thing that we want to uh evaluate. So what I can do uh I'll go into my my playground, right? This is the place where I can start to uh run those those uh those experiments or I can start to iterate on that prompt that I have from my project. I've actually loaded in two different uh two different prompts. So maybe I'll before going into the playground, I've actually created these two prompts within my codebase and I've pushed them into the brain trust platform. I've also created a data set in that codebase and I've also created some scores. Right? These are the ingredients that we need to run our evals. So now when we have we have those we have those different components. Now we're able to start to iterate here. So I'm gonna actually create a net new playground and I will load in one of these prompts. So again, here's my first prompt. Uh my first prompt has a model associated with it. What becomes really cool here is the ability to to like to iterate on the underlying model, right? I think a lot of us are we have access to a lot of underlying providers and we want to be able to understand if I change this or if I you know a provider adds a new model what is the impact to my application. So I can duplicate this prompt and maybe change this to GPT41. I can run this. Before I run it, I have to add all of my components. So I'll add my my data set and then I can add my different scores that I've uh that I've configured here for my change log. And so I can click run and then now we'll understand here what is the effect of changing the model uh for this particular task with the scores that I've configured against this data set. So this will churn through all of these in in parallel and we'll start to get some results back. Uh lots of different ways to actually start to look at this data. I always like coming over to the summary layout because I can understand like um you can see over here this is my base task right here and then my comparison task. So I can understand uh looks like you know on average uh the base is performing a little bit better than my uh comparison task on my completeness score. Uh it's fairing a little bit worse on my accuracy score. Uh both of them are you know 0% on my formatting. So probably have some work to do there. But you can start to see how you can use this type of interface to iterate very quickly. Right now uh the other thing that um maybe shouldn't do but uh I can't resist because we just released this today is this new loop feature. So imagine you are uh you know you're a user within brain trust and before this you would sort of manually iterate here creating net new prompts uh making modifications changing the model. What if you could now uh utilize AI to go and do that for you? So any sort of like cursor like uh interface, we can ask it to optimize a prompt. And I think the really unique thing here is it has access to those uh those evaluation results. And so when it goes to go uh change that prompt, it understands that it changes the prompt, it runs the evaluation, it understands if it got better relative to the scores that we defined. So you can see it's going to go through here. It'll fetch some eval results. Uh you'll probably see a diff here very very soon. If we don't, I I won't hang out here too long, but I do want to highlight one of the things that we are releasing that really enables our users to iterate in a really uh really fast way. So, here's my my change. We can click accept and then it'll actually go out and run that eval again or it would uh I think I have an issue with my anthropic API keys. But the idea here again is like we can create that very rapid uh iterative feedback loop here within the playground. The other thing here is uh we can run these as experiments. So this is um this is where we can start to create those snapshots in time of that eval. And again see as I make these changes to that that application, how is it sort of performed over time. I want to make sure I don't I don't want to go down. I don't want to like uh decrease the performance of my scores relative to obviously the last time it ran. looking at over the last month, six months, whatever it is that we're tracking. Cool. So, that was very very uh brief sort of intro to like eval via the UI, right? Again, like just to summarize, we need a task, we need a data set, and we need at least one score. We can pull those into the playground, and now we can start to iterate. We can save these via experiments. uh and now we can we have a way in which we can understand how well this application is performing right this is no longer like qualitative right this isn't like hey I think this got better that output looks better there is actual rigor behind this now um customers oftentimes ask though like I don't really want to use or I'm not going to use the the platform as much I'd rather use this from my my codebase is that possible and it is uh so uh we have a Python SDK we have a TypeScript SDK there there's some other ones as Well, Go, Java, Cotlin. Uh, for the most part, most of our users are using Python or TypeScript. Here's a just a couple examples of what this might look like from an SDK perspective. Um, actually, if you all aren't opposed to looking at uh some code, here's just a a really basic example of uh defining a prompt within my my codebase and then pushing it into Brain Trust. So, just leveraging that that Python SDK. Uh, another example should come over here. creating a score. So you give it sort of like the the things that it's looking for, but now I've sort of defined this score within my codebase. Uh it's version controlled. Also the the prompts that you create within the UI are version controlled as well. Um but this is just another way to start to interact with with brain trust. So again, scores, we could do data sets, uh and then you can even do uh prompts up here as well, I believe. So here's my eval data set. Here's my change log to prompt. Again, being able to start from the codebase and actually push them into the platform is possible. Just depends on the organization where they want to start. The other thing here is like this is more on the like the components of the eval side. So that's that top portion. Define those assets in code, run that brain trust push, and now you have access to that in that brain trust library. The other one is actually like defining the eval code, right? So what that looks like is is slightly different. Um come over here. So we have our eval. This is just a that's coming from our Brain Trust SDK. Again, it's looking for the exact same things that I just described, right? A data set that we can use from Brain Trust itself, uh the task that we want to invoke and then the scores. So, again, defining this here within uh within your codebase certainly possible and then I can run you know a command that actually runs that eval within brain trust. So, from here go into brain trust see the eval running. This is now experiment that I can view over time. So again like you saw two different types of workflows here. Uh again catering to maybe two different personas or again the way in which organizations want to work. It's up to them. Brain trust is very flexible and how we allow our users to uh to consume or use the platform. Uh probably jumped ahead a little bit, but this is sort of a recap of what I just showed you. Again, from your code, you you create your prompts, your scores, your data sets. You can push them in there. Uh maybe just importantly, important to highlight here of like why you would do this. Uh you want to source control your your prompts. Uh the big one here to call out is the online scoring. I have a section in a little bit diving a little bit deeper into that. uh but if you want to use those scores that we define in the data set, we should push them into brain trust so that we can create online scores. We can understand how our application is performing in production relative to those scores that we want or that we're using within our offline evals. What I just showed you maybe another uh another variation of that that eval within our code again defining that data set defining that task defining those scores becomes very very easy to now connect these two things. It's just again up to you to decide where you are where you want to do this. The other thing to call out here is that this can be run via CI/CD. We do have some customers that that want to run their eval as part of the CI process. So understanding in a more automated way, right, the the score for A, B, and C, whatever they've configured. Has it gotten better? Has it gotten worse? This becomes maybe a check as part of CI. There is uh if you look within our documentation, there's a a GitHub action example that shows you how you could set this up. Cool. Let's let's move to production. Moving to production entails setting up logging, right? It entails instrumenting our application with uh with brain trust um code. Uh being able to like say I want to wrap this this uh LLM client. I want to wrap this particular function when it goes to call that tool. Becomes very very easy to do that. But so so why should you do it? Um, I think I've probably said it said it numerous times here, but we want to measure quality on live traffic, right? We actually want to understand how well our application is performing with those scores. Really great to use during offline evals becomes uh our aid in ensuring that we that we build really good applications that we're not creating regressions, but also really important to monitor that live traffic. The other really important thing to call out, I think, is that that flywheel effect that it creates. So we have these data sets that we use to inform our offline evals. It's very very easy now to take the the logs that are generated within production and add those back to data sets. This also um speaks to some of that human review component where we want to now bring those humans in. they can start to review some of the logs that are relevant like maybe there's user feedback equals zero, maybe there's a comment or whatever it is, but like they can filter down to those particular things and as they find really interesting maybe test cases, it's very very easy to add those uh back to the data set that we use in our offline evals. So I think the the feedback loop or the flywheel effect that that this creates is one of the really fundamental value props of of the platform. So how do we do this? Uh there there's a couple different ways, right? We're first going to initialize a logger. This is just going to authenticate us into Brain Trust and point us to a project. Uh you may have seen when I open up the platform, I had numerous projects inside of there. You can almost think of a a project as a as a container for that feature, right? So you probably have multiple AI features that you're building. I want to have a container for feature A for those prompts, those scores, those data sets. You could certainly utilize those things across projects, but it becomes a really good sort of way to uh containerize the things that are important for that feature. Then you can start really basic, right? You can wrap a an LLM client. Uh so when you saw those some of those metrics with like tokens and l and duration and costs uh just very basically within the this uh the script or excuse me the the code here I just wrapped that open AAI client and now I'm just sort of ingesting all of that those metrics into my logs. That's the easiest way to get started. You obviously probably want to uh do a little bit more. Again maybe you want to uh understand when you know that LLM invokes a tool. So, I want to trace uh I can add a trace decorator on top of a function. Uh I can even use some of the like the brain trust low-level like span elements to create custom logs and I want to customize the input and I want to customize the output and the metadata that we that we log to that span. So again, you can you can start very basic with wrapping in a client and then go down to like the individual span itself specifying that input and that output. This leads us to online scoring, right? This is I talked a little bit about this, right? This is where like when our logs are coming in, we can actually configure within the platform those scores that we want to run and we can specify sort of a sampling rate. So, we don't necessarily run that score across every single log that comes in. Maybe it's 10% 20% so on. Um, but it but it creates that really tight feedback loop that that I've been talking about. Uh al also maybe just important to mention the early regression alerts. So we can create automations within the brain trust platform. If my score drops below a certain threshold, let's create an alert uh with it with our automation feature. This is just uh and I can maybe walk through what this looks like instead of showing you here. Uh the custom views. This is where like there's a lot of really rich information within these logs and it becomes really important I think again for the human review component to like filter these down to the things that they care about but the things that anybody cares about. So we can create custom views within brain trust with the appropriate filters uh and then it's very easy for that human to go into what we call human review mode within brain trust and sort of parse through those logs uh the ones that are the the ones that are going to be most meaningful to them. Let me uh let me connect some of those those dots there. So again, showing you some code. Maybe good, maybe bad, but um I'm guessing there's some technical people in the room that that don't mind here. So if I look for um [Music] the uh you may have s seen in one of those slides there is a the Verscell AI SDK. I want to wrap this AI SDK model. Again, this allows us to just create all of those metrics within brain trust with just zero lift from us as a developer. This becomes really easy to do. Uh you can also see where I have specified that span itself, right? I actually want to define the inputs and the outputs of that. The reason you would do that is because you have a specific data set with a structure that you want to ensure maps to that. So like when you are within those logs parsing through them becomes really easy to add those spans back to that data set. So ensuring that that data structure is sort of consistent across offline and online becomes really important again to create that feedback loop. So this is you know very high level of like how we can start to create those spans. Um now that we do right we can now go in uh the platform and start to configure our online scoring. So this is here just within this configuration pane. I can click online scoring. I'll just delete I'll create a new rule. Uh so my new rule and here's where we can add different scores. Right? Obviously I have a few here that I've been using for the offline evals. I don't necessarily need to select all of them, but I certainly can. And then I want to uh apply a sampling rate. So I want to actually give you an example of what this looks like. So I'm going to do 100%. The other thing to call out here is that you can apply these to the individual spans themselves and not the entire root span. So where this becomes beneficial is like when you are invoking maybe tool calls, you're invoking like a a rag workflow and you actually want to create a score on whether or not the thing that it gave back uh is actually relevant to the user's query. So we can actually create a score specifically for that and highlight what that span is here. So again very very uh flexible in how you apply these scores to the things that are happening online. Now when I come back here to the application and we'll just run this again uh creating that change log you'll now start to see here within uh the logs this will start to show up and then you'll start to see these scores be generated. Right? And again this is where like you can now start to understand over time in production how are these things doing? How are they fairing? Where can we get better? Uh again now we can connect again like the things that are happening in our offline evals with the things that are happening with online. The other thing to call out here is the uh the feedback mechanism right? Uh we certainly have uh the ability to do like human review but oftentimes you want your users to provide provide feedback as well. And so this is just a basic example of a thumbs up thumbs down. You can even provide a comment here. This can now be logged to Brain Trust. So I should see over here my user feedback. So here's my my comment and then I have my user feedback score. But now I can also do something like this. So again, maybe I want to filter my logs down to where user feedback is zero. So click that button. I'm going to change this to zero. Right, I don't have any rows yet like that. But now I can save this as a view and people who are now using this as human review can filter this down to where user feedback equals zero and we can figure out what's going on, right? What are the things that that fell down here within this application that we need to go fix? The other thing I'll highlight here is our sort of human review component. Uh actually um you click that that button or you can hit just R and it opens up this this different pane of your log. So it's a paired down version of what you just saw there. It's a little bit easier for a human to go through and actually uh look at that that that input and that output. But you as a as a user of Brain Trust can configure the human review scores that you would like to uh to use. So I have this add something here. So maybe this is a little bit more free text. I have a better score. Again, these are the things that that you can add to your platform that map to the the the review, excuse me, the scores that you want your your humans to to add to those logs. Um, just really quick, I I'll highlight some of these things here. Um, this is what it starts to look like when you instrument your uh your application with those different uh wrappers or those different trace functions. um I'm able to understand at a very granular level, excuse me, granular level the things that it's doing, right? So I essentially have these tool calls where it's going out and it's grabbing the commits from GitHub. It's understanding what the latest release is and it's fetching the commits from that latest release and now I can generate that change log. But again, the the really unique thing here and maybe a different example of this is I can start to score those individual things that are happening. So this is a different uh application with a with this example. So if I open this up, I have these uh this conversational analytics application. So a user can ask a question, it can return back some data. But this application goes through these various steps, right? The first step is to rephrase the question that the user asked. So imagine like there's this chat history that we can load in as input and the LLM needs to rephrase that user question. If the LLM does a really bad job of rephrasing this question, everything as a result of this will fall down. Probably not going to get a right a right answer. So what I can do is create a score specifically for that span to understand how well the LLM did in rephrasing that question. Can also understand the intent that I was able to derive or the LLM was able to derive from that question. Is that right? But you start to think of like these these more complex type of applications that you build. You need to be able to understand the individual steps that are happening and brain trust allows for that very very easily via these scores and then being able to apply them not only again while you're in offline eval kind of mode but also online right we want to understand these logs and be able to apply these scores at the individual span level this becomes pretty powerful as well. Um I think I actually stole from my my next section my human in the loop kind of walk through this uh a little bit. Um maybe just another a call out if you happen to be at one of our workshops on Tuesday. Um Sarah from notion uh who's a a brain trust customer talked a little bit about how they think about human in the loop. I think it's it's important to consider like the the size of her organization and what they're doing. Um, she mentioned that like she has a special type of role that they use for human in the loop type of interaction, right? There is it's almost like a product manager mixed with an LLM specialist. Uh, they're the people that are going through and doing those human reviews. Smaller organizations, she made a comment that was it actually makes a lot of sense for the engineers, some engineers to actually go through and do this as well. Becomes really powerful to pair like the automation with the human component of this. like this is not going to go away. I think it it adds value to the process. Um again, I think I just stole for myself like why why this matters, right? This is really critical for the the quality and the reliability of your application. Uh it provides that that ground truth for for what you're for what you're doing. Two types of human in the loop uh interactions here. I walked you through that human review. Um give me one second. I'll and I'll call you. Yeah. uh the the two excuse me the two types the human review uh being able to like create that that uh interface within brain trust that allows that user to kind of parse through the logs in a really easy manner as well as configuring scores that allow them to add the relevant scores to that particular log and then the user feedback. This is actually coming from our users in the application. Again being able to create sort of uh views on top of that feedback that then power uh maybe the human review and then creates that flywheel effect uh that we that we that we want. That's all I have today. Appreciate you all coming out here and and listening to me. But yeah, you have a question. Thanks. Yeah, the question is around like how are we using uh human review and like some of the logs and informing the the offline eval portion of this largely that goal. Uh yeah, one thing I maybe I didn't highlight here is so maybe back within brain trust, I'm going to go back to my initial project. So imagine now like we have we have all of these logs. We filtered it down to a particular Oh, are we still showing on the screen? Awesome. Thank you. Um, yeah. So, imagine like we have this uh this this process now, right, where we're we're doing that human review. We filtered it down to the records that are meaningful for whatever reason. it becomes really easy again to connect what's happening within production. So I maybe I select all of these rows or I select individual rows but I can add these back to the data set that we're using within those offline evals. I I think I've said this like a hundred times over this conference this flywheel effect. This is like I think what's missing oftentimes when we're building these these AI applications and what brain trust allows for really seamlessly. Yeah. Uh I have two questions. Uh the first one is about production. Uh is it possible to have multiple models in production and compare how they behave? Yeah, I I don't see why not. Like my guess is in the underlying application, you're swapping them out like having like AB tests, you know, I can have like two or three or four and easily compare. Absolutely. Yeah. Um let's see if I have an example here. You're able to to group some of these scores. Uh maybe this is sort of an example of of what you're talking about. So like maybe within production we have different models running. Uh this this sort of view here allows us to understand like the the models that we're using under the hood. Uh and this is just you know you could do this within production as well and sort of do that AB testing. Cool. Uh my my second question is about humans in the loop, right? Um let's suppose that I have multiple humans and um they behave uh slightly different as a scorer scorers do you have anything or what what is the vision to do that like is there a way that I can actually compare how they're scoring or something like that or not really so different users can maybe have different sort of criteria for scoring maybe the first thing I would say to that is like there should be like maybe a rubric for your users who are interacting with human review so you're not creating that uh you certainly have the ability to see like who is scoring different things within the platform. Um I'm not sure if you're able to pull that as like a data set to like assess the differences there, but maybe like before it gets to that place like have a rubric, have a guideline of what scoring looks like for your humans. Okay, thank you. Yeah, of course. Hi. Um so the scores I'm used to working with for like LLM as a judge are like they're relativistic, right? So they can't tell you is the answer relevancy good or bad for a single run but it can tell you how it compares to previous iteration of like the same test set for example. Um, do you guys use LM as a judge scores for online or is and like how are are they relativistic like that or do you have some way to be like this is a good answer you know in and of itself for this sample or because it's all you have new data coming in right yeah I think a lot of our customers who are are thinking about this are are like almost doing eval on their eval like trying to understand did the LLM as a judge actually do a good job there. So like when that actually runs there's a rationale behind it and so you can sort of run an eval of those LLMs as a judge. I think Sarah from notion in our workshop described sort of a process like that within notion but I think that's that's sort of like where I would aim it. Okay. Cool. Awesome. Are any of your customers doing eval before they launch? Like I'm working with a government. They don't want to launch until we show some accuracy levels. Yeah. So, we're getting our subject matter experts to enter in all the questions that they have, right? They have huge data sets of thousands of questions, believe me, as a government. Um, and then we're using measures like you're talking about. Do you have a way to do that? I I guess it I guess it's the same. Is it? So what you're describing is what we call offline evals, right? This is development, right? Um, we can actually do this testing before we get into production. This is what I was talking about like establish that baseline, right? Using those scores, using that that data set that you've already um created, but this all happens before we get into production, right? And then you can like one of the things that I heard from from somebody earlier is like one of my challenging things of building this AI application is uh establishing trust or creating that trust in this thing. That's part of what this is, right? It's like it's showing showing those people the scores of that application. So you start to iterate on this thing. Maybe it starts at 20% then it goes to 30 then at 40 and so on. That to me is the thing that you use to create that trust and uh create that like ground swell to push it into production. Okay. Yeah, that's what it that is what we're trying to do. So but I wondered if I can see the tool does that. Thank you. Yeah, of course. Time for one more. Thanks. Um quick question. I love the CI/CD components. Um we're trying to build a lot of um we're trying to build like ML as a as a platform for our team. Um so we get into evals and stuff like that. So how much of how much of the monitoring dashboard you have in brain trust can actually be like take the data taken out and post it in a unified dashboard somewhere else. Yeah. All of this is available via SDK, right? You can pull down experiments, you can pull down data sets. Uh so you're able to you know pull this down like we have we have a customer that is actually building their own UI on top of like the the SDK itself like so they built their own sort of like components utilizing the SDK and pulling the the sort of things that we've logged the experiments that we have in the application into into their own UI. So certainly possible cool thanks everybody. [Applause] CEO as well as I don't Yeah, I'm going to go All right. Yeah. Check check check. Hey, nice job. Good to see you. Not too bad. Was tolerable. Okay, great. It's time. Yeah. Okay. So, a couple of tips on your video and testing one, two, three. Awesome. So, uh, make sure you have no copyrighted material in your slides because it is being streamed on YouTube and very couple people had a like a slide that was stock photo, but still I don't think we have any stock photos. Great. Um, are you both speaking? He's speaking. I'm I'm the handler. Oh, okay. Great. Um, the other thing is because we are streaming, I have to It's mostly a slide and then I have a closeup of you so hang around the podium. I will. You know, some people like to pace. No, I don't like to pace the camera. But what happens is it looks like this. I'm just going to put my backpack over here. Oh, over there. Perfect. Yeah, I'm not cool enough to pace. Um, and we also started in the Perfect. out of respect for the next and that includes the Q&A. Excellent. I don't think there's any Q&A. We'll see. We'll see. And if somebody does shout something from the audience because people do um just in incorporate the question in your answer, please somebody on the stream might have the same question. You are. Yeah. I'm John. Good to meet you. Yes. Let's make sure computer's working. Okay. And I do have a hard line for the internet. Great. I don't think I need any internet. Let me hide some windows before it resizes again. It's not sharing right now, is it? No, it's not. Okay. I just want to hold. Perfect. Okay. What's up, Alan? Thank you. Oh, hey. I saw Sydney yesterday or the day before. Yeah. probably How's the conference been for you guys so far? Turn this off basically right now. Right. Great. Nice. Slip it in the pocket. Oh, yeah. Yeah. Like you guys on TV. It's all you see like this. Mr. President, I get down. Yeah. Thank you. Are you guys doing an event tonight? Hello. Hello. Hello. Hello. Okay. Is it related to the conference? Okay. Okay. Nice. Oh, I'm on now. Well, okay. Whatever. Okay, I'll stop the small talk. So, how many people saw Manu's uh keynote? Nice. I'm Manu's uh less cool brother. Awesome. All right, welcome y'all. Uh, welcome again to the second part of the track for the EVAL here at the AI engineering world's fair conference. Uh, we're doing something interesting in this room where I've actually asked the AI to write up the introduction for our speakers. So, um, without further ado, I've asked Gemini 2.5 flash preview to uh, introduce Anchor Goyle here. It's saying that uh please welcome the CEO of brain trust drawing from his extensive experience including leading the product for Google cloud AI platform I don't think true I don't think that's true so anchor is here to share invaluable lessons from the eval front lines for every AI engineer so without further ado thank you Gemini awesome um really excited to uh uh share some stuff with you all today. Um and I'm I'm going to talk about five hard-earned lessons uh that we've uh learned over the last year or so about evals. Um so first just a little bit about us and where we learned some of these lessons. Um at Brain Trust we build a product that helps people do EVELs and observability. Um we work with some really great um companies that are building products in the space. Um and we uh we pulled up some stats. I find some of these numbers kind of interesting. Across all of our users, and this includes like free users who just sign up uh to use our free product, um the average org runs almost 13 emails a day. I I think that's pretty cool. Um it's really what our product is about, but um it's great to see just how much people are evaluating stuff. Um some of the most advanced organizations that we work with are actually running more than 3,000 emails a day. Um they're really testing and experimenting with quite a bit. And some of the users who are deep into evaluation are spending around two hours a day around their data understanding what's working, what's not working, it just speaks to how much impact evaluation can have when you really take it seriously and build it into your process. So with that said, uh let's talk about some of the interesting things we've learned uh over time. Um so the first thing is I think it's super important for you to uh understand and define um whether evals are actually providing value uh for your organization or not. Um and I tried to come up with three signs that you should look for um that that are good. Uh so the first is um if a new model comes out uh you should be prepared um via your evals to be able to launch an update to your product within 24 hours that incorporates the new model. Uh Sarah from notion um she talked yesterday she talked about this um specifically but um for the past several model releases every time something comes out notion's able to incorporate um the new model within 24 hours. And I think that's a really good sign of success. If you can't do that, um, then it means that, uh, you have some work to do on your evals. Um, another sign of success is if a user complains about something, do you have a very clear and straightforward path to take their complaint and add it into your emails? Um, if you do, then you have a shot at actually um, incorporating user feedback, pulling it into your emails, and ultimately doing it better. If you don't, then you're going to lose a lot of valuable information into the ether. Uh so again I think this is a really important kind of threshold or milestone to hit. Um and the last one which I'm actually going to talk about a little bit more throughout the presentation is um you should really start using evals to play offense and understand which use cases you can solve um and how well you can solve them before you actually ship things not like unit tests which allow you to just test for regressions. Um, and so if you if you really adopt evals, then I think uh before you launch a new product, you have a really good idea of how well the product might work uh given what your eval say. Um the second lesson is that great evals uh they have to be engineered. They don't just come for free with uh synthetic data sets and random LLM as a judge scores that you read about online. Um, and I think there's maybe two ways of thinking about this. Um, there's no data set that is perfectly aligned with reality. I think in the cases that there are, there's like basically nothing to do and the use cases already work, which there are a few that that are kind of like that, like solving competition math problems for example. But for most real world use cases, any data set that you can come up with ahead of time is not going to represent what users are actually experiencing. And I think um the best data sets are those that you can continuously reconcile um as you actually experience what happens in reality. And doing that well requires quite a bit of engineering. Um of course brain trust can help you with that. But I think the the point is you have to think about uh a data set as an engineering problem not just something that's given to you. And the same is true with scorers. I think um a lot of people we talk to ask hey what scorers does brain trust come with and and how can we use those so that we don't need to think about scoring and we actually have a really uh powerful um open source library called autoevels but it's very open source and uh uh flexible for a reason which is that um every company that we work with that's sufficiently advanced is writing their own scoring functions um and modifying them uh constantly and I think uh one way to think about scores is they're like a spec or like a PRD for your AI application. And if you think about them that way, um, one, it it actually justifies making an investment in scoring beyond just using something off the shelf. And two, hopefully it's fairly obvious that if you just use, you know, an open- source or generic scorer, that's a spec for someone else's project, not yours. Um, there's been a real shift towards context in prompts that's not just the system prompt that you write. And I actually think that um just traditional prompt engineering pe people say this in different ways but I think traditional prompt engineering is evolving quite a bit and it's very important to think about context not just a prompt. Um so this um is an example of what kind of a modern prompt looks like for an agent. Usually you have a system prompt and then a for loop which you know uh runs LLM calls uh issues tool calls incorporates the tool calls into the prompt and then iterates and iterates. Um, and I I actually took a few uh um uh uh trajectories from agents that that we see in the wild and summarized these numbers. And as you can see, a vast majority of the tokens in the average prompt um are not from the system prompt. And so, yes, it's very important to write a good system prompt and continue to improve it. But if you're not very precise about uh how you define tools and how you define their outputs, uh then you're leaving a lot on the table. And I think one of the most important things we've learned together with some customers is that um you can't just take tools as a reflection of your APIs or your product as it exists today. You have to think about tools in terms of what the LLM wants to see um and how you can use you know exactly what you uh present to the LLM to make it work really well. And I think that in most projects um it's actually very disruptive when you write good tools. Um it's not something that's just like an API layer on top of the stuff that you already have. And the same is true with their outputs. Um there's one example that we uh worked on recently for an internal project where um shifting the output of a tool from JSON to YAML actually made a significant difference. And I know that's a little bit of a meme in the AI universe, but it's just so much more token efficient and easy for an LLM to look at um a YAML shaped data while doing analysis than extremely verbose JSON. Um now, if you're writing code and you're plugging something into, you know, a charting library, it makes no difference because to JavaScript, YAML and JSON are both structured data. Um but to an LLM, they're very different. And so I think you have to be very very thoughtful about um you know how you actually construct the definition of a tool and how you construct its output for the LLM to maximally benefit from it. So I think one of the most important things we've learned um and actually I would credit some of the folks at Replet uh for really uh pioneering this pattern. Um but you know every time a new model comes out uh everything might change. Um and I think you need to engineer your product, engineer your team, um engineer your you know mindset so that when a new model comes out if it changes everything for you, you can jump on that opportunity and and ship something that maybe wasn't possible before. Um and I'm going to show you some numbers uh for a product uh feature that we're actually launching and I'm going to show you a little bit of it today. Um, but uh we we've had an eval for a while that tells us how well this feature might work and we run it every few months and you can see you know it wasn't uh that long ago that GBT40 was the best model out there. Um but but things have changed uh and you know progressively uh GPT41 did a little bit better. Uh 37 sonnet is much better and and for sonnet is actually even more remarkably better. Um and uh what what that's meant for us is that this feature that um you know at 10% would would really not be viable for our users to use suddenly becomes viable. Um and so you know cloud for sonnet actually came out two weeks ago. Um, and we're shipping the first version of this feature today, which is just two weeks later. But we were able to jump on that opportunity because we ran this EVEL. Um, we were ready to do it and we we saw that, okay, great. We've actually finally crossed uh this threshold. Um so everyone that I personally work with or talk to I encourage to create evals that are very very ambitious and um likely not uh uh viable with today's models and construct them in a way that when a new model comes out you can just plug the new model in and try it. Um in Brain Trust we have this tool called the brain trust proxy. Um there's a lot of similar tools. You could use ours or you could use something else. Uh but really the point is that you don't need to change any code to work across model providers. And so uh you know Google just launched the newest version of of uh Gemini. Um actually Gemini 2.5 Pro0520 scores 1% on this benchmark. Uh so we didn't even put it on here. Um but maybe the thing they launched today actually uh does a lot better. We can find out you know with with just a few key strokes maybe right after this talk. Um and the last thing is it's super important if you uh think about um optimizing your prompts to optimize the entire system. Um so that means uh thinking holistically about your um AI system as the data that you use for your evals, the task which is you know the prompt, the agentic system, tools etc. and the scoring functions and and every time you think about making um you know your your app better, you need to think about improving this overall system. Um we actually ran a benchmark uh which is uh the same benchmark that I showed previously. um it autooptimizes prompts uh using um an LLM and uh we ran it once by just giving it the prompt and saying like hey please optimize the prompt and a second time giving it the prompt the data set and the scores and said please optimize this whole system. Um and you can see there's a very dramatic difference. So again um something goes from unviable to viable. Um, but it's just super important to optimize the entire system, not not just the prompt. And actually, uh, this is, uh, a new product feature that we are starting to launch today. Um, if you're a Brain Trust user, uh, you can go to the feature flag section of Brain and turn on a new feature flag called Loop. Um and uh the loop is this amazing cool new feature that actually autooptimizes uh your evals um directly within brain trust. Uh so uh you can work in our playground and um give it you know a prompt uh a data set um and some scores and it can actually create prompts, data sets and scores too um and just you know work with it. Uh the kinds of things that we've seen work really well are optimize this prompt or uh what am I missing from this data set that would be really good to test for this use case? Um why is my score so low? Um or why is my score so high? Can you please help me write a score that is uh you know harsher than the one that I have right now? Um you can also try it out with different models. So, uh, as you could see from this, uh, we've definitely seen the best performance with Cloud 4 Sonnet and Cloud 4 Opus performs a couple of percentage points better. Um, but we encourage you to try it out with different models. You can use 03, you can use 04 mini, you can use Gemini, maybe you're building your own uh, LLM or fine-tune model. You can try that as well. Um, and yeah, we're very excited uh, for this. I think uh, I'm going to talk about this a little bit later. Um, and I'm happy to do it with some Q&A as well, but um, I actually I really think that the workflow around evals is going to dramatically change now that LLMs are capable of looking at prompts and looking at data and actually making um, you know, constructive improvements automatically. A lot of the manual labor that went into iterating with EVELs um, doesn't need to be there anymore. So, it's it's really exciting. Uh, we're excited uh, to ship this and and start to get some feedback. Uh so just to recap um five lessons that I think are really important. Um effective eval speak for themselves. It's it's important to understand whether you've kind of reached a point of eval competence in your organization or not. It's okay if you haven't. Um it's not easy, but it's important to be honest about that and work towards it. Um when you're working on evals, it's very important to engineer the entire system. So don't just think about the prompt. Don't just think about improving the prompt. Please don't just use synthetic data or hugging face data sets. I know they're awesome, but please use more than just that. Please don't use off-the-shelf scores only. Write your own. Think very deliberately about um how you can craft the spec of what you're working on into your scoring functions. Um think very carefully about context. And I think in particular um what helps me personally is to think about writing tools uh like I would think about writing a prompt. It's my opportunity to communicate with an LLM and set it up for success. And how I define the API interface of the tool and I define its output has a very dramatic impact on that. Make sure that you're ready for new models to come out and to just change everything. Um, so if a if a new model comes out, you want to be prepared to know that immediately, ideally the day that it comes out. Um, and also be prepared to like rip out everything and replace it with a fundamentally new architecture that takes advantage of that new model. And I think part of that is obviously having the right evals. Part of it is engineering your product in a way that actually allows you to do that. And then finally when you think about optimizing or improving uh your eval performance um you have to think about optimizing the whole system the data and how you get that data the task itself um which you know the prompt tools etc and the scoring functions. Um and with that uh we have some time for Q&A. Yeah there's uh two microphones up here one on the left side one on the right side. uh feel free to stand up and ask your questions. Hi, this is Joti. Um, one of your slides said take feedback and turn it into an eval. Are you concerned about overfitting evals at that point where every feedback then turns into an eval? Oh, that's a great question. Um, also nice to see you. Um so uh the question was um one of the slides was about taking feedback uh from you know real data and adding it to a data set and incorporating it in an eval. Are you worried about overfitting? Um and I think the answer is I'm actually way more worried about overfitting to the data set without the user's feedback than I am to um adjusting the fit to incorporate the user's feedback. Like the most important thing about a data set is not the state of the data set at any point in time. It is how well you are equipped to reconcile the data set with the reality that you want. Um and I actually think one of the things that we discourage uh in the product and some people complain to us about this. I get it uh if you're one of those people. Um but we don't automatically take user feedback and add it to data sets right now. We actually want a human who has some taste and maybe uh can build some intuition about the problem to find the uh data points from users that are interesting and add them to the data set. And I think that is your opportunity as a user to apply some judgment about like oh okay this user is trying to do something that should obviously work. It's really sad that it doesn't work in my product. Let me add it to the data set so I can make sure it does. Excuse me. You had a slide I think in the tool descriptions about like with some percentages on it. Yeah, this one. What what is that? Yeah. So, um we took a few agents um like we you know have a lot of traces uh and we analyzed the relative um number of tokens for different message types. So the system prompt is one message type. Tool definitions um are you know the spec of what uh tools the model can call. User and assistant um uh are um tokens from user and assistant just text interactions and then tool responses are um tokens from the that you know that the the tool generates itself. Oh this is the percentage of tokens correct and this is the relative percentage of those tokens. Yeah. Yeah. Yeah. Yeah. So the the the the point that we're trying to make here is that um I think in modern agentic systems uh tools actually like very very significantly dominate the token budget of the LLM. And I think that it's very important to um think about how you define the definition of tools and how you define their outputs so that you u you know engineer the LLM for success. uh not just sort of take you know your GraphQL API and give it as a bunch of uh you know uh tool calls to to the LLM. Um first off that point about the thumbs down is such a good point. I'm working with the government and people don't like the answer they got for example about taxes and they give it a thumbs down. Yeah. Right. So like adding that human aspect is a really good idea. We actually even added a little thing that said the answer is right, but I just don't like it. That's awesome. Um, but my question is about your point that the new model changes everything. We've updated our models several times and and use clo and open AAI and we haven't found huge differences other than recently someone really cheap wanted to use 4.1 mini and like it seemed to ignore every it. I swear it ignored the system prop completely. Yeah. But what kind of things when you say it changes everything, can you tell me a little more about what kind of changes you're seeing? For sure. I think um the use case that we just shipped with loop is a really good example of that. So this is a very ambitious uh agent. It's looking at prompts and uh data sets and scores and automatically optimizing the prompts based on the data sets and scores. And this is something that um you know we wrote a benchmark for a while ago and we ran with every consecutive model launch and the numbers looked more like what you see for GPT40 for a very long time. This isn't true for every benchmark. So um as part of this exercise we actually have a bunch of uh evals that loop optimizes. That's our eval set. And there's some evals like uh classifying taking movie quotes and figuring out what movie they're coming from that have worked really well since GPT 3.5. Um and so there are certain use cases where it just doesn't matter. There are other use cases where um they're so ambitious that they just don't work today. And I think you want to create evals uh so that if there's something ambitious that you want to do in the future, you are very well prepared when a new model comes out to just push a button and find that out. Okay. Thank you. So, we are out of time, but I'm sure Anchor would be happy to take more questions um outside as well. For sure. Um so, in to for the interest of time, we we'll call it that done on that one. Um and yeah, if you don't mind. All right, let's let's hear a round of applause. So, while you get going, I'll introduce you. All right. So like uh if in case people are not aware I fed uh the information of the speakers to AI and uh we're doing an eval on the uh output of the AI. So here I have GPT40 mini and uh it says please welcome John Dickerson. It says pronounced Dick Earth son. I don't know why, but okay. The CEO of Misilla AI exclamation point with a rich background as a co-founder of Arthur AI. Please let us know if it's true or not. That is true. And a former tenure professor, John is at the forefront of building an open-source AI ecosystem and has extensive experience in responsible AI practices. Today, he'll share insights on the urgent need for robust AI evaluations in enterprises, drawing from a decade of experience in the field. That's roughly right. Thank you. All right, cool. Thanks everyone for being here. Um I'm going to give this talk mostly from the uh point of view of being a co-founder and chief scientist at Arthur AI where for six years prior to joining Mozilla AI as CEO. Um I do want to say Mozilla AI operates in the open source world where we're providing open source AI tooling and we're supporting the open source AI stack. Our end goal is to uh enable the open source community to be at the same table as a Sam Alman when talking about AI moving forward. So if you're interested in that, that's not what this talk is going to be about, but we can talk about that offline. This talk is going to be about 2025 finally being the year of the evals. And as was written uh written as was spoken uh by my introduction, I've been in the space for a very long time. Arthur AI, for example, was in and still is in obser observability evaluation and security in both traditional ML and AI and then into the deep learning revolution and then into the genai revolution and then into the agentic revolution. And I think we're finally at the point where all of these companies are going to start seeing hockey stick growth, which is exciting. So the thesis of this talk, one thing is that uh I I see a IML monitoring and evaluation as two sides of the same sword or ruler, right? You can't do monitoring or observability without being able to measure and measurement is uh the core functionality for evaluation. This was not uh top of mind really with the seuite uh until two things happened concurrently. One is AI became a thing that people who aren't a CIO or CTO could understand. So the CEOs, the CFOs, the CISOs began to understand it basically when chat GBT came out and simultaneously there was a perfectly timed budget freeze across enterprise at least in the US that happened due to a fear of an impending recession. So this is right before chat GBT launched. This is like October, November when most enterprises would set up budget for the next year. At that time there was a freeze except for money that could be opened up for a specific pet project and that pet project because CEOs and CFOs then knew about it uh was Gen AI. So that happened and then now we have the final sort of uh vertex on this triangle which is going to force evaluation to be top of mind this year which is that we have systems that are now acting for uh humans acting for teams as opposed to just providing inputs into larger systems. So these three things together are as you saw with Brain Trust, as we're seeing at Arthur, as you're seeing with like Arise AI or Galileo, other big players in this space. We're starting to see big takeoff because of this. Cool. So right now, this is what's happening. You're the agent. We all hear about it. We're hearing about it here at this conference. Agents are starting to make decisions and take actions, complex steps that lead toward an action, either autonomously or semi-autonomously. uh as a question on the last talk um um uh brought up you know bringing humans into the loop is still obviously a very good idea on many systems but we're getting closer and closer to full automation and agentic systems are going into deployment now okay and that's in enterprises that's in SMBs that's in pet projects and so on and what that means is by that last slide this is also the year that we need flash back uh to then up to like a year ago where every year we were asking hey is this the year of ML? Is this the year of evaluations? And prior to sort of these agentic systems coming out, we would have machine learning models basically spitting out numbers that would then be ingested into a more complex system. And that complexity uh would sort of erase uh the top of mind need to think about what's coming out of the model itself. Except for people in this audience, we know that's very important. But when it comes to decision makers, it would often get wrapped up into this sort of opaque box. And that meant that the ML part didn't really bubble up beyond like the CIO or whatever org the system was going into which meant that typically that year was not the year for EVAs because the need for evals was not obvious to the entire seuite. Cool. So let's take a quick step back in time. Before November 30th of 2022, the chatbt launch, uh, ML monitoring was certainly a thing, right? Like data science teams have long used statistical methods as part of larger systems to understand what's going on, right? This is this is core. Um, like I mentioned though, there's a tenuous connection to sort of downstream business KPIs. And at the end of the day, that's what gets your product bought in the enterprise is being able to make a sale about dollars saved or about dollars earned. So being able to connect machine learning the components specifically to a downstream business KPI there was a lot of lip service around a IML around the ROI from the seauite including CIO CEOs um but that was just lip service in our experience at least uh it was still basically selling into the CIO so basically it made it hard to sell outside of that now obviously this is a large space this has been happening since you know about 2012 I would say is when a IML monitoring really started up with like H2O and algorithmia and Seldon sort of the first generation of these companies coming around Y labs Aporeia Arise Arthur Galileo Fiddler protect AAI and so on and so on and so on I've put the cut off here at uh like mid 2022 sort of like before the genai revolution happened there have obviously been companies founded after that you know we just saw brain trust talking as well and then you know the big players here as well right snowflake data bricks data dog sage maker vertex you know Microsoft's products and so So people have been rarely top of mind for the CEO and the issue. So when we would talk to people it's always yes we understand that we need this but security is going to be a bigger issue or latency is going to be a bigger issue or some of these more traditional technology sort of problems are going to be the issue. It wasn't the machine learning model itself. So basically, you know, I do a lot of due diligence for venture capitalists as well now as a, you know, um, a multi-time founder and so on. And basically every pitch deck in this space from like the mid2010s onward had a slide that said, "This is the year that a CEO is going to get fired because of an ML related screw-up." Uh, and to my knowledge, it just still hasn't happened. There have been some forwardinking leaders. So I have here a um the annual report from Jamie Diamond, head of JPMC. Uh, this came out in April of 2022. So it covered basically JPMC up through their fiscal year in 2021. Uh and he's talking about the spend that they have going into AI. But if you squint and you look at these numbers, they're still like comically small. So basically one of these is in the consumer world, he makes the statement that from 2017 up through the end of 2021, they had put $100 million into a IML, right? That's not a huge amount of money uh for JPMC. So keep sitting back in time pre-hat GPT and so on. But let's now flip to the macroeconomic side of things. So the economy started getting pretty dicey um right up until about chatbt um uh um launched right so that was the end of November for chatbt uh a lot of enterprise budgets are set in October November for the following year and toward mid to the end of 2022 there were very deep fears about an impending recession that didn't end up happening but those fears basically made it so that most enterprises either froze or shrunk their IT budgets for 20 23. Okay. So what that meant is were it not for particular tailwinds called JGPT, we probably wouldn't have seen a lot of new technology being developed in the IT departments at these large enterprises in 2023. Right? So this is sort of a bittersweet uh mix. I guess I just talked about a lot of this. Um, this was sort of a bittersweet mix in the sense that it did set us up to put the eye of Sauron on a particular specific small pet project where uh a small amount of budget could be applied and that small pet project uh came from our friends at OpenAI launching chatbt right before the holiday breaks and I'm convinced just from talking to some seuite folks across the enterprises here in the US that basically what happened is now CEOs and CFOs and you know less technical and the computer science sense of the word were able to interact swiftly and easily with a single UI on the internet and get wowed by AI. Right? So I was also wowed by AI. In fact, we were hosting um Arthur was hosting with our um series A leaders index ventures an event at Nurups, the major machine learning conference the night that chatbt came out and it basically took over a bunch of nerds in a room being like, "Wow, this is very, very impressive." And that happened to everybody else, right? Flip back to November 30th. You know, you can make Eminem rap like Taylor Swift. Oh, hey, isn't this funny? Oh, hey, my mom did this sort of like joke about, you know, I want to have poetry but in the uh uh written in the, you know, the words of a rapper or whatever. This started to happen over and over again. And what that meant is that that discretionary budget uh which exists uh was unlocked specifically for now the CEO's pet projects which were called Genai. So 2023 uh we still had austerity forced on us because of those frozen budgets um or even reduced budgets. But the thing that was happening here is that now the only money going around that could be allocated was going to specifically genai. And so everybody focused on this. A it's a cool technology. Everybody focused on this and the science projects started to sort of float around within enterprise. 2024 we started to see genai based applications going into production right chat applications are the obvious one internal chat applications internal hiring tools things like that and that's because basically the only budget going into new projects in 2023 was going to genai and now as things go into production primarily internally in 2024 uh we have the folks who tend to dress in business suits uh asking questions around ROI governance risk compliance brand optics and that kind of thing so now we're starting to get a little bit closer to people outside of the machine learning, the data science, the computer science world, the CIO's office caring about evaluation, right? If I need to quantit quant uh if I need to have a quantitative estimate of risk, then I need to do evaluation. 2025, we've seen, you know, scaleups, right? Look look at the revenue numbers for any frontier model provider. Look at the revenue numbers for a lot of us in this room. Just everything is really going up right now. And that's because of usage, which is great. That also means that's a function of the seauite basically becoming comfortable basically talking about and putting large real budget into AI right so 2023's IT budgets 2024's IT budgets uh set for the following year those weren't frozen right those are earmarked specifically for AI applications and things along those lines so we had science projects in 2023 go into production in 2024 and they're now shipping and scaling uh in 2025 and also like frankly the the technologies just gotten really amazing Right? Like all of us in this room are are technical, but even I'm just amazed every time a new model is dropped. The community has also really gotten behind this. You know, open source has gotten behind this. Venture capital, big tech is writing huge checks into uh into frontier model providers and so on. So everything is sort of coming together in 2025. And also remember that third vertex, we have machine learning systems now moving toward autonomy. Okay, so 2025 we all hear it. It's year of the agent. I you know, no longer is a question mark needed here. It's clearly the the year of the agent. Now, quick 30 second definition of an agent. As defined, you know, in the late 50s onward, um agents need to perceive the environment. They need to learn. They need to abstract and generalize. And unlike traditional machine learning, they're going to reason and act, right? We have reasoning models out there. We have systems that are acting in virtual environments or cyber physical environments. And what that means is you have a lot of complexity introduced into the system and you have a lot of risk introduced into the system and that's great for those of us in uh evol. So at the end of the day, the thing that really matters, like I mentioned, is connecting when you're selling any product, not just our products in this room. When you're selling any product into an enterprise or SMB, is being able to attach uh your product into some sort of downstream business KPI, risk mitigation, revenue gains, uh you know, losing less money, whatever. So now evaluations, you need to be able to do this because you're quantifying things and they're finally a first class discussion point, which is fantastic. So we have the CEO like I mentioned November 30th 2022 and onward now at least knows what the tech is and I'm not saying they know you know what attention means uh right but I am saying that they know some of the capabilities around these generative models they know some of the capabilities around agentic and multi-agent systems uh and they're comfortable um you know talking to experts about it uh allocating budget for it uh and talking to their board of directors and shareholders about it as well. Uh, we have the CFO who because the CEO cares about this stuff, uh, also obviously needs to care about it, but she's going to care about the impact to the bottom line, right? That's what a CFO does. They're doing allocation. They're doing budget planning. Uh, and they're going to need to basically write some numbers into an Excel spreadsheet, and those numbers have to come from in part quantitative evaluation. Uh, CISOs, uh, now see this as, you know, a huge security risk and opportunity. And for those who haven't sold into enterprises before, CISOs are typically willing to write checks uh smaller checks especially for startups uh more quickly and with less overhead than like a CIO would. CIOS tend to have a bigger org for one and also tend to have a lot more process. CISOs tend to be a little bit more scrappy and willing to try out tools and so on. And so this actually happened before the agentic uh revolution. This actually happened when Gen AI started coming up where CISOs were like, hey, hallucination detection, prompt injection, things like that. that's firmly in the security space, which is why you've seen a lot of guardrail products, including Arthurs, including the ones that come from our competitors, going into the CISO's office and and basically being able to sign a lot of deals. Uh, the CIO, poor poor CIO, has been on board the entire time. Uh, and they're just trying to keep the ship sailing. So, that's great. They're still on board. Uh, they want to keep their job. And the CTO's now, uh, they always want standards, right? They need to make these decisions based on numbers. Those numbers are coming in part from like hotel standards, standards like that. And that's great, right? So I've listed a lot of the seauite here. I haven't talked about chief strategy officers or otherwise, but like the CEO, the CFO, the CTO, the CIO, the CISO, they control a lot of budget and now they are all willing to talk about and they're all aligned about basically the need to understand evaluation from AI. Great. So uh quick remind me, you know, you should hold me truthful here. All the evaluation companies, observability companies, monitoring companies, security companies, whatever you want to call them, have shifted into aentic and multi-agent systems monitoring. Right? The point around you should monitor the whole system. You shouldn't just monitor the one model that is being used by one particular agent. That's well taken and well understood in industry and in government. And I think that's that's great to have that at that topline discussion point. But, you know, keep me honest here. uh there was an article that came out in midappril uh in the information uh showing some leaked revenue numbers for a variety of startups in the evaluation space weights and biases Galileo brain trust and so on but they were lagged by about 6 months or eight months um and just from talking to friends in the space uh those numbers are no longer representative of what folks in this area are making and so let's see what the information leaks in early 2026 about this and maybe we'll see something like revenue no longer lags at AI evaluation startups because this is the for its mention is not firmly in the evaluation systems. So if you're playing around with different multi- aent system frameworks, check out any agent. We implement a lot of them for you under a unified interface. So for people in this room, that might be a fun project to play around with. So thank you. I'll uh have three three minutes for questions. [Applause] I go first. Okay. Thanks. Uh thanks great presentation. I just have a question really about the enterprise value about the most of the evaluations in Genead require domain expertise. So for example, if you're building a multi- aent system to do financial investment analysis to do something called a discounted cash flow spreadsheet, is the agent doing it correctly or not? Uh I'm just trying to understand is How is that problem getting solved? Because most of them are, you know, coming from an ML background where it was structured data, but this is a lot of unstructured data and you have to measure the quality like is it in acting like a human, right? Yeah. Yeah. It's it's great. Actually, I have a paper in nature machine intelligence talking about some of the problems that can come around when you do persona based agents where I say act like a farmer in Ohio in your mid-40s and so on. There's value in that, but you can't do it perfectly. And my gut reaction to this is um there was a leaked spreadsheet from Merkor, which is a company that uh can hire in experts showing, you know, $50 an hour, $100 an hour, $200 an hour for experts to be hired by, for example, Google or for example, Meta or for example, large banks to do kind of what you're saying, which is you're going to have an expert sitting alongside the multi- aent system. Uh basically sitting next to, you know, the intern who's going to come in and take your job in like a year or two or change your job. Maybe take is not the right word, but they're basically doing that expensive human validation and lock step with the multi- aent system, which you know, if you're going to be doing discounted cash flow analysis, right, the kind of thing where a you can either make or lose a lot of money and b lose your job if you get it wrong. Uh it's worth spending that large amount of money doing the human validation. It's a question for everyone though is like what does that look like in five years once that data is incorporated into the systems themselves and that can be a mode as well, right? when you talk to anyone in the eval space like it's the data set creation and the environment creation that matters more than anything which is the point you're getting at as well. So if I if I spend a bunch of money to have a very good competitive like uh DCF environment or whatever that can help me versus my competitors. So there is like um there is capex going into that. Uh yeah thanks for the presentation. Um do you have a rough uh maybe timeline on when you think um eval will primarily be driven by uh maybe gen AI or even LMS? Yeah, the LLM as a judge paradigm is and I think this was talked about in in the previous talk as well. Uh we see it getting used in practice because there are issues with it, right? We have a paper inclar uh from last month talking about some of the biases that LLMs as judges have versus humans and things like conciseness or helpfulness and some of those anthropic words. But the long and short of it is that like it it solves the data set creation problem in some sense and that like you can ask you you give a persona to an LLM and it it it is like a poor man's version of like a human doing the judging and so we see a lot of people using that as a crutch right now but you need to toward the toward the last question that was asked you do need to make sure you're validating this and and making sure that you're not going off in some weird bias direction but we do see them in practice that's time. Oh uh happy to chat offline. Yep. Thank you so much, John. Um, he is available as well. Feel free to reach out to him. Up next, um, we have, let's see. Let's use claw for sonnet. So, that should be good, right? All right. So, let's see. Yeah, I actually asked the AI to go ahead and create the introduction for everybody. So, make sure to tell us if anything is hallucinated. Please welcome Jeff Huber, CEO and co-founder of Chroma, the widely adopted open source vector database that's been making waves across the tech media from Techrunch to Forbes exclamation point. Joining him is Jason Leu, principal at 567 Studio and a machine learning engineer who wears many hats as a consultant educator ready to show you how to turn your messy conversation data into actionable insights and killer evals. There you go. Pretty good. Pretty good. All right. Can you hear me? Okay. All right. Welcome everybody. Um, I'm Jeff Huber, the co-founder and CEO of Chroma, and I'm joined by Jason. We're going to do a two-parter here. We're really going to pack in the content. It's the last session of the day, and so we thought I'd give you a lot. Um everything in this presentation today is open source and code available. So we're also not selling you any tools. Um and so there'll be QR codes and stuff throughout to grab the code. So let's talk about how to look at your data. Um all of you are AI practitioners. Um you're all building stuff and uh this probably these questions probably resonate quite deeply with you. Uh what chunking strategy should I use? Is my embedding model the best betting embedding model for my data? Um and more. And our contention is that you can really only manage what you measure. Again, I think Peter Ducker is the original who coined that. So I can't take too much credit. Um but it certainly is still true today. So we have a very simple hypothesis here which is you should look at your data. Um the goal is to say look at your data I think at least 15 times this presentation. So that's two. Um and great measurement ultimately is what makes systematic improvement easy. And it really can be easy. It doesn't have to be super complicated. So, I'm going to talk about part one, how to look at your inputs. And then Jason is going to talk about part two, how to look at your outputs. So, let's get into it. All right. Looking at your inputs. How do you know whether or not your retrieval system is good? And how do you know how to make it better? Um, there are a few options. Um, there is guess and cross your fingers. That's certainly one option. Um, another option is to use an LLM as a judge. you're using, you know, some of these frameworks where you're checking factuality and other metrics like this and they cost $600 and take three hours to run. If that is your preference, you certainly can do that. Um, you can use public benchmarks. So, you can look at things like MTev to figure out, oh, which embedding model is the best on English. Uh, that's another option. Um, but our contention is you should use fast evals. And I will tell you exactly what fast evals are. All right. So, what is a fast eval? Um a fast eval is simply a set of query and document pairs. So the first step is if this query is put in this document should come out. Uh a set of those is called a golden data set and then the way that you measure your system is you put all the queries in and then you see do those documents come out. Um and obviously you can retrieve five or retrieve 10 or retrieve 20. It kind of depends on your application. Um it's very fast and very inexpensive to run. And this is very important because it enables you to run a lot of experiments quickly and cheaply. Um you know I'm sure all of you know that like experimentation time and your energy to do experimentation goes down significantly when you have to click go and then come back six hours later. Um all of these metrics should run extremely quickly for pennies. So maybe you don't have yet, you have your documents, you have your chunks, you know, you have your stuff in your retrieval system, but you don't have queries yet. That's okay. Um, we found that you can actually use an LLM to write questions and write good questions. Um, you know, I think just doing naive like, "Hey LM, write me a question for this document." Not a great strategy. However, we found that you can actually teach LLMs how to write queries. Um, these slides are getting a little bit cropped. I'm not sure why, but we'll make the most of it. Um, to give you an example. So, this is actually a um example from one of the MTe kind of the golden data sets around embedding models or the benchmark data sets. Um, this also points to the fact that like many of these benchmark data sets are overly clean, right? What is a pergola used for in a garden? And then the beginning of that sentence is a pergola in a garden dot dot dot dot dot. Um, real world data is never this clean. Um so what we did in this report um the link is in a few slides. Um we did a huge deep dive into how can we actually align queries that are representative of real world queries. It's too easy to trick yourself into thinking that your system's working really well with synthetic queries that are overly specific to your data. Um and so what these graphs show is that we're actually able to semantically align the specificity of queries synthetically generated to real queries that users might ask of your system. So what this enables is, you know, if a new cool sexy embedding model comes out and it's doing really well in the MTB score and everybody on Twitter is talking about it, instead of just, you know, going into your code and changing it and guessing and checking and hoping that it's going to work, um, you now can empirically say whether it's good or better or not for your data. Um, and you know, the kind of example here is quite contrived and simple. Um, you know, but you can actually look at the actual success rate. Okay, great. These are the queries that I care about. Do I get back more documents than I did before? If so, maybe you should consider changing. Now, of course, you need to re-mbed your data. That service could be more expensive. It could be slower. The API for that service could be flaky. There's a lot of considerations obviously when making very good engineering decisions. Um, but clearly the north star of like success rate of how many documents that I get for my queries, super fast and super useful and makes your improvement of your system much more systematic and deterministic. All right. So we actually uh worked with weights and biases um looking at their chatbot um to kind of ground a lot of this work. So what you see here is for the weights and biases chatbot um you can see four different embedding models and you can see the recall at 10 across those four different embedding models. And then I'll point out that uh blue is ground truth. So these are actual queries that were logged um in weave and then sent over. And then there's generated. These are the ones that are synthetically generated. And we want to see is a few things. We want to see that those are pretty close and we want to see that they are always the same kind of in order of accuracy, right? We don't want to see any like big flips between um ground truth and generated and uh we're really happy to see that we found uh that answer. Now there are a few fun findings here which is and of course they're going to get cropped out but that's okay. Um number one uh the original embedding model used for this application was actually text embedding three small. um this actually performed the worst out of all the embedding models that we evaluated just for in this case. Um and so probably wasn't the best choice. Um the second one was that actually if you look at MTeb Gina embeddings v3 does very well in English. It's like you know way better than anything else but for this application uh it didn't actually perform that well. Um it was actually the voyage 3 large model which performed the best and that was empirically determined by actually running this fast eval and looking at your data. That's number three. All right. All right. So, if you'd like access to the full report, um, you can scan this QR code. It's at research.tra.com. There's also an adjoining video, which is kind of screenshotted here, which goes into much more detail. There are full notebooks with all the code. It's all open source. You can run it on your own data. And um, hopefully this is helpful for you all thinking about how again you can systematically and deterministically improve your retrieval systems. And with that, I'll hand it over to Jason. Thank you. So, you know, if you're working with some kind of system, there's always going to be the inputs that we look at. And so we talked about maybe thinking about things like retrieval, how does the embeddings work? But ultimately we also have to look at the outputs, right? And the outputs of many systems might be the outputs of a conversation that has happened, a you know agent execution that has happened. And the idea is that if you can look at these outputs, maybe we can do some kind of analysis that figures out, you know, what kind of product should we build, what kind of portfolio of tools should we develop for our agents and so forth. And so the idea is, you know, if you have a bunch of queries that users are putting in or even a couple of hundred of conversations, it's pretty good to just look at everything manually, right? Think very carefully about each interaction and then only use these models when they make sense. And then oftent times, but if I say this, they can say, you know what, if we just put everything in 03 and then here, generally only use the language models if you think you're not smarter than the language model. Then when you have a lot of users and actual good product, you might get thousands of queries or tens of thousands of conversations and now you run into an issue where there's too much volume to manually review. There's too much detail in the conversations and you're not really going to be the expert that can actually figure out what is useful and what is good. And ultimately with these long conversations with tool calls and chains and reasoning steps, these outputs are now really hard to scan and really hard to understand. But there's still a lot of value in these conversations, right? If you've used a chatbot, whether it's in cursor or any kind of like cloud code system, oftentimes you do say things like, "Try again. This is not really what I meant, you know, be less lazy next time." It turns out a lot of the feedback you give is in those conversations, right? We could build things like feedback widgets or thumbs up or thumbs down, but a lot of the information exists in those conversations. and the frustration and the retry patterns that exist can be extracted from those conversations. The idea is that the data really already exists in this conversation. We think of a simple example outside in a different industry. You know, we can imagine the analogy of marketing, right? Maybe we run our evals and the number is 0.5. I don't really know what that means. Factuality is 6. I don't know if that's good or bad. Is 0.5 the average. Who knows? But imagine we run a marketing campaign and our you know ad metric or our KPI is 0.5. There's not much we can do. But if we realize that 80% of our users are under 35 and 20% are over and we realize that the younger audience performs well and the older audience performs poorly. What we've done is we've just drawn a line in the sand on who our users are. And now we can make a decision. Do we want to double down on marketing to a younger audience or do we want to figure out why we aren't uh successfully marketing to to the older population, right? Do I find more park podcast to market to? You know, should I run a Super Bowl ad? Now, just by drawing a line in the sand and deciding which segment to target, we can now make decisions on what to improve. Whereas just making them ads better is a sort of very generic sentiment that people can have. And so one of the best ways of doing that is effectively just extracting some kind of data out of these conversations in some structured way and just doing very traditional data analysis. And so here we have some kind of object that says I want to extract a summary of what has happened maybe some tools that it's used maybe the errors that we've noticed the conversations that that happen maybe some metric for satisfaction maybe some metric for frustration. The idea is that we can build this portfolio of metadata that we can extract and then what we can do is we can embed this find clusters identify segments and then start testing our hypothesis. And so what we what we might want to do is sort of build this extraction put into an LLM, get this data back out and just start doing very traditional data analysis, no different than any kind of uh product engineer or any kind of data scientist. And this tends to work quite well. You know, if you look at some of the things that Anthropic Cleo did, they basically found that, you know, uh, code use was 40x more represented by cloud claude users than by, you know, uh, GDP value creation. They go, okay, maybe code is like a good avenue. And and obviously that's not the really case, but the idea is that by understanding how your users develop a product, you can now figure out where to invest your time. And so this is why we built a library called Cura that allows us to summarize conversations, cluster them, build hierarchies of these clusters, and ultimately allow us to compare our evals across different KPIs again. So now, you know, if we have factuality is 6, that's really hard. But if it turns out that factuality is really low for queries that require time filters, right? Or factuality is really high when queries revolve on, you know, contract search. Now we know something's happening in one area, something's happening in another. And then we can make a decision on what to do and how to invest our time. And the pipeline is pretty simple. We have models to do summarization, models to do clustering, and models that do this aggregation step. And so what you might want to do is just load in some conversations. And here we've made some a fake data set, maybe conversations, fake conversations from Gemini. And the idea is that first we can extract some kind of summary model where there's topics that we discuss, frustrations, errors, etc. We can then cluster them to find cohesive groups. And here we can find maybe you know some of the conversations are around data visualization, SEO content requests and authentication errors. And now we get some idea of how people are using the software. And then as we group them together, we realize, okay, really there are some themes around technical support. Does the agent have tools that can do this? as well. Do we have tools to debug these database issues? Do we have tools to debug authentication? Do we have tools to do data visualization? Um, that's something that's going to be very useful. And at the end of this pipeline, we're sort of presented with these printouts of clusters, right? We know what the tools are, how the chatbot is being used at a higher level, you know, SEO, content, data analysis, and at a lower level, you know, maybe it's blog post and marketing. And just by looking at this, we might have some hypothesis as to what kind of tools we should build, how we should, you know, develop, you know, even our marketing or how we can think about changing our prompts. We can do a ton of these kinds of things. And this is because the ultimate goal is to understand what to do next, right? You do the segmentation to figure out what kind of new hypotheses that you can have. And then you can make these targeted investments within these certain segments. If it turns out that, you know, 80% of the conversations that I'm having with the chatbot is around SEO optimization, maybe I should have some integrations that do that. Maybe I should reevaluate the prompts or have other workflows to make that use case more powerful for them. And again, the goal really is to just make a portfolio of tools of metadata filters of data sources that allows the agent to do its job. And oftent times the solution isn't really making the AI better. It's really just providing the right infrastructure. Right? A lot of times if you find that a lot of queries use time filters and you just didn't add a time filter that can probably improve your eval by quite a bit right we have situations where we wanted to figure out if contracts were signed and if we just extracted one more step in the OCR process now we can do this large scale filters and figure out you know what data exists and generally the practice of improving your applications is pretty straightforward right we all know to define eval but not everyone that I work with has really been thinking about something like finding clusters and comparing KPIs across cost clusters. But once you do, then you can start making decisions on what to build, what to fix, and what to ignore. Maybe you have a two-sided uh quadrants, right? Maybe you have low usage and high usage, and you have high performing evals and low performing evals, right? If a large portion of your population are using tools that you are bad at, that is clearly the thing you have to fix. But if a large proportion of people are using tools that you're good at, that's totally fine. If a small proportion of people use something that do something that you're good at, maybe there's some product changes you need to make. Maybe it's about educating the user. Maybe it's adding some, you know, prefiller or automated questions to show them that we can do these kind of capabilities. And if there are things that nobody does, but when we do them, they're bad, maybe that's a oneline change in the prompt that says, "Sorry, I can't help you. Go talk to your manager." Right? These are now decisions that we can make just by looking at, you know, what proportion of our conversations are of a certain category and whether or not we can do well in that category. And as you understand this, then you can go out, you can build these classifiers to identify these specific intents. Maybe you build routers, maybe you build more tools. And then you start doing things like monitoring and having the ability to do these group buys, right? So now you have different categories of query types over time and you can just see what the performance looks like, right? where 0.5 doesn't really mean anything, but whether or not a metric changes over time across a certain category can determine a lot about how your product is being used. By doing this, we figured out that, you know, some customers when we onboard them, they they use our applications very differently than our historical customers. And we can now then make other investments and how to improve these systems. And ultimately, the goal is to create a datadriven way of defining the product roadmap. oftentimes it is research that leads to better products now rather than products justifying some research that we don't know is possible and again the real marker of progress is your ability to have a high quality hypothesis and your ability to test a lot of these hypotheses and if you segment you can make clearer hypotheses if you use faster evals you can run more experiments and by having this continuous feedback through monitoring this is how you actually build a product right this is regardless of being an AI product. This is just how you build a product. And so if you look at the takeaways really when you think about measuring the inputs, you really want to think about not using public benchmarks, building evals on your data and focusing first on retrieval because that is the only thing a LLM improvement won't fix, right? If the retrieval is bad, the LLM will still get better over time, but you need to earn the right to sort of twink tinker with the LLM by having good retrieval. And then lastly, if you don't have any customers or any users, you can start thinking about synthetic data as a way of augmenting that. And once you have users, look at your data as well. Look at the outputs, right? Extract structure from these conversations. Understand, you know, how many conversations are happening? How often are tools being misused? What are the errors? And how are people frustrated? And by doing that, you can do this population level data analysis, find these similar clusters, and have some kind of impact weighted understanding of what the tools are. Right? It's one thing to say, you know, maybe we should build more tools for data visualization. It's another thing to say, hey boss, 40% of our conversations are around data visualization and the, you know, the code engine or the code execution can't really do that well. Maybe we should build two more tools for plotting and then see if that's worth it. And you can justify that because we know there's a 40% of the population is using data visualization and we do that, you know, maybe only 10% of the time, right? This is impact weighted. And ultimately as you compare these KPIs across these clusters, you can just make better decisions across your entire product development process. So again, start small, look for structure, understand that structure, and start comparing your KPIs. And once you can do that, you can make decisions on what to fix, what to build, and what to ignore. If you want to find more resources, feel free to check out these QR codes. uh the first one is to chroma cloud to understand a little bit more about their research and the second one is actually a set of notebooks that we've built out that go through this process. So we load the weights and biases conversations we do this cluster analysis and we show you how we can use that to make better product decisions. So there's three Jupyter notebooks in that repo. Check them out on your own time and uh thank you for listening. We do have time for we do have time for like one quick question and of course as well outside as well. So thank you If anybody wants to grab the mic there and over there. Yep. What's the spicy take today? It's not KPI, by the way. That's not the spicy. I think I think more agent businesses should try to price and like price their services on the work done than the tokens used. Yeah. Price on success, price on value. very unrelated to this talk but all right well if there's no other questions definitely meet up with them for any other questions you may have but yeah thank you so much for coming to the to the eval track hopefully you all are sold and you're going to look into evals and into your data as well so thank you so much nailed it you Hi. Yeah. Is there a reason that the technique that I talked about using fast evals, you know, query and document out? Could it be for images as well, right? Um, so yeah, I think it would work sort of just the same. You just sort of need a labeled pair, right, of like this input should yield this output and it could be multimodal or anything. relation for Chroma or for the talk too many. That's okay. [Music] Yeah, Chroma is a retrieval system. So, Chroma does like vector search, full text search, metadata search, but the Choco is in some ways separate from what Chroma does commercially. Like this is just a open- source set of Jupyter notebooks that show you how to like set up and run benchmarks. I guess once you figure out certain angle on the world. Yep. I'm not touching anything in case they come over in five minutes and we have a pop. I don't trust it. Right. So So don't I'm just saying to you don't start taking batteries. Yeah. Yeah. I think that's also why they said don't do anything. They just wanted us to hang on and see what happens. Well, because we need to do this real quick. I think Can you go grab the Q&A mics? Bring them back here. So, check check. I don't even need for this to be open to send it to record. Okay, I can I wonder if I just cut it mute in the N.