[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)
Channel: aiDotEngineer
Published at: 2025-07-29
YouTube video id: jxrGodnopHo
Source: https://www.youtube.com/watch?v=jxrGodnopHo
[Music] Uh we're a bit early but I guess everyone's here so we we'll get started. Uh maybe we'll spend uh first few minutes to get people a little oriented and actually we we are we are quite curious like what brings people here. Yeah. Why why are you here? Uh what what about evals? Uh did do you like maybe by a show of hands a few people like have you done evals before? Okay. And Okay. Have we struggled with them? Is that was it hard? Maybe back again. Show hand fans away from the speaker. Um are you so people who have not done eval before what bring what brings you here? Somebody anyone wants to volunteer? Yeah, feel feel free to raise your hand. Sorry. Should I move away from the speaker a little bit? Maybe for a little bit. All right, I'll stand on this. Trying to understand like how to approach it. Oh, I I see. Okay. So, so trying to understand how to how to approach evals. Um, for people who have tried before, what have they struggled with? What has been the Go ahead. I think the struggle one is to define the metrics that because it's hard to like I do situations. So, first it's hard to define what's the correct answer because it can look in many different ways and two is um they are end based business questions and then you can and how do you pragmatically improve it versus I design and I review it and I do it again and I feel like that's still very manual and labor intensive to me right yeah it's a eval can be labor intensive evals can be hard to hard to set up um they can be painful to get get started with um any other thoughts It's like machine learning is dependent upon training data. Agents are depending upon eval to provide feedback from the world to let them know whether or not they're getting it right. And clearly it's sophisticated and challenging problem. So like how to do science you have to have experiments. This is like that. How do you experiment? Right. So we are all I'm just saying it's super important. We are all getting pulled into the the world of data science and machine learning in some ways even if we don't want to. Go ahead. Yes. Sorry. Uh engineering was a part of quality assurance team and you know all these years people wasn't happy that testing testing took I don't know 30% of feature development time and now with AI I think that validation took 80% of feature development time. people even more not happy about it and we are trying to find ways also to to find a shortcuts but it seems that it's not always possible because each case is so unique that you just can't to reuse some previous works and you have to create many things from scratch on different levels and and because of this unclear nature pro probable nature of evaluation it's every time it's such a optimized way to do it, right? So, basically um you said speak up or speak low. Okay. Speak up. Okay. Um yeah. So, so basically like um it's custom sub subjective eval like for your specific use case. You just copy paste something that exists much like tests. We used to do testing. Some people did testing, some didn't. But now for for in the world of AI, everyone has to do evals. Everyone is spending a lot of time doing eval. Uh the best practices don't exist. So I go ahead please. Oh, uh I was just going to add that I'm looking forward to understand how synthetic data is playing a role there and especially for novelty like type of problems where you don't have access to real real world data to use as a data line. So how much of this can impact the evolve system? Right. There are like two aspects of eval. There's the synthetic data and there are the metrics. Metrics are checking. Synthetic data is testing it. Uh how what does synthetic data do? How do you generate good synthetic data to test and evaluate? Um go ahead. Mine goes along with the first comment. We've been trying on element models and use to evaluate but it's not accurate. So we have to do this manually and hoping to learn that what are some right right like automatic evals are are something everybody desires because you don't want a like human beings to sit and do it uh but the options that are are somewhat limited typically for most things you can lean on AI to do human tasks for for some reason eval tend to be hard LMS to do Right. Um, so I guess that's that makes a lot of sense. I we're not going to solve all your eval problems here. I think this is a honest a very hard research area that we would probably continue working on in the next few months. Um, but David and I were were at Google for for over a decade and uh we have been doing eval for for a long time. Uh, we we dealt with stoastic application. We both search we were building search. search was also similarly stochastic and most of uh the core effort in Google was to do a good job of evaluating where we are and improving it and some of the techniques we learned at Google we are bringing in here um so we built a product around it a bunch of technology around it but really the biggest takeaway I want people to have is the methodology and the stuff that we have learned uh whatever manifestation it takes right so I'm hoping that you get to play with some of our stuff but also to learn some some good ideas. Um David you want to add? Yeah, I mean one thing I would add is at Google we used to call this quality and evalu was part of it and there was this constant idea like benchmarking and you know exhausting a benchmark and then moving on to the next benchmark whenever we work with clients right now we we take a similar approach so like a chin mentioned so much of eval is just good methodology setting up benchmarks trying to figure out the metrics that work calibrating metrics with humans calibrating metrics with user data uh what I'm trying to say is it can get arbitrarily complex like evals is not like oh there's one way to do it and then you're done it's it's just part of how you do development uh which you know everybody's been struggling with because we've all been trying to learn what it means to develop on top of this new stack but again today in today's session we're just hoping to bring you some of these ideas where like methodologies that make sense like how to think about your metrics maybe people think of like four metrics so I would challenge you search had 300 metrics so like you start to think about like how can you expand the scope of how you're doing this work uh and you know part of what we're doing is trying to build technologies that make it easy to adopt these these technologies how to create benchmarks that are interesting how to like make it harder um and then we can talk about about this as we go. Uh this workshop will be mostly hands-on. Uh we'll show a few slides and then we'll dive into the code because I think the best way to to learn about these things is just to to do them. Um I think as we go we'd love to just pass it around and you know if people have you know again people are at different parts of the journey for some people like hey I don't have metrics and I'll just develop my own metrics. Uh we worked with clients that have metrics but they want to for example correlate them to user behavior. They have a lot of thumbs up thumbs down data. Um so that goes all the way into a feedback loop. So you see like evals is not just one thing related to testing. It goes all the way into your online system and your feedback loops and all that. So but hopefully a lot of the mental models today will help you kind of gauge that. Um just two two is things the slack channel it's uh workshops uh so you can join there on the slack and then we'll just be posting things and then we can have discussions and continue the conversations uh even after the the the workshop. And uh the second one is uh we have a document that will have all the steps uh of the workshop uh all the places where you can get the code where you can get the sheet just go with pi.ai/workshop you will land in a Google doc uh so there's a lot of links the Google doc also has a link to the slide deck we're presenting so you can have it we'll keep it online even after the workshop so you guys can reference it. Um all right we'll get started to to keep you guys on time and maximize sort of coding time. Yeah, I uh there are four of us. A couple of other people will just walk around. If you guys get stuck on any step, if you're having trouble with anything, just raise your hands and then we'll come and raise your hand over. Um all right, I'll give you a very like maybe five minute uh quick blur, show a quick demo and then get started. Um so can you guys still hear me? Yeah, I think we went went through this, but basically like most people start with VIP testing. That's like, you know, try it out, see how it works, change prompts. And honestly, for many applications, you can go pretty far with just that. Like I think AI AIS have become much quite good. I think agent systems, as you uh you were saying, agent systems are more complex because of multiple steps. Things fail more often. But wipe testing gets you pretty far. Human evals are expensive. Typically most companies don't bother setting up human var. Some do. Some have subject m matter experts. Uh codebased evals is where I think majority of the people are spending time. They're writing some sort of a code to test some verifiable things that they can and and people are moving into natural language like LM as a judge type things where it gets this is not a very good task. We'll get into some of it like for for t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t typical decoder models, the AI models. And so you kind of have to fight against um they're these models are designed to be creative. They're designed to be um that's not what you want from a judge. Typically this the scoring system uh scoring system is this idea that you you you're not trying hard to build a comprehensive set of metrics from the beginning. You start with some correlated signals. Maybe you start with five, 10 signals that you know for a fact are correlated with goodness. They're very simple signals that you can easily derive and then time you build upon those as you see problems as you as you debug and and then you test your application works well or not you learn from it. You build your application you measure it again. So that grip makes it a much more of a feedback loop type process. Uh that's the thing that we're trying to introduce in terms of methodology. Uh the thing that David was saying yeah one thing I will add to that is like there's no right or wrong like vibe testing actually gets you a very long way. Um sorry we we have a lot of echo. Um maybe you want to turn that off and then we can switch the mic as we yeah so ju just say like it it's there's no right or wrong. I think you should absolutely start with vibe testing. A lot of people use tracing and they just monitor traces. I think a as your system scales, you do want to get a little bit more sophisticated. Uh so you you start to layer those things in order of complexity. That's a lot of how we talk about quality generally is this idea of like there's complex things but they give some amount of investment of some of amount of return. So it's like a little bit of an ROI. Some things are very cheap to do and then you do them and as your system scales they become impractical and then you just layer in more techniques. So there's not a statement that like any of these things are good or bad. You just need all of them. So think of these things as tools in your toolbox, but eventually what you really want is a scoring system. Now how it manifests for you, you should you should be the judge of it, but like this is one thing to get away with like just keep layering into your tools and with increasing levels of sophistication, you get probably to a system that will look like the one we'll develop today. Yeah. And as David was saying like um are such such an investment. So they might might ask like well why am I spending so much time on just eval why am I not building and I think one of the things that that was core to to Google and I think we this industry is going to adopt this over time is that eval are actually the only place you're going to spend most of your time because that's where domain knowledge is going to live everything else will just work off of those evals and so um so for example if you have really good evals you don't have to write prompts you can write like you can find problems and use meta prompts or optimizers like DSPI and others to improve your prompts yourself. You can filter out synthetic data and then use that for fine-tuning if you're really interested in fine-tuning or or reinforcement learning. But you can also use these techniques online, right? One of the things that you're going to test to try today is a very simple but very effective technique that almost all big labs use. um Google used it extensively which is the the idea that you crank up the temperature, you generate a bunch of responses instead of just one response. You can think of this as online reinforcement learning, generate four or five responses and then score those responses online and see which one's the best one and you get a pretty decent lift just by doing this like without actually doing any changes to your prompts or models and so on. So that these are the kinds of things you can do uh with when when you have a really good scoring system that you can lean on. So you get to try some of these techniques today. Um but but the key point is don't think of eval as testing in the classic sense. Think of these as the primary place where domain knowledge lives. Um and then and then one of one of the things like that I I think um one of the attendees was pointing out is that tip right now at this point in the AI industry we have figured out a handful of standard evals like you can think of simple helpfulness harmfulness hallucinations and that doesn't really get you far enough. That's good enough to just sort of do a guardrail and make sure you're not doing anything wrong. But we are we're moving into the world where you want to build really good. Like for example, if you're trying to build a trip plan, you you you're curious about like, you know, how how to make that trip plan really be perfect. Like one of the things I I really hate about trip plans if they're too like not interesting enough. I'm looking at a trip plan. I want to be excited about this place. Typically, when LM give me a trip plan, it's a very kind of very planishish, right? So now you're trying to build these sort of nuances in your applications. And that's the kind of stuff if you want to evaluate that's where that's where the industry is going to go. So how do you build like these much more nuanced evals uh that that's one of the places where these traditional evals fail. So, as I went over, like start simple, iterate, see what's broken, and then improve, right? And so, that's the other thing you get to try today is in a co-pilot-like setting. Start with something, test it out with a handful of examples, maybe generate some synthetic examples, test it out with good examples, bad examples, see what's working, what's not, and then iterate. We have picked a relatively simple example for today because for workshop purposes we wanted to keep it simple but you can absolutely go and try it on fairly complex things and and there you can see a lot more of the nuances. Um so yeah so there are two parts of today's workshop the first part is um setting up your scoring system and the second part is once you have set up the scoring system and played with it and iterated on it in a co-pilot trying to use it in a collab you would need a Google account and some some proficiency in working with collabs and python code um we'll also introduce a spreadsheet component to this so that you can actually try to play around with this in a spreadsheet um so that you uh you can easily make changes and test things out. Uh one last thing I want to hit on before we jump into the the the workshop is what what is the scoring system? What is this idea of scoring system? Right? is just to give you a mental framing of it. Um ranking is scoring is eval to think about it. When Google does a search, what it's trying to do is it's scoring every document and seeing whether this is good or not and then giving you the best document that's the best for you. And that's not that different from te checking and scoring an LM generated content. And the way Google does it is by breaking this problem down into a ton of signals, right? So you can imagine you're looking at SEO, you can look at document popularity, you're looking at title scores, whether the content is good or not, uh maybe feasibility of things, um spam, you know, click batiness, stuff like that, and then brings all of these signals together into a single score that combines these these ideas. Individual signals that you have are very easy to understand. You know what that is, you can inspect it very easily. It's not a complex prompt with some random score. It's something that you can easily understand but it all comes together into a single score and that's the idea we are sort of bringing in and that's what we call a scoring system. At the bottom level things are very objective tend to be deterministic sometimes just Python code but as you bring this up to the top it becomes fairly subjective and you can like bring these together into very complex ways. Yeah. And I would just uh encourage you to think about like why this why why this kind of why this could work and and we know it does work. Um some of what you will be struggling with with eval is just like you're not measuring the things you need to be measuring. So you have a little bit of a comprehensive issue and so like the question is like how many metrics do you need to add? So I give the example Google search uses around 300 signals. Now maybe you don't want to be that but like Google does care about all those things. They care about like you know the score of the site. They care about popularity. They care about content relevance, spam, porn seeeking. They have all these classifiers, all these ways they understand the content to then bring it together. What this gives you is like really visibility over your application and more and more ways to marry your own judgment into it. So if if instead you just go and say, "Hey, is this a helpful response?" Mostly what you're doing is delegating that that eval to the LLM itself, right? Or to a raider who you're asking like, "Hey, see if this is good." But when you break this all down, you get some really nice properties where like your variance goes down a lot just because you're measuring way more objective things. So things are not like going back and forth all the time. And it's very precise because like you're you get all these things and you add them together to a more high fidelity score. And when you are analyzing the data then you can like slice and dice by way finer grain things. And that's kind of like why the system tends to work much better. And the best part of it is as you iterate you just add more signals. like this now doesn't leave you as like oh I either have evals or I don't have heals but like rather you just have a set of metrics that you keep adding over over time as you discover what actually matters about your oh um okay uh how do I do this um okay so I'll just quickly show you a demo of where you're going to start today. Uh give you some sort of basic ideas of the kinds of things you would be doing. Uh and then and then I'll just we we we should just get started, right? And so this is uh this is a co-pilot that helps you put together your value eval. That's the idea where I'm starting is basically just a system prompt. This is this application is basically a simple meeting summarizer. It takes uh a meeting script talking talking between multiple people and then generates some sort of a structured JSON in the end which is a summary with very specific um action items key key u key insights and then a title right so it's a relatively simple thing it's easy for you to inspect and see where things are not going working and not working well um so typically you would start with something like a system prompt you can also start with examples if you have a bunch of examples or you can start with criteria itself. Um, and the the first step is it would try to use this to build your scoring system. Um, yeah. And uh, this is doing exactly what was in that slide which is trying to say like from that coar grain subjective thing what are all the smaller things I can I can sus out out of it. Now this uses a reasoning model like if you drop this into a chat GPT or so uh we just try to replicate that experience where you get these artifacts on the right hand side which are your is your actual scoring system but if you think you just want to do it like iteratively through your favorite like sort of chat and phase you can also do it just say oops go ahead we do it in reverse where we provide examples of input and output and yes exactly right so right in front there is an example button you can start with example you can give it actually hundreds of examples if you want 20 examples uh and then it gets into a much more complex process of like figuring out these dimensions based on example or you can just copy paste one or two examples in the prompt itself and it will it'll generate it. Um the this is just a starting point by the way that's the idea it it starts you somewhere and now you're going to iterate over it and that's the exercise you will you'll spend time on. I'll show you a few things like this is your scoring system. These are your individual um these are individual dimensions is what we call it or you can think of these as signals. Uh they're all questions um effectively. For example, this is a does the output include any insights from the meeting. That's a natural language question. Um or you can have code um just Python code that we have generated. You can edit this code however you want. Uh this this uh this this is actually as simple as what this this says. When you when you look at the actual code, the code is effectively just a bunch of questions and you're sending these questions to our specialized foundation models that are designed for scoring and evaluation. Um, so you'll get to play with this a bunch in the collab. You can see what the form of it looks like. There's another thing that you would notice that there's this idea of critical, major, minor. These are just weights. These are just way ways for you to control what's important, what's not. The combination of this is done through a mathematical function that that you can learn over time. So actually eventually you would give it a bunch of examples and it'll learn it. But in this particular exercise you have a little bit more control. And finally once you have your scoring system uh done uh there is a way for you to integrate it into Google Sheets. That's the thing that you're going to play around with today which is basically taking this criteria moving it to a Google sheet and testing it against real examples. So here you're going to work with synthetic examples. You'll develop your scoring system and then completely blinded to this. We have label data set where users are set thumbs up thumbs down on the summaries. You're going to apply this to that scoring system and see how well it aligns with real real thumbs up thumbs down. Um so we don't know. It depends on you. You you're you're building your own scoring system whether it aligns or not. Um yeah and and this is a really interesting point and why why eval start to get hard like we we call this workshop like solving the hardest challenge which is metrics that actually work. So this idea of like correlation ends up being really really important. Metrics that work are not necessarily good metrics or bad metrics. They're either calibrated metrics or unccalibrated metrics. And this is where like uh at Google for example we had a lot of data scientists that we worked with right because they would do like all these correlation analyses and confusion matrices and such. So part of the challenge of of of good evals is just getting comfortable with the numerical aspect of these things. And of course again having a scoring system that dissects things into much simpler things makes it easier to analyze. But you still have to think about those things like does this actually correlate with goodness? Like if it gives a high score is this an actually a good score and that's a big part of the methodology. But the good news is as as a chin showed in the previous slide, once you have metrics you can trust, like almost all of the rest of your stack gets radically simplified as a result. Okay. So how do you use this copilot? It's created this. Maybe a place to start would be, you know, generate an example. This is a synthetic generation uh um happening behind the scenes. It's taking uh your system prompt and other information and trying to generate uh some sort of an example and and then and then this example that is generated is scored and you can see how these individual scores work. Now you can start uh kind of testing this a little bit more. You can say um can you generate a bad example? Um and then it will it will try to generate an example that's broken in some particular way. it the the copilot understands what you've done so far. It has the full context. It's using all of this information to kind of like sort of walk you through this. So in this particular way like in this particular case, it created something that is missing a bunch of information. But you can even do things like you know specific things like can you create uh an example that has broken JSON. Uh so this is like basically example generation. That's one one thing you can do here. The other thing you can do is you can make changes to your scoring system itself through the copilot. So you can either go and ask the copilot to make changes to your um anyways this is a broken JSON example. You can go to the c-ilot and either ask it to change the python code itself. You can say you know can you update the python code or and such of any of these things. You can also ask it to remove or add dimensions. Maybe you can say can you generate a dimension that checks for the title to be less than 20 words and then the copilot can add these dimensions. Go ahead. Uhu. Exactly. Exactly. What do you do just add you just ask the copilot and say I've updated here's a new examples can you add new dimensions to test these new examples or we have changed things and it automatically I mean it knows what layer it knows what layer you're exactly yeah the other thing I would say is the co-pilot is very helpful if you want this sort of human in a loop type process uh but once you get comfortable with the system what we see people doing is just use collab to and large amounts of data and let our system figure things out on its own. Uh I I find by personally I find it very good to play with my scoring system here every so often to see if it's like working well like maybe paste an example from from a user and see how well it's working and so on and so forth. You can kind of go back and forth. Uh but but most of our clients actually just fire off these uh long longunning processes. Um yeah I think maybe so um so anyway so that's uh so it just created this new uh title length thing for you. So you can sort of play around with this um and it'll you can change the questions you can remove dimensions so on and so forth. Right. The last thing I would show before we get going is uh one of the things that you would do is we we have pre-filled a spreadsheet in the in the in the workshop directions. you will make a copy of that spreadsheet and we have a spreadsheet integration where you can run our score inside a spreadsheet itself. You see there are other other sort of places where you can use it. I'm not going to get into this, but you can use it for reinforcement learning, for example, using onslaught integration, but there are other places you can actually integrate uh with PICORE uh if not sure if you're familiar with these, but basically in this in the sheets integration, what you're going to do in this workshop is you can actually just copy this wholesale and put it into a new spreadsheet with the examples that are here. What we're trying to do is just copy the criteria. So, you want to build a criteria. Here's a copy signal icon. Just click on the copy icon and then go to the spreadsheet that we will which which you have replace the criteria that exists there with this criteria that you've created and then under extensions these directions are all in the in the doc so you don't have to remember it but under extensions you can go and call the scorer the score is going to run across about 120 examples and then you'll see a confusion matrix which shows you how many times there's alignment on thumbs up how many times there's alignment on thumbs down and how many times there's no alignment and then you can play around in the spreadsheet itself make changes to the dimensions and see if you can bring the alignment closer. So this gives you a sense of u yeah so this is the this is what a spreadsheet would look like. This is your criteria um which is basically just the English form uh the the spreadsheet form of the English that you saw. Oh sorry is that better? Okay. So like you can see this is the label the question these are the weights. In the case of Python there's Python code. So this is what your criteria sheet looks like right now. This is the default one that we put in there for you, but you would replace it with your own. This is the data. This this basically has actually feedback, which is a thumbs up, thumbs down, which we got from users. Um, some of this is fairly complex. Some of this is easy. So, it's all kinds of different sort of mix of things. And what you're going to do is you're going to select these two rows. This is the input, the output, and then in the extensions uh under here, you'll have score selected ranges, which is going to create a score. And then you can look at the confusion matrix and see how it works. So that's sort of how that's one part of the exercise, but you can easily go and make changes here or even test out by making changes to the data, you know, messing up your like your JSON and seeing whether how that impacts things and so on and so forth. So that's that's the first phase of our co-pilot the of of our workshop and we'll talk about the second phase when we get started with that one. Go ahead. Yeah, I'm just curious how uh like best practices for using this in production when you have you know tens of thousands maybe hundreds of thousands of examples or do teams run this on a subset or just like how do you do this at scale right so that will we will hit a lot of that in the second part of our workshop we will directly go into Python collab code uh we have um the the SDK goes through a bunch of this um the details and best practices on how you do it But you will get to play with this uh in this workshop. You you can you will our our our scorers are specifically designed for online workflows. These like 20 dimensions that you have, they score them all in sub 20 like 50 milliseconds. So you can run on a very large scale. You can run it online um fairly easily and we have batch processes and stuff like that set up for you. So you'll be able to play with that create sets. Um, typically you want to create eval sets which have a combination of hard, easy, medium, that kind of stuff. We're not going to get into data generation, synthetic data generation that much, but our uh you'll play with it in the co-pilot, but the actual data generation stuff there there's actually um documentation that we you can follow up afterwards and how to create like easy to set hard medium sets for your testing purposes. uh the best thing is to to sample from your logs some number of things and evaluate or just run it online. Majority of this these kinds of quality checks at Google were run online whether it's spam detection, whether it's like you know what what decisions to make. That's the ideal place you want to be. Most people it's very difficult to kind of like sort of implement at that point. I think the challenge one of the biggest challenge with LLM as a judge is it's so expensive you can't run run online. So that's one of the things that this solves. Um all right so go ahead how this is different as a traditional models or smaller um so so we had a bunch of discussion on this maybe I'll go very quick quick quickly through these slides um so so the the these models are designed for high precision like for example if you run the same score twice on the same thing it's not going to give you different scores exactly the same score with small variations keep the same scores heights. It's high designed for like super high precision. It's just the architecture of these these models is is basically um very low variance. Um they the part of the reason is that they're using this birectional attention instead of like the typical decoder models attention. They have a regression head on top instead of a token generation. So it's not auto reggressively generating tokens which has a lot of like weirdness that happens to it. Like there's a lot of post talk uh explanation for scores. will come up with the score then it'll try to justify the score that doesn't happen. The other thing is these models have been uh trained on um a lot of data like you know billions and billions of tokens which are only for scoring but different kinds of content so coding content other type of content. So these are these generalize really well across but that also stabilizes them quite a bit. Um the the one thing that I would say which is really nice is the interface is very nice. It basically just you ask a question and give it the the data and it will answer the question with a score and then you can inspect the score and understand why why it gave you that score. So it's a fairly simple interface. There's no like prompt tuning and so on and so forth because in this particular case when you're evaluating prompt tuning doesn't really like it's not very natural like you know how do you explain a rubric to to a model. So these models understand internally why should this score high why shouldn't they score high and then these things come together uh using uh a fairly sophisticated model which is like a extension of generalized additive model which brings all of these different signals together um by baiting them based on your thumbs up thumbs down data so this is a process called calibration where you give it a bunch of data and it understands what's important what's not what should I some if something fails I should fail everything for example, spam, but if it succeeds, I shouldn't contribute those kinds of decisions. It makes those decisions for you based on the data. So that's sort of it's it's a much more advanced way of doing evalu purposes. It gives them that sort of stability that you desire and that and the reason they're very fast is because when you use birectional attention, you can build much more denser embeddings. So with fewer parameters, you can get fairly high quality scores. Go ahead. Does it work well with other languages? Uh right now we have trained it with handful of languages. Uh we have we are pretty soon going to release a model um with multilang multilingual capabilities. We don't support multimodal yet but this is in our road map. Right. So right now it's English in a few languages and then we're going to expand it beyond that. Let's kick we should kick off the workshop and then we'll pass it on and we can just answer all the questions as well. Um, do you just want to share the doc? Just uh again reminder it's like with pi.aiworkshop. I'll just share the doc here as well so that you all can see it. Yeah, I'll also put it in the slack channel for everybody. Uh but please get started and let us know. Um sorry uh some people maybe have already started working with the collabs. have seen um but uh if I'll quickly show others uh what um what what the second phase of the exercise is about um but of course we can continue all of this going forward uh how do I project this um so this the second part of the exercise by the way we may not have enough time to wrap it all up here but the second part of the exercise feel free to do it on your own the collab is available for you. So you can you can you can use this um this collab um this this is the pre-prepared collab um and how do I I don't know if you can see this but basically this particular collab will take you through these multiple steps and a lot of people ask me questions about how to use it in code so I just wanted to quickly go over it. Um this is a basic installation. This is your scoring spec. This is this is where all of your intelligence going to live. Again, this is a relatively simple example. So, it's a relatively simple scoring spec. But the way you get this spec is basically over here. You get into code, you copy it, and you put it into your collab. Right? So, this is basically how you get the spec here. It's all in natural language spec uh with some Python code in there. This is what you'll change. This is what you'll tweak. This is where everything is going to be. And what you're now doing in this particular collab which you can literally click through it and you don't have to do anything more than that but or you can play around with it a whole bunch is one is we have some data sets public public data sets on hugging face which has the thumbs up thumbs down data. This is the same data if you've seen the sheet that's the same data over here. you it you you you this collab is loading that data running the score on it returning your results um and then building a similar confusion matrix that you saw there which indicates how well well aligned your scores are so that just getting you warmed up uh then you can use it to compare models this is where like really interesting stuff starts happening right so in this particular case we are comparing 1.5 and 2.5 models you can see that 2.5 has a slightly higher score than 1.5 5 for this particular task. But because 2.5 was mostly for reasoning, there's not that much of a delta here. If you go to smaller models uh like some of the the the the mini models like the cloud haiku, you could see you'll see a much bigger delta between the quality based on your own scoring system. So now you can eva use your scoring system for evaluating different models. What this is doing is it's taking about you know 10 examples calling these five different models generating responses and then scoring them using our scoring system. Right? So that's a model comparison. The other upset is to try different prompts. Um people change their system prompts but they're worried about them when they change it. What would happen? This is the right way to kind of make sure that you're not regressing. You have your scoring spec. You try different system prompts. see if your scores are going down on your on your test set. So again, taking 10 examples just to demonstrate how you compare them. Uh here's like we we created a bad and a good prompt just to kind of accentuate this. So like bad prompts getting much lower score than good prompts on this particular task. Uh this is the one that I that I'm very excited about because it brings you into the online world and how to actually do this online where you're taking this one particular transcript and you are testing it out with different number of samples. So if you're if you use just one sample which is typically what you do generate one response with a temperature of.7 you get a particular score but as you up the the number of samples what it's doing behind the scenes is creating three or four four of those responses each one is a different response and then ranking them using our the pi scoring system that you just built and picking the one that's best. And what you would see is as you increase the number of examples, you'll see the score steadily going up, the response quality going up. So this is uh literally like you can click through this in the collab and run through it and you don't won't need anything else. But there's a bunch of options for you to play around with this right now later. Just want to introduce this to you guys before um before you all disappear. [Music]