How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe
Channel: aiDotEngineer
Published at: 2025-07-22
YouTube video id: coKKKKh8Vns
Source: https://www.youtube.com/watch?v=coKKKKh8Vns
[Music] Hey everyone. Um hope you are having a great conference. Um so I'm going to talk about uh how to run events at scale and thinking beyond accuracy or similarity. Uh so in the last uh presentation we we learned about like how to art u architect the AI applications um and then whys are important. In this presentation I am going to talk about like the importance of ewells as well as what type of ewells we have to choose when we are crafting an application. This is a bit about me. Um so I work as a lead engineer for applied AI for developer platforms at Adobe. Um I have also co-authored um CI/CD design patterns book um and also involved in a lot of open source work uh across the communities. So let's get started. So how many of you have seen this or are active on the Twitter right now like have you seen these kind of patterns emerging? Um I think this morning there was a talk where this snapshot was again surfaced. So the one of the most important trends in AI application development is EVs because without Ewells we can't uh we can't craft any AI application then um how many of you are developing an AI application be it a rag chatbot agents anything so if you are working on that you often have come across these kind of questions like how do I test applications when outputs are nondeterministic and require subjective judgment because we all know in LLM world uh you can have the different output for the same set of input. LLMs are nondeterministic or how many times you wondering like if I am changing a prompt what is going to break or how am I going to test that and then most importantly when you are developing an application in order to uh measure the performance or accuracy you need to find out what tools to use what metrics to uh use or what models are best because models are getting capable day by day and the answer is Ewells. So, Ewells is the fundamental approach where you are writing sort of test cases to measure your AI applications. And why do we why do they matter? Uh because without measuring something uh it can have various impacts. It can uh impact your business. Uh you need to measure whatever system output is being produced. How do you align your application with system goals or one of the important aspect is how do you keep getting better because applications are you are developing applications day by day and you need to make sure it is getting better and then trust and accountability. This is one of the aspects um uh which is very important because whenever you are developing something for a customer uh you need to make sure uh they trust your application whatever output is being generated. Now when we talk about EVLs, one of the important aspect to focus on is data. So when we think about EVs, when we think about the tests, how do we start? So the very first step is starting with the data. Now how do you get the data? So there are a couple of approaches to get the data. One is you start small and you start with the synthetic data. Synthetic data means you can generate the um you can generate the ancillary data, you can generate the artificial data and start validating uh your application's output against that data. Then it's evalu uh it's a continuous improvement process. It means every time you generate some output you need to observe the system and then you need to keep on defining that data set whatever data set you are procuring and another aspect is you need to label your data accordingly. So because data is fundamental to writing Ewells. So you need to when generating the data you need to define your data set in in a way where it is labeled into different aspects. It is covering multiple flows or application prospects. So things like that and then you need to continuously refine that. Another another u approach which I have learned from my experience is you one data set is never sufficient. So when you are thinking about ewell you need to think about multiple data sets based on the flows based on the applications and whatever you are trying to achieve. Now when we think about evaluation uh what do we think about? So what do you want to evaluate? The answer is everything. But what does that mean? So you need to defi start by defining your goals and objectives or what do you want to evaluate in your system. Then you need to design in a way where you have mod modules defined for each of the components. You need to optimize your data handling. Um and I I notice I'm mentioning data again and again but the point is you need to have different data sets for different flows. Uh you need to test your flows outputs and paths. So if your application involves multiple flows, multiple path, you need to evaluate in in all all parts. Now adaptive evals. So one of the uh previous presentation talked about like there is no universal eval and that's that is again most important thing because your evals depend upon what you want what type of application you want to evaluate. For example, evaluating a rag application a typical rag application is different from code generation. uh if you are dealing with a typical Q&A type of application, you can define your email such as accuracy or similarity or usefulness versus when you are generating a code u you want to uh test the generated code against the actual code base. So that is where you need to def uh measure your functional correctness of the code generated or how robust that code is generated. Then uh when you are trying to evaluate agents. So one of the important aspect for evaluating agents is trajectory evaluation because agents can take a different path and often times you need to define which path they are taking in order to execute a flow. There is also multi-turn sim uh simulation where most of these agents are complex and you you need to check like when you are having a conversation like how do we evaluate that? Then if you are doing the tool call then you also need to check the correctness or test suite or like how they are how the data is being generated. Now another aspect is how do you scale EL. So one one strategy is you can catch the intermediate results and regression. You need to focus on orchestration and parallelism like how you are how you are running your EV vals how you are orchestrating them how you are paralyzing them. You need to aggregate the results and then you need the important aspect here is you need to run them frequently and then improve upon. So one of the uh term which is being used in industry is measure, monitor, analyze and repeat. So you need to often measure it, you need to analyze that and iterate on that. Then you need to strategize what you want to measure. So again depending upon the use case there are different type of matrices or different type of methodologies you need to adapt to. And then again use there is no fixed strategy to run your ewell. So use what what fits best. In some cases uh you want your humans in the loop to be taking precedence. In some cases you have u uh automation test automation ewells running in there is a fine balance or trade-off between human in the loop versus automation like whether you want the high speed versus high fidelity. So again depending upon what you want to achieve um you want to take give a fine balance on that and rely on process over tools. Reason is because tools again you cannot automate everything. So you need to define and establish the process. How do you want to run the EVs? So these are some of the key takeaways we just talked about. Um so one is EVs are the most important aspect for AI application. uh there is a term being coined now evalu development uh which is if if you think about typical software like testdriven development this is the eval development define ewells based on the use cases uh you need to focus on positive as well as negative cases then focus on the data that is I cannot emphasize enough on uh on that and then remember to measure monitor analyze iterate in a loop continuously and always take a balance approach in fidelity versus speed. Uh if you have any questions, uh there's a barcode. You can come later and chat with me. Happy to chat more. And that's all from now. [Music]