Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear
Channel: aiDotEngineer
Published at: 2025-08-03
YouTube video id: -T6uZYYzkWw
Source: https://www.youtube.com/watch?v=-T6uZYYzkWw
[Music] Welcome everyone. I'm going to talk about practical tactics to build uh reliable AI applications and why nobody does it this way yet. Uh a little bit about myself or why you should trust me. Uh I have allowed 15 years as a startup co-founder and CTO. Uh I held executive positions for the last five years at uh several enterprises. uh but most importantly I spent last couple of years developing a lot of gen projects ranging from PC's to uh many production level uh solutions and helped some companies to get it done and uh I've learned or distilled a way to uh make these applications reliable and there are quite a lot of uh tracks this uh uh this conference about evals and reliability but uh to my surprise nobody was talking about the most important things and uh we're going to talk about it right now. So uh standard software development life cycle is uh very standard uh simple uh you design your solution, you develop it, you test it and then eventually you deploy it. And uh when people start doing uh PC with AI it sounds simple like you can very easily do you have some prompt and uh models are very capable but then you start uh facing some uh unexpected challenges uh actually like you can easily do a PC that works 50% of the time uh but we're like making it do the same reliable work the Rest of the 50% is very hard uh because models are nondeterministic and uh it starts requiring uh a data science approach uh continuous experimentation you need to try this prompt you need to try that model you need to try this approach etc etc and uh everything in your solution everything that uh represents your solution which is your code your logic uh the prompts that you use the the models that you use the the data that you base your solution on changing anything of that impacts your uh solution in unexpected ways. Um people very often come to this uh to try solving this with the wrong approach. They start with a data science metrics. They like it sounds reasonable, right? So it requires data science approach of experimentation and uh people start measuring groundness, factuality, bias and other uh metrics that don't really help you to understand uh is your solution uh working the right way? Does it uh does your latest change improved uh your solution in the right way for your users? uh for example I've been talking to an ex-colague that are building a customer support bot at tweaks I asked him how do you know that your solution is working well he started talking about factuality and other data science metrics u that's again I started to dig deeper and then we just uh together figure out that the most important metric for them is uh the rate of moving from AI I support bot like escalation to a human support. If uh your solution uh hasn't able to answer the user with all this factuality like it could be super grounded but still not provide the right answer that the user expects and uh this is what you actually need to test. Um and my experience was to start with real world scenarios. So basically you need to reverse engineer your metrics and your metrics should be very very specific to what your end goal. So they should come from a product experience from business outcomes. Uh if your solution is customer support bot, you need to figure out what your users want and uh how you can mimic it. And instead of measuring something u average or something generic, you need to measure a very specific criterias. Uh cuz universal valves don't really work. How do we do it? Uh so for example, customer support bot, which is by the way one of the hardest uh things to do properly. uh let's say I have a bank and uh bank has FAQ materials which contain including like how do you reset your password? Um so what I usually do when I help my uh like companies that I help them to build uh AI solutions we start with uh reverse engineering like how do we create the valves based on that. So in this case I use LLM and in most cases I use LLM to come up with right evaluations. So here I can take say 01 uh 03 now uh and just reverse engineer what should be the user question uh that we know to answer based on these materials and what should be the specific criteria that uh these materials are providing an answer for and some of these criteria are quite important. So for example here it says that uh uh as part of the thing you you need to receive a mobile validation. So you receive a SMS code and uh it says that if you uh don't have a mobile number then you can reach uh support etc etc. Uh if some of that information is missing from the answer the answer would not be correct. You need to be very specific about what exact information you need to see in the answer and that information is very specific to that specific question. So you need to build like lots of evals uh from the materials in this case uh that mimic specific user questions that uh you need to be able to answer for. Uh how do we do it? Usually again I work with uh smart models like O3 uh and I uh provided enough context. I provided which personas are we trying to represent because you can make ask the same question in uh completely different ways depending on who is the persona asking uh yet you would expect exactly the same answer. So you need to account for it. Um so this is uh an example from uh the open source platform that we have that uh just helps to get it done. So if you look it up multineer I'm not trying to sell you anything. I'm not trying to like vendor lock in or whatever. It's completely open source and if needed I can just recreate it in a couple of days now with cursor. The point is in the approach not in the platform. Uh so for example here we see that very same question um how do I reset my password you see the what was the input what was the output and uh that specific criteria that I measure it uh that specific question how do I know if the answer is correct and now I can just reiterate and generate like 50 different variations of the same question and see if I still get the right answer if the answer matches all the checklist that I have for that specific answer. Um how the process usually works um so contrary to like regular approach you build your evals not at the end of the process but in the very beginning of the process. So you just build your first version of the PC. You define the first version of your tests evaluations. You run them and you see what's going on. You you will see that uh in some cases it will fail. Uh in some cases it will succeed. What's important is to to look at the details not just see the average numbers. The average numbers won't tell you anything. Uh won't tell you how to improve it. If you actually look at the details of each evaluation, you'll see exactly why it's failing. It could be failing um because your test is not defined correctly. It could be failing because your uh solution is not working as it should be. And like in order to do it, you may need to uh to do a change in like you may change a model, you may change something on in your logic, you may change a prompt or the data that you use in order to uh answer a question in our example. And uh basically what you do now is experimentation. So you you start running your experiment. You change something. you you need to define these tests in a way that will uh help you to make an educated guess on uh what you need to change in order to to do it. In some cases it will work in some cases it won't. But even if it works uh let's say you change something in your prompt and it fixed this test. In my experience, in many cases, it breaks uh something that used to work before. Uh like you you you have constant regressions and if you don't have these evaluations, there is no way you'll be able to catch it on time. So this is hugely important and what actually happens is that again you build your first version. You build your first version of the vows uh you match them you run these valves you improve something you improve your vows or maybe add more evaluations and then you like continuously improve it until you reach some point where you are satisfied with your valves for this specific solution for that specific point of time. And what actually happened is that you you you got your baseline, you got your benchmark that uh now you can start optimizing and uh you have the confidence that the tests should be working. So now you can try another model. Let's say well what how can I try to see if 40 mini will work the same way with 40 or not? uh can I use a graph rag or can I try a simpler solution? uh should I have uh to use agentic approach that like maybe better but uh requires more time more uh inference cost etc or should I try to simplify the logic or maybe I can simplify the logic for a specific portion of the application etc etc having this benchmark uh allows you to do all these experimentations uh with confidence but again the the most important part is like how do you reach this benchmark And uh while the approach is uh pretty much the same, the evaluations that you need to build and how do you build your evaluations are completely different depending on the solution that you need to build because uh the models are super capable right now. Uh so they allow you to build a huge variety of uh solutions but each and every solution is quite uh different in terms of how do you uh evaluate it. uh for support bot you usually typically use LM as a judge as I uh made an example if you're building text to SQL or text to graph database then uh to my experience the best way is to create a mock database that represents the um whatever uh database or databases that you need your solution to work with they represent the same schema and you have a mock data so you know exactly uh what should expect on specific questions. Um if you need to build some classifier for call center conversations then your uh tests are like simple match whenever this is this is the right rubric or not. Uh and the same appro uh approach applies to guard rails. So uh getting back to the support to the uh example of the customer support bot uh guardrails you need to cover uh questions that should not be answered or questions that should be answered in different ways or questions that uh uh the answers are not in the material. So all of these you can put into your benchmark just different type of benchmark but it's pretty much the same approach. Uh so just to reiterate uh the key takeaways, you need to evaluate your apps the way your users actually use them. Um and uh avoid abstract metrics uh because these abstract connectors don't really measure anything important. Uh and the approach is uh through experimentation. So you run these evaluations frequently. you that allows you to have rapid progress with uh less regressions because testing frequently help you to to catch these surprises. Uh but most importantly what you get if you divi define your evaluations correctly, you get your solution pretty much uh as kind of explainable AI because you know exactly what it does, you know exactly how it does it if you test it the right way. Thank you very much. Uh take a look at multineer uh that's a platform that you can use to uh run these evaluations. You can totally use any other platform. The approach is quite simple. It doesn't require any specific platform. Uh I've built multin just because no other platform helped me to do it this way to to help me with the process of evaluation like end to end. Um, I'm working on a startup that does reliable AI automation right now. Um, and uh, yeah, thank you very much. [Music]