Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

Channel: aiDotEngineer
Published at: 2025-08-03
YouTube video id: -T6uZYYzkWw
Source: https://www.youtube.com/watch?v=-T6uZYYzkWw
[Music]
Welcome everyone.
I'm going to talk about practical
tactics to build uh reliable AI
applications and why nobody does it this
way yet.
Uh a little bit about myself or why you
should trust me. Uh I have allowed 15
years as a startup co-founder and CTO.
Uh I held executive positions for the
last five years at uh several
enterprises. uh but most importantly I
spent last couple of years developing
a lot of gen projects ranging from PC's
to uh many production level uh solutions
and helped some companies to get it done
and uh I've learned or distilled a way
to uh make these applications reliable
and there are quite a lot of uh tracks
this uh uh this conference about evals
and reliability but uh to my surprise
nobody was talking about the most
important things and uh we're going to
talk about it right now.
So uh standard software development life
cycle is uh very standard uh simple uh
you design your solution, you develop
it, you test it and then eventually you
deploy it. And uh when people start
doing uh PC with AI
it sounds simple like you can very
easily do you have some prompt and uh
models are very capable but then you
start uh facing some uh unexpected
challenges
uh actually like you can easily do a PC
that works 50% of the time uh but we're
like making it do the same reliable work
the Rest of the 50% is very hard uh
because models are nondeterministic
and uh it starts requiring uh a data
science approach uh continuous
experimentation you need to try this
prompt you need to try that model you
need to try this approach etc etc and uh
everything in your solution everything
that uh represents your solution which
is your code your logic uh the prompts
that you use the the models that you use
the the data that you base your solution
on changing anything of that impacts
your uh solution in unexpected ways.
Um people very often come to this uh to
try solving this with the wrong
approach. They start with a data science
metrics. They like it sounds reasonable,
right? So it requires data science
approach of experimentation and uh
people start measuring groundness,
factuality, bias and other uh metrics
that don't really help you to understand
uh is your solution uh working the right
way? Does it uh does your latest change
improved uh your solution in the right
way for your users? uh for example I've
been talking to an ex-colague that are
building a customer support bot at
tweaks I asked him how do you know that
your solution is working well he started
talking about factuality and other data
science metrics
u that's again I started to dig deeper
and then we just uh together figure out
that the most important metric for them
is uh the rate of moving from AI I
support bot like escalation to a human
support.
If uh your solution uh hasn't able to
answer the user with all this factuality
like it could be super grounded but
still not provide the right answer that
the user expects and uh this is what you
actually need to test.
Um and my experience was to start with
real world scenarios. So basically you
need to reverse engineer your metrics
and your metrics should be very very
specific to what your end goal. So they
should come from a product experience
from business outcomes. Uh if your
solution is customer support bot, you
need to figure out what your users want
and uh how you can mimic it. And instead
of measuring something u average or
something generic, you need to measure a
very specific criterias.
Uh cuz universal valves don't really
work.
How do we do it? Uh so for example,
customer support bot, which is by the
way one of the hardest uh things to do
properly. uh let's say I have a bank and
uh bank has FAQ materials
which contain including like how do you
reset your password?
Um so what I usually do when I help my
uh like companies that I help them to
build uh AI solutions we start with uh
reverse engineering like how do we
create the valves based on that. So in
this case I use LLM and in most cases I
use LLM to come up with right
evaluations. So here I can take say 01
uh 03 now uh and just reverse engineer
what should be the user question uh that
we know to answer based on these
materials and what should be the
specific criteria that uh these
materials are providing an answer for
and some of these criteria are quite
important. So for example here it says
that uh uh as part of the thing you you
need to receive a mobile validation. So
you receive a SMS code and uh it says
that if you uh don't have a mobile
number then you can reach uh support etc
etc. Uh if some of that information is
missing from the answer the answer would
not be correct.
You need to be very specific about what
exact information you need to see in the
answer and that information is very
specific to that specific question. So
you need to build like lots of evals
uh from the materials in this case uh
that mimic specific user questions that
uh you need to be able to answer for. Uh
how do we do it? Usually again I work
with uh smart models like O3 uh and I uh
provided enough context. I provided
which personas are we trying to
represent because you can make ask the
same question in uh completely different
ways depending on who is the persona
asking uh yet you would expect exactly
the same answer. So you need to account
for it.
Um so this is uh an example from uh the
open source platform that we have that
uh just helps to get it done. So if you
look it up multineer I'm not trying to
sell you anything. I'm not trying to
like vendor lock in or whatever. It's
completely open source and if needed I
can just recreate it in a couple of days
now with cursor. The point is in the
approach not in the platform. Uh so for
example here we see that very same
question um how do I reset my password
you see the what was the input what was
the output and uh that specific criteria
that I measure it uh that specific
question how do I know if the answer is
correct and now I can just reiterate and
generate like 50 different variations of
the same question and see if I still get
the right answer if the answer matches
all the checklist that I have for that
specific answer.
Um how the process usually works
um so contrary to like regular approach
you build your evals not at the end of
the process but in the very beginning of
the process. So you just build your
first version of the PC. You define the
first version of your tests evaluations.
You run them and you see what's going
on. You you will see that uh in some
cases it will fail. Uh in some cases it
will succeed. What's important is to to
look at the details not just see the
average numbers. The average numbers
won't tell you anything. Uh won't tell
you how to improve it. If you actually
look at the details of each evaluation,
you'll see exactly why it's failing. It
could be failing um because your test is
not defined correctly. It could be
failing because your uh solution is not
working as it should be. And like in
order to do it, you may need to uh to do
a change in like you may change a model,
you may change something on in your
logic, you may change a prompt or the
data that you use in order to uh answer
a question in our example.
And uh basically what you do now is
experimentation. So you you start
running your experiment. You change
something. you you need to define these
tests in a way that will uh help you to
make an educated guess on uh what you
need to change in order to to do it. In
some cases it will work in some cases it
won't. But even if it works uh let's say
you change something in your prompt and
it fixed this test. In my experience, in
many cases, it breaks uh something that
used to work before. Uh like you you you
have constant regressions and if you
don't have these evaluations, there is
no way you'll be able to catch it on
time.
So this is hugely important and what
actually happens is that again you build
your first version. You build your first
version of the vows uh you match them
you run these valves you improve
something you improve your vows or maybe
add more evaluations and then you like
continuously improve it until you reach
some point where you are satisfied with
your valves for this specific solution
for that specific point of time. And
what actually happened is that you you
you got your baseline, you got your
benchmark that uh now you can start
optimizing and uh you have the
confidence that the tests should be
working. So now you can try another
model. Let's say well what how can I try
to see if 40 mini will work the same way
with 40 or not? uh can I use a graph rag
or can I try a simpler solution?
uh should I have uh to use agentic
approach that like maybe better but uh
requires more time more uh inference
cost etc or should I try to simplify the
logic or maybe I can simplify the logic
for a specific portion of the
application etc etc having this
benchmark uh allows you to do all these
experimentations uh with confidence but
again the the most important part is
like how do you reach this benchmark
And uh while the approach is uh pretty
much the same, the evaluations that you
need to build and how do you build your
evaluations are completely different
depending on the solution that you need
to build because uh the models are super
capable right now. Uh so they allow you
to build a huge variety of uh solutions
but each and every solution is quite uh
different in terms of how do you uh
evaluate it. uh for support bot you
usually typically use LM as a judge as I
uh made an example if you're building
text to SQL or text to graph database
then uh to my experience the best way is
to create a mock database that
represents the um whatever uh database
or databases that you need your solution
to work with they represent the same
schema and you have a mock data so you
know exactly uh what should expect on
specific questions. Um if you need to
build some classifier for call center
conversations then your uh tests are
like simple match whenever this is this
is the right rubric or not. Uh and the
same appro uh approach applies to guard
rails. So uh getting back to the support
to the uh example of the customer
support bot uh guardrails you need to
cover uh questions that should not be
answered or questions that should be
answered in different ways or questions
that uh uh the answers are not in the
material. So all of these you can put
into your benchmark just different type
of benchmark but it's pretty much the
same approach.
Uh so just to reiterate uh the key
takeaways, you need to evaluate your
apps the way your users actually use
them. Um and uh avoid abstract metrics
uh because these abstract connectors
don't really measure anything important.
Uh and the approach is uh through
experimentation. So you run these
evaluations frequently. you that allows
you to have rapid progress with uh less
regressions because testing frequently
help you to to catch these surprises. Uh
but most importantly what you get if you
divi define your evaluations correctly,
you get your solution pretty much uh as
kind of explainable AI because you know
exactly what it does, you know exactly
how it does it if you test it the right
way.
Thank you very much. Uh take a look at
multineer uh that's a platform that you
can use to uh run these evaluations. You
can totally use any other platform. The
approach is quite simple. It doesn't
require any specific platform. Uh I've
built multin just because no other
platform helped me to do it this way to
to help me with the process of
evaluation like end to end. Um, I'm
working on a startup that does reliable
AI automation right now. Um, and uh,
yeah, thank you very much.
[Music]