Make your LLM app a Domain Expert: How to Build an Expert System — Christopher Lovejoy, Anterior

Channel: aiDotEngineer

Published at: 2025-07-28

YouTube video id: MRM7oA3JsFs

Source: https://www.youtube.com/watch?v=MRM7oA3JsFs

[Music]
Hi everybody. So I'm Christopher
Lovejoy. I'm a medical doctor turned AI
engineer and I'm going to share a
playbook for building a domain native
LLM application. Uh so I spent about
eight years um training and working as a
medical doctor and then I spent the last
seven years building AI systems that
incorporate medical domain expertise. Um
and I did that at a few different
startups. So I worked at a a health tech
startup called Serakare um doing tech
enabled home care. Uh the startup
recently hit 500 million ARR. Um worked
at various other startups and I
currently work at Anterior.
And Anterior is a New York based
clinicianled uh company. Here we provide
clinical reasoning tools to automate and
accelerate uh health insurance um and
healthcare administration. Uh we serve
about 50 million we serve uh health
insurance providers that cover about 50
million lives uh in the US. And we spend
a lot of time thinking about what does
it mean to build a domain native LLM
application whether it's in healthcare
um or otherwise. Uh and that's what I'm
going to talk about today. And in
particular, our bet really is that when
it comes to vertical AI applications,
the system that you build for
incorporating your domain insights is
far more important than the
sophistication of your models and your
pipelines. So the limitation these days
is not like how powerful is your model
and whether it can uh reason to the
level you need it to. It's more can your
model understand the context um in that
industry for that particular customer uh
and perform perform the reasoning that
it needs to and the way that you enable
that and the way that you uh kind of
iterate quickly with your customers is
by building the system around it and
there's various components to that um
that's what I'm going to talk about. So
this is the kind of high level schematic
and we're going to go through each of
these parts um throughout the talk. Uh
as you'll see right in the middle
there's the the PM um and this is you
know in our experience it makes sense
for this to be a domain um expert um
product manager. So in our context it's
clinical um and I'm going to go through
go through this in more detail shortly.
But first I think it's worth taking a
quick step back and asking you know why
is it so hard to successfully apply
large language models to specialized
industries.
We think it's because of the last mile
problem. And what I mean by the last
mile problem is is this problem that I I
kind of touched on just now around uh
giving the model and your your kind of
AI system more generally context and
understanding of the specific workflow
for that customer for that industry. Um
and
I'm going to illustrate that with an
example um from a clinical case that
we've processed. Our AI anterior is
called Florence and a 78-y old female
patient uh presented with right knee
pain. The doctor recommended a knee
arthoscopy and as part of deciding
whether this treatment was appropriate,
whether the doctor made an appropriate
decision, Florence needs to answer
various questions. Uh one of those
questions is is there documentation of
unsuccessful conservative therapy for at
least six weeks. Um and you know on the
surface of it that might seem relatively
simple. I mean, I appreciate maybe not a
lot of doctors in the room, so you might
not know what conservative therapy is,
but um actually there's a lot of kind of
like hidden complexity in answering a
question like this. So, for example, you
know, conservative therapy um typically
what we mean by conservative therapy is
when there's some kind of option for uh
you know, a more aggressive treatment,
maybe a surgical operation, that's like
the the you know, the surgical
treatment. And then if you're deciding
not to operate and you want to try
something conservative first, that's
like the conservative therapy. So it
might be, you know, do physiootherapy,
uh lose weight, um do kind of, you know,
non-invasive things that might help
resolve the problem. But actually
there's some there's still some
ambiguity there because, uh, you know,
in some cases, giving medication might
be a conservative therapy. In some
cases, that's actually the more
aggressive treatment and there's
something else that's more conservative.
Um, so there's one layer of ambiguity
there. Then when we talk about
unsuccessful um well what is unsu let's
say that somebody has uh some knee pain
they do some treatment and their
symptoms improve significantly but they
don't like fully resolve. So is that
successful? Do we need like a full
resolution of symptoms or is it just
like a partial resolution is enough? If
it's partial like at what point is that
enough to be quantified as successful?
Um so again there's kind of complexity
and nuance with with how that's
interpreted. And then finally
documentation for at least six weeks.
again, you know, documentation. Are we
saying that the medical record said they
started physical therapy 8 weeks ago,
then it's never mentioned again? We can
therefore assume that they've been doing
it for for 8 weeks. Uh or do we need
like explicit documentation that they
started treatment, they did it for 8
weeks, and you know, it's completed. Uh
where where do we like draw the line
there in terms of what we can infer?
Um and yeah, just kind of coming back to
echo our point. So this is really our
bet that the system is more important.
Uh we believe that in every vertical
industry the uh you know the team the
company that wins is the one that builds
the best system for taking those domain
insights and quickly translating them
into the pipeline giving it that context
and iterating um to create this
improvements.
Um, and we also, you know, found I guess
to talk to this counterpoint, the
models, I mean, models obviously are
important. Um, and the the progress in
models makes it easier to have a good
starting point, but that's only getting
up to a certain baseline. And we found
we kind of hit a saturation around like
95% uh level. So, we invested a lot of
time and effort improving our pipelines.
Um, obviously 95% is stuck, still pretty
reasonable. And this is that performing
the like primary task that our our AI
system does which is approving these
care requests um in a health insurance
context. Um so we're at 95% and we then
iterated based on this system um that
I'm going to walk through and we really
got to you know kind of almost silly
accuracy of like 99%. Uh we got this
class point of um light award a few
weeks ago for this. Um and really what
we found here and what we observed is
that the the models reason very well.
they get to a great baseline. But if
you're in in an industry where you
really need to ek out that like final
mile of performance, um you need to be
able to then kind of give the model give
the pipeline that context.
Uh so how do we do that? Well, we call
this our adaptive domain intelligence
engine. And what this is performing is
it's taking customer specific domain
insights and it's converting them into
performance improvements um and kind of
building a system around that. And
there's broadly two main parts to this.
The first part is the measurement side
of things. So, you know, how is how is
our current pipeline doing? Um, and then
the rest of this is the uh improvement
side. So, I'm going to talk first a bit
more about measurement in more detail
and then and then a bit about
improvements.
So, measuring domain specific uh
performance.
The first thing um and I think you know
a lot of this is is really just kind of
practice best practice more generally
but um the first step is to define what
is it that your users really care about
as metrics. So in a health context
obviously I've been talking about
medical necessity reviews um this is our
bread and butter and there the customers
really care about false approvals. They
want to minimize false approvals because
a false approval where you've approved
care means that you know a patient who
didn't need the care might get given
some care they don't need and obviously
from an insurance provider point of view
they're then paying for treatment that
they don't necessarily want to pay for.
Um and often defining these metrics is
like a collaboration between the domain
experts in your company and the
customers to kind of like really
translate what are the metrics that you
care about. They might be like one or
two or like usually there's just a few
metrics that matter most. So in a few
other industries like legal when you're
analyzing contracts it might be that you
really want to minimize a number of uh
missed critical terms when you're when
you're identifying these clauses in the
contract for fraud detection. Your
topline metric might be something like
preventing um dollar loss from fraud.
You know education it might be you want
to optimize for test score improvements.
Um I think it's it's definitely a
helpful exercise to push yourself to
think of like really if I'm optimizing
for like one or two metrics what is like
the metric that is most important. Um
and then what you can also do hand
inhand with that um which is very
helpful uh just going off the bottom
there a little bit but uh is designing a
failure mode ontology and what I mean by
this is taking the task that you're
performing and identifying what are all
the different ways in which my AI fails
and it might be at the level of like
higher order categories so for example
here we've got medical record extraction
clinical reasoning and rules
interpretation we found that for medical
necessity review these are the three
broad categories these the three board
ways in which the AI can fail and then
within those there's various like
different subtypes. Um and this is an
iterative process. There's like various
techniques for doing this. Um I think it
it's important here to bring in your
domain experts. I think one failure mode
is that you have somebody kind of
looking at your AI traces in isolation
and coming up with these um who don't
necessarily have the context on how
things are working. I think this is a a
step that's critical to have domain
experts uh leading this process.
Um but really I think the the big value
ad is when you do both of these at the
same time um together because what this
gives you uh and and this is a this is a
dashboard that we've built internally. I
appreciate the text might be a little
bit small um but essentially on the
right hand side you have a patient's
medical record. You also have the
guidelines that the record is being
appraised against. On the left hand side
you have the AI outputs. Um so this is
the decision that it's made the
reasoning behind its decision and what
we enable our domain experts to do here
enable our clinicians is they can come
in they can mark whether it's correct or
incorrect and if it's incorrect then
this box here is for um defining the
failure mode. So from that ontology we
just saw on the slide before they can
say this failed in this way and doing
those at the same point and having your
domain expert sit at that point doing
both of these is uh super valuable
because it then enables you to
understand things like this. So on the
x- axis here we have number of false
approvals. That's the metric that we
really care about in our context. And
then we have the different failure modes
on on the y- axis. And obviously that
tells us that if we want to minimize our
false approvals and we want to like
optimize for this this top northstar
metric that we care about, these are
what we want to address first kind of in
this order. Um which as a PM is then a
useful piece of information to help you
prioritize uh the work that you want to
do.
So that's the measure side of things.
I'm now going to go on to talk about the
um the improvements
um and particularly with this domain
specific context.
So
what that also gives you this kind of
failure mode labeling we talked about
before is you get these readymade data
sets that you can iterate against. And
these data sets are super valuable
because they're coming directly from
production data, which means you know
that they're representative of the kind
of input data distribution that you're
going to see more so than synthetic data
would be. Uh and you can now you know
when you you had those priorities on the
previous slide, we saw which sort of
failure modes were causing the most
false approvals. We can then pick that
data set of you know 100 cases that came
through prodical
failure mode. You can give that to an
engineer. An engineer can iterate
against it and you can keep on testing.
Okay, how is my performance against that
particular failure mode right now?
And that lets you do something like this
where on the x-axis here we have the
pipeline version. On the y axis we have
the performance score. Um by definition
on these flaws we're starting very low
for each of these like failure mode data
sets but every time you increment your
pipeline version maybe you spent some
time focusing on this particular failure
mode and and you were able to get a big
jump in performance. Um and then you can
see the other ones also jumping up as
well um on kind of subsequent releases.
And you can also use this to then track
that you're not regressing on any
particular failure mode as well. Um so
it's a useful useful uh visualization to
be able to make.
And you can then go one step further and
actually bring your domain experts into
the kind of improvements in the
iteration itself. And what that looks
like is creating this tooling that
enables a domain expert who's not
necessarily technical to come in. They
can then suggest changes to the
application pipeline. They can also
suggest new domain knowledge that's made
available to the pipeline. And obviously
they're the best positioned to be making
these kind of um you know opinions of
what sort of domain knowledge might be
might be relevant. And then you have
your pipeline in the middle that's ready
to use those if it wants to. And on the
right hand side you have those domain
evals which might be these failure set
evals. You might have more generic eval
sets as well. and they can then tell you
in a data-driven way, okay, given this
domain knowledge suggestion from a
domain expert, should that go live in
the platform and now it's in production
and and then um you know it should be
improving the performance for for live
customers. Um and this whole loop can
happen very quickly. So for example and
I think actually on the next slide yeah
I'll just show um so this is a dashboard
we saw before but this is with this
extra button which is like a domain
knowledge addition button and so again
we're keeping the same context we have
uh you know a domain expert clinician
coming in here they're reviewing the
case they're saying is it correct is it
incorrect they're saying what's the
failure mode and now they can say I
think this domain knowledge would be
helpful for the application's
performance and uh you know it might be
I think in this case appreciate it might
not be that easy to read But um the
model is kind of making some some
mistake related to understanding
suspicion of a condition because the
patient like has the condition and it
says oh there's no suspicion of the
condition um but actually they they have
it and like there's there's like you
could give some information to the model
for the medical context of how we
interpret suspicious or suspicion as a
word that would then influence the
answer. Um or it could be that maybe the
reasoning uses some kind of scoring
system and you realize actually the
model doesn't have access to that
scoring system. You could again you
could add that as domain knowledge um to
continually build out what the what the
model can handle. And what that helps
with yeah in ter in terms of kind of
iteration speed from that you can do
that maybe you want to let your evals
automatically let that go in or maybe
you want to um have some kind of human
in the loop but it just means that you
can have this very quick process. This
prod comes through you analyze it um by
through a clinical lens and then the
same day you've essentially fixed it
because you've added the domain
knowledge that should solve it. You can
prove that with the evals and then it's
live.
And what this means is that, you know,
these domain expert reviews that are
really kind of powering a lot of the
insights you're getting here are giving
you three main things. They're giving
you performance metrics, they're giving
you these failure modes, and they're
giving you these suggested improvements
um allin one.
Yep.
Yeah, good question. So the question is
um how do you define a domain expert?
like what level of of expertise do you
need here? I think it really depends on
the specific like workflow that you're
doing um and what you're kind of
optimizing for. So in our context, if
you're optimizing for clinical reasoning
and the quality of the clinical
reasoning, you therefore want somebody
with like as much clinical experience,
ideally a doctor, um you know, ideally
they have relevant expertise in the
specialtity that you're dealing with. Uh
but it but it kind of really depends on
your use case. It might be that there's
actually simpler things we also um can
can do in which case that level of
expertise is not necessary and you could
have you know like a more junior
clinical person but the idea being that
it's either like a nurse or a doctor or
somebody that has experience of doing
this workflow in in the real world. Does
that make sense? Yeah. Another question.
Yeah. This is this is bespoke tooling.
And I think in general my my philosophy
on this is that if you if you're really
placing a lot of weight on what you're
kind of generating and this feeds into
your system in various other different
ways in the kind of ways I'm describing,
it probably makes most sense to do this
with bespoke tooling that you build
yourself because it's you want to
integrate it into the rest of your
platform and it's just generally going
to be um you know easier to do that if
you're if you're kind of like doing
everything yourself. Yeah.
Are these are your domain experts users?
them to come in.
Yeah, great question. Um, I think it c
it can be both. Um, we like in our
experience typically we start with we we
will hire some people inhouse who kind
of come and do this for us to give us
this initial data so that we can do that
iteration. I think there's definitely a
world in which the customer themselves
might also want to do validation of your
AI and they might actually do this kind
of process themselves in which case this
then becomes a customerf facing product
for them to to use as well. Um, yeah.
Uh, okay. So, love the questions, but
we're going to reserve time for Chris to
keep going. Sounds good. And and I'm
just um these are the last couple slides
now as well. So, putting everything
together. Uh this is the overall flow
and
essentially what this what what this can
look like is you have your production
application. It's generating these
decisions, these AI outputs. You're
having your domain experts review that,
giving these performance insights.
That's things like the metrics, the
failure modes. uh you then have your PM,
your kind of domain expert PM who sits
in the middle. They then have this rich
information on okay, what should I
prioritize based on the failure modes,
based on the metrics. They can then turn
to an engineer and say um I want you to
fix this failure mode because I really
care about it and I want you to fix it
up to this performance threshold. So
they can say right now, you know, in
production we're getting 0% or 10% on
this particular data set. I want you to
go away and work on this until you get
to 50%. And then the engineer can go and
um you know run different experiments,
have different ideas of how they might
improve this, changing prompting,
changing models, doing fine-tuning, all
this kind of thing. They then have a
very tight iteration loop because they
have these ready-made failure mode data
sets. They can run the eval. They can
see the impact of those um eval. And
then once they've kind of done that loop
and they're they're hitting the
percentage that they need, they can then
go and give that back to the PM and say,
"Hey, here are the changes I made. This
is the impact." The PM can then um take
that information and make some decision
about going live. they can take the
those uh email metrics, they can look at
the kind of wider context of what this
change might impact elsewhere in the
product um and then decide whether to go
live uh with that in production.
So final takeaway just to wrap up um you
know to build a domain native element
application you need to solve the the
last mile problem. This isn't solved by
just using more powerful models or more
sophisticated pipelines. Uh you need
what we call an adaptive domain
intelligence engine. Domain experts can
power this system by reviewing their AI
outputs to generate metrics to generate
failure modes and to generate suggested
improvements. And this is really
powerful because it takes production
data live from kind of inside your
customer's context and it uses that to
give your LM product the nuance
understanding of the customer workflows
and continually iterate towards that and
eek out the the kind of final
performance um performance level. And
the end result is you have this
self-improving datadriven process that
can be managed by a domain expert PM
sitting in the middle.
Um, so thank you for your attention. Um,
uh, I, if you're interested in kind of
vertical AI applications or like evals
and AI product management more
generally, I've written about that at my
website, chris lovejoy.me. Uh, always
interested to talk about this. So feel
free to drop an email at chrisanser.com.
And we're also hiring as well at the
moment. So check out anio.com/comp
for open roles. Thank you.
[Music]