Make your LLM app a Domain Expert: How to Build an Expert System — Christopher Lovejoy, Anterior
Channel: aiDotEngineer
Published at: 2025-07-28
YouTube video id: MRM7oA3JsFs
Source: https://www.youtube.com/watch?v=MRM7oA3JsFs
[Music] Hi everybody. So I'm Christopher Lovejoy. I'm a medical doctor turned AI engineer and I'm going to share a playbook for building a domain native LLM application. Uh so I spent about eight years um training and working as a medical doctor and then I spent the last seven years building AI systems that incorporate medical domain expertise. Um and I did that at a few different startups. So I worked at a a health tech startup called Serakare um doing tech enabled home care. Uh the startup recently hit 500 million ARR. Um worked at various other startups and I currently work at Anterior. And Anterior is a New York based clinicianled uh company. Here we provide clinical reasoning tools to automate and accelerate uh health insurance um and healthcare administration. Uh we serve about 50 million we serve uh health insurance providers that cover about 50 million lives uh in the US. And we spend a lot of time thinking about what does it mean to build a domain native LLM application whether it's in healthcare um or otherwise. Uh and that's what I'm going to talk about today. And in particular, our bet really is that when it comes to vertical AI applications, the system that you build for incorporating your domain insights is far more important than the sophistication of your models and your pipelines. So the limitation these days is not like how powerful is your model and whether it can uh reason to the level you need it to. It's more can your model understand the context um in that industry for that particular customer uh and perform perform the reasoning that it needs to and the way that you enable that and the way that you uh kind of iterate quickly with your customers is by building the system around it and there's various components to that um that's what I'm going to talk about. So this is the kind of high level schematic and we're going to go through each of these parts um throughout the talk. Uh as you'll see right in the middle there's the the PM um and this is you know in our experience it makes sense for this to be a domain um expert um product manager. So in our context it's clinical um and I'm going to go through go through this in more detail shortly. But first I think it's worth taking a quick step back and asking you know why is it so hard to successfully apply large language models to specialized industries. We think it's because of the last mile problem. And what I mean by the last mile problem is is this problem that I I kind of touched on just now around uh giving the model and your your kind of AI system more generally context and understanding of the specific workflow for that customer for that industry. Um and I'm going to illustrate that with an example um from a clinical case that we've processed. Our AI anterior is called Florence and a 78-y old female patient uh presented with right knee pain. The doctor recommended a knee arthoscopy and as part of deciding whether this treatment was appropriate, whether the doctor made an appropriate decision, Florence needs to answer various questions. Uh one of those questions is is there documentation of unsuccessful conservative therapy for at least six weeks. Um and you know on the surface of it that might seem relatively simple. I mean, I appreciate maybe not a lot of doctors in the room, so you might not know what conservative therapy is, but um actually there's a lot of kind of like hidden complexity in answering a question like this. So, for example, you know, conservative therapy um typically what we mean by conservative therapy is when there's some kind of option for uh you know, a more aggressive treatment, maybe a surgical operation, that's like the the you know, the surgical treatment. And then if you're deciding not to operate and you want to try something conservative first, that's like the conservative therapy. So it might be, you know, do physiootherapy, uh lose weight, um do kind of, you know, non-invasive things that might help resolve the problem. But actually there's some there's still some ambiguity there because, uh, you know, in some cases, giving medication might be a conservative therapy. In some cases, that's actually the more aggressive treatment and there's something else that's more conservative. Um, so there's one layer of ambiguity there. Then when we talk about unsuccessful um well what is unsu let's say that somebody has uh some knee pain they do some treatment and their symptoms improve significantly but they don't like fully resolve. So is that successful? Do we need like a full resolution of symptoms or is it just like a partial resolution is enough? If it's partial like at what point is that enough to be quantified as successful? Um so again there's kind of complexity and nuance with with how that's interpreted. And then finally documentation for at least six weeks. again, you know, documentation. Are we saying that the medical record said they started physical therapy 8 weeks ago, then it's never mentioned again? We can therefore assume that they've been doing it for for 8 weeks. Uh or do we need like explicit documentation that they started treatment, they did it for 8 weeks, and you know, it's completed. Uh where where do we like draw the line there in terms of what we can infer? Um and yeah, just kind of coming back to echo our point. So this is really our bet that the system is more important. Uh we believe that in every vertical industry the uh you know the team the company that wins is the one that builds the best system for taking those domain insights and quickly translating them into the pipeline giving it that context and iterating um to create this improvements. Um, and we also, you know, found I guess to talk to this counterpoint, the models, I mean, models obviously are important. Um, and the the progress in models makes it easier to have a good starting point, but that's only getting up to a certain baseline. And we found we kind of hit a saturation around like 95% uh level. So, we invested a lot of time and effort improving our pipelines. Um, obviously 95% is stuck, still pretty reasonable. And this is that performing the like primary task that our our AI system does which is approving these care requests um in a health insurance context. Um so we're at 95% and we then iterated based on this system um that I'm going to walk through and we really got to you know kind of almost silly accuracy of like 99%. Uh we got this class point of um light award a few weeks ago for this. Um and really what we found here and what we observed is that the the models reason very well. they get to a great baseline. But if you're in in an industry where you really need to ek out that like final mile of performance, um you need to be able to then kind of give the model give the pipeline that context. Uh so how do we do that? Well, we call this our adaptive domain intelligence engine. And what this is performing is it's taking customer specific domain insights and it's converting them into performance improvements um and kind of building a system around that. And there's broadly two main parts to this. The first part is the measurement side of things. So, you know, how is how is our current pipeline doing? Um, and then the rest of this is the uh improvement side. So, I'm going to talk first a bit more about measurement in more detail and then and then a bit about improvements. So, measuring domain specific uh performance. The first thing um and I think you know a lot of this is is really just kind of practice best practice more generally but um the first step is to define what is it that your users really care about as metrics. So in a health context obviously I've been talking about medical necessity reviews um this is our bread and butter and there the customers really care about false approvals. They want to minimize false approvals because a false approval where you've approved care means that you know a patient who didn't need the care might get given some care they don't need and obviously from an insurance provider point of view they're then paying for treatment that they don't necessarily want to pay for. Um and often defining these metrics is like a collaboration between the domain experts in your company and the customers to kind of like really translate what are the metrics that you care about. They might be like one or two or like usually there's just a few metrics that matter most. So in a few other industries like legal when you're analyzing contracts it might be that you really want to minimize a number of uh missed critical terms when you're when you're identifying these clauses in the contract for fraud detection. Your topline metric might be something like preventing um dollar loss from fraud. You know education it might be you want to optimize for test score improvements. Um I think it's it's definitely a helpful exercise to push yourself to think of like really if I'm optimizing for like one or two metrics what is like the metric that is most important. Um and then what you can also do hand inhand with that um which is very helpful uh just going off the bottom there a little bit but uh is designing a failure mode ontology and what I mean by this is taking the task that you're performing and identifying what are all the different ways in which my AI fails and it might be at the level of like higher order categories so for example here we've got medical record extraction clinical reasoning and rules interpretation we found that for medical necessity review these are the three broad categories these the three board ways in which the AI can fail and then within those there's various like different subtypes. Um and this is an iterative process. There's like various techniques for doing this. Um I think it it's important here to bring in your domain experts. I think one failure mode is that you have somebody kind of looking at your AI traces in isolation and coming up with these um who don't necessarily have the context on how things are working. I think this is a a step that's critical to have domain experts uh leading this process. Um but really I think the the big value ad is when you do both of these at the same time um together because what this gives you uh and and this is a this is a dashboard that we've built internally. I appreciate the text might be a little bit small um but essentially on the right hand side you have a patient's medical record. You also have the guidelines that the record is being appraised against. On the left hand side you have the AI outputs. Um so this is the decision that it's made the reasoning behind its decision and what we enable our domain experts to do here enable our clinicians is they can come in they can mark whether it's correct or incorrect and if it's incorrect then this box here is for um defining the failure mode. So from that ontology we just saw on the slide before they can say this failed in this way and doing those at the same point and having your domain expert sit at that point doing both of these is uh super valuable because it then enables you to understand things like this. So on the x- axis here we have number of false approvals. That's the metric that we really care about in our context. And then we have the different failure modes on on the y- axis. And obviously that tells us that if we want to minimize our false approvals and we want to like optimize for this this top northstar metric that we care about, these are what we want to address first kind of in this order. Um which as a PM is then a useful piece of information to help you prioritize uh the work that you want to do. So that's the measure side of things. I'm now going to go on to talk about the um the improvements um and particularly with this domain specific context. So what that also gives you this kind of failure mode labeling we talked about before is you get these readymade data sets that you can iterate against. And these data sets are super valuable because they're coming directly from production data, which means you know that they're representative of the kind of input data distribution that you're going to see more so than synthetic data would be. Uh and you can now you know when you you had those priorities on the previous slide, we saw which sort of failure modes were causing the most false approvals. We can then pick that data set of you know 100 cases that came through prodical failure mode. You can give that to an engineer. An engineer can iterate against it and you can keep on testing. Okay, how is my performance against that particular failure mode right now? And that lets you do something like this where on the x-axis here we have the pipeline version. On the y axis we have the performance score. Um by definition on these flaws we're starting very low for each of these like failure mode data sets but every time you increment your pipeline version maybe you spent some time focusing on this particular failure mode and and you were able to get a big jump in performance. Um and then you can see the other ones also jumping up as well um on kind of subsequent releases. And you can also use this to then track that you're not regressing on any particular failure mode as well. Um so it's a useful useful uh visualization to be able to make. And you can then go one step further and actually bring your domain experts into the kind of improvements in the iteration itself. And what that looks like is creating this tooling that enables a domain expert who's not necessarily technical to come in. They can then suggest changes to the application pipeline. They can also suggest new domain knowledge that's made available to the pipeline. And obviously they're the best positioned to be making these kind of um you know opinions of what sort of domain knowledge might be might be relevant. And then you have your pipeline in the middle that's ready to use those if it wants to. And on the right hand side you have those domain evals which might be these failure set evals. You might have more generic eval sets as well. and they can then tell you in a data-driven way, okay, given this domain knowledge suggestion from a domain expert, should that go live in the platform and now it's in production and and then um you know it should be improving the performance for for live customers. Um and this whole loop can happen very quickly. So for example and I think actually on the next slide yeah I'll just show um so this is a dashboard we saw before but this is with this extra button which is like a domain knowledge addition button and so again we're keeping the same context we have uh you know a domain expert clinician coming in here they're reviewing the case they're saying is it correct is it incorrect they're saying what's the failure mode and now they can say I think this domain knowledge would be helpful for the application's performance and uh you know it might be I think in this case appreciate it might not be that easy to read But um the model is kind of making some some mistake related to understanding suspicion of a condition because the patient like has the condition and it says oh there's no suspicion of the condition um but actually they they have it and like there's there's like you could give some information to the model for the medical context of how we interpret suspicious or suspicion as a word that would then influence the answer. Um or it could be that maybe the reasoning uses some kind of scoring system and you realize actually the model doesn't have access to that scoring system. You could again you could add that as domain knowledge um to continually build out what the what the model can handle. And what that helps with yeah in ter in terms of kind of iteration speed from that you can do that maybe you want to let your evals automatically let that go in or maybe you want to um have some kind of human in the loop but it just means that you can have this very quick process. This prod comes through you analyze it um by through a clinical lens and then the same day you've essentially fixed it because you've added the domain knowledge that should solve it. You can prove that with the evals and then it's live. And what this means is that, you know, these domain expert reviews that are really kind of powering a lot of the insights you're getting here are giving you three main things. They're giving you performance metrics, they're giving you these failure modes, and they're giving you these suggested improvements um allin one. Yep. Yeah, good question. So the question is um how do you define a domain expert? like what level of of expertise do you need here? I think it really depends on the specific like workflow that you're doing um and what you're kind of optimizing for. So in our context, if you're optimizing for clinical reasoning and the quality of the clinical reasoning, you therefore want somebody with like as much clinical experience, ideally a doctor, um you know, ideally they have relevant expertise in the specialtity that you're dealing with. Uh but it but it kind of really depends on your use case. It might be that there's actually simpler things we also um can can do in which case that level of expertise is not necessary and you could have you know like a more junior clinical person but the idea being that it's either like a nurse or a doctor or somebody that has experience of doing this workflow in in the real world. Does that make sense? Yeah. Another question. Yeah. This is this is bespoke tooling. And I think in general my my philosophy on this is that if you if you're really placing a lot of weight on what you're kind of generating and this feeds into your system in various other different ways in the kind of ways I'm describing, it probably makes most sense to do this with bespoke tooling that you build yourself because it's you want to integrate it into the rest of your platform and it's just generally going to be um you know easier to do that if you're if you're kind of like doing everything yourself. Yeah. Are these are your domain experts users? them to come in. Yeah, great question. Um, I think it c it can be both. Um, we like in our experience typically we start with we we will hire some people inhouse who kind of come and do this for us to give us this initial data so that we can do that iteration. I think there's definitely a world in which the customer themselves might also want to do validation of your AI and they might actually do this kind of process themselves in which case this then becomes a customerf facing product for them to to use as well. Um, yeah. Uh, okay. So, love the questions, but we're going to reserve time for Chris to keep going. Sounds good. And and I'm just um these are the last couple slides now as well. So, putting everything together. Uh this is the overall flow and essentially what this what what this can look like is you have your production application. It's generating these decisions, these AI outputs. You're having your domain experts review that, giving these performance insights. That's things like the metrics, the failure modes. uh you then have your PM, your kind of domain expert PM who sits in the middle. They then have this rich information on okay, what should I prioritize based on the failure modes, based on the metrics. They can then turn to an engineer and say um I want you to fix this failure mode because I really care about it and I want you to fix it up to this performance threshold. So they can say right now, you know, in production we're getting 0% or 10% on this particular data set. I want you to go away and work on this until you get to 50%. And then the engineer can go and um you know run different experiments, have different ideas of how they might improve this, changing prompting, changing models, doing fine-tuning, all this kind of thing. They then have a very tight iteration loop because they have these ready-made failure mode data sets. They can run the eval. They can see the impact of those um eval. And then once they've kind of done that loop and they're they're hitting the percentage that they need, they can then go and give that back to the PM and say, "Hey, here are the changes I made. This is the impact." The PM can then um take that information and make some decision about going live. they can take the those uh email metrics, they can look at the kind of wider context of what this change might impact elsewhere in the product um and then decide whether to go live uh with that in production. So final takeaway just to wrap up um you know to build a domain native element application you need to solve the the last mile problem. This isn't solved by just using more powerful models or more sophisticated pipelines. Uh you need what we call an adaptive domain intelligence engine. Domain experts can power this system by reviewing their AI outputs to generate metrics to generate failure modes and to generate suggested improvements. And this is really powerful because it takes production data live from kind of inside your customer's context and it uses that to give your LM product the nuance understanding of the customer workflows and continually iterate towards that and eek out the the kind of final performance um performance level. And the end result is you have this self-improving datadriven process that can be managed by a domain expert PM sitting in the middle. Um, so thank you for your attention. Um, uh, I, if you're interested in kind of vertical AI applications or like evals and AI product management more generally, I've written about that at my website, chris lovejoy.me. Uh, always interested to talk about this. So feel free to drop an email at chrisanser.com. And we're also hiring as well at the moment. So check out anio.com/comp for open roles. Thank you. [Music]