How Intuit uses LLMs to explain taxes to millions of taxpayers - Jaspreet Singh, Intuit
Channel: aiDotEngineer
Published at: 2025-07-23
YouTube video id: _zl_zimMRak
Source: https://www.youtube.com/watch?v=_zl_zimMRak
[Music] Hi, I'm Jaspit. I'm a senior staff engineer in it. I work on Genifi for Turboax. And today we'll be talking about how we use LLMs at Inuit to well help you understand your taxes better. So I think uh to just to understand the scale right uh into Turboax successfully processed 44 million tax returns for tax year 23 and that's really the scale we're going for. We want everybody to be have high confidence in how their taxes are filed and understand them that they are getting the best deductions uh that they can. So, so this is the experience that we work on. So uh you go into Turboax, you uh enter your information, then you go through what credits you are eligible for and so on. And we basically help you exp uh expand onto how you are getting the tax breaks that you are, help you understand them better uh and so on. And and this is another example. This is basically the overall tax outcome like what is your overall refund for this year. Now into it geni experiences are built on top of our propriety genos that's the generative OS that we have built as a platform capability and it has a lot of different pieces uh that you see over here. Uh the key goal is that we found that a lot of the genos tooling that comes out of the box is not supporting all our use cases. We want to most prominently working in tax we are in the regulatory business uh safety security uh is very very important. So we want to focus on that. At the same time we want to build a piece that company at the scale of in it can use end to end really large scale. So that's where Geno comes in. We have different pieces. There's on the UI side which is the genux. Then there's orchestrator. That's basically the piece where different teams are working on different components, different pieces, different LM solutions. How do you find the right solution to answer the right question and uh into it calls the entire experience that we power through this into it assist. So I'm going to deep dive into specific pieces that our team used to build out uh the experience for Turboax. So as I said earlier right we have millions and millions of customers who are coming in. So we're trying to build a scalable solution that can work end to end. So on the slide here I'm basically going to talk about different pieces that are powering the experience. Uh of course to begin with the first iteration was the prompt tooling uh basically a prompt based solution to try and go through what's your tax situation going on. Let's take example of what I was showing earlier which was your tax refund. So your tax refund has many constituents. These are your deductions. These are your credits, standard deduction, W2 withholding and so on. So we want to make sure that you understand all of that. So we built a prompt based solution around it and work from there. The production model that we went with is claude uh for this use case. Uh in is one of the uh biggest users of claude. Uh we had a multi-million dollar contract for this year as well. And uh you'll also see open eye over there. So open eye is where we used for other question and answering. So you'll see on the slide we're talking about static and dynamic type of queries. So uh static queries would be you know what I was showing earlier that we know you are looking at your summary. You want to see what happened overall. So that would be a static prompt. Think of it like a prepared statement. Uh however the additional information that we're gathering is the tax info when the user comes in. Now uh dynamic query would be user have questions about the tax situation you know can I deduct my dog well you can't but you can try so things like that that's what we're trying to answer more dynamically u opens GP4 mini had been the model of choice for until a few months ago we're now iterating on the newer versions of course models change every year every month I should say uh so we're trying to focus on that u same for the dynamic piece Again another important aspect is you know tax information. IRS changes forms every year uh into it has proprietary tax uh information tax engines that we want to use. So we have rag based and of course graph rag based solutions around it as well. So they help us uh answer users questions much better. And uh one thing that we also piloted recently was actually having a fine tuned LLM. So uh we went with cloud because that's the primary one we are using there and we stuck to static queries and we tested it out and uh it does well uh it definitely does well uh quality is there uh it takes effort to fine-tune the model uh however we found that was a little too specialized in the specific use case and uh one thing I want to highlight I'll deep dive further on is eval so you want to make sure that we evaluate everything we do um you want to make sure what's happening in production. You want to make sure in the development life cycle you're doing everything you need to do to make sure the you have the best bronze out there. Uh and with that moving on to the next slide. So to summarize a little bit you know these are the key pillars that we have. I already spoke about some of them before I want to highlight here that the bottom part in this slide actually the human domain expert. So uh indeed has a lot of tax analysts that we work with of course that are on that work with us uh decoding IRS changes year-over-year making changes and so on. So they are the experts that provide us the information uh make sure the evaluations are correctly done. So we have a phased evaluation system. We have manual evaluations initially in the development life cycle. Um and another thing that we have done is actually using the tax analysts as the prompt engineers. So that allows us the folks in data science and ML world to actually focus on the quality defining the metrics uh making sure we have a nice data set that we can iterate on and test on. uh as we go along as I said models change we want to try out different models we want to see uh the laws change in the IRS say tax year 23 to 24 what happened uh so those changes so we focus on that uh and human experts bring their expertise and are able to both help with prompt engineering and get the initial evaluations done that then becomes the basis for automated evaluations um LLM as a judge is what we use as uh I'm going to talk a little bit more about that. Uh I'm going to take uh going back then to what I was turning earlier about the claw3 highQ and fine-tuning. So uh fine-tuning as part of genos we built out a lot of tool sets. Uh one more thing that we want to do is support fine-tuning. So for our use case we actually stuck to just fine-tuning on claw 3 haiku powered by AWS bedrock. And the goal there was that we wanted to see if we can actually improve uh the quality of responses. Uh biggest driver of course is uh fewer instructions are needed once you have fine-tuned the model. We want to make uh latencies are a big concern. So we want to see if we can squeeze down the prompt size and at the same time keep the quality uh that we need and keep going there. So this is roughly what it looks like. We build out uh we have different test AWS accounts, different environments uh that are provided by the uh platform teams that we work with. We look at the data and uh brief not to regulations uh 7 to six uh 16 regulations. So we only use consented data from users uh make sure uh we're on the right and uh just to double down on the evaluation part right you want to evaluate everything. So the key pillars are accuracy, relevancy and coherence. So we have both manual and automated systems. We also have broad monitoring uh automated systems basically look at sample data uh on what the LLM is basically giving real users in real time. And uh for this tooling that we've built out uh here LLM is a judge comes in in the auto side. We've also developed some tooling uh inhouse uh to basically do some automated prompt changeing and that actually really helps to update our LLM as a judge. Basically LLM as a judge operates on top of a prompt. Uh it needs different information. It needs some manual samples which are the like golden data set. We use AWS ground truth for that. Uh and take on that. Uh one more thing that I want to highlight here is uh models. So we made the move from uh uh anthropic cloud instant to anthropic cloud haiku for the next year uh for uh taxia 24 and that takes some effort and the only way it's possible is because we have clear eval in place so that we can test out uh whatever we are changing and uh model changes are not uh as smooth as you would think. These are some more details on what we're talking about on the automated evals. Uh as you can see the key output is we want to make sure it's tax accurate. That's the main thing we want to aim for and focus on that. I'm going to move on here. So let's talk about some major learnings that we have. So uh the contracts are really expensive and the only way they are slightly cheaper if you have long-term contracts. So uh you are tied in to the vendor. So it helps to have strong partners on the vendor side who work with you uh to help iterate, help improve and uh I think I was in this conference last year and this was one thing called out then as well that uh essentially vendors are a form of lock in the prompts are a form of lock in. It's not easy and we found out it's not even easy to upgrade this model from the same vendor going into the next year. So we want to focus on that. Uh, another thing I really want to highlight here is the latency. So, uh, LLM models of course they don't have the SLAs of backend services. We're not looking at, you know, 100 millisecond, 200 milliseconds. We're talking about 3 seconds, 5 seconds, 10 seconds. So as the user's tax info tax information comes in maybe they have a complicated situation like me that you know they own a home they have maybe something in stocks and they're trying to file they have their spouse have their jobs as well a lot of things going on so the prompts really balloon up uh if you're trying to figure out the outcome and uh as you go into you know tax day everybody's trying to file on tax day right April 15 so uh latency really is uh shooting through the roof. So we design a product around that. We want to make sure we have the right uh fallback mechanisms, the right uh user design uh product design to make sure that the user experience is seamless and uh useful. Uh we want to make sure that the explanations are helpful more than anything else. And uh I think I covered all the other places but once again I cannot say that enough. Evals are a must to launch. Focus on evals. Make sure you have clear guidelines on what you're building. Uh have clear golden data set. I've heard that from other talks as well. That's really a key point. Uh that's all I'm going to pause here for questions. Uh if you're going to be asking questions, please come to one of the microphones so that we can capture the audio. Yeah. Hi. Um you said uh evaluate everything, right? But uh with geni systems there could be you know very small changes right you make small change to a prompt and evaluations can get very expensive or slow down your whole sort of development process right so maybe could you dive a little bit deeper into like when do you bring in different types of evaluations? Are there are there anything that you just say uh we ran some aggression tests and it looks fine so you launch or do you always go kind of with the expert? Sure. Uh thank you for the question. So just to reiterate so the evaluations are different types. I would say when we are in the initial phase of development we are looking more on the uh manual relations with tax experts so we can get a baseline in place. Then as we are tweaking different things in the prompts that's where auto evaluation comes in. So we basically take the input from the uh uh tax experts and use that to train a judge prompt for the LLM. So that LLM is once again expensive. Uh we go for the GPD4 series until recently on that one. And uh then minor iterations we can do with auto eval. So we have clear understanding with product we want to make sure that the quality is there. And maybe once we have major changes for example we went from tax year 23 to tax year 24 then we definitely reiterate if the prompt changes a lot we would uh go for manual evaluations. Um thank you for the technical deep dive. I was more interested in the product side of it. Sure. We we also do taxes. So I was curious what are the kind of um LLM interactions that the users are having like what are the kind of questions they're asking? Is it is it more like critical parts of the workflow or more like um what? So uh we have question answering for all types of questions that includes both the product question as in you know how do I do this in turbo tax uh or also their tax situation. So for example uh I paid the tuition from my grandchild can I claim that on my taxes so things like that. So our goal is we have different teams going after different pieces. Our goal is we want to answer all of these questions and uh accordingly different types of questions need different solutions and that's where maybe I would reiterate go back to here so right so this piece here planner so essentially this is where it comes in we want to make sure when the query comes in we understand what the user is trying to ask and then we have different kind of solutions for different kind of questions and go through that. Thank you. Yeah. Hi. So you mentioned about the evaluation. So one quick question like so Turboax I'm sure it involves a lot of numbers the answers. So how do you verify those numbers in terms of the evaluation? Let's say uh the actual tax number is 11,235 and if it's something like 11,100 so it's quite difficult to catch this with a manual evaluation or an yeah uh thank you for the question. So that's a key thing that we work on. So Turboax of course has a tax knowledge engine that we have built proprietary in house managed over the years built and developed and that's really what's providing these numbers. The tax profile information is all coming from these numbers. We are not having LLMs do the calculations at all. We're basically using the ground truth that is already existing in our systems as the numbers that we see. And we have safety guard rails. Uh maybe this piece here I would probably call out. We have a lot of safety guardrails on what's the raw LLM response. Make sure you know we are not hallucinating numbers before we send to the user. Got it. So uh the data is coming from the tax engine itself. But when you formulate the final explanation the answer itself. So how do you make sure that the numbers that are actually in the final answer are you know that's coming from data. So basically we have ML models that are working under the hood as part of the uh security aspect that you see here that basically make sure we did not hallucinate any numbers that we built up. Got it. Yeah. Thank you. Um could you give an overview of how you use both just a traditional rag and graph rag like an hybrid in your workflow? Sure. Sure. So uh and and sorry one more question is now with the new model cloud 4 coming out do you think the fine-tuning might be getting easier or needs needed? Uh I'll take the first one. Uh so uh graph rack we've definitely seen better response uh better response quality with graph rack. Uh even more than that though I think for end user helpfulness getting personalized answer is the key piece. I would say graphic definitely outperforms uh well regular rack uh and what even more outperforms is personalizing the answers and to your second question uh we are constantly evaluating the models uh this is really the time that you know April is just behind us we are trying to look at what new things we can do we also have some uh in-house models that in it trains and develops so we are constantly evaluating and uh I don't have an answer now what we'll do for the next tax year but yes we keep working on that. Uh you mentioned uh you have different situations, tax stack situations and you come up with an answer. So if I describe my situation and it's complicated and it comes up with an answer, is that answer being generated using the LLM or is it going back to the tax engine and how do you explain how you came up with that answer and I I assume there's going to be a lot of legal challenges to a wrong answer. Absolutely. I mean uh into it focuses heavily on legal legal and privacy uh controls. So the solution for this one right what we worked on here this is specific this is more of the static variety of questions. So once again what I was saying earlier the inherent numbers are coming in from tax knowledge engine and we have tax experts who actually crafted these prompts. So they are specifically tested for each piece that you see here. So that's basically when we do the evad, we make sure it doesn't happen what you're suggesting. Okay, great. Thanks. Thank you so much. What a great talk.