How Intuit uses LLMs to explain taxes to millions of taxpayers - Jaspreet Singh, Intuit

Channel: aiDotEngineer
Published at: 2025-07-23
YouTube video id: _zl_zimMRak
Source: https://www.youtube.com/watch?v=_zl_zimMRak
[Music]
Hi, I'm Jaspit. I'm a senior staff
engineer in it. I work on Genifi for
Turboax. And today we'll be talking
about how we use LLMs at Inuit to well
help you understand your taxes better.
So I think uh
to just to understand the scale right uh
into Turboax successfully processed 44
million tax returns for tax year 23
and that's really the scale we're going
for. We want everybody to be have high
confidence in how their taxes are filed
and understand them that they are
getting the best deductions uh that they
can.
So,
so this is the experience that we work
on. So uh you go into Turboax, you uh
enter your information, then you go
through what credits you are eligible
for and so on. And we basically help you
exp uh expand onto how you are getting
the tax breaks that you are, help you
understand them better uh and so on. And
and this is another example. This is
basically the overall tax outcome like
what is your overall refund for this
year.
Now into it geni experiences are built
on top of our propriety genos that's the
generative OS that we have built as a
platform capability and it has a lot of
different pieces uh that you see over
here. Uh the key goal is that we found
that a lot of the genos tooling that
comes out of the box is not supporting
all our use cases. We want to most
prominently working in tax we are in the
regulatory business uh safety security
uh is very very important. So we want to
focus on that. At the same time we want
to build a piece that company at the
scale of in it can use end to end really
large scale. So that's where Geno comes
in. We have different pieces. There's on
the UI side which is the genux. Then
there's orchestrator. That's basically
the piece where different teams are
working on different components,
different pieces, different LM
solutions. How do you find the right
solution to answer the right question
and uh into it calls the entire
experience that we power through this
into it assist. So I'm going to deep
dive into specific pieces that our team
used to build out uh the experience for
Turboax.
So as I said earlier right we have
millions and millions of customers who
are coming in. So we're trying to build
a scalable solution that can work end to
end. So on the slide here I'm basically
going to talk about different pieces
that are powering the experience. Uh of
course to begin with the first iteration
was the prompt tooling uh basically a
prompt based solution to try and go
through what's your tax situation going
on. Let's take example of what I was
showing earlier which was your tax
refund. So your tax refund has many
constituents. These are your deductions.
These are your credits, standard
deduction, W2 withholding and so on. So
we want to make sure that you understand
all of that. So we built a prompt based
solution around it and work from there.
The production model that we went with
is claude uh for this use case. Uh in is
one of the uh biggest users of claude.
Uh we had a multi-million dollar
contract for this year as well. And uh
you'll also see open eye over there. So
open eye is where we used for other
question and answering. So you'll see on
the slide we're talking about static and
dynamic type of queries. So uh static
queries would be you know what I was
showing earlier that we know you are
looking at your summary. You want to see
what happened overall. So that would be
a static prompt. Think of it like a
prepared statement. Uh however the
additional information that we're
gathering is the tax info when the user
comes in. Now uh dynamic query would be
user have questions about the tax
situation you know can I deduct my dog
well you can't but you can try so things
like that that's what we're trying to
answer more dynamically u opens GP4 mini
had been the model of choice for until a
few months ago we're now iterating on
the newer versions of course models
change every year every month I should
say uh so we're trying to focus on that
u same for the dynamic piece Again
another important aspect is you know tax
information. IRS changes forms every
year uh into it has proprietary tax uh
information tax engines that we want to
use. So we have rag based and of course
graph rag based solutions around it as
well. So they help us uh answer users
questions much better. And uh one thing
that we also piloted recently was
actually having a fine tuned LLM. So uh
we went with cloud because that's the
primary one we are using there and we
stuck to static queries and we tested it
out and uh it does well uh it definitely
does well uh quality is there uh it
takes effort to fine-tune the model uh
however we found that was a little too
specialized in the specific use case and
uh one thing I want to highlight I'll
deep dive further on is eval so you want
to make sure that we evaluate everything
we do um you want to make sure what's
happening in production. You want to
make sure in the development life cycle
you're doing everything you need to do
to make sure the you have the best
bronze out there. Uh and with that
moving on to the next slide.
So to summarize a little bit you know
these are the key pillars that we have.
I already spoke about some of them
before I want to highlight here that the
bottom part in this slide actually the
human domain expert. So uh indeed has a
lot of tax analysts that we work with of
course that are on that work with us uh
decoding IRS changes year-over-year
making changes and so on. So they are
the experts that provide us the
information uh make sure the evaluations
are correctly done. So we have a phased
evaluation system. We have manual
evaluations initially in the development
life cycle. Um and another thing that we
have done is actually using the tax
analysts as the prompt engineers. So
that allows us the folks in data science
and ML world to actually focus on the
quality defining the metrics uh making
sure we have a nice data set that we can
iterate on and test on. uh as we go
along as I said models change we want to
try out different models we want to see
uh the laws change in the IRS say tax
year 23 to 24 what happened uh so those
changes so we focus on that uh and human
experts bring their expertise and are
able to both help with prompt
engineering and get the initial
evaluations done that then becomes the
basis for automated evaluations um LLM
as a judge is what we use as
uh I'm going to talk a little bit more
about that. Uh I'm going to take uh
going back then to what I was turning
earlier about the claw3 highQ and
fine-tuning. So uh fine-tuning
as part of genos we built out a lot of
tool sets. Uh one more thing that we
want to do is support fine-tuning. So
for our use case we actually stuck to
just fine-tuning on claw 3 haiku powered
by AWS bedrock. And the goal there was
that we wanted to see if we can actually
improve uh the quality of responses. Uh
biggest driver of course is uh fewer
instructions are needed once you have
fine-tuned the model. We want to make uh
latencies are a big concern. So we want
to see if we can squeeze down the prompt
size and at the same time keep the
quality uh that we need and keep going
there. So this is roughly what it looks
like. We build out uh we have different
test AWS accounts, different
environments uh that are provided by the
uh platform teams that we work with. We
look at the data and uh brief not to
regulations uh 7 to six uh 16
regulations. So we only use consented
data from users uh make sure uh we're on
the right
and uh just to double down on the
evaluation part right you want to
evaluate everything. So the key pillars
are accuracy, relevancy and coherence.
So we have both manual and automated
systems. We also have broad monitoring
uh automated systems basically look at
sample data uh on what the LLM is
basically giving real users in real
time. And uh for this tooling that we've
built out uh here LLM is a judge comes
in in the auto side. We've also
developed some tooling uh inhouse uh to
basically do some automated prompt
changeing and that actually really helps
to update our LLM as a judge. Basically
LLM as a judge operates on top of a
prompt. Uh it needs different
information. It needs some manual
samples which are the like golden data
set. We use AWS ground truth for that.
Uh and take on that. Uh one more thing
that I want to highlight here is uh
models. So we made the move from uh uh
anthropic cloud instant to anthropic
cloud haiku for the next year uh for uh
taxia 24 and that takes some effort and
the only way it's possible is because we
have clear eval in place so that we can
test out uh whatever we are changing and
uh model changes are not uh as smooth as
you would think.
These are some more details on what
we're talking about on the automated
evals. Uh
as you can see the key output is we want
to make sure it's tax accurate. That's
the main thing we want to aim for and
focus on that. I'm going to move on
here. So let's talk about some major
learnings that we have. So uh the
contracts are really expensive and the
only way they are slightly cheaper if
you have long-term contracts. So uh you
are tied in to the vendor. So it helps
to have strong partners on the vendor
side who work with you uh to help
iterate, help improve and uh I think I
was in this conference last year and
this was one thing called out then as
well that uh essentially vendors are a
form of lock in the prompts are a form
of lock in. It's not easy and we found
out it's not even easy to upgrade this
model from the same vendor going into
the next year. So we want to focus on
that. Uh, another thing I really want to
highlight here is the latency. So, uh,
LLM models of course they don't have the
SLAs of backend services. We're not
looking at, you know, 100 millisecond,
200 milliseconds. We're talking about 3
seconds, 5 seconds, 10 seconds. So as
the user's tax info tax information
comes in maybe they have a complicated
situation like me that you know they own
a home they have maybe something in
stocks and they're trying to file they
have their spouse have their jobs as
well a lot of things going on so the
prompts really balloon up uh if you're
trying to figure out the outcome and uh
as you go into you know tax day
everybody's trying to file on tax day
right April 15 so uh latency really is
uh shooting through the roof. So we
design a product around that. We want to
make sure we have the right uh fallback
mechanisms, the right uh user design uh
product design to make sure that the
user experience is seamless and uh
useful. Uh we want to make sure that the
explanations are helpful more than
anything else. And uh I think I covered
all the other places but once again I
cannot say that enough. Evals are a must
to launch. Focus on evals. Make sure you
have clear guidelines on what you're
building. Uh have clear golden data set.
I've heard that from other talks as
well. That's really a key point.
Uh that's all I'm going to pause here
for questions.
Uh if you're going to be asking
questions, please come to one of the
microphones so that we can capture the
audio.
Yeah. Hi. Um you said uh evaluate
everything, right? But uh with geni
systems there could be you know very
small changes right you make small
change to a prompt and evaluations can
get very expensive or slow down your
whole sort of development process right
so maybe could you dive a little bit
deeper into like when do you bring in
different types of evaluations? Are
there are there anything that you just
say uh we ran some aggression tests and
it looks fine so you launch or do you
always go kind of with the expert? Sure.
Uh thank you for the question. So just
to reiterate so the evaluations are
different types. I would say when we are
in the initial phase of development we
are looking more on the uh manual
relations with tax experts so we can get
a baseline in place. Then as we are
tweaking different things in the prompts
that's where auto evaluation comes in.
So we basically take the input from the
uh uh tax experts and use that to train
a judge prompt for the LLM. So that LLM
is once again expensive. Uh we go for
the GPD4 series until recently on that
one. And uh then minor iterations we can
do with auto eval. So we have clear
understanding with product we want to
make sure that the quality is there. And
maybe once we have major changes for
example we went from tax year 23 to tax
year 24 then we definitely reiterate if
the prompt changes a lot we would uh go
for manual evaluations.
Um thank you for the technical deep
dive. I was more interested in the
product side of it. Sure. We we also do
taxes. So I was curious what are the
kind of um LLM interactions that the
users are having like what are the kind
of questions they're asking? Is it is it
more like critical parts of the workflow
or more like um what? So uh we have
question answering for all types of
questions that includes both the product
question as in you know how do I do this
in turbo tax uh or also their tax
situation. So for example uh I paid the
tuition from my grandchild can I claim
that on my taxes so things like that. So
our goal is we have different teams
going after different pieces. Our goal
is we want to answer all of these
questions and uh accordingly different
types of questions need different
solutions and that's where maybe I would
reiterate go back to
here
so right so this piece here planner so
essentially this is where it comes in we
want to make sure when the query comes
in we understand what the user is trying
to ask and then we have different kind
of solutions for different kind of
questions and go through that.
Thank you.
Yeah. Hi. So you mentioned about the
evaluation. So one quick question like
so Turboax I'm sure it involves a lot of
numbers the answers. So how do you
verify those numbers in terms of the
evaluation? Let's say uh the actual tax
number is 11,235
and if it's something like 11,100 so
it's quite difficult to catch this with
a manual evaluation or an yeah uh thank
you for the question. So that's a key
thing that we work on. So Turboax of
course has a tax knowledge engine that
we have built proprietary in house
managed over the years built and
developed and that's really what's
providing these numbers. The tax profile
information is all coming from these
numbers. We are not having LLMs do the
calculations at all. We're basically
using the ground truth that is already
existing in our systems as the numbers
that we see. And we have safety guard
rails. Uh maybe this piece here I would
probably call out. We have a lot of
safety guardrails on what's the raw LLM
response. Make sure you know we are not
hallucinating numbers before we send to
the user. Got it. So uh the data is
coming from the tax engine itself. But
when you formulate the final explanation
the answer itself. So how do you make
sure that the numbers that are actually
in the final answer are you know that's
coming from data. So basically we have
ML models that are working under the
hood as part of the uh security aspect
that you see here that basically make
sure we did not hallucinate any numbers
that we built up. Got it. Yeah. Thank
you.
Um could you give an overview of how you
use both just a traditional rag and
graph rag like an hybrid in your
workflow? Sure. Sure. So uh and and
sorry one more question is now with the
new model cloud 4 coming out do you
think the fine-tuning might be getting
easier or needs needed? Uh I'll take the
first one. Uh so uh graph rack we've
definitely seen better response uh
better response quality with graph rack.
Uh even more than that though I think
for end user helpfulness getting
personalized answer is the key piece. I
would say graphic definitely outperforms
uh well regular rack uh and what even
more outperforms is personalizing the
answers and to your second question uh
we are constantly evaluating the models
uh this is really the time that you know
April is just behind us we are trying to
look at what new things we can do we
also have some uh in-house models that
in it trains and develops so we are
constantly evaluating and uh I don't
have an answer now what we'll do for the
next tax year but yes we keep working on
that.
Uh you mentioned uh you have different
situations, tax stack situations and you
come up with an answer. So if I describe
my situation and it's complicated and it
comes up with an answer, is that answer
being generated using the LLM or is it
going back to the tax engine and how do
you explain how you came up with that
answer and I I assume there's going to
be a lot of legal challenges to a wrong
answer. Absolutely. I mean uh into it
focuses heavily on legal legal and
privacy uh controls. So the solution for
this one right what we worked on here
this is specific this is more of the
static variety of questions. So once
again what I was saying earlier the
inherent numbers are coming in from tax
knowledge engine and we have tax experts
who actually crafted these prompts. So
they are specifically tested for each
piece that you see here. So that's
basically when we do the evad, we make
sure it doesn't happen what you're
suggesting.
Okay, great. Thanks. Thank you so much.
What a great talk.