Transforming search and discovery using LLMs — Tejaswi & Vinesh, Instacart

Channel: aiDotEngineer

Published at: 2025-07-16

YouTube video id: PjaVHm_3Ljg

Source: https://www.youtube.com/watch?v=PjaVHm_3Ljg

[Music]
Hi, good afternoon everyone. Uh my name
is Vines and he's the we are part of the
search and machine learning team at
Instacart. So today we'd like to talk to
you about how we are using LMS to
transform our search and discovery.
Um, so yeah, so first a little bit about
ourselves. Yeah, as I mentioned, we are
part of the search and discovery ML team
at Instacart. And for those of you who
may not be part who may not be familiar
with Instacart, it's the leader in
online grocery in North America. Uh, and
our mission is to create a world where
everyone has access to the food they
love and more time to enjoy it together.
So coming to what we'll actually talk
about today. Uh first we'll talk about
the importance of search uh in grocery
e-commerce. Uh then we'll look into some
of the challenges facing conventional
search engines. Uh and then actually get
to the meat of the talk today which is
how we are using LMS to solve some of
these problems. Uh finally we'll finish
with some key takeaways from today's
talk.
So coming to the importance of search
and grocery commerce. Uh I think we've
all gone grocery shopping. Customers
come with long shopping lists. Uh and
it's the same on the platform as well.
People are looking for tens of items. Uh
and of these a majority of them are just
restocking purchases that is things that
the customer has bought in the past. Uh
and the remaining are items that the
user is trying out for the first time.
So uh and and a majority of these
purchases come from search. So uh search
has a dual role. It needs to both
support uh quick and efficient uh it
needs to have the customer quickly and
efficiently find the product they're
looking for and also enable this new
product discovery. Uh and new product
discovery isn't just important for the
customer. It's also great for our
advertisers because it helps them
showcase new products. Uh and it's also
good for the platform because overall it
encourages larger basket sizes. Uh so
let's see what some problems are with
our existing setup that sort of uh makes
this hard. Uh so so to begin with uh we
have two classes of queries that are are
generally more challenging especially
from an e-commerce perspective. Uh the
first are overly broad queries uh in
this case like on the left the snacks
query where there are tons of products
that map to that query. Uh and now
because our models are trained on
engagement data, if we aren't exposing
these products uh to the user, it's hard
to actually collect engagement data to
to them, rank them up high. So the
traditional cold start problem in a way.
Uh then uh as you can see on the query
on the right we have very specific
queries like unsweetened plantbased
yogurt where the user is looking for
something very specific and these
queries uh don't happen very frequently
which means that we just don't have
enough engagement data to train the
models on. Uh and while we have uh done
quite a bit of work to sort of um
improve this, the challenge that we
continually keep facing is that while
recall improves, precision is still a
challenge, especially in a prelim world.
Uh the next class of problems is how do
we actually support that new item
discovery as we spoke about. So when a
customer walks into a grocery store,
let's say into a pasta aisle, they might
see new brands of pasta that they would
want to try out. Uh along with that they
would also see pasta sauce and every
other thing that's needed to make a bowl
of pasta. Uh and customers would want a
similar experience on our site. Uh we
have heard multiple feedback multiple
rounds of feedback from our customers
that hey I can find the product that I'm
that I want via search but when I'm
trying to find any other related
products it's it's a bit of a dead end.
I would need to make multiple searches
to get to where I want to. So this was a
problem that we wanted to solve as well.
And yeah, as I mentioned, prel this was
a a hard problem because of the lack of
engagement data, etc. So let's see how
we actually use thems to sort of solve
these problems. U I'll sort of talk
specifically about how we use TMs to
uplevel our query understanding module.
Now query understanding as I'm sure most
of you know uh is the most upstream part
of the search stack. uh and very
accurate outputs are needed to sort of
enable better retrieval and recall uh
and finally improve our ranking results.
Uh so our query understanding module has
multiple models in them like query
normalization, query tagging, query
classification, category classification
etc. So in the interest of time uh I'll
just pick a couple of models and talk
about how we u sort of really improved
them. Uh the first is our query to
category a product category classifier.
Uh essentially we are taking a query and
mapping it to a category in our
taxonomy. Uh so as an example if you
take a query like watermelon that maps
to categories like fruits, organic
fruit, foods etc. Uh and our taxonomy
has about 10,000 labels of it 6,000 are
more commonly used. So because a product
a query can map to multiple labels. This
is essentially a multilabel
classification problem. Um, and in the
past our traditional models, uh, which
were we actually had it a couple of
different models. One was a fast
textbased uh, neural network which
essentially modeled the semantic
relationship between the query and the
category. Uh, and then as a fallback we
had an npmi model which was a
statistical co-occurrence model between
the query and the category. Now, while
these uh techniques work great for the
for the head and torso queries, we had
really low coverage for our tail queries
because again, we just didn't have
enough engagement data to train the
models on. Um, and to be honest, we
actually tried more sophisticated
bird-based models as well. Uh, and while
we did see some improvement, the lack of
engagement data meant that for the
increased latency, we didn't see the
wins that we actually hoped for.
So um so this is where we actually tried
to use an LLM. Uh first we took all of
our queries uh and we along with the
taxonomy if we fed it into an LLM and
asked it to predict the most relevant
categories for that query. Now the
output that came back was decent.
Actually when we all looked at it it
made a lot of sense. Uh but when we
actually ran an online AB test the
results weren't as great. Uh and one
particular example that illustrates this
point very well is a query like protein.
Uh uh users that come to Instacart when
they type something like protein,
they're looking for maybe protein
shakes, uh protein bars or other protein
supplements. The LLM on the other hand
thinks that pro when a user types
protein, they're looking for maybe
chicken, tofu or other protein foods. So
this mismatch wherein the LLM doesn't
truly understand Instacart user behavior
was really the cause of the problem.
So to sort of maybe improve our results,
we sort of switch the problem around, we
took the most commonly converting
categories or the top K converting
categories for each query and fed that
as additional context to the LLN. Um,
and then I'm sort of simplifying this a
bit. there's a bunch of uh ranking and
downstream validation that happens. But
essentially that that was what we did.
We generated a bunch of candidates uh
rank candidates and this greatly
simplified the problem for the LLM as
well. Uh and again to illustrate this
with an example uh take a query like
Wner soda. Uh our previous model
actually identified this as a as a brand
of fruit fl or a fruit flavored soda
which is not incorrect uh but it's not
very precise either. Now the LLM did a
much better job. It identified it as a
brand of ginger ale. And with this our
downstream retrieval and ranking
improved greatly as well. And as you can
see from uh the results below uh
especially for tail queries we saw a big
improvement. Our precision improved by
our 18 percentage points and our recall
improved by our 70 percentage points
which is actually pretty significant for
our tail queries. Um and maybe to very
briefly look at our prompt. As you can
see it's very simple. uh we are
essentially passing in the C the top
converted categories as context uh there
are a bunch of guidelines about what the
LLM should actually outd do and and
that's it so this was all that is needed
to sort of enable this uh again I'm
simplifying the overall flow but uh the
general concepts are pretty
straightforward
so coming to the another model uh the
query rewrites model is actually pretty
important as well uh from uh a from an
e-commerce perspective, especially at
Instacart because not all retailers are
created equal. Some have large
cataloges, some have very small
cataloges. The same query may not always
return results. And that is where a
rewrite is really helpful. For example,
going from a query like 1% milk to just
milk would at least return results that
the customer can decide to buy or not.
Uh and again, our previous approach
which was trained on uh engagement data
didn't do too well. it suffered or it
did decently well on head and torso
queries but it suffered from a lack of
engagement data on tail queries. Uh so
by using an LLM similar to how we did
for the product category classifier uh
we were able to generate very precise
rewrites. Uh in the example here you can
see that there's a a substitute a broad
and a synonymous rewrite. So for the
case of avocado oil, a substitute is
olive oil. A broader rewrite is um
healthy cooking oil and a synonymous
rewrite is just avocado extract. And
again uh just just looking at the
results from this and if you we saw a
bunch of offline improvements and just
moving from uh from using third party
LLMs here just going from more simpler
models to better models improved uh the
results quite a bit. This is based off
of our human evaluation data. Uh so as
you can see just improving the models
itself improved the overall performance
of the task and in terms of online
improvements we actually saw a large
drop in the number of queries without
any results. This is pretty significant
again because uh we could now actually
show results to users where they
previously saw empty results uh which
was great for the business.
So uh coming to the sort sort of the
important part of this which is how we
actually scored and served the the data
the thing is that Instacart has a pre
has a pretty idiosyncratic
u query pattern. There's a very fat head
and torso set of queries and we have a
sort of a long tail. So by comput
premputing the outputs for for all of
the head and torso queries offline in a
batch mode we were able to sort of uh
cache all of this data and then at
online when a query comes in we could
just serve it off of the cache with very
low impact on latency uh and fall back
to our existing models for the long tail
of queries.
And again, this worked really well
because it didn't uh impact our latency
while it greatly improved our coverage
for the long tail of queries. Now, for
the the really long tail where I said we
would fall back to our existing models,
we're actually trying to replace them
with a distilled lama model uh so that
we can actually do a much better job
compared to the existing models. Um so
yeah to sort of summarize uh essentially
what we saw was that uh from a query
understanding perspective we have a
bunch of models uh and just using our
hybrid approach greatly improved their
performance but what's actually more
interesting is that today query
understanding consists of a bunch of
models and as Yazu was talking about in
the Netflix talk managing all of these
models is actually complex from a system
perspective so uh consolidating all of
these into an SLM or a or maybe a large
language model uh can make the results a
lot more consistent. And I'll finish it
off by giving an example here. Uh
there's a query hum that we sort of saw
some interesting issues with uh which is
which is spelled hmm.
Uh the actual query the our our query
brand tagger identified the brand
correctly as a brand of kombucha but
then our spell corrector unfortunately
corrected it as hummus. So the results
were really confusing to users uh and
was pretty bad but by using a more
unified model I think the results were
much better. The second is by passing in
by using an LLM for query understanding
uh we can actually pass in extra
context. Um so instead of just
generating results for that query in
isolation, we can really try to
understand what the customer's mission
is. Um so for example, detect if they're
actually here to buy ingredients for a
recipe, etc. And then generate the
content for that. So to talk more about
that, uh I have the here.
Thank you, Anish. Uh now I'll quickly
talk about how we used LLMs for showing
more discovery oriented content in
search results page. Uh just to re
restate the problem. Uh, our users found
that while our search engine was very
good at showing exactly the the results
that they exactly wanted to see, once
they added an item to the cart, they
couldn't do anything useful with the
search results page. They either had to
do like another search or go to another
page to fulfill their next intent to
some starts. Uh, tradition solving this
with traditional methods would require
like a lot of feature engineering or
manual work. Uh, LLM solved this problem
for us and I will talk about how. Uh so
this is how it looked in the end. So for
queries like swordfish uh let's say
there are no exact results we used llms
to generate substitute results like
other seafood alternatives meaty fish
like tilapia and whatnot u and similarly
for queries like sushi where there were
a lot of exact results let's say uh we
would show at the bottom of the search
results page we would show things like
Asian cooking cooking ingredients or
Japanese drinks and so on uh in order to
like you know get the users to engage.
uh I'll talk about the techniques here
but uh both of these uh both of these
discovery oriented results we saw like
improve uh led to like improvement in
engagement as well as improvement in
revenue uh for our per each search. Uh
cool uh like I said I'll get into the
techniques but let's first talk about
the requirements to generate such
content. Uh first uh obviously we wanted
to generate content that is incremented
to the current solutions. We don't want
duplicates to what we were already
showing. And the second requirement and
the most important one is we wanted all
of the LLM answers or or the generation
to be aligned with Instacart's domain
knowledge. What does this mean? So if a
query if a user searches for a query
called dishes, L&M should understand
that it refers to like cookware and not
food. Uh and vice versa for a query like
Thanksgiving dishes, right? So with
these requirements in mind, we set up
with we started with like a very basic
generation approach. So what did we do?
We took the query and we told the LLM,
hey, you are an AI assistant and your
job is to generate two shopping lists.
One is a list of complimentary items and
another is a list of like uh substitute
items for a given query, right? Um
looked good. Uh I mean like so so we saw
the results, they looked pretty good. Uh
our PMs vetted everything. We looked at
everything. uh and and like vines said
in in like QUU we when we launched this
to our users uh we saw that the results
were good but users weren't engaging it
as much as we would have liked it to so
we went back to the drawing board and we
were like we tried to analyze what was
going on and what we realized quickly
was while LLM's answers uh were like
common sense like answers and so on and
such they weren't really what users were
looking for uh taking the protein
example again like uh users when they
search for protein They look for protein
bars and protein shakes rather than what
LLM would give us an answer which is
chicken, turkey and tofu and whatnot.
Right? So uh so what we did was we
augmented the prompt with Instacart
domain knowledge. So uh in one case what
we did was we took the query and then we
augmented it with like here here is the
query and here are the top converting
categories uh for this particular query
along with any annotations from the
query understanding model like hey here
is a brand present in the query here is
like a uh dietary attribute present in
the query and so on as such. Uh in
another case we were like here is the
query and here are the subsequent
queries that users did once they issued
this particular query. So once you
augmented the prompt with this
additional metadata about how Instacart
users behave, the the the results were
far more better. I don't have the time
to show like the before and after, but
like I said, we definitely saw like a
huge improvement in both engagement as
well as revenue. Uh I'll quickly talk
about like how we served uh all of these
contents uh like very similar to QU.
It's impractical to call the LLM in real
time because of latency and maybe cost
concerns sometimes. So what we did was
uh we took all of our uh historical
search logs. We called LLM in like a
batch mode and stored everything. So
query content metadata along with even
the products that could potentially show
up in the carousel and online it's just
a very quick look up from a feature
store. Uh and that's how we were able to
like uh serve all of these
recommendations in like blazing fast
time. Uh again things weren't as simple
as as we making them out to be. the the
like when I said the overall concept is
simple. The the prompt itself is very
simple but there were three key
challenges that we solved along the way.
Uh one is aligning generation with
business metrics like revenue. Uh this
was very important to select topline
bins. So we iterated over the prompts
and the kind of metadata that we that we
would feed to the LLM in order to
achieve this. Second, we spent a lot of
time on ranking uh on improv improving
the ranking of the content itself and so
on as such. So our traditional PCTR PCBR
models did not work. So we had to like
employ strategies like uh diversity
based ranking and so on and so forth to
get users to engage with the content. Uh
and then the third thing is evaluating
the content itself. So one is making
sure that hey whatever LLM is giving uh
is one right it's not hallucinating
something uh and second it adhered to
like what Instacart or what we need as a
product right cool. Uh so summarizing
the the key takeaways from our talk uh
LLM's world knowledge was super
important uh to improve uh query
understanding predictions for especially
for the tail queries. Uh while LLMs were
super helpful we really found success by
combining the domain knowledge of
Instacart with LLMs uh in order to see
the topline wins that we saw. Uh and the
third and the last one is evaluating the
content as well as the cure predictions
and so on as such was far more important
and far more difficult uh than we
anticipated. We used LLM as a judge in
order to make this happen but very very
important step and we realized that kind
of late. So yeah that's all from us.
We'll take questions now.
Thank you Jri Vines. Um yeah we'll take
questions at the mic while the next
speaker gets set up.
Hi uh thanks for the talk. Um have you
also been trying around queries which
are very long in natural language like
uh I want these three items and these
five items like what we would do it on
CH chat GPT or it's still like single
item that's the focus.
Uh yeah I think we have we have actually
launched something in the past uh like
ask Instacart if you heard of it which
essentially takes natural language
queries and tries to map that to search
intent. So, for example, you might ask,
you might say healthy foods for a
three-year-old baby or something like
that. And so, that would map to things
like fruit slices. Uh, I don't know if
three-year-old toddlers can eat popcorn,
but something along those lines. And and
then we had our usual ranking recall and
ranking stack sort of retrieve those
results.
So, any learnings from that experiment
for you?
Yeah, so so I think we actually have a
lot of learnings from that. Essentially
as this we already mentioned uh we need
to inject a lot of Instacart context
into the model to be able to get decent
results. The evaluation part is really
key. Uh so having a robust automated
evaluation pipeline was important. And
lastly passing context that is for
example if it's a
let's say it's a mother's day query and
let's say we come up with the individual
search intents as perfumes you really
want women's perfumes to be in there
whereas when we just had perfumes we
could see all kinds of items so passing
that context from the LLM to the
downstream systems is really important.
Thanks.
Yeah we have a lot of examples where we
failed. We can talk about
[Music]