Transforming search and discovery using LLMs — Tejaswi & Vinesh, Instacart
Channel: aiDotEngineer
Published at: 2025-07-16
YouTube video id: PjaVHm_3Ljg
Source: https://www.youtube.com/watch?v=PjaVHm_3Ljg
[Music] Hi, good afternoon everyone. Uh my name is Vines and he's the we are part of the search and machine learning team at Instacart. So today we'd like to talk to you about how we are using LMS to transform our search and discovery. Um, so yeah, so first a little bit about ourselves. Yeah, as I mentioned, we are part of the search and discovery ML team at Instacart. And for those of you who may not be part who may not be familiar with Instacart, it's the leader in online grocery in North America. Uh, and our mission is to create a world where everyone has access to the food they love and more time to enjoy it together. So coming to what we'll actually talk about today. Uh first we'll talk about the importance of search uh in grocery e-commerce. Uh then we'll look into some of the challenges facing conventional search engines. Uh and then actually get to the meat of the talk today which is how we are using LMS to solve some of these problems. Uh finally we'll finish with some key takeaways from today's talk. So coming to the importance of search and grocery commerce. Uh I think we've all gone grocery shopping. Customers come with long shopping lists. Uh and it's the same on the platform as well. People are looking for tens of items. Uh and of these a majority of them are just restocking purchases that is things that the customer has bought in the past. Uh and the remaining are items that the user is trying out for the first time. So uh and and a majority of these purchases come from search. So uh search has a dual role. It needs to both support uh quick and efficient uh it needs to have the customer quickly and efficiently find the product they're looking for and also enable this new product discovery. Uh and new product discovery isn't just important for the customer. It's also great for our advertisers because it helps them showcase new products. Uh and it's also good for the platform because overall it encourages larger basket sizes. Uh so let's see what some problems are with our existing setup that sort of uh makes this hard. Uh so so to begin with uh we have two classes of queries that are are generally more challenging especially from an e-commerce perspective. Uh the first are overly broad queries uh in this case like on the left the snacks query where there are tons of products that map to that query. Uh and now because our models are trained on engagement data, if we aren't exposing these products uh to the user, it's hard to actually collect engagement data to to them, rank them up high. So the traditional cold start problem in a way. Uh then uh as you can see on the query on the right we have very specific queries like unsweetened plantbased yogurt where the user is looking for something very specific and these queries uh don't happen very frequently which means that we just don't have enough engagement data to train the models on. Uh and while we have uh done quite a bit of work to sort of um improve this, the challenge that we continually keep facing is that while recall improves, precision is still a challenge, especially in a prelim world. Uh the next class of problems is how do we actually support that new item discovery as we spoke about. So when a customer walks into a grocery store, let's say into a pasta aisle, they might see new brands of pasta that they would want to try out. Uh along with that they would also see pasta sauce and every other thing that's needed to make a bowl of pasta. Uh and customers would want a similar experience on our site. Uh we have heard multiple feedback multiple rounds of feedback from our customers that hey I can find the product that I'm that I want via search but when I'm trying to find any other related products it's it's a bit of a dead end. I would need to make multiple searches to get to where I want to. So this was a problem that we wanted to solve as well. And yeah, as I mentioned, prel this was a a hard problem because of the lack of engagement data, etc. So let's see how we actually use thems to sort of solve these problems. U I'll sort of talk specifically about how we use TMs to uplevel our query understanding module. Now query understanding as I'm sure most of you know uh is the most upstream part of the search stack. uh and very accurate outputs are needed to sort of enable better retrieval and recall uh and finally improve our ranking results. Uh so our query understanding module has multiple models in them like query normalization, query tagging, query classification, category classification etc. So in the interest of time uh I'll just pick a couple of models and talk about how we u sort of really improved them. Uh the first is our query to category a product category classifier. Uh essentially we are taking a query and mapping it to a category in our taxonomy. Uh so as an example if you take a query like watermelon that maps to categories like fruits, organic fruit, foods etc. Uh and our taxonomy has about 10,000 labels of it 6,000 are more commonly used. So because a product a query can map to multiple labels. This is essentially a multilabel classification problem. Um, and in the past our traditional models, uh, which were we actually had it a couple of different models. One was a fast textbased uh, neural network which essentially modeled the semantic relationship between the query and the category. Uh, and then as a fallback we had an npmi model which was a statistical co-occurrence model between the query and the category. Now, while these uh techniques work great for the for the head and torso queries, we had really low coverage for our tail queries because again, we just didn't have enough engagement data to train the models on. Um, and to be honest, we actually tried more sophisticated bird-based models as well. Uh, and while we did see some improvement, the lack of engagement data meant that for the increased latency, we didn't see the wins that we actually hoped for. So um so this is where we actually tried to use an LLM. Uh first we took all of our queries uh and we along with the taxonomy if we fed it into an LLM and asked it to predict the most relevant categories for that query. Now the output that came back was decent. Actually when we all looked at it it made a lot of sense. Uh but when we actually ran an online AB test the results weren't as great. Uh and one particular example that illustrates this point very well is a query like protein. Uh uh users that come to Instacart when they type something like protein, they're looking for maybe protein shakes, uh protein bars or other protein supplements. The LLM on the other hand thinks that pro when a user types protein, they're looking for maybe chicken, tofu or other protein foods. So this mismatch wherein the LLM doesn't truly understand Instacart user behavior was really the cause of the problem. So to sort of maybe improve our results, we sort of switch the problem around, we took the most commonly converting categories or the top K converting categories for each query and fed that as additional context to the LLN. Um, and then I'm sort of simplifying this a bit. there's a bunch of uh ranking and downstream validation that happens. But essentially that that was what we did. We generated a bunch of candidates uh rank candidates and this greatly simplified the problem for the LLM as well. Uh and again to illustrate this with an example uh take a query like Wner soda. Uh our previous model actually identified this as a as a brand of fruit fl or a fruit flavored soda which is not incorrect uh but it's not very precise either. Now the LLM did a much better job. It identified it as a brand of ginger ale. And with this our downstream retrieval and ranking improved greatly as well. And as you can see from uh the results below uh especially for tail queries we saw a big improvement. Our precision improved by our 18 percentage points and our recall improved by our 70 percentage points which is actually pretty significant for our tail queries. Um and maybe to very briefly look at our prompt. As you can see it's very simple. uh we are essentially passing in the C the top converted categories as context uh there are a bunch of guidelines about what the LLM should actually outd do and and that's it so this was all that is needed to sort of enable this uh again I'm simplifying the overall flow but uh the general concepts are pretty straightforward so coming to the another model uh the query rewrites model is actually pretty important as well uh from uh a from an e-commerce perspective, especially at Instacart because not all retailers are created equal. Some have large cataloges, some have very small cataloges. The same query may not always return results. And that is where a rewrite is really helpful. For example, going from a query like 1% milk to just milk would at least return results that the customer can decide to buy or not. Uh and again, our previous approach which was trained on uh engagement data didn't do too well. it suffered or it did decently well on head and torso queries but it suffered from a lack of engagement data on tail queries. Uh so by using an LLM similar to how we did for the product category classifier uh we were able to generate very precise rewrites. Uh in the example here you can see that there's a a substitute a broad and a synonymous rewrite. So for the case of avocado oil, a substitute is olive oil. A broader rewrite is um healthy cooking oil and a synonymous rewrite is just avocado extract. And again uh just just looking at the results from this and if you we saw a bunch of offline improvements and just moving from uh from using third party LLMs here just going from more simpler models to better models improved uh the results quite a bit. This is based off of our human evaluation data. Uh so as you can see just improving the models itself improved the overall performance of the task and in terms of online improvements we actually saw a large drop in the number of queries without any results. This is pretty significant again because uh we could now actually show results to users where they previously saw empty results uh which was great for the business. So uh coming to the sort sort of the important part of this which is how we actually scored and served the the data the thing is that Instacart has a pre has a pretty idiosyncratic u query pattern. There's a very fat head and torso set of queries and we have a sort of a long tail. So by comput premputing the outputs for for all of the head and torso queries offline in a batch mode we were able to sort of uh cache all of this data and then at online when a query comes in we could just serve it off of the cache with very low impact on latency uh and fall back to our existing models for the long tail of queries. And again, this worked really well because it didn't uh impact our latency while it greatly improved our coverage for the long tail of queries. Now, for the the really long tail where I said we would fall back to our existing models, we're actually trying to replace them with a distilled lama model uh so that we can actually do a much better job compared to the existing models. Um so yeah to sort of summarize uh essentially what we saw was that uh from a query understanding perspective we have a bunch of models uh and just using our hybrid approach greatly improved their performance but what's actually more interesting is that today query understanding consists of a bunch of models and as Yazu was talking about in the Netflix talk managing all of these models is actually complex from a system perspective so uh consolidating all of these into an SLM or a or maybe a large language model uh can make the results a lot more consistent. And I'll finish it off by giving an example here. Uh there's a query hum that we sort of saw some interesting issues with uh which is which is spelled hmm. Uh the actual query the our our query brand tagger identified the brand correctly as a brand of kombucha but then our spell corrector unfortunately corrected it as hummus. So the results were really confusing to users uh and was pretty bad but by using a more unified model I think the results were much better. The second is by passing in by using an LLM for query understanding uh we can actually pass in extra context. Um so instead of just generating results for that query in isolation, we can really try to understand what the customer's mission is. Um so for example, detect if they're actually here to buy ingredients for a recipe, etc. And then generate the content for that. So to talk more about that, uh I have the here. Thank you, Anish. Uh now I'll quickly talk about how we used LLMs for showing more discovery oriented content in search results page. Uh just to re restate the problem. Uh, our users found that while our search engine was very good at showing exactly the the results that they exactly wanted to see, once they added an item to the cart, they couldn't do anything useful with the search results page. They either had to do like another search or go to another page to fulfill their next intent to some starts. Uh, tradition solving this with traditional methods would require like a lot of feature engineering or manual work. Uh, LLM solved this problem for us and I will talk about how. Uh so this is how it looked in the end. So for queries like swordfish uh let's say there are no exact results we used llms to generate substitute results like other seafood alternatives meaty fish like tilapia and whatnot u and similarly for queries like sushi where there were a lot of exact results let's say uh we would show at the bottom of the search results page we would show things like Asian cooking cooking ingredients or Japanese drinks and so on uh in order to like you know get the users to engage. uh I'll talk about the techniques here but uh both of these uh both of these discovery oriented results we saw like improve uh led to like improvement in engagement as well as improvement in revenue uh for our per each search. Uh cool uh like I said I'll get into the techniques but let's first talk about the requirements to generate such content. Uh first uh obviously we wanted to generate content that is incremented to the current solutions. We don't want duplicates to what we were already showing. And the second requirement and the most important one is we wanted all of the LLM answers or or the generation to be aligned with Instacart's domain knowledge. What does this mean? So if a query if a user searches for a query called dishes, L&M should understand that it refers to like cookware and not food. Uh and vice versa for a query like Thanksgiving dishes, right? So with these requirements in mind, we set up with we started with like a very basic generation approach. So what did we do? We took the query and we told the LLM, hey, you are an AI assistant and your job is to generate two shopping lists. One is a list of complimentary items and another is a list of like uh substitute items for a given query, right? Um looked good. Uh I mean like so so we saw the results, they looked pretty good. Uh our PMs vetted everything. We looked at everything. uh and and like vines said in in like QUU we when we launched this to our users uh we saw that the results were good but users weren't engaging it as much as we would have liked it to so we went back to the drawing board and we were like we tried to analyze what was going on and what we realized quickly was while LLM's answers uh were like common sense like answers and so on and such they weren't really what users were looking for uh taking the protein example again like uh users when they search for protein They look for protein bars and protein shakes rather than what LLM would give us an answer which is chicken, turkey and tofu and whatnot. Right? So uh so what we did was we augmented the prompt with Instacart domain knowledge. So uh in one case what we did was we took the query and then we augmented it with like here here is the query and here are the top converting categories uh for this particular query along with any annotations from the query understanding model like hey here is a brand present in the query here is like a uh dietary attribute present in the query and so on as such. Uh in another case we were like here is the query and here are the subsequent queries that users did once they issued this particular query. So once you augmented the prompt with this additional metadata about how Instacart users behave, the the the results were far more better. I don't have the time to show like the before and after, but like I said, we definitely saw like a huge improvement in both engagement as well as revenue. Uh I'll quickly talk about like how we served uh all of these contents uh like very similar to QU. It's impractical to call the LLM in real time because of latency and maybe cost concerns sometimes. So what we did was uh we took all of our uh historical search logs. We called LLM in like a batch mode and stored everything. So query content metadata along with even the products that could potentially show up in the carousel and online it's just a very quick look up from a feature store. Uh and that's how we were able to like uh serve all of these recommendations in like blazing fast time. Uh again things weren't as simple as as we making them out to be. the the like when I said the overall concept is simple. The the prompt itself is very simple but there were three key challenges that we solved along the way. Uh one is aligning generation with business metrics like revenue. Uh this was very important to select topline bins. So we iterated over the prompts and the kind of metadata that we that we would feed to the LLM in order to achieve this. Second, we spent a lot of time on ranking uh on improv improving the ranking of the content itself and so on as such. So our traditional PCTR PCBR models did not work. So we had to like employ strategies like uh diversity based ranking and so on and so forth to get users to engage with the content. Uh and then the third thing is evaluating the content itself. So one is making sure that hey whatever LLM is giving uh is one right it's not hallucinating something uh and second it adhered to like what Instacart or what we need as a product right cool. Uh so summarizing the the key takeaways from our talk uh LLM's world knowledge was super important uh to improve uh query understanding predictions for especially for the tail queries. Uh while LLMs were super helpful we really found success by combining the domain knowledge of Instacart with LLMs uh in order to see the topline wins that we saw. Uh and the third and the last one is evaluating the content as well as the cure predictions and so on as such was far more important and far more difficult uh than we anticipated. We used LLM as a judge in order to make this happen but very very important step and we realized that kind of late. So yeah that's all from us. We'll take questions now. Thank you Jri Vines. Um yeah we'll take questions at the mic while the next speaker gets set up. Hi uh thanks for the talk. Um have you also been trying around queries which are very long in natural language like uh I want these three items and these five items like what we would do it on CH chat GPT or it's still like single item that's the focus. Uh yeah I think we have we have actually launched something in the past uh like ask Instacart if you heard of it which essentially takes natural language queries and tries to map that to search intent. So, for example, you might ask, you might say healthy foods for a three-year-old baby or something like that. And so, that would map to things like fruit slices. Uh, I don't know if three-year-old toddlers can eat popcorn, but something along those lines. And and then we had our usual ranking recall and ranking stack sort of retrieve those results. So, any learnings from that experiment for you? Yeah, so so I think we actually have a lot of learnings from that. Essentially as this we already mentioned uh we need to inject a lot of Instacart context into the model to be able to get decent results. The evaluation part is really key. Uh so having a robust automated evaluation pipeline was important. And lastly passing context that is for example if it's a let's say it's a mother's day query and let's say we come up with the individual search intents as perfumes you really want women's perfumes to be in there whereas when we just had perfumes we could see all kinds of items so passing that context from the LLM to the downstream systems is really important. Thanks. Yeah we have a lot of examples where we failed. We can talk about [Music]