Recsys Keynote: Improving Recommendation Systems & Search in the Age of LLMs - Eugene Yan, Amazon
Channel: aiDotEngineer
Published at: 2025-07-16
YouTube video id: 2vlCqD6igVA
Source: https://www.youtube.com/watch?v=2vlCqD6igVA
[Music] Hi everyone. Um, thank you for joining us in today's Rexis the inaugural Rexus track at the AI engineer world's fair. So today what I want to share about is what the future might look like when we use when we try to merge recommendation recommendation systems and language models. So my wife looked at my slides and she's like they're so plain. So therefore I'll be giving a talk together with Latte and Mochi. You might have seen Mochi wandering the halls around somewhere but there'll be a lot of doggos throughout the slides. I hope you enjoy. First language modeling techniques are not new in recommendation systems. I mean it started with what in 2013 we started learning item embeddings across um AC from co- occurrences in user interaction sequences and then after that we started using GRU for right I don't know who here remembers recurrent neuronet networks gated recurrent units yeah so those were very short-term and we predict the next item from a short set of sequences then of course um transformers and attention came about and we we we became better on attention on long range dependencies so that's where we started that hey you know can we just process on everything in a user sequence hundreds 2,000 item ids long and try to learn from that and of course now today in this track I wanted to share with you about three ideas that I think are worth thinking about semantic ids data augmentation and unified models so the first challenge we have is hashbased item ids who here works on recommendation systems so you you you probably know that hashbased item ids actually don't encode put the content of the item itself. And then the problem is that every time you have a new item, you suffer from the cold start problem, which is that all you have to relearn about this item all over again. And and therefore there's also sparsity, right? Whereby you have a long set of tail items that have maybe one or two interactions or even up to 10, but it's just not enough to learn. So recommendation systems have this issue of being very popularity bias and they just struggle with coign and sparity. So the solution is semantic ids that may even involve multimodal content. So here's an example of trainable multimodal semantic IDs from quaou. So quaisho is kind of like Tik Tok or xongu is a short video platform in China. I think it's the number two short video platform. You might have used their text to video model cling which they released sometime last year. So the problem they had, you know, being a short video platform, users upload hundreds of millions of short videos every day and it's really hard to learn from this short video. So how can we combine static content embeddings with dynamic user behavior? Here's how they did it with trainable multimodel semantic IDs. So I'm going to go through each step here. So this is the quadial model. It's a standard two tower network. Um on the left this is the embedding layer for the user which is a standard sequence uh sequence of ids and the user ID and on the right is the embedding layer for the item ids. So these are fairly standard. What what's new here is that they now take in content input. So all of these slides will be available online. Um don't don't worry about it. I'll make it available right media after this. Um and to encode visual they use restnet. To encode video descriptions they use bird and to encode audio they use vGish. Now the thing about the trick is this when you have this encoder models it's very hard to back propagate and try to update these encoder model embeddings. So what did they do? Well firstly they took all these content embeddings and then they just concatenated them together. I know it sounds crazy right but I just concat concat them together. Then they learn cluster ids. So I think they shared in the paper they had like a 100 million short videos and they learned just be via CIN's clustering a thousand cluster ids. So that's what you see over there in the model encoder which is in the boxes at at the bottom which is the cluster ids. So be above the cluster ids you have the non-trainable embeddings below that you have the trainable cluster ids which are then all mapped to their own embedding table. So the trick here is this. The motor encoder as you train a model, the motor encoder learns to map the content space via the cluster ids which are mapped to the embedding table to the behavioral space. So the output is this. Um these semantic ids not only outperform regular hashbased ids on clicks and likes, right? Like that's pretty standard. But what they were able to do was they were able to increase co-start coverage which is the of a 100 videos that you share how many of them are new they were able to increase it by 3.6%. And also increase co-start velocity which is okay how many new videos were able to hit some threshold of uh views and this they did not they did not share what the threshold was but being able to increase co-start and co-star velocity by these numbers are pretty outstanding. So the long story short, the benefits of semantic ids, you can address coart with the semantic ID itself and now your recommendations understand content. So later in the talk, we're going to see some amazing uh sharing from Pinterest and YouTube. And in the YouTube one, you see how they actually blend language models with semantic ids whereby it can actually explain why you might like the semantic ID because it understands the semantic ID and is able to give human readable explanations and vice versa. Now the next question and I'm sure all of this everyone here has this challenge. The lifeblood of machine learning is data. good quality data at scale and this is very essential for search and of course recommendation systems but search is actually far more important. You you need a lot of metadata. You need a lot of uh query expansion, synonyms, uh you need spellchecking, you need um uh all sorts of metadata to attach to a search index. But this is very costly and high effort to get. In the past, we used to do with human annotations or maybe you can try to do it automatically. But LMS have been outstanding at this. And I'm sure everyone here is sort of doing this to some extent using LM for synthetic data and labels. But I want to share with you two examples uh from Spotify and Indeed. Now the Indeed paper it's quite out uh I really like it a lot. So the problem that they were trying to face is that they were sending job recommendations to users via email. But some of these job recommendations were bad. They they were just not a good fit for the user. Right? So they had poor user experience and then users lost trust in the job recommendations. Imagine and how how they would indicate that they lost trust was that these job recommendations are not good a good fit for me. I'm just going to unsubscribe. Now the moment a user unsubscribes from your feed or for your newsletter is very very very hard to get them back. Almost impossible. So while they had explicit negative feedback, thumbs up and thumbs down, this was very sparse. How often would you actually give thumbs down feedback? Very sparse. And implicit feedback is often imprecise. What do I mean? If you if you get some recommendations but you actually don't act on it, is it because you didn't like it or is it because it's not the right time or maybe your s your your wife works there and you don't want to work in the same company as your wife. So the solution they had was a lightweight classifier to filter bad racks. And I'll tell you why I really like this paper from indeed in the sense that they didn't just share their successes but they shared the entire process and how they get how they got there and it was fraught with challenges. Well, of course, the first thing that makes me really like it a lot was that they started with emails. So, they had their experts label um job recommendations and uh user pairs and from the user you have their resume data, you have their activity data and they try to see hey you know is this recommendation a good fit. Then they prompted open ALMs uh Mistral and Lama 2. Unfortunately, the performance is very poor. these models couldn't really pay attention to what was in the resume and what was in the job description even though they had sufficient context length and and the output was just very generic. So to get it to work they prompted GB4 and GB4 worked really well um specifically that GB4 had like 90% precision and recall. However, it was very costly. Um they didn't share actual cost but it's too slow. It's 32 seconds. Okay, if GBD4 is too slow, what can we do? Let's try GBD 3.5. Unfortunately, GP GBD3D 3.5 had very poor precision. What does this mean? In the sense that of the recommendations that he said were bad, only 63% of them were actually bad. What this means is that they were throwing out 37% of recommendations, which is one/ird. And for a company that tries on recommendations and people are rec recruiting through your recommendations, throwing out oneird of them that are actually good is is is quite a is quite a guardrail for them. This was their key metric here and also GB. So what they did then is they fine-tuned GBD 3.5. So you can see the the entire journey right open models GBD4 GBD3 now fine-tuning GBD3.5. um GB25 got the precision they wanted 0.3 precision and you know it's one quarter of GBD4's cost and latency right but unfortunately it was too still too slow it was about 6.7 seconds and this would not work in an online filtering system so therefore what they did was they distilled a lightweight classifier on the fine tetune GBD 2.5 labels and this lightweight classifier was able to achieve very high performance uh specifically 0.86 86 AU ROC. I mean the numbers may not make sense to you but suffice to say that in an industrial setting this is pretty good. And of course they didn't mention the latency but it was good enough for real-time filtering. I think less than 200 millconds or something. So the outcome of this was that they were able to reduce bad recommendations. They they were able to cut out bad recommendations by about 20%. So initially they had hypothesized that by cutting down recommendations even though they were bad you will get fewer subscriptions. It's just like sending out links, right? You might have links that are clickbait, even though they are bad, people just click on it. And he thought that even if if we cut down recommendations, even if they were bad, we will get lower application rate. But this was not the case. In fact, because the recommendations were now better, application rate actually went up by 4%. And unsubscribe rate went down by 5%. That that's that's quite a lot. So essentially, what this means is that in recommendations, quantity is not everything. Quality makes a big difference. And the quality here moved the needle quite a bit by 5%. The next example I want to share with you is Spotify. So who here knows that Spotify has podcast and audio books? Oh, okay. I guess you you guys are not the target target audience in in this use case. So Spotify is really known for song and artists and a lot of their users just search for songs and artists and they're very good at that. But when they started introducing podcasts and audio books, how would you help your users know that you know these new items are available? And of course there's a huge ass co-star problem. Now it's not only co-star on it term is now co-star on category. How do you start growing a new category within your service? And of course exploratory search was essential to the business right for going for expanding beyond music. Spotify doesn't want to do just do music songs. They just now now they doing audio. So the solution to that is a query recommendation system. So how did they recommend how first how did they generate new queries? Well um they have a bunch of ideas which is you know extract it from catalog titles playlist titles you mine it from the search logs you just take the uh you just take the artist and then you just add cover to it and this is what they use from existing data. Now you might be wondering like where's the LM in this? Well the LM is used to generate natural language queries. So this might not be sexy, but this works really well, right? Take whatever you have with conventional techniques that work really well and use the LM to augment it when you need it. Don't use the LM for everything at the start. So now they have this exploratory queries. When you search for something, you still get the you still get the immediate results hit, right? So you take all of this, you add the immediate results, and then you rank these new queries. So this is why when you do a search, this is the UX that you're probably going to get right now. I got this from a paper. It may have changed recently. So you still see the item queries at the bottom, but at the top with the query recommendations, this is how Spotify informs users without having a banner. Now we have audio books, now we have podcast, right? You search for something, it actually informs you that we have these new categories. The benefit here is plus 9% exploratory queries. is essentially onetenth of their users were now exploring their new products. So imagine that onetenth every day exploring their new products. How quickly would you be able to grow your your new product category, right? It's actually 1.1 to the^ of n. You will grow pretty fast. Long story short, I don't have to tell you about the benefits of LM augmented synthetic data. Richer high quality data at scale on the tail queries, right? even on the tail queries and the tail items and it's far lower cost and effort than is even possible with human annotation. So later we also have a talk from Instacart who will tell us about how they use uh LMS to improve their search and recommend uh their search system. Now the last thing I want to share is this challenge whereby right now in a regular company the system for ads for recommendations for search they're all separate systems and even for recommendations the the model for homepage recommendations the model for item recommendations the model for cut to cut recommendations the model for the thank you page recommendations they may all be different models right so you can imagine there you're going to have many many models But you going to have well leadership expects you to keep the same amount of headcount. So then how do you try to get around this right? You have duplicative engineering pipelines. There's a lot of maintenance costs and improving one model doesn't naturally transfer to the improvement in another model. So the solution for this is unified models, right? I mean it works for vision, it works for language. So why not recommendation systems? And we've been doing this for a while. This is not new. And aside maybe the text is too small but this is a tweet from stripe whereby they built a transformerbased payments fraud model right even for payments the sequence of payments you can build a foundation model which is transformer based so I want to share an example of the unified ranker for search and rexis and Netflix right the problem I mentioned they have teams they are building bespoke models for search similar item similar video recommendations and pre-quyer recommendations like on the search page before you even enter a search query. High operational cost um you know missed opportunities from learning throughout. So their solution is a unified ranker and they call it a unified contextual ranker which is unicorn. So you can see over here uh at the bottom there's the user foundation model and in it you put in the user watch history and then you also have the context and relevance model where where you put in the context of the videos and what they've watched. Now the thing about this unified model is that it takes in unified input right? So now if you are able to find a data schema where all your use cases and all your features can use the same input you can adopt an approach like this which is multi similar to multitask learning. So the the user the input will be the user ID the item ID you know the video or the drama or the series the search query if a search query exists the country and the task. So of course they have many different tasks in this example in the paper they have three different tasks uh search pre-quyer and more like this. Now what they did then was very smart imputation of missing items. So for example if you are doing an item to item recommendation you're just done watching this video you want to recommend the next video you would have no search query how would you imputee it well you just simply use the title of the current item and try to find similar items. The outcome of this is that this unified model was able to match or exceed the metrics of their specialized models on multiple tasks. Think about it. I mean it doesn't seem very impressive, right? It may not seem very impressive. Match or exceed. It might seem we did all this work just to match, but imagine unifying all of it like removing the tech depth and building a better foundation for your future iterations. It's going to make you iterate faster. The last example I want to share with you is unified embeddings at Etsy. So you might think that embeddings are not very sexy but this paper from Etsy is really uh outstanding in what they share in terms of model architecture as well as their system. So the problem they had was how can we help users get better results from very specific queries or very broad queries and if you know that Etsy inventory is constantly changing they they don't have the the same products all all throughout right it's very home homegrown so now you might be quering for something like mother's day gift that would almost match very few items I think very few items would have mother's day gift in their description or their title right and you know lexica embedding the other problem is that knowledge based embeddings like lexical embedding retrieval don't account for user preferences. So how do you try to address this? The problem the how do you address this is with a unified embedding and retrieval. So if you remember at the start of my presentation I talked about the qu show tower model right there's the user tower and then there's the item tower. We will see the same pattern again over here you see the product tower right this is the product encoder. So how they encode the product is that they use T5 models for text embeddings, right? Text item descriptions as well as a query product log for query embeddings. What was the query that was made and what was the product that was eventually clicked or purchased? And then over here on the on the left you see the query encoder which is the search query encoder. And they both share encoders for the tokens which is actually a text tokens the product category which is a token of a token of itself and the user location. So what this means is that your now your embedding is able to match user to the location of the product itself. And then of course to personalize this they encode the user preferences via the query user scaler features at the bottom. Essentially what were the queries that the user searched for? what what did they buy previously all their preferences. Now this is they also shared their system architecture and over here this is the product encoder from the previous slide and the query encoder from the previous slide. But what's very interesting here is that they added a quality vector because they wanted to ensure that whatever was searched and retrieved was actually of good quality in terms of ratings uh freshness and conversion rate. And you know what they did is they just simply concatenated this quality vector to the product embedding vector. But when you do that for the query vector, you have to you have to expand the product vector by the same dimension so that you can do a dot product or cosine similarity. So essentially they just slapped on a constant vector uh for the query embedding and it just works. The result 2.6% increase in conversion across entire site. That's quite crazy. Um and more than 5% increase in search purchases. If you search for something the purchase rate increases by 5%. Um this is very very these are very very very good results for e-commerce. Um so the benefits of uh unified models you simplify the system uh you whatever you build to improve one side of the tower or improve your model your unified model also improves other use cases that use these unified models. That said, there may also be the alignment text. You you you may find that when you try to build this, try to compress all 12 use cases into a single unified model, you may need to split it up into maybe two or three separate unified models because that's just the alignment text. We're trying to get better at one task actually makes the other task worse. We have a talk from uh LinkedIn in the in this afternoon's in this afternoon blog, the last talk of the blog, and then we also have a talk from Netflix uh which we'll be sharing about their unified model at the start of the next blog. All right, the three takeaways I have for you, think about it, consider it. Semantic ids, data augmentation, and unified models. Um, and of course do stay stay tuned for the rest of the track uh for the rest of the talks in this track. Okay, that's it. Thank you. [Music]