AI Engineer World’s Fair 2025 - LLM Recommendation Systems (RecSys)
Channel: aiDotEngineer
Published at: 2025-06-04
YouTube video id: 3k4a0PemMu4
Source: https://www.youtube.com/watch?v=3k4a0PemMu4
Hi everyone. Um, thank you for joining us in today's Rexus the inaugural Rexus track at the AI engineer world's fair. So today what I want to share about is what the future might look like when we use when we try to merge recommendation recommendation systems and language models. So my wife looked at my slides and she's like they're so plain. So therefore I'll be giving a talk together with Latte and Mochi. You might have seen Mochi wandering the halls around somewhere but there'll be a lot of doggos throughout the slides. I hope you enjoy. First language modeling techniques are not new in recommendation systems. I mean it started to back in 2013. We started learning item embeddings across um AC from co- occurrences in user interaction sequences and then after that we started using GRU4. Right. I don't know who here remembers recurrent neuronet networks gated recurrent units. Yeah. So those were very short-term and we predict the next item from a short set of sequences. Then of course um transformers and attention came about and we we became better on attention on long range dependencies. So that's where we started hey you know can we just process on everything in the user sequence hundreds 2,000 item ids long and try to learn from that. And of course now today in this track I wanted to share with you about three ideas that I think are worth thinking about. semantic ids, data augmentation and unified models. So the first challenge we have is hashbased item ids. Who here works on recommendation systems? So you you you probably know that hashbased item ids actually don't encode the content of the item itself. And then the problem is that every time you have a new item, you suffer from the co-star problem which is that all you have to relearn about this item all over again. And and therefore and there's also sparity, right? whereby you have a long set of tail items that have maybe one or two interactions or even up to 10, but it's just not enough to learn. So, recommendation systems have this issue of being very popularity biased and they just struggle with coaround and sparity. So, the solution is semantic ids that may even involve multimodal content. So, here's an example of trainable multimodal semantic IDs from Quao. So, Quao is kind of like Tik Tok or Xiaongu is a short video platform in China. I think it's the number two short video platform. You might have used their text to video model cling which they released sometime last year. So the problem they had, you know, being a short video platform, users upload hundreds of millions of short videos every day and it's really hard to learn from this short video. So how can we combine static content embeddings with dynamic user behavior? Here's how they did it with trainable multimodel semantic ids. So I'm going to go through each step here. So this is the qua model. It's a standard two tower network. Um on the left this is the embedding layer for the user which is a standard sequence uh sequence of ids and the user ID and on the right is the embedding layer for the item ids. So these are fairly standard. What what's new here is that they now take in content input. So all of these slides will be available online. Um don't don't worry about it. Uh I'll make it available right media after this. Um and to encode visual they use restnet to encode video descriptions they use bird and to encode audio they use vGish. Now the thing about the trick is this when you have this encoder models it's very hard to back propagate and try to update these encoder model embeddings. So what did they do? Well firstly they took all these content embeddings and then they just concatenated them together. I know it sounds crazy right but just concat concat them together. Then they learn cluster ids. So I think they shared in the paper they had like a 100 million short videos and they learned just be via c means clustering a thousand cluster ids. So that's what you see over there in the model encoder which is in the boxes at at the bottom which is the cluster ids. So be above the cluster ids you have the non-trainable embeddings below that you have the trainable cluster ids which are then all mapped to their own embedding table. So the trick here is this. The motor encoder as you train a model, the motor encoder learns to map the content space via the cluster ids which are mapped to the embedding table to the behavioral space. So the output is this. Um these semantic ids not only outperform regular hashbased ids on clicks and likes, right? Like that's pretty standard. But what they were able to do was they were able to increase co-star coverage which is the of a 100 videos that you share how many of them are new they were able to increase it by 3.6%. And also increase co-star velocity which is okay how many new videos were able to hit some threshold of uh views and this they did not they did not share what a threshold was but being able to increase co-star and co-star velocity by these numbers are pretty outstanding. So the long story short, the benefits of semantic ids, you can address co-start with the semantic ID itself and now your recommendations understand content. So later in the talk, we're going to see some amazing uh sharing from Pinterest and YouTube. And in the YouTube one, you see how they actually blend language models with semantic ids whereby it can actually explain why you might like the semantic ID because it understands the semantic ID and is able to give human readable explanations and vice versa. Now the next question and I'm sure all of this everyone here has this challenge. The lifeblood of machine learning is data. good quality data at scale and this is very essential for search and of course recommendation systems but search is actually far more important. You you need a lot of metadata. You need a lot of uh query expansion, synonyms, uh you need spell checking, you need um uh all sorts of metadata to attach to your search index. But this is very costly and high effort to get. In the past, we used to do with human annotations or maybe you can try to do it automatically. But LMS have been outstanding at this. And I'm sure everyone here is sort of doing this to some extent using LMS for synthetic data and labels. But I want to share with you two examples uh from Spotify and Indeed. Now the Indeed paper it's quite out uh I really like it a lot. So the problem that they were trying to face is that they were sending job recommendations to users via email. But some of these job recommendations were bad. They they were just not a good fit for the user. Right? So they had poor user experience and then users lost trust in the job recommendations. Imagine and how how they will indicate that they lost trust was that these job recommendations are not good a good fit for me. I'm just going to unsubscribe. Now the moment a user unsubscribes from your feed or for your newsletter is very very very hard to get them back. Almost impossible. So while they had explicit negative feedback, thumbs up and thumbs down, this was very sparse. How often would you actually give thumbs down feedback? Very sparse. And implicit feedback is often imprecise. What do I mean? If you if you get some recommendations but you actually don't act on it, is it because you didn't like it or is it because it's not the right time or maybe your s your your wife works there and you don't want to work in the same company as your wife. So the solution they had was a lightweight classifier to filter bad racks. And I'll tell you why I really like this paper from Indeed in the sense that they didn't just share their successes but they shared the entire process and how they get how they got there and it was fraught with challenges. Well, of course, the first thing that makes me really like it a lot was that they started with emails. So, they had their experts label um job recommendations and uh user pairs and from the user you have their resume data, you have their activity data and they try to see hey you know is this recommendation a good fit. Then they prompted open ALMs uh Mistral and Lama 2. Unfortunately, the performance is very poor. these models couldn't really pay attention to what was in the resume and what was in the job description even though they had sufficient context length and and the output was just very generic. So to get it to work they prompted GB4 and GB4 worked really well um specifically that GB4 had like 90% precision and recall however it was very costly. Um they didn't share the actual cost but it's too slow. It's 22 seconds. Okay if GBD4 is too slow what can we do? Let's try GBD 3.5. Unfortunately, GP GBD 3.5 had very poor precision. What does this mean in the sense that of the recommendations that he said were bad, only 63% of them were actually bad? What this means is that they were throwing out 37% of recommendations, which is one/ird. And for a company that tries on recommendations and people uh rec recruiting through your recommendations, throwing out one-third of them that are actually good is is is quite a is quite a guardrail for them. This was their key metric here. And also GB. So what they did then is they fine-tuned GBD 2.5. So you can see the the entire journey, right? Open models, GBD4, GBD3, now fine-tuning GBD 2.5. Um GBD 2.5 got the precision they wanted, 0.3 precision, and you know, it's one quarter of GBD4's cost and latency, right? But unfortunately it was too still too slow. It was about 6.7 seconds and this would not work in an online filtering system. So therefore what they did was they distilled a lightweight classifier on the fine tune GBD 215 labels and this lightweight classifier was able to achieve very high performance uh specifically 0.86 AU ROC. I mean the numbers may not make sense to you but suffice to say that in an industrial setting this is pretty good. And of course they didn't mention the latency but it was good enough for realtime filtering. I think less than 200 millconds or something. So the outcome of this was that they were able to reduce bad recommendations they they were able to cut out bad recommendations by about 20%. So initially they had hypothesized that by cutting down recommendations even though they were bad you will get fewer subscriptions. It's just like sending out links right you might have links that are clickbait even though they are bad people just click on it. And they thought that even if if we cut down recommendations even if they were bad we will get lower application rate. But this was not the case. In fact, because the recommendations were now better, application rate actually went up by 4%. And unsubscribe rate went down by 5%. That that's that's quite a lot. So essentially, what this means is that in recommendations, quantity is not everything. Quality makes a big difference. And quality here moved the needle quite a bit by 5%. Another example I want to share with you is Spotify. So who here knows that Spotify has podcast and audio books? Oh, okay. I guess you you guys are not the target. uh audience in in this use case. So Spotify is really known for song and artists and a lot of their users just search for songs and artists and they're very good at that. But when they started introducing podcasts and audio books, how would you help your users know that you know these new items are available and of course there's a huge ass co-star problem. Now it's not only co-star on item, it's now co-star on category. How do you start growing a new category within your service? And of course exploratory search was essential to the business right for going for expanding beyond music. Spotify doesn't want to do just do music songs. They just now now they doing audio. So the solution to that is a query recommendation system. So how did they recommend how first how did they generate new queries? Well um they have a bunch of ideas which is you know extract it from catalog titles, playlist titles, you mine it from the search logs. You just take the meth uh you just take the artist and then you just add cover to it. And this is what they use from existing data. Now you might be wondering like where's the LM in this? Well, the LM is used to generate natural language queries. So this might not be sexy, but this works really well, right? Take whatever you have with conventional techniques that work really well and use the LM to augment it when you need it. Don't use the LM for everything at the start. So now they have this exploratory queries. When you search for something, you still get the you still get the immediate results here, right? So you take all of this, you add the immediate results and then you rank these new queries. So this is why when you do a search, this is the UX that you're probably going to get right now. I got this from a paper. It may have changed recently. So you still see the item queries at the bottom but at the top with the query recommendations. This is how Spotify informs users without having a banner. Now we have audio books, now we have podcast, right? You search for something, it actually informs you that we have these new categories. The benefit here is plus 9% exploratory queries. Essentially onetenth of their users were now exploring their new products. So imagine that onetenth every day exploring their new products. How quickly would you be able to grow your your new product category? Right? It's actually 1.1 to the^ of n. You will grow pretty fast. Long story short, I don't have to tell you the about the benefits of LM augmented synthetic data. Richer high quality data at scale on the tail queries, right? Even on the tail queries and the tail items and it's far lower cost and effort that is even possible with human adaptation. So later we also have a talk from Instacart who will tell us about how they use uh LLMs to improve their search and recommend uh their search system. Now the last thing I want to share is this challenge whereby right now in a regular company the system for ads for recommendations for search they're all separate systems and even for recommendations the the model for homepage recommendations the model for item recommendations the model for cut add to cut recommendations the model for the thank you page recommendations they may all be different models right so you can imagine this you're going to have many many models But you going to have well leadership expects you to keep the same amount of headcount. So then how do you try to get around this right? You have duplicative engineering pipelines. There's a lot of maintenance costs and improving one model doesn't naturally transfer to the improvement in another model. So the solution for this is unified models, right? I mean it works for vision, it works for language. So why not recommendation systems? And we've been doing this for a while. This is not new. And aside maybe the text is too small but this is a tweet from Stripe whereby they built a transformerbased payments fraud model right even for payments the sequence of payments you can build a foundation model which is transformer based. So I want to share an example of the unified ranker for search and rexis and Netflix. Right? The problem I mentioned they have teams they are building bespoke models for search similar item similar video recommendations and pre-quy recommendations like on the search page before you enter a search group high operational cost um you know miss opportunities from learning throughout. So their solution is a unified ranker and they call it a unified contextual ranker which is unicorn. So you can see over here uh at the bottom there's the user foundation model and in it you put in a user watch history and then you also have the context and relevance model where where you put in the context of the videos and what they've watched. Now the thing about this unified model is that it takes in unified input. Right? So now if you are able to find a data schema where all your use cases and all your features can use the same input you can adopt an approach like this which is multi similar to multitask learning. So the the user the input will be the user ID the item ID you know the video or the drama or the series the search query if a search query exists the country and the task. So of course they have many different tasks in this example in the paper they have three different tasks uh search pre-quyer and more like this. Now what they did then was very smart imputation of missing items. So for example, if you are doing an itemto item recommendation, you're just done watching this video, you want to recommend the next video, you would have no search query. How would you impute it? Well, you just simply use the title of the current item and try to find similar items. The outcome of this is that this unified model was able to match or exceed the metrics of their specialized models on multiple tasks. Think about it. I mean it doesn't seem very impressive, right? It may not seem very impressive. Match or exceed. It might seem we did all this work just to match but imagine unifying all of it like removing the tech depth and building a better foundation for your future iterations. It's going to make you iterate faster. The last example I want to share with you is unified embeddings at Etsy. So you might think that embeddings are not very sexy but this paper from Etsy is really uh outstanding in what they share in terms of model architecture as well as their system. So the problem they had was how can we help users get better results from very specific queries or very broad queries and if you know that Etsy inventory is constantly changing they they don't have the the same products all all throughout right it's very home homegrown so now you might be quering for something like mother's day gift that would almost match very few items I think very few items would have mother's day gift in their description or their title right and you know lexica embedding the other problem is that knowledge based embeddings like lexical embedding retrieval don't account for user preferences. So how do you try to address this? The problem the how do you address this is with unified embedding and retrieval. So if you remember at the start of my presentation I talked about the qu show tower model right there's the user tower and then there's the item tower. We will see the same pattern again over here. You see the product tower right? This is the product encoder. So how they encode a product is that they use T5 models for text embeddings, right? Text item descriptions as well as a query product log for query embeddings. What was the query that was made and what was the product that was eventually clicked or purchased? And then over here on the on the left you see the query encoder which is the search query encoder and they both share encoders for the tokens which is actually a text tokens the product category which is a token of a token of itself and the user location. So what this means is that your now your embedding is able to match user to the location of the product itself. And then of course to personalize this they encode the user preferences via the query user scale effect features at the bottom. Especially what were the queries that the user search for, what what did they buy previously, all their preferences. Now this is they also shared their system architecture and over here this is the product encoder from the previous slide and the query encoder from the previous slide. But what's very interesting here is that they added a quality vector because they wanted to ensure that whatever was search and retrieved was actually of good quality in terms of ratings uh freshness and conversion rate. And you know what they did is they just simply concatenated this quality vector to the product embedding vector. But when you do that for the query vector, you have to you have to expand the product vector by the same dimension so that you can do a dot product or cosine similarity. So essentially they just slapped on a constant vector uh for the query embedding and it just works. The result 2.6% increase in conversion across entire site. That's quite crazy. Um and more than 5% increase in search purchases. If you search for something the purchase rate increases by 5%. Um this is very very these are very very very good results for e-commerce. Um so the benefits of uh unified models you simplify the system. Uh you whatever you build to improve one side of the tower uh improve your model your unified model also improves other use cases that use these unified models. That said there may also be the alignment types. You you you may find that when you try to build this, try to compress all 12 use cases into a single unified model, you may need to split it up into maybe two or three separate unified models because that's just the alignment text. We're trying to get better at one task actually makes the other task worse. We have a talk from uh LinkedIn in the in this afternoon's in this afternoon blog, the last talk of the blog and then we also have a talk from Netflix uh which will be sharing about their unified model at the start of the next blog. All right, the three takeaways I have for you, think about it, consider it. Semantic ids, data augmentation and unified models. Um, and of course do stay stay tuned for the rest of the track uh for the rest of the talks in this track. Okay, that's it. Thank you. I maybe have time for one question while we have our speakers from Pinterest come up and join us. Oh, oh, do you mind speaking in the mic, please? I read your very long paper that you wrote on recommendation systems and what's available today, but you didn't mention the genre or HST work for META's paper and I was just curious why you left that out. I didn't deliberately left that out. I think 20 I I think there was so many papers that I just didn't have time and I just time boxed myself. I was like, "Okay, Eugene, you'll be done with this in two weeks and then two weeks is up. That's all I have. So, ship it." Um that that's all. Yes, another question. Um I feel like I have read in anecdotal blog posts about how part of what people might say is some of the decline in Google search quality is a move away from explicit ranking factors and sort of an easily auditable uh like ultimate ranking algorithm to something more blackbox and using more of those techniques. And I guess I was curious if you had an opinion on whether that seems likely to be the case or whether that is just, you know, noise and not actually influencing the quality of the search results. Yeah, that's a good question. Um, unfortunately, I don't have any insider information on why that might happen. Um, I think we do have some Google folks here. Maybe you can ask them, but honestly, I have I haven't realized this and I haven't even experienced this the degradation. Wow. Okay. This should I didn't have Thank you everyone for your patience. Next we have machine learning engineers from Pinterest. They'll be sharing with us about they'll be sharing with us about how they integrated to enhance relevance scoring at Pinterest. um how they combine search queries with multimodal context and this multimodal context includes visual captions uh link based text and user curated over to you. Thanks for the introduction. Yeah. Hi everyone. Um thanks for joining the talk today. Um we're super excited to be here and share some of the learnings we um we have from integrating the LM into Pinterest search. My name is Han and today I'll be presenting with and we are both machine learning engineers from search relevance team at Pinterest. So start with a brief introduction to Pinterest. Um Pinterest is a visual discovery platform where piners can come to find inspiration to create a life they love. And there are three main discovery services on Pinterest. The home feed, the related things and search. In today's talk, we'll be focusing on search and um where the user can type in their queries and um find useful inspiring content based on their information need. And we'll share um how we leverage LM to improve the search relevance. Um here are some key statistics for Pinterest search. Every month we handled over six billion searches with billions of pins to search from. covering topics from rusty, home decor, travel, fashion and beyond. And at Pinterest search is remotely global and multilingual. We support over 45 languages and reaching in more than 100 countries. These numbers highlight the importance of search at interest and why we are investing in search relevance to improve user experience. So um this is an overview of how Pinterest search work at the back end. So it's similar to um many recommendation system at industry. It has query understanding, retrieval, reanking and the blending stage and finally produce um relevant and engagement search feed. And um in today's talk we'll be focusing on the sematic relevance modeling that happened at the reenting stage and share about how we use LM to improve um the search relevance on the search. Okay. So um here's our search relevance model which um is essentially a classification model. Given a search query and a ping the model will predict how much the is relevant to this search query. And to measure this, we use a fivepoint scale um ranging from the most relevant to most irrelevant. All right. Um now we are going to share some key learnings we have from using the LM to improve search Pinterest search relevance. And here are four main takeaways that we would like to um go into more details. Lesson one, LMS are good at relevance prediction. Um so before I present um the result, let me first give a quick overview of the model architecture that we are using um we contain the query and the pin text together and pass them into a LM to get a um embedding. So this is called cross encoder structure we where we can better capture the interaction between the query and the team and then we feed the embedding from LM into MLP layer to produce a fivedimensional vector which correspond to the five relevance levels and during training we fine-tune some open source LM using Pinterest internal data and to better adapt the model to our Pinterest content And here um I'd like to share some results um to demonstrate that the usefulness of LM and as a baseline we use search stage which is a Pinterest inhouse content and the quiry embedding and um so if you look at the table you can see that the LM has substantially um improved the performance of the relevance prediction And as we use more advanced LMS and increase the model size, the performance keeps improving. And for example, um the 8 billion L3 model gives um 12% of improvement over the multilingual bird based model and 20% of improvement over the search stage embedding model. So um the lesson here is that um LMS they are quite good at reance prediction. Um all right. Um lesson two, the vision language model generated captions and the user actions can be quite useful content annotations. So to use LM for search uh for relevance prediction, we need a text representation of each ping. And here I listed several features that we um for the user curated board that the ping has been saved to or um the queries that led to the less than a second like a 500 400 millisecond latency at at best. Um there are there are three levers that we can we can pull in order to make the model more efficient and improve the throughput and and reduce the latency for these models. uh specification uh going to the smaller model and quantization. Uh as as Moz explained before uh a smaller models definitely have a better throughput but our recipe is that we need to go big and then go small. If you go with the smaller model initially it doesn't have enough capacity it doesn't have enough reasoning power to to to solve the complicated task that we have. So we go with a larger model and then we start this 150 billion parameter model and then we start distilling it with a smaller model. And one of the recipe here is that we need to do the distillation step by step. And that means that we go with a for example 8B uh 8B model then 3B model and then 1B model. So we slowly decrease the size of the model and we we we distill over and over from the from the from the from the previous model. Um and that recipe shows to be much much more effective rather than basically directly going from 150 billion parameter model to one v parameter model. Uh same thing for pruning. Uh so pruning is a mathemat optimiz mathematical optimization problem. You want to either reduce the uh reduce the uh number of heads in the transformers. You can reduce the number of MLPS. Overall this transform model tends uh proven to be very very redundant in terms of keeping the information. So we can start pruning and removing some of these layers uh or or reduce basically the precision for for uh for the for the for each of the activations and parameters. Uh however uh again if if you do the pruning uh very aggressively at the beginning your performance would significantly suffer. So the the recipe here is also do the gradual pruning. uh we we do we what we do is that we start pruning the model very small pruning to the model. We we we distill uh to the smaller model and we do it over and over again. More more pruning, more distillation, more pruning, more distillation. And as you can see from this plot u doing the gradual pruning uh has u can be as effective as basically u no no information loss. Whereas if you just basically do aggressive pruning at the beginning, you can have up to 1% reduction in the in the model quality. Another level is is is quantization going to lower precision uh we are leveraging FP8 uh for activation model parameters. However, uh doing just FP8 in all the layers uh here's the performance of the or the quality of the model significantly. So now basically your tool would be to do mixed precision. And one of the important aspect when it comes to ranking recommendations and overall uh prediction tasks is you want the model the prediction or the probability output of the model to have a very good precision. So in the LM head at the end of the language model has to be in FP32. If you do it in FP16, BF16 or FP8, uh what what happens is that the numbers collapse and you don't have a very good calibration on top of that and you cannot distinguish between different item recommended. Uh last part is specification. We can specify basically the the attentions the most expensive part of the transformers is is attention scores. And we can leverage a specification. Not every item needs to attend to every items. And when you know your task when you know that is recommendation these are the items that you want to uh in the history you can specify and not have every item it add to each other and same same goes with when you are recommending the items instead of recommending one item you can recommend 50 item 500 item at the same time but you want to make sure that these items are not attending to each other. So you you spify uh the attention scores uh for the output and for the for the query. If you put everything together uh we can we can see that basically this we can have a significant reduction in the latency. What we have done is that in the in four or five of our release uh uh one release after the other we were able to reduce the latency by 7x and at the same time increasing the throughput which is basically the number of queries that we can handle by one GPU by 30x. So we are improving basically the amount of the work work that the GPU is doing at the same time we are reducing the latency that each query is sync. Uh these are some of basically technical reports and and and um papers that we published uh during our journey to share with the community basically our a lesson learned. Um and that's end of our talk. So we have some time also to answer some questions. Thank you. Thank you. Great talk. One question, how did you measure that it doesn't lose generalization power? Obviously, you've done a lot of fine-tuning. Uh, and you mentioned it works for four or five tasks instead of task specific models. How do you know it's going to work for the next five tasks? That's a good question. So, we have a lot of I mean the answer overall is having a very comprehensive benchmarking set. We have something around like 50 to 60 benchmarking. Some of them are internal, some of them are external. For example, we leverage if Eva to make sure that the model still follows a very good instruction. Um and uh as Maz mentioned some of the tasks are not never being part of our training data and that's how we are measuring basically digization to the new new domain within use cases for example. Thanks thanks for the um I'm wondering what a small uh listing website uh can use out of the box. Um have you heard of NL web which was launched recently by Microsoft? Uh if yes, what are your views on that as a recommendation system? and web. No, I haven't actually heard. Okay. Okay. Sorry about that. Anything you for smaller ones listing say real estate listing website has like thousands of uh real estate listings. What are the out of the box recommendation models that uh people can start using. Uh I mean that's the I wish that that such a model would exist. I I don't really I mean that's why I think we started this work. We try we trying to see if we can actually make it a foundation model so that you can actually solve those kinds of problems. I think there's a lot of potential for for this to be able to serve a lot of the use cases that are beyond the bigger companies but definitely I don't know any I think you should check out NL web. Okay. Thanks. Uh thank you for the great talk. Um uh on the slide where you mentioned you have a multi- uh item scoring. Uh I'm curious like uh what does Does that uh effectively mean? Does it mean that you need to do multi-step decoding or it's just a one step or just processing of the logits for multiple items? What does it it's a multi-step? We don't want to basically we didn't want to go to the for example complication of speculative decoding or basically the decoding aspect. We wanted to have everything at the preill. So what we did was that basically all the items are being sequenced. Other recommended items or potential candidates are sequenced together. But we also wanted to avoid them to attend to each other. So we leverage basically what we call it like a 4D attention mask. Uh and we we developed a special kernel actually in the SG lang and VLM to to be able to do that. And now when you have up to 500 items in in your query segment, those items don't attend to each other. They only attend to the historical user and and user profile information. Okay. Thank you. Hey, great talk. Uh, so a user history means many things, right? So like there is all of the jobs that they've applied to or and the job postings there are so many entities and so on. The the context of the model can get quite large. Uh how did you manage that? Did you compress it or were there parts that you focused on? So we we actually experimented with a lot of things. Uh we experimented with with rack system so that basically when you have a query we try to figure out what are the most closest items in your history to bring it up. Uh we also uh experimented with chronical orders and some sort of weight decate on the chronical orders. It turns out that for majority of application that we have actually chronical order is good enough and that kind of makes sense because recommendation systems are very biased to the freshness. So the more recent user activity helps. One of the biggest challenge is actually the this is more now become more like a traditional LM problem. How do you balance the distribution of your positive and negative within the context and I think that's become something that more like a ML engineering effort to figure out okay do I want more positive or negative like how much how much information I need to put in the context. Just I can add one more thing to this there's also another complication when you go to the serving of these models you don't want to break the KV caching or that you're using in the serving. So, it's going to be a little bit more complicated, more cumbersome to do something that's smarter than just putting the chronological order. So, that's something that needs to be designed. So, it's not something that's very obvious. Absolutely. One more question. Uh you guys did so many experiments, tried out so many things. How's your entire system set up? Because I'm assuming that you say quantization, but you must have tried different forms of quantization, what not. How do you set up the system in such a way that you can try out multiple experiments and see what works best? Can you talk a bit about that? Uh yes m touched a bit on that one. I think the the the one thing that we we we hold a very high bar for the one was automation. So our system is very automated to the extent that when you're running experimentation actually the result of the experimental being pushed automatically into the excel sheet and now when you have such automated system now basically the developers are very efficient in terms of like I just want to figure out different quantization. So you just change the quantization parameters and everything else just by click bottom happens end to end. Uh so I think uh automation is the key if you want to basically really optimize for these models. So did you build all of that automation in house or yes most of them but we leveraged for example lightning vlg uh we leverage basically a lot of open source tools but we make sure that they are integrated very well with each other and uh and optimize basically the entire flow. Thank you. Thank you for uh thank you again. Thank you. So we'll come back after lunch at 2. So I'll see you guys back here. Thank you. Thank you. Yes. Okay. You mean during my talk? Yeah. Okay. Yeah. What's up? All right. I hope I hope everyone has had a good lunch. Um you know good discussions in the hallways. The hallway discussions and you know the after talks discussions are the best where you get a great alpha from the from our speakers. So continuing from the previous sharing of Netflix's foundational model, next we have Yu, staff research scientist at Netflix and he'll be sharing about their big bat of building a foundational model for personalization. Yes, please. All right. Uh good afternoon. Uh thank you uh Eugene for the introduction. Uh so today uh I'm going to share our big bat and Netflix on personalization namely to use one foundation model to cover all the recommendation use cases. Uh at Netflix we have diverse recommendation needs. Uh this is example homepage of uh one profile on Netflix. Uh it's a 2D layout rows and items. Diversity comes at uh at least three levels. Uh there is first level about row. We have diverse rows. We have genres for example rows on comedies, roles on action movies. We have rows about uh new trending uh just the release titles. We also have rows about for example titles only available on Netflix. Um so that's the first dimension. Second dimension is of course of the the items or entities. uh in addition to traditionally movie and TV shows now we have games we have live streaming and we're are going to add more so our content space is ex expanding to uh very heterogeneous content types the third level is page right so we have homepage we have search page we have a kids homepage which is tailored very differently toward kids interest uh mobile feed is a linear page page is not a 2D layout uh so on and so forth. So page different pages are also very diverse. Uh what happened traditionally was that these lead to naturally uh many specialized models that got developed over the years. Uh some models rank videos, some rank rows, some focus on for example shows user have not watched yet. Some uh focus on shows what a user are already engaging. Uh and many of those models are were built independently over the years. They may have different objectives uh but have a lot of overlaps as well. Uh so naturally this lead to duplications uh uh duplications in our label engineering as well as feature engineering. Take the uh feature engineering as example. uh we have this very commonly used factual data about user interaction history. Uh the the factual data is the same but over the years many features are developed derived out of the same fact data like counts of different actions, counts of actions within various time window or other kind of uh slice and dice dimensions. Similarity between the users history titles against the target titles unique. Lastly like uh just a sequence of unique show ids uh to be used as a sequence feature into the model. So this list can go on and on and a lot of those features uh because they are developed independently into each model they have slight variations but become very but largely uh very similar. So become very hard to uh maintain. So the challenge the challenge uh back then was uh is this scalable? uh obviously not if we keep expanding our landscape of content type or business use cases, it's not manageable to spin up new models for each uh individual use cases. Uh there's not much leverage. uh there's some shared components on building the feature label but still by and large each model uh basically uh spinned up independently and that also impact our uh innovation velocity in in the terms that you don't reuse as much as you can instead you just spin up new models uh pretty much from scratch so this was the situation about four years ago uh at the beginning or middle of the pandemic so the question we asked at that time was uh can we centralize the learning of user representation in one place. So the answer is yes and we have this key hypothesis that about foundation model based on transformer architecture. Uh concretely two hypothesis here one hypothesis that through scale up semi-supervised learning personalization can be improved. Uh the scaling law also applies to recommendation system as it applies to LLM. Uh second is that by integrating the foundation model into all systems we can create high leverage. we can simultaneously improve all the downstream canvas facing models at the same time. So we'll see in the following slides how we validate those hypothesis. Uh I I'll break up the overview into two sub sessions. First about data data and training and then later uh second about application and serving. So um about data and training. So starting from data a very interesting aspect of building such foundation model auto reggressive transformer is that there's a lot of analogy but also differences sometimes uh between this and LLM. So we can transfer a lot of learnings inspirations from LLM uh uh development. uh if we start from the very bottom layer which is basically data cleaning and tokenization. Uh people work with LLM understand tokenization decisions have profound impact in your model quality. So uh although it's the bottom layer the decision you made there can percolate through all the downstream layers and manifest as either your model quality problem or model quality plus. So uh this applies to recommendation uh foundation model as well. uh instead of uh there are some differences very importantly instead of language tokens which is just a one id here for uh if we want to translate the user interaction history or sequence each of a token is a event interaction event from the user right but that event has many facets or many fields so it's not just one ID you can represent there are a lot of rich information about the event so how you all of those fields can play a role in making the decision of tokenization. Uh I think that's what we need to consider very carefully. Um what is the granularity of tokenization and trade off that versus the context window for example. Um and through many iterations we reach the right I think reach the right abstraction and interfaces that we can use to uh adjust our tokenization to different use cases. For example, you can imagine we have a token one version of tokenization used for pre-training for fine-tuning against a specific application. We apply slightly different tokenization. Um so moving up from the tokenization layer uh then it becomes the model layers. uh at high level uh from bottom to top we go through the uh event representation uh embedding layer transformer layer and the objective layer. So event representation as we just briefly touched upon uh many information in the event about a high level you can break it down by when where and what when that event happened that's about time encoding and where it happened it's about a physical location your local country so on and so forth but also about device about the uh canvas or which row which page this action happened uh and then uh what basically is about the target entity or the title which title you interacted with what is the interaction how long and uh any that kind of information associated with the action. So um that's where the we need to decide what information we need to keep what we should drop so and so forth. uh moving one layer above uh the embedding feature transformation layer. Uh one thing that needs to be pointed out is that for recommendation we need to combine ID embedding learning with other semantic content information. Um if you only have ID embedding learn from scratch in the model then you have problem with co-star meaning that titles the model hasn't seen during training it doesn't know how to deal with it at inference time. So we need to have semantic content information to be uh comp complementaryary to those ID embeddings. Uh this is not a problem for LLM but very commonly encountered the co-star problem for re recommendation system. Uh uh transformer layer I think there's no need to talk too much into this in terms of architecture choices optimization so forth. The only thing that I want to point out is that uh we are using the hidden state output from this layer as our user representation which is one of the primary goal of the foundation model is to learn a good long-term user representation. Then uh we need to put this into context. Then things to consider are for example how stable is our user user representation given our user profile user interaction history keep changing. How do we guarantee or maintain the stability of that representation and what kind of aggregation we should use? You can think of broadly aggregate across the time dimension in terms of sequence dimension or aggregate uh across the layers. You have multiple self attention layer. How do you aggregate that? Um and then lastly, do we need to do explicit adaptation of the representation based on our downstream objective to fine-tune it? Um so then we move to last uh the very top layer objective loss function. This is also very interesting in the sense that it's much richer than LLM because you can see first we use uh instead of one sequence but multiple sequence to represent the output because you can have a sequence of entity ids that's your like uh next token prediction softmax or sample softmax but then we have many other facets of field of each event that can be also used as a target right so it could be for things like uh action type. It could be some aspect of the entity's metadata like entity type, yarn, language, so on and so forth and also about your action like prediction of the duration or uh the device where the action happened or the time when the next uh user play will happen. So those are all legitimate uh targets or labels depends on your use case you can use them to do the finetuning. Now instead of so you can cast the problem as a multitask learning problem multi head or hierarchical prediction but you can also use them just as your weights your rewards or your mask on the loss function. So in terms of to adapt the model to zooming into one subcategory of uh user behavior you want to you want the model to learn. Okay. So that's about the model architecture that I want to talk about. Um so does it scale? The first question part of the first hypothesis we want to answer is does a scaling law apply and I think the answer is yes. So this is over the uh roughly two two two to two and a half years we were scaling up and then we constantly still see the gain uh from only on the order of 10 million profile or a few million profile to now on the order of one billion u model parameters. We scale up the data accordingly. Um now we stop here because we can still keep going but uh as you may realize that recommendation system usually have much stringent latency cost requirement. So scaling up scaling up more requires to also distill back. Yeah. But certainly I think this is not the end of the scaling law. Uh before we wrapping up the data and training discussion I would like to highlight some of the learnings I think quite interesting we borrow from LLM. This is not exhaustive list but uh uh I think very interesting to me uh the top three one is to multi-token prediction. You may have seen this in the deepseek paper so on and so forth. So you get implementation wise you can use multi head multi-lel so and uh different implementation flavor but the goal is really to force the model to be less myopic more robust to serving time shift because you have a time gap between your training and serving and also force the model to targets long-term user satisfaction and long-term user behavior instead of just focus on next action. uh I we have observed a very notable uh matrix improvement by doing that. Uh the second is multi-layer representation which uh I touched upon on the profile representation. So this is also translated from LLM side of techniques of layer wise supervision self-distillation or multi-layer output aggregation. The goal here is really to make a better and more stable user representation. Lastly, uh this is also should be no surprise, long context window handling from truncated sliding window to sparse attention to progressively training uh longer and longer sequences uh to eventually all of the parallelism strategies. So this is about more efficient training and maximize the learning. Okay. So uh shift gear to talk about the serving and applications. uh before the foundation model FM uh this is roughly the algo stack we have for personalization many data many features many models independently developed each serving multiple or one canvases or applications we call now with the foundation model we consolidate largely the data and representation layer especially the user representation as well as content representation in the personalization domain uh modeling here as well because model now each application model now are built on top of FM so become a thinner layer instead of a very standalone full-fledged model trained from scratch. So how do the various models utilize a foundation model? Um there are three main approaches uh or consumption patterns. Okay, the first is foundation model can be integrated as a subgraph within the downstream model. Uh additionally the content embeddings learned from the foundation model can be integrated as the embedding lookup layers. So downstream model is a neuron network uh it may already have initialize some of the sequence transformer uh tower or graph and then using a pre-trained foundation model subgraph to directly replace that. Uh second is that uh we can push out embeddings. This is no surprise from both content side and entity embedding as well as member embeddings. Uh the only the main concern here of course is how we want to re how frequently we want to refresh the member embeddings and how we make sure they are stable uh and push them to the centralized embedding store. And this of course allow far more uh wider use cases than just personalization because people analytics data scientists can also just fetch those embeddings directly to do the things that they want. Finally user can u extract the models and fine-tune it for specific applications either fine tune or they need to do distillation to meet the online serving uh requirement. um especially for those with a very strict latency requirement. To wrap up uh I want to show at high level the wings we accumulated over the last one year and a half uh by incorporating FM into various places. So the blue bar represent how many applications have FM incorporated. The green bar represent the AB test swings because in any application we may have multiple AB tests going on there to have wings. So we see we indeed see high leverage of FM to bring about both AB test wings as well as infrastructure consolidation. Uh so I think the big back uh big bets are validated. Uh it is a scalable solution uh in terms of both both in terms of a scalable scale up the model with improved quality as well as make the whole infra consolidated and the scale uh to new applications to be much easier. High leverage because it's a centralized learning innovation velocity also is faster because we allow a lot of newly uh launched applications directly fine-tune the foundation model to uh launch the first experience. Uh so the current directions um one is that um we want to have universal representation for heterogeneous entities. This is uh as you can guess the semantic ID and along those lines because we want to cover that uh as Netflix expanding to very different very hoggeneous content types. Uh second is generative retrieval for collection recommendation. Right? So instead of just recommend a single video be generative at inference time and serving time because you have multi-step decoding. A lot of the consideration about business rules or diversity for example can be naturally handled in the decoding process. Lastly faster adaptation through promp tuning. So this is also borrow from LLM. Can we just train some soft tokens so that at inference time we can directly swap in and out those soft tokens to prompt the FM to behave differently. So that is also a very promising direction that we are getting into. All right, that concludes my talk. Thank you for your attention and questions. Thank you. Um if you have any questions may I invite you to come to the mic while we get our next speakers. Hi thank you for the talk. Uh since you get billions of users so except the recommendation system you maybe it can do much more right. So what's your thought on that since I can just ask it to to predict who's the next president in the United States? Thank you. Um so I actually don't uh could you explain a little bit what do you mean by beyond recommendation? Do you mean the other personalization or other things? Um yeah since you get kind of beating users preference. So actually that that that preference is also been leading to what things they buy or who they will vote for the next president. So do you think your foundation model has that capability to to expand not only recommendation what videos they want to look what others they like or what's their opinions on anything else. Thank you. Yes. So I think we are expanding to different uh I think entity type and also capture uh users taste from both on and off our platform. I think that's a general trend that we're going to. Yes. Great. Thank you. This was really helpful. um question on and you might not be able to share it um for IP reasons but whatever you can uh thoughts on graph models then I didn't hear a lot of that in your talks graphs and uh reinforcement learning any utilization there any benefits you saw any boost in in performance and accuracy yeah that's a good question I think we have actually uh a dedicated team sub team doing graph model uh especially around our knowledge graph to cover the content space is both on and off our platform in the whole entire entertainment ecosystem. So we use actually a lot of embeddings for example from the graph model to co-start that's where I see I show those semantic embeddings that's where it comes from in terms of reinforcement learning yes as well especially where we consider sparse reward that we have on users from users action are pretty much sparse but we want to use them to guide how for example we generate the whole collection that's where we need to consider how to use those rewards to guide those process. Yeah, sure. I would be here and so we can also follow. Yeah. Do you also use these unified representations as features to models? You had a slide how you use model. Yeah. So uh the we so for the embeddings learning within our model we also expose to downstream to direct consume them. Uh we also have a to train our unified embedding. We also have some upstream like just for example the GN embeddings and those are also consumed to to do that. Hello. Uh in your embeddings, are you just using when someone does an action or sorry for the in these embeddings are you just using metadata over the video to understand what they like or are you actually using like frame by frame of the video or second clips? Uh not yet. We do have that from some other content group of our organization but I think the trend will go there. So we are not yet uh into very granular sub like clips level or view level. We have those embedded but not quite yet to incorporate. Y thank you. Thank you. Please another round of applause for so next we have the Jastri and Mesh director of machine learning and staff machine learning engineer respectively from Instacart. They're going to share about how they use LMS to improve search and discovery at Instacart. Hi Good afternoon everyone. Uh my name is Mesh and this we are part of the search and machine learning team at Instacart. So today we'd like to talk to you about how we are using LMS to transform our search and discovery. Um so yeah so first a little bit about ourselves. Yeah as I mentioned we are part of the search and discovery ML team at Instacart. And for those of you who may not be part who may not be familiar with Instacart, it's the leader in online grocery in North America. uh and our mission is to create a world where everyone has access to the food they love and more time to enjoy it together. So coming to what we'll actually talk about today uh first we'll talk about the importance of search uh in grocery e-commerce. Uh then we'll look into some of the challenges facing conventional search engines. Uh and then actually get to the meat of the talk today which is how we are using LMS to solve some of these problems. Uh finally we'll uh finish with some key takeaways from today's talk. So coming to the importance of search and grocery e-commerce uh I think we've all gone grocery shopping. Customers come with long shopping lists. Uh and it's the same on the platform as well. People are looking for tens of items. Uh and of these a majority of them are just restocking purchases that is things that the customer has bought in the past. Uh and the remaining are items that the user is trying out for the first time. So uh and and a majority of these purchases come from search. So uh search has a dual role. It needs to both support quick and efficient uh it needs to have the customer quickly and efficiently find the product they're looking for and also enable this new product discovery. Uh and new product discovery isn't just important for the customer. It's also great for our advertisers because it helps them showcase new products. uh and it's also good for the platform because overall it encourages larger basket sizes. Uh so let's see what some problems are with our existing setup that sort of uh makes this hard. Uh so so to begin with uh we have two classes of queries that are are generally more challenging especially from an e-commerce perspective. Uh the first are overly broad queries in this case like on the left the snacks query where there are tons of products that map to that query. Uh and now because our models are trained on engagement data, if we aren't exposing these products uh to the user, it's hard to actually collect engagement data to to them, rank them up high. So the traditional cold start problem in a way. Uh then uh as you can see on the query on the right we have very specific queries like unsweetened plantbased yogurt where the user is looking for something very specific and these queries uh don't happen very frequently which means that we just don't have enough engagement data to train the models on. Uh and while we have uh done quite a bit of work to sort of um improve this, the challenge that we continually keep facing is that while recall improves, precision is still a challenge, especially in a pre-LM world. Uh the next class of problems is how do we actually support that new item discovery as we spoke about. So when a customer walks into a grocery store uh let's say into a pasta aisle, they might see new brands of pasta that they would want to try out. Uh along with that they would also see pasta sauce and every other thing that's needed to make a bowl of pasta. Uh and customers would want a similar experience on our site. Uh we have heard multiple feedback multiple rounds of feedback from our customers that hey I can find the product that I'm that I want via search but when I'm trying to find any other related products it's it's a bit of a dead end. I would need to make multiple searches to get to where I want to. So this was a problem that we wanted to solve as well. And yeah, as I mentioned, uh, prelims, this was a a hard problem because of the lack of engagement data, etc. So, let's see how we actually use thems to sort of solve these problems. Uh, I'll sort of talk specifically about how we use the LMS to uplevel our query understanding module. Now, query understanding, as I'm sure most of you know, uh, is the most upstream part of the search stack. uh and very accurate outputs are needed to sort of enable better retrieval and recall uh and finally improve our ranking results. Uh so our query understanding module has multiple models in them like query normalization, query tagging, query classification, category classification etc. So in the interest of time uh I'll just uh pick a couple of models and talk about how we uh sort of really improve them. Uh the first is our query to category a product category classifier. Uh essentially we are taking a query and mapping it to a category in our taxonomy. Uh so as an example if you take a query like watermelon that maps to categories like fruits, organic fruit, foods etc. Uh and our taxonomy has about 10,000 labels of its 6,000 are more commonly used. So because a product a query can map to multiple labels. This is essentially a multilation problem. Um, and in the past our traditional models, uh, which were we actually had a a couple of different models. One was a fast text based, uh, neural network which essentially modeled the semantic relationship between the query and the category. Uh, and then as a fallback, we had an NPMI model which was a statistical co-occurrence model between the query and the category. Now, while these uh techniques work great for the for the head and torso queries, we had really low coverage for our tail queries because again, we just didn't have enough engagement data to train the models on. Um, and to be honest, we actually tried more sophisticated bird-based models as well. Uh, and while we did see some improvement, the lack of engagement data meant that for the increased latency, we didn't see the wins that we actually hoped for. So um so this is where we actually tried to use an LLM. Uh first we took all of our queries uh and we along with the taxonomy if we fed it into an LLM and asked it to predict the most relevant categories for that query. Now the output that came back was decent actually when we all looked at it it made a lot of sense. Uh but when we actually ran an online AB test the results weren't as great. Uh and one particular example that illustrates this point very well is a query like protein. Uh users that come to Instacart when they type something like protein, they're looking for maybe protein shakes, uh protein bars or other protein supplements. The LLM on the other hand thinks that pro when a user types protein, they're looking for maybe chicken, tofu or other protein foods. So this mismatch wherein the LLM doesn't truly understand Instacart user behavior was really the cause of the problem. So to sort of maybe improve our results, we sort of switched the problem around, we took the most commonly converting categories or the top K converting categories for each query and fed that as additional context to the LN. Uh and then I'm sort of simplifying this a bit. there's a bunch of uh ranking and downstream validation that happens. But essentially that that was what we did. We generated a bunch of candidates uh rank candidates and this greatly simplified the problem for the LM as well. Uh and again to illustrate this with an example uh take a query like Wernner soda. Uh our previous model actually identified this as a as a brand of fruit fl or a fruit flavored soda which is not incorrect uh but it's not very precise either. Now the LLM did a much better job. It identified it as a brand of ginger ale. And with this our downstream retrieval and ranking improved greatly as well. And as you can see from uh the results below uh especially for tail queries we saw a big improvement. Our precision improved by our 18 percentage points and our recall improved by our 70 percentage points which is actually pretty significant for our tail queries. Um and maybe to very briefly look at our prompt. As you can see it's very simple. uh we are essentially passing in the C the top converted categories as context uh there are a bunch of guidelines about what the LM should actually outd do and and that's it so this was all that is needed to sort of enable this uh again I'm simplifying the overall flow but uh the general concepts are pretty straightforward so coming to the another model uh the query rewrites model is actually pretty important as well uh from uh a from an e-commerce perspective, especially at Instacart because not all retailers are created equal. Some have large catalog, some have very small cataloges. The same query may not always return results. And that is where a rewrite is really helpful. For example, going from a query like 1% milk to just milk would at least return results that the customer can decide to buy or not. Uh and again, our previous approach which was trained on uh engagement data didn't do too well. it suffered or it did decently well on head and torso queries but it suffered from a lack of engagement data on tail queries. Uh so by using an LLM similar to how we did for the product category classifier uh we were able to generate very precise rewrites. Uh in the example here you can see that there's a a substitute a broad and a synonymous rewrite. So for the case of avocado oil, a substitute is olive oil. A broader rewrite is healthy cooking oil and a synonymous rewrite is just avocado extract. And again just just looking at the results from this and if you we saw a bunch of offline improvements and just moving from uh from using third party LLMs here just going from more simpler models to better models improved uh the results quite a bit. Uh this is based off of our human evaluation data. Uh so as you can see just improving the models itself improved the overall performance of the task and in terms of online improvements we actually saw a large drop in the number of queries without any results. This is pretty significant again because uh we could now actually show results to users where they previously saw empty results uh which was great for the business. So uh coming to the sort sort of the important part of this which is how we actually scored and served the the data the thing is that Instacart has a has a pretty idiosyncratic u query parent. There's a very fat head and torso set of queries and we have a sort of a long tail. So by comput precomputing the outputs for for all of the head and torso queries offline in a batch mode we were able to sort of uh cache all of this data and then at online when a query comes in we could just serve it off of the cache with very low impact on latency uh and fall back to our existing models for the long tail of queries. And again, this worked really well because it didn't uh impact our latency while it greatly improved our coverage for the long tail of queries. Now, for the the really long tail where I said we would fall back to our existing models, we're actually trying to replace them with a distilled lid model uh so that we can actually do a much better job compared to the existing models. Um so yeah to sort of summarize uh essentially what we saw was that uh from a query understanding perspective we have a bunch of models uh and just using our hybrid approach greatly improved their performance but what's actually more interesting is that today query understanding consists of a bunch of models and as Yazu was talking about in the Netflix talk managing all of these models is actually complex from a system perspective so uh consolidating all of these into an SL LM or or maybe a large language model uh can make the results a lot more consistent. And I'll finish it off by giving an example here. Uh there's a query hum that we sort of saw some interesting issues with uh which is which is spelled hu m. uh the actual query the our our query brand tagger identified the brand correctly as a brand of kombucha but then our spell corrector unfortunately corrected it as hummus so the results were really confusing to users uh and was pretty bad but by using a more unified model I think the results were much better the second is by passing in by using an LLM for query understanding uh we can actually pass in extra context um so instead of just generating results for that query in isolation, we can really try to understand what the customer's mission is. Um, so for example, detect if they're actually here to buy ingredients for a recipe, etc., and then generate the content for that. So to talk more about that, uh, I have TSV here. Thank you. Uh, now I'll quickly talk about how we used LLM for showing more discovery oriented content in search results page. Uh just to re restate the problem, uh our users found that while our search engine was very good at showing exactly the the results that they exactly wanted to see once they added an item to the cart, they couldn't do anything useful with the search results page. They either had to do like another search or go to another page to fulfill their next intent to some starts. Uh tradition solving this with traditional methods would require like a lot of feature engineering or manual work. uh LLM solve this problem for us and I will talk about how uh so this is how it looked in the end. So for queries like swordfish uh let's say there are no exact results we used llms to generate substitute results like other seafood alternatives meaty fish like til clapia and whatnot u and similarly for queries like sushi where there were a lot of exact results let's say uh we would show at the bottom of the search results page we would show things like Asian cooking ingredients or Japanese drinks and so on uh in order to like you know get the users to engage uh I'll talk about the techniques here But uh both of these uh both of these discovery oriented results we saw like improve uh led to like improvement in engagement as well as improvement in revenue uh for our for each search. Uh cool. Uh like I said I'll get into the techniques but let's first talk about the requirements to generate such content. Uh first uh obviously we wanted to generate content that is incremented to the current solutions. We don't want duplicates to what we were already showing. And the second requirement and the most important one is we wanted all of the LLM answers or or the generation to be aligned with Instacart's domain knowledge. What does this mean? So if a query if a user searches for a query called dishes, L&M should understand that it refers to like cookware and not food and vice versa for a query like Thanksgiving dishes, right? So with these requirements in mind, we set up with we started with like a very basic generation approach. So what did we do? We took the query and we told the LLM, hey, you are an AI assistant and your job is to generate two shopping lists. One is a list of complimentary items and another is a list of like uh substitute items for a given query, right? Um looked good. Uh I mean like so so we saw the results, they looked pretty good. Uh our PMS vetted everything. We looked at everything. uh and and like vines said in in like u we when we launched this to our users uh we saw that the results were good but users weren't engaging it as much as we would have liked it to so we went back to the drawing board and we were like we try to analyze what was going on and what we realized quickly was while LLM's answers uh were like common sense like answers and so on such they weren't really what users were looking for uh taking the protein example again like uh users when they search for protein They look for protein bars and protein shakes rather than what LLM would give us an answer which is chicken, turkey and tofu and whatnot. Right? So uh so what we did was we augmented the prompt with Instacart domain knowledge. So uh in one case what we did was we took the query and then we augmented it with like here here is the query and here are the top converting categories uh for this particular query along with any annotations from the query understanding model like hey here is a brand present in the query here is like a uh dietary attribute trend is present in the query and so on as such. Uh in another case we were like here is the query and here are the subsequent queries that users did once they issued this particular query. So once you augmented the prompt with this additional metadata about how Instacart users behave, the the the results were far more better. I don't have the time to show like the before and after, but like I said, we definitely saw like a huge improvement in both engagement as well as revenue. Uh I'll quickly talk about like how we served uh all of these contents. Uh like very similar to QU, it's impractical to call the LLM in real time because of latency and maybe cost concern sometimes. So what we did was uh we took all of our uh historical search logs. We called LLM in like a batch mode and stored everything. So query content metadata along with even the products that could potentially show up in the carousel and online it's just a very quick look up from a feature store. Uh and that's how we were able to like uh serve all of these recommendations in like days and fast time. Uh again things weren't as simple as as we making them out to be. the the like vines said the overall concept is simple. The the prompt itself is very simple but there were three key challenges that we solved along the way. Uh one is aligning generation with business metrics like revenue. Uh this was very important to select topline wins. So we iterated over the prompts and the kind of metadata that we that we would feed to the LLM in order to achieve this. Second, we spent a lot of time on ranking uh on improving improving the ranking of the content itself and so on as such. So our traditional PCTR PCBR models did not work. So we had to like employ strategies like uh diversity based reanking and so on and so forth to get users to engage with the content. Uh and then the third thing is evaluating the content itself. So one is making sure that hey whatever LLM is giving uh is one right it's not hallucinating something uh and second it adhered to like what Instacart or what we need as a product right cool. Uh so summarizing the the key takeaways from our talk uh LLM's world knowledge was super important uh to improve uh query understanding predictions for especially for the tail queries. Uh while LLMs were super helpful we really found success by combining the domain knowledge of Instacart with LLMs uh in order to see the topline wins that we saw. Uh and the third and the last one is evaluating the content as well as the cure predictions and so on as such was far more important and far more difficult uh than we anticipated. We used LLM as a judge in order to make this happen but very very important step and we realized that kind of late. So yeah that's all from us. We'll take questions now. [Applause] Thank you Jri V. Um yeah we'll take questions at the mic while the next speaker gets set up. Hi uh thanks for the talk. Um Have you also been trying around queries which are very long in natural language like uh I want these three items and these five items like what we would do it on chat GPT or it's still like single item that's the focus. Uh yeah I think we have we have actually launched something in the past uh like ask instart if you heard of it which essentially takes natural language queries and tries to map that to search intent. So, for example, you might ask, you might say healthy foods for a three-year-old baby or something like that. And so, that would map to things like fruit slices, uh, and if three-year-old toddlers can eat popcorn, but something along those lines. And, and then we had our usual ranking, recall, and ranking stack sort of retrieve those results. So, any learnings from that experiment? Yeah. So, so I think we actually have a lot of learnings from that. Essentially as this we already mentioned uh we need to inject a lot of Instacart context into the model to be able to get decent results. The evaluation part is really key. Uh so having a robust automated evaluation pipeline was important. And lastly passing context that is for example if it's a let's say it's a mother's day query and let's say we come up with individual search intents as perfumes. you really want women's perfumes to be in there. Whereas when we just had perfumes, you could see all kinds of items. So passing that context from the LLM to the downstream systems is really important. Thanks. Yeah, we have a lot of examples where we say we can talk. Thank you. Um I'm sorry I don't think we have time for any more questions. Uh but we have our last speaker and then right after that the speakers will be hanging around. So thank you tori and vines again. [Applause] Finally, uh we have Dash, principal product manager at Google. He he'll be sharing with us how they adapted a Gemini checkpoint for YouTube recommendations. He he also share about how they built semantic IDs that we've heard a lot so much in this track that they still multimodal features into tokens. Take it away. Wonderful. Um I guess while we got the slides up, I can just introduce myself. I'm Devanch. I'm a product manager at YouTube. Um, I've been working on recommendation systems at Google for a while. Um, and built a new recommendation system across DeepMind and YouTube that uses Gemini to recommend YouTube videos. Um, and so I just want to take you through the process of how we built that with a lot of examples and share a recipe of how you might build this kind of LLM based recommendation system. I'll just get started without some of the slides. Um, here's why I think this is important, right? I think there's a lot of attention in terms of how LLMs are going to transform search. Uh Google search is having a revolution. Chat GPT has a big chat interface. Perplexity is a product that a lot of people use. Um but I think recommendations is uh probably a bigger problem that is underhyped because it's kind of transparent to the user. Um and I think the application of LLMs to recommendations is going to be a bigger consumer application than search. Um, so in terms of my talk, I just want to introduce the problem of YouTube recommendations and then talk about how we've built large recommener models. We're adapting Gemini for YouTube, how we build semantic ID and how we're using that and then end with this recipe of how you might use an LLM to make a recommendation system. Um, to start why this is important. Um, who here watches YouTube every day? It's one of the biggest consumer apps in the world. Um, and a large majority of the watch time on YouTube is driven by the recommendation system. Uh, and we serve recommendations across home, watch next, we have a big shorts product and even a lot of our search results are personalized in some way. Uh and so if you think about consumer applications of LLMs, I think in terms of consumer engagement and impact, recommendations is going to be a much bigger uh application than than searches. And this is true of any consumer app with a billion D. Um, the way I think about the recommendation problem is you're trying to learn this function of you get a user and their context as input and you're trying to give them a bunch of recommendations. Um, at YouTube we have a bunch of user information like their demographics, their age, their gender, where they're located. We have a lot of context about them. What are the last hundred videos they watched? How deeply did they engage with them? What did they comment on? who are they subscribed to and we use all of that to make uh video recommendations. We've tried a lot of different modeling techniques here. Multi-headed rankers, embedding models, sequence to sequence transformers. There's a there's a long history. Um and about two years ago, we started thinking how can we rethink this recommendation system on top of Gemini, which has been making incredible progress in modeling. How can we adapt that for YouTube? And so we've built this system which we call LRM large recommener model uh where we adapt Gemini for recommendations. Um okay. Okay. Um and so the way that we do this is I guess I'll just pause for a second. That good? Yeah. Okay, cool. Okay. Uh, back to how we're adapting Gemini for recommendation tasks. So, we start with this base Gemini checkpoint. Um, and then we're adapting it for YouTube recommendations, teaching it a lot of information about YouTube to get this kind of unified YouTube specific checkpoint of Gemini, which we call LRM. Then we can align it for different recommendation related tasks like retrieval and ranking. Um, and basically make a small custom version of this model for all of the major recommendation surfaces. Um, and so this is a model that we have launched in production at YouTube for a while in terms of the retrieval system and we're experimenting a lot on the ranking side. So I want to start with just kind of explaining how we built this YouTube and Gemini model and then we'll talk about how we use it for retrieval. The first step of this kind of a model is you have to develop a way to tokenize videos. So when you uh in terms of an LLM when you give it an input it it tokenizes that text and then is predicting the next text token. Uh the ideal product we wanted to make was we want to give this model an input of a number of video tokens and then just get video tokens out that would be good recommendations. Um, we had to build this because even with a million tokens of context, when you want to reason over many videos, you have to compress that video representation in some way. Um, and before we kind of settle on this approach, we tried a bunch of other things like uh predicting search queries and retrieving videos through that or trying to just uh recommend videos directly. And those solutions were just not good enough. And so we built semantic ID which we actually wrote a paper about last year uh and it was presented at Rexus. The way that this semantic ID works is you take a video um you extract a number of features out of it like the title, description, transcript, even the audio and video frame level data. You put all of that into a multi-dimensional embedding. Um, and then you quantize it using RQVE to give every video uh a token. Um, we've written a pretty detailed paper about this if people are interested. But at a high level, the way I think about this is we're making the atomic units for a new language of YouTube videos. Um, once we have these tokens, you can imagine the whole corpus of billions of videos on YouTube gets organized around these semantically meaningful tokens. And so you could imagine the first token representing topics like music, gaming, sports. Within sports, you would have different different sports and then you can uh get to volleyball. And so these two volleyball videos would share some tokens in the prefix but also then have a unique identifier. Um and this this I think in itself is is an interesting milestone to move away from hashbased tokenization into a semantically meaningful one. Um and we use this in production at YouTube. Uh what we then tried to do is this process of what we call continued pre-training where we're trying to take this model and have it understand both English and this new YouTube language. And we do this in in two big steps. One is around linking text and SID. Um and then the second step is around having it understand sequences of watches and be able to reason across this video space. And so some of the example training tasks that we're teaching this model, you have this video, it it's a tennis highlights video which has some semantic ID and you can prompt it and say, "Hey, this video has title XYZ." And the model starts to learn to output the title. Um, you could imagine a very similar thing where you could say has creator or has topics and so on. And so you're basically trying to connect text and this video token. Then what we can try to do is we have a corpus of all the YouTube engagement data, all the paths that users took through YouTube when they watch videos together. Um, and you can prompt the model with things like a user has watched the following videos ABCD. And you mask some of those videos and the model starts to learn to predict those masks. And now it's starting to understand what are videos that are watched together and make relationships between videos on the basis of user engagement. Right? Um, after a bunch of pre-training tasks like this, we get this really interesting model that can reason across English and YouTube videos. And so this is an example from a user's watch history. Um, and we find that this model can now reason across these videos. So you could prompt it with things like, hey, video one is interesting to tennis fans because it's about Wimbledon. Video two is interesting for F1 because it's about the Spanish Grand Prix. video three is interesting to math fans because it's about pi and then you prompt video four is going to be interesting too and the model starts to be able to understand that it's interesting to technology fans because it's about AI and this is just based on the semantic ID uh definition of a video it doesn't really have a lot of other information uh to go off of. So I think this in itself is a very interesting checkpoint that is starting to reason across English and YouTube. Once we have this model, we think about how we can use this for different video recommendation tasks at YouTube. And the first one that we focused on is generative retrieval. And so here you could just construct a prompt for every user and see what this model recommends. And so in this example, you have a user, they would be a 24 year old woman in the US on Android. They're watching this highlight video from the Olympics. Um, and they have some watch history of, you know, 50 videos they've watched in the past, how they engaged with it. And you can just construct a prompt like we have on the right with this user demographic information, the context video, and have the model decode some video recommendations as SIDs. We find that this gives really interesting unique recommendations, especially for our hardest recommendation tasks. So, in this example, when you're watching uh this highlight from the Olympics, the production system before LRM would give you other men's track races. Um, now with this new model, it's able to find this unique connection between the user demographic and their past watch history and find related women's uh races that we weren't able to recommend in the past. And so we find that especially for users where we don't know as much about them, we get very interesting and unique recommendations out of this strategy. Um, and so we've experimented with this and launched it in a few places at YouTube. The big findings from this is that LRM is a very powerful model, but it's really expensive to serve. It it learns very quickly. It's very training data efficient. Um, and it handles our toughest Rex tasks, but the biggest limitation was that the serving costs are too high, especially for the scale that YouTube operates at with billions of users. And so, after we got our first experiments working, we spent a lot of time just reducing the TPU serving cost. And, you know, we got 95% plus cost savings to be able to actually launch this in production. Um, one other strategy that we used which I think is kind of interesting is we tried to turn this into an offline problem where it's the same prompt and the same model. We just remove the personal personalized aspects of this prompt. And we wanted to build just an offline recommendations table where if you're watching video A, what are the candid videos that would be good to watch next? Um, and normally these unpersonalized recommendation models just don't hold a candle to a personalized recommener. But because this LRM is trained from a really big checkpoint, it actually gives us some differentiated recommendations. Um and so in the YouTube context like we can take our corpus of billions of videos look at the head which represent a lot of the watch time do offline inference um make this offline Rex table and then we can just do a simple lookup to serve some recommendations. Um and so this was kind of a a complete way around our serving problems. Um, I want to talk a bit about the challenges for YouTube. And I think in some ways making an LLM based recommendation system is harder than training an LLM. Um, one of the big differences is the vocabulary and size of the corpus. Right? So for Gemini, if you're training an English LLM, your vocabulary is about 100,000 words in the Oxford dictionary and they add about a thousand words every year. Um at YouTube, if you imagine the library of YouTube, it has billions of videos. We have 20 billion videos on YouTube, um with millions added every day. Um and the freshness of videos is really important, much more so than LLMs. So if you think about a new word that's added to the English dictionary, word of 2023 was RZ. If your model Gemini doesn't know about RZ, it can still answer 99% of questions that people would have. Maybe it misses some jokes, maybe it misses some pop culture references. But in the world of YouTube, if Taylor Swift drops a new music video, you have to be able to recommend it within the next minutes or hours, otherwise a lot of users are going to be upset. So even within this large corpus you have to very quickly understand what are the videos that are important and start recommending them to the right user. Um and so what we do with this LRM recommener is we have to continuously pre-train it on the order of days and hours which is very different than classical LLM pre-training like Gemini which happens maybe like once in three to six months. Um and so in that way it's a much harder problem. Uh, and then the last part is scale. Um, we have great models in Gemini. Gemini Pro is incredible, but there's no way that you can serve that to billions of daily active users. Um, and so for YouTube, we had to focus on the smaller, more efficient models like flash and and even smaller checkpoints than that um just to be able to hit the latency and scale requirements that we have. Um, so I kind of want to summarize the journey that we've been on YouTube in this what I think of as a LM and Rexus recipe that you can maybe adapt to your own application. And there's three major steps to this, right? The first is you want to find a way to tokenize your content. Um, just like LLM's tokenized text, you want to find you want to make some essence of your content into an atomic token. Uh one way to do that which we've done is you find some rich representation a bunch of features build an embedding and then find a way to tokenize or quantize it. And the outcome of this is like you're making your own domain specific language. The second step is you then want to adapt the LLM and basically make links between English and your domain language uh and find training tasks that help you reason across English and these new tokens you've built. And so the outcome after this step in my mind is it's a bilingual LLM that can speak English and natural language but it can also speak your domain specific language. Um and then once you have this you can do the third step of prompting it with user information where you can just construct personalized prompts with user demographic user activity different actions um and then train task specific or surface specific models and you have a generative recommendation system on top of an LLM and there this is like a tweet size summary of maybe two years of work. Um, maybe the last thing that I want to talk about is kind of where I see this going. Um, and some possible future directions for LLM and Rexus. I think the stage that we're at right now is that LMS are just augmenting recommendations. They bring these magical recommendation experiences. They enhance the quality, but they're largely invisible to users. like your YouTube feed just got better, but you don't really know whether a Gemini inference happened or not. Um, this is why I think the LLM application of Rexus is very underhyped because users don't directly know what's happening. Um, I think we're close to a world and we're experimenting with this. If you have like we talked about a bilingual LLM across English and recommendations, users can then talk to it in natural language. And I think you're going to start to see experiences where users can steer recommendations to their own goals. The recommener can explain why a candidate was recommended to a user. Um, and users can start to align it towards their own goals expressed in natural language. Um, and I think also the lines between search and recommendation start to blur in this world. Um, and then maybe a hint of the future is I think you're going to see recommendation and generative content start to come together in the future where we're going to be recommending a personalized version of a piece of content and in the future instead of recommending content we may even start creating it and you can get to really interesting end of one content that's generated for the user. Um, I think we're a bit away from this, but it's going to come sooner than you expect with all the advances happening in AI. Um, so yeah, thank you. I'll take any questions. Thank you. We have time for a few questions. Hi, great talk. Um uh one question on generally how you balance the learning of the semantic ID embeddings within the model versus keeping the general language capability not damaged by learning through for example a tokenized user history which is a very second language very different from English. Any uh high level takeaway that you can share? That's a super interesting question. Um we've struggled with this a lot. Um in terms of some of our early applications, we mostly cared just about recommendation quality in which case we overindexed on speaking the semantic ID language. And as you overtrain on more and more of those examples actually the model forgets to speak English. Maybe it's reasoning in some intermediate layers which finally end up in semantic ID language. Um we are trying a bunch of things like you know with mixture of experts maybe we can have a few experts that retain the text capability while other experts focus on the semantic ID capability and so it it's it's a balance and I think we're going to shift more towards text as we try to build these interactive experiences where uh text input from user is going to become more important. Thank you. So during this process, did you learn any uh any good suggestions for cold starting embeddings on these domain specific uh tokens? Yeah. So the semantic one thing is semantic ID training process is entirely unsupervised. We're not telling like it's making its own quantization of the video corpus. when you sample to see what the model is doing, we find that it's learning concepts like sports versus movies and entertainment. But we didn't actually try to teach that explicitly, which I think is very interesting. I think the second aspect is because of semantic ID, we can warm start into a semantically meaningful space. And what we find is performance for videos that were uploaded in the last day or the last week uh gets much better because we're better understanding this fresh and tail content. Thank you. Hey, quick question. So when you said you extract frames as part of making the semantic ID, are you just running a video at let's say 3 to 30 fps? uh making a grid of them, running cig and inserting that. We're just trying to sample video frames. Um we we've tried a few different approaches where like maybe we try to sample from like key moments in the video. We actually have the engagement data if you've seen in the YouTube player. Uh it can highlight what are the places where people had the most engagement. So we try to sample from there. um you know given the scale we can't sample a lot of video frames so we try to intelligently select it but we do have video frames and over time I think we'll get more in this way of selecting it are you able to highlight important things that are based on small objects in a video pretty well let's say it's a person in the distance that's of attention of this video hard to say because like at the end all of this video information gets compressed to eight tokens. So, it's probably learning something, but it's hard to know exactly, you know, what it picked up from that video frame. Uh yeah, so it it's unclear. Thank you. Yeah, it was a pretty good talk. Uh I have a question regarding pre-training. Okay. So uh did you also feed in a user query and what they watched also as a pre-training data? If yes then did you also use semantic ID for user as well in pre-training or or just semantic ID is only for for the videos. Yeah. So in this case we have only tokenized videos um and we focused more on sequences of watches rather than search query to what watch originated from that search query. You could imagine some parallel work where you try to tokenize users and build some kind of user token that represents like the last 500 watches that they have had and so on. Um we've experimented with some stuff there. I think it's less far along. Um but yeah, I think it's a very interesting like research direction to do. So, so the pre-training was done on top of existing Gemini pre-trained model, right? Yeah, we basically take a Gemini checkpoint and then adapt it for this YouTube purpose and get this like YouTube and Gemini LRM checkpoint. Okay. Yeah. So, it would be cool to see cementing ID of videos to V3. Yeah. Hey, uh I'm kind of curious how much uh improvement do we see compared to the non LLM or more traditional recommendation system and when should we use a more traditional one and when should we use LM based recommendation system? Yeah, I can't really share metrics like I I was I can share everything except code and metrics, you know. Um and so we've given you as much conceptual steps of what we did. Maybe what I'll say is I think it's been the biggest improvement to recommendation quality we've seen in the last few years. So I do think it's quite significant. Thank you. Um well please join me and give a big round of applause to all our speakers. So for all of these speakers we actually reached out to them personally. They didn't submit any talks. We reached out to them because we know that their work was high quality and I was really interested to share their work with you. Some of them are still hanging out, hanging around. Um, you can try to find them and just ask them as many questions as you have. Um, thank you everyone for joining and I hope you have a awesome rest of the conference. Thank you.