360Brew: LLM-based Personalized Ranking and Recommendation - Hamed and Maziar, LinkedIn AI
Channel: aiDotEngineer
Published at: 2025-07-16
YouTube video id: U0S6CfzAY5c
Source: https://www.youtube.com/watch?v=U0S6CfzAY5c
[Music] Hi everyone. Very excited to be here and I'm Ahmed. This is Mazar. And uh today uh uh we're going to talk about our journey in leveraging large language models for personalization and ranking u and our path to production such a large model for uh for LinkedIn use cases. Oop uh recommendation ranking and personalization is deeply integrated our daily life. uh when you go to a feed to to read an article, when you're looking for a for a job, when you're searching for something, when you're buying something online, the the the back end uh powered by recommendation system tries to find the the the best uh content or best entity based on your uh your interest and and relevancy to to your uh to to you. Uh however this uh this system usually um um suffer from some some challenges especially they are they are being being trained on a specific task. So they are disjoint optimized um they are usually not leveraging for for leveraging the the most advanced architecture they are being rolled out one by one which is very time consuming and and unproductive. Um so what the question that we are asking is that what if we have only one model to to solve all the task uh at at at the same time. So the mission that we started was to build a large foundation uh model based on large language models that understand the the holist have a holistic understanding of the user uh journey on LinkedIn platform and can solve all the personalization tasks that that LinkedIn has with just one model. And in addition to that we wanted this model to have three three other main characteristics. One, we want this model to have zero shot capability so that when you have a new problem or new surface instead of basically collecting the data, building a new recommendation ranking ranking model and and putting into production which is a very time consuming uh journey. You can basically leverage this model out of the box to solve your task. You just basically prompt the model and tell the model that this is the task that I want to solve. This is kind of recommendation. This is this is the entity. this is a user and what do you think about the relevancy between these two entities. Uh the second characteristic that we want to have this model to have is to leverage in context learning as much as possible so that for a cold start users problem for example uh we can leverage this model by just giving a very few examples or or just by explaining what the user might be interested in and the model can solve that problem for the coldest star users and the last one is a following instruction. We want basically give our users and members the ability to uh to tell the model what they're interested in. Like imagine that next time that you go to a LinkedIn feed or uh you can tell the model that these are the these are the niche these are my niche interest and these are the topics that I'm interested to to explore and the model basically the recommendation system start finding the relevant information for you and recommend it to you. Now Maz will talk about how we build this model uh and then I I'll talk about how to serve this model. Okay. So it's done. So let me talk a little bit about the uh the brewing part building of the model. So in order to make use of the LLMs which is what I think most of you guys are here for is that we need to convert all the information we have about the users and the interactions and everything that they had into prompt and this is what we call the magic of promptification. So we take the uh all the information we have about the user history and their profiles and a lot of interactions that they have had and we turn it into a prompt something like the one on the right hand side here. So there's as you can see there's an instruction for the model to follow for example what we want the model to do in this case so that we can actually generalize over different instructions. We give some information about the member profile and we have some past for example interactions that they have had with the data that we have already shown to them. And then the question comes in, what do you think the user is going to do with this data or this this new piece of information or this new item that we are showing to you? So that's basically how we formalize the problem in order to feed it into an LLM. So obviously I mean if you take one of the LLMs out of the box and try to solve this problem with it's going to work a little bit but it's not going to be perfect. So in order to do that we have to train the model. So this is actually the pipeline that we have for developing the model and making it productionize. So as you can see the left hand side we start with the open source model then we do some uh magic of upycling to to basically so that we can actually control the size of the model and the throughput versus the quality of the model and then we have like a few uh blocks of training uh continuous pre-training fine-tuning and in instruction fine-tuning and also alignment and at this point we have this large model which is we call blue XL which can think of it as a large model with 150 billion parameters that does really really well and we we have maximized the quality but obviously this model is not going to be able to serve online because as you as you know the recommendation systems are very very coopus hungry so from here we go all the way down to try to distill the model so maximize the efficiency and we're going to talk a little bit about that but basically we go all the way down to let's say 3B model which is actually something that can be productionized but as you can see there are so many different boxes here and in order to make sure that the the the development cycle is actually smooth. We had to do a lot of automation. So, one of the key lessons from here is that you build a lot of automation into this uh into these pipelines in order to make uh ma make the fact that making these models is actually very complicated into much easier and more manageable uh situation. Uh one big question that might actually come up here is that why do you actually need the Excel model? And in fact, we did a lot of experimentation to see if you can actually get away from with not having the Excel model. Unfortunately, that's not actually the case. You have to first go big and then go small. If you do try to train the model from scratch with a small model, it doesn't actually work that well. So, in this case, we did this and we showed that the distillation is actually something that is very important for for the smaller models. But now, let me tell you a little bit about the levers that you can use in order to improve these models over time. This is actually something that's very important. I mean if you look at all the literature there's a lot about the scaling laws how these models actually scale with data with compute and uh with this and that. So in this case we have three different layers and I'm going to talk about the first one is obviously the data scaling. So what if we have actually more and more data this is something that we comes up a lot in the uh in the recommendation systems we actually have a lot of data depending on how much you actually log about the user behavior. you might have a lot of data that goes back to 6 months, one year or whatever. And in this graph, as as as you see, as we increase the amount of data, the performance of the model actually improves and uh uh we hope that we can actually improve the model even further with having more and more data feed feed into it. Um the another lever that you can actually pull in order to improve the quality of the model, especially the Excel model, is to increase the size of the model. And in this experiment, we actually did this experiment over mixtural uh architecture. You can see if you go from 7B to 8x 22B, the performance of the model actually increases and improves. And finally, this is another thing that is kind of like I think one of the take take-h home message from here would be that the context length actually matters a lot for these kinds of applications with the recommendation systems. And the context length actually defines how much history from the user you can actually give to the model. So in this experiment we actually show that if you increase the context length by uh feeding more history from the user to the model you can actually improve the uh the performance of the model by feeding more and more data to the model. As you can see towards the end of this graph the performance actually drops. Uh we don't believe that this is because of the fact that the context is actually less informative. The problem is that the models I mean at least the model that we were using in this experiment doesn't generalize that well to the longer context. actually they are I mean if you look at most of the literature they tell that the the performance of the model actually drops if you go beyond some context. So I uh actually I I have to give it back to you. Okay. Uh let's uh talk a little bit about the uh the results and see if we can actually uh deliver on some of the promises that I that we had. So one of the things that we promised was that we can actually improve the performance of the model or performance of the behav the behavior of the system on cold start users. In this case we actually show the gap between our model and the production models uh on the users that have few uh interactions like for example less than five interactions, less than 100 interactions and uh so on. And you as you can see the gap between the uh the 360 brew model and the production model actually grows uh as the number of interactions decreases. So this actually shows you that having the word knowledge uh that the model uh inserts into this these systems actually improves the the quality uh of its uh predictions. Finally uh uh we we were uh promising to give you some generalization to the new domains meaning that the problems that model has never seen inside its training. And in this graph as I show these are four different tasks and these tasks are completely out of domain. Not no information about that surface the model has seen during the training. But as you can see it can actually uh be on par or even beat uh some of the um the production models and just to say these production models are specific for that specific task. So they have been trained on that task. So this is not actually the small feat. So it's actually something that is significant. So as you can see this gives the uh the people who are developing these uh uh develop developing these platforms to roll out uh features and roll out surfaces much more quickly because they can actually use these models to do uh to do uh recommendation for them. And now I give it back to Hamemed to talk about serving. So let me walk you through that how can we production such a large model in an environment that requires a very high QPS and low latency. Many recommendation systems have tens of thousands of the of the QPS and they also require more less than a second like a 500 400 millisecond latency at at best. Um there are there are three levers that we can we can pull in in order to make the model more efficient and improve the throughput and and reduce the latency for these models. uh specification uh going to this model model and quantization uh as as Moz explained before uh a smaller models definitely have a better throughput but our recipe is that we need to go big and then go small if you go with a smaller model initially it doesn't have enough capacity it doesn't have enough reasoning power to to to solve the complicated task that we have so we go with a larger model and then we start this 150 billion parameter model and then we start distilling it to the smaller model. And one of the recipe here is that we need to do the distillation step by step and that means that we go with a for example 8B uh 8B model then 3B model and then 1B model. So we slowly decrease the size of the model and we we we distill over and over from the from the from the from the previous model. Um and that recipe shows to be much much much more effective rather than basically directly going from 150 billion parameter model to one v parameter model. Uh same thing for pruning. Uh so pruning is a mathemat optimiz mathematical optimization problem. You want to either reduce the uh reduce the number of heads in the transformers. You can reduce the number of MLPS. Overall this transform model tends uh proven to be very very redundant in terms of keeping the information. So we can start pruning and removing some of these layers uh or or reduce basically the precision for for u for the for the for each of the activations and parameters. Uh however uh again if if you do the pruning u very aggressively at the beginning your performance would significantly suffer. So the the recipe here is also do the gradual pruning. uh we we do we what we we do is that we start pruning the model very small pruning to the model. we we we distill uh to the smaller model and we do it over and over again. More more pruning, more distillation, more pruning, more distillation. And as you can see from this plot u doing the gradual pruning uh has um can be as effective as basically no no information loss. Whereas if you just basically do aggressive pruning at the beginning, you can have up to 1% reduction in the in the model quality. Another lever is is is quantization. Going to lower precision uh we are leveraging FP8 uh for activation model parameters. However, uh doing just FP8 in all the layers uh here's the performance of the or the quality of the model significantly. So now basically your tool would be to do mix precision. And one of the important aspect when we comes to ranking recommendations and and overall uh prediction tasks is you want the model the the prediction or the probability of output of the model to have a very good precision. So in the LM head at the end of the language model has to be in FP32. If you do it in FP16, BF16 or FP8, uh what what happens is that the numbers collapse and you don't have a very good calibration on top of that and you cannot distinguish between different item recommended. Uh last part is specification. We can specify basically the the attentions the most expensive part of the transformers is is attention scores. And we can leverage a specification. Not every item needs to attend to every items. And when you know your task when you know that this recommendation these are the items that you want to uh in the history you can spearsify and not have every item item to each other. And same same goes with when you are recommending the items. Instead of recommending one item you can recommend 50 item 500 item at the same time but you want to make sure that these items are not attending to each other. So you sparify uh the attention scores uh for the output and for the for the query. If you put everything together uh we can we can see that basically this we can we can have a significant reduction in the latency. What we have done is that in the in in four or five of our release uh uh one release after the other we were able to reduce the latency by 7x and at the same time increasing the throughput which is basically the number of queries that we can handle by one GPU by 30x. So we are improving basically the amount of the work work that the GPU is doing. At the same time we are reducing the latency that each query is sync. Uh these are some of basically technical report and and and papers that we published uh during our journey to share with the community basically a lesson learned. Um and that's the end of our talk. So we have some time also to answer some questions. Thank you. Please come to the microphones. Um if you want to ask the question. Yeah. Yeah. Thank you. Great talk. One question. How did you measure that it doesn't lose generalization power? Obviously you've done a lot of fine-tuning. Uh and you mentioned it works for four or five tasks instead of task specific models. How do you know it's going to work for the next five tasks? That's a good question. So we have a lot of I mean the answer overall is having a very comprehensive benchmarking set. We have something around like 50 to 60 benchmarking. Some of them are internal some of them are external. For example we leverage if evad to make sure that the model still follows a very good instruction. Um and as Mia mentioned some of the tasks are not never being part of our training data and that's how we are measuring basically the generalization to the new new domain within LinkedIn use cases for example. Hi, thanks thanks for the talk. Um, I'm wondering what a small uh listing website uh can use out of the box. Um, have you heard of NL Web which was launched recently by Microsoft? Uh, if yes, what are your views on that as a recommendation system? NL web. No, I haven't actually heard the okay. Sorry about that. Anything you for smaller ones listing, let's say real estate listing website has like thousands of uh real estate listings. what are the out of the box recommendation models that uh people can start using? Uh I mean that's the I uh I wish that that such a model would exist. I I don't really I mean that's why I think we started this work. We tried we were trying to see if you can actually make it a foundation model so that you can actually solve those kinds of problems. I think there's a lot of potential for for this to be able to serve a lot of the use cases that are beyond the bigger companies but definitely I don't know any I think you should check out NL web one. Okay, I'll look at that. Yeah, thanks. Uh thank you for the great talk. Um uh on the slide where you mentioned you have a multi- uh item scoring. Uh I'm curious like uh what does that uh effectively mean? Does it mean that you need to do multi-step decoding or it's just a one step or just processing the logits for multiple items? What does it it's a multi-step. We don't want to basically we didn't want to go to the for example complication of speculative decoding or basically the decoding aspect. We wanted to have everything at the preill. Okay. So what we did was that basically all the items are being sequenced all the recommended items or potential candidates are sequenced together but we also wanted to avoid them to attend to each other. So we leverage basically what we call it like for the attention mask uh and we we develop a special kernel actually in the SG lang and VLM to to be able to do that. And now when you have up to 500 items in in your query segment, those items don't attend to each other. They only attend to the historical user and and and user profile information. Okay, thank you. Hey, great talk. Uh so a user history means many things, right? So like there is all of the jobs that they've applied to or and the job postings. There are so many entities and so on. Uh the the context of the model can get quite large. uh how did you manage that? Did you compress it or were there parts that you focused on? Yeah, so we we actually experimented with a lot of things. Uh we experimented with with rack system so that basically when we have a query we try to figure out what are the most closest items in the your history to bring it up. Uh we also experimented with chronical orders and some sort of weight decade on the chronical orders. It turns out that for majority of application that we have actually chronical order is good enough and that kind of makes sense because recommendation systems are very biased to the freshness. So the more recent user activity helps one of the biggest challenge is actually the this is more now become more like a traditional LM problem. How do you balance the distribution of your positive and negative within the context? And I think that's becomes something that more like a ML engineering effort to figure out okay do I want more positive more negative like how much how much information I need to put in the context. Yeah. Just I can add one more thing to this. This there's also another complication when you go to the serving of these models. You don't want to break the KV caching or something that you're using in the serving. So it's going to be a little bit more complicated, more cumbersome to do something that's smarter than just putting the chronological order. So that's something that needs to be designed. So it's not something that's very obvious. Absolutely. Uh one more question. Uh you guys did so many experiments, tried out so many things. How's your entire system set up? Because I'm assuming that you say quantization, but you must have tried different forms of quantization, whatnot. How do you set up the system in such a way that you can try out multiple experiments and see what works best? Can you talk a bit about that? Uh yes touched a bit on that one. I think the the the one thing that we we we hold a very high bar for the one was automation. So our system is very automated to the extent that when you're running experimentation actually the result of the experimental being pushed automatically into the Excel sheet and now when you have such an automated system now basically the developers are very efficient in terms of like I just want to figure out different quantization. So you just change the quantization parameters and everything else just by click button happens end to end. Uh so I think uh automation is the key if you want to basically really optimize for these models. So did you build all of that automation in house or did you Yes, most of them I mean we leverage for example lightning VLSG like uh we leverage basically a lot of open source tools but we make sure that they are integrated very well with each other and uh and optimize basically the entire flow. Cool. Thank you. Thank you for Thank you again Maza. Thank you. [Music]