360Brew: LLM-based Personalized Ranking and Recommendation - Hamed and Maziar, LinkedIn AI

Channel: aiDotEngineer
Published at: 2025-07-16
YouTube video id: U0S6CfzAY5c
Source: https://www.youtube.com/watch?v=U0S6CfzAY5c
[Music]
Hi everyone. Very excited to be here and
I'm Ahmed. This is Mazar. And uh today
uh uh we're going to talk about our
journey in leveraging large language
models for personalization and ranking u
and our path to production such a large
model for uh for LinkedIn use cases.
Oop
uh recommendation ranking and
personalization is deeply integrated our
daily life. uh when you go to a feed to
to read an article, when you're looking
for a for a job, when you're searching
for something, when you're buying
something online, the the the back end
uh powered by recommendation system
tries to find the the the
best uh content or best entity based on
your uh your interest and and relevancy
to to your uh to to you. Uh however this
uh this system usually um
um suffer from some some challenges
especially they are they are being being
trained on a specific task. So they are
disjoint optimized um they are usually
not leveraging for for leveraging the
the most advanced architecture they are
being rolled out one by one which is
very time consuming and and
unproductive.
Um
so what the question that we are asking
is that what if we have only one model
to to solve all the task uh at at at the
same time.
So the mission that we started was to
build a large foundation uh model based
on large language models that understand
the the holist have a holistic
understanding of the user uh journey on
LinkedIn platform and can solve all the
personalization tasks that that LinkedIn
has with just one model. And in addition
to that we wanted this model to have
three three other main characteristics.
One, we want this model to have zero
shot capability so that when you have a
new problem or new surface instead of
basically collecting the data, building
a new recommendation ranking ranking
model and and putting into production
which is a very time consuming uh
journey. You can basically leverage this
model out of the box to solve your task.
You just basically prompt the model and
tell the model that this is the task
that I want to solve. This is kind of
recommendation. This is this is the
entity. this is a user and what do you
think about the relevancy between these
two entities.
Uh the second characteristic that we
want to have this model to have is to
leverage in context learning as much as
possible so that for a cold start users
problem for example uh we can leverage
this model by just giving a very few
examples or or just by explaining what
the user might be interested in and the
model can solve that problem for the
coldest star users
and the last one is a following
instruction. We want basically give our
users and members the ability to uh to
tell the model what they're interested
in. Like imagine that next time that you
go to a LinkedIn feed or uh you can tell
the model that these are the these are
the niche these are my niche interest
and these are the topics that I'm
interested to to explore and the model
basically the recommendation system
start finding the relevant information
for you and recommend it to you.
Now Maz will talk about how we build
this model uh and then I I'll talk about
how to serve this model. Okay. So it's
done. So let me talk a little bit about
the uh the brewing part building of the
model. So in order to make use of the
LLMs which is what I think most of you
guys are here for is that we need to
convert all the information we have
about the users and the interactions and
everything that they had into prompt and
this is what we call the magic of
promptification. So we take the uh all
the information we have about the user
history and their profiles and a lot of
interactions that they have had and we
turn it into a prompt something like the
one on the right hand side here. So
there's as you can see there's an
instruction for the model to follow for
example what we want the model to do in
this case so that we can actually
generalize over different instructions.
We give some information about the
member profile and we have some past for
example interactions that they have had
with the data that we have already shown
to them. And then the question comes in,
what do you think the user is going to
do with this data or this this new piece
of information or this new item that we
are showing to you? So that's basically
how we formalize the problem in order to
feed it into an LLM. So obviously I mean
if you take one of the LLMs out of the
box and try to solve this problem with
it's going to work a little bit but it's
not going to be perfect. So in order to
do that we have to train the model. So
this is actually the pipeline that we
have for developing the model and making
it productionize. So as you can see the
left hand side we start with the open
source model then we do some uh magic of
upycling to to basically so that we can
actually control the size of the model
and the throughput versus the quality of
the model and then we have like a few uh
blocks of training uh continuous
pre-training fine-tuning and in
instruction fine-tuning and also
alignment and at this point we have this
large model which is we call blue XL
which can think of it as a large model
with 150 billion parameters that does
really really well and we we have
maximized the quality but obviously this
model is not going to be able to serve
online because as you as you know the
recommendation systems are very very
coopus hungry so from here we go all the
way down to try to distill the model so
maximize the efficiency and we're going
to talk a little bit about that but
basically we go all the way down to
let's say 3B model which is actually
something that can be productionized but
as you can see there are so many
different boxes here and in order to
make sure that the the the development
cycle is actually smooth. We had to do a
lot of automation. So, one of the key
lessons from here is that you build a
lot of automation into this uh into
these pipelines in order to make uh ma
make the fact that making these models
is actually very complicated into much
easier and more manageable uh situation.
Uh one big question that might actually
come up here is that why do you actually
need the Excel model? And in fact, we
did a lot of experimentation to see if
you can actually get away from with not
having the Excel model. Unfortunately,
that's not actually the case. You have
to first go big and then go small. If
you do try to train the model from
scratch with a small model, it doesn't
actually work that well. So, in this
case, we did this and we showed that the
distillation is actually something that
is very important for for the smaller
models. But now, let me tell you a
little bit about the levers that you can
use in order to improve these models
over time. This is actually something
that's very important. I mean if you
look at all the literature there's a lot
about the scaling laws how these models
actually scale with data with compute
and uh with this and that. So in this
case we have three different layers and
I'm going to talk about the first one is
obviously the data scaling. So what if
we have actually more and more data this
is something that we comes up a lot in
the uh in the recommendation systems we
actually have a lot of data depending on
how much you actually log about the user
behavior. you might have a lot of data
that goes back to 6 months, one year or
whatever. And in this graph, as as as
you see, as we increase the amount of
data, the performance of the model
actually improves and uh uh we hope that
we can actually improve the model even
further with having more and more data
feed feed into it. Um the another lever
that you can actually pull in order to
improve the quality of the model,
especially the Excel model, is to
increase the size of the model. And in
this experiment, we actually did this
experiment over mixtural uh
architecture. You can see if you go from
7B to 8x 22B, the performance of the
model actually increases and improves.
And finally, this is another thing that
is kind of like I think one of the take
take-h home message from here would be
that the context length actually matters
a lot for these kinds of applications
with the recommendation systems. And the
context length actually defines how much
history from the user you can actually
give to the model. So in this experiment
we actually show that if you increase
the context length by uh feeding more
history from the user to the model you
can actually improve the uh the
performance of the model by feeding more
and more data to the model. As you can
see towards the end of this graph the
performance actually drops. Uh we don't
believe that this is because of the fact
that the context is actually less
informative. The problem is that the
models I mean at least the model that we
were using in this experiment doesn't
generalize that well to the longer
context. actually they are I mean if you
look at most of the literature they tell
that the the performance of the model
actually drops if you go beyond some
context. So
I uh actually I I have to give it back
to you. Okay. Uh let's uh talk a little
bit about the uh the results and see if
we can actually uh deliver on some of
the promises that I that we had. So one
of the things that we promised was that
we can actually improve the performance
of the model or performance of the behav
the behavior of the system on cold start
users. In this case we actually show the
gap between our model and the production
models uh on the users that have few uh
interactions like for example less than
five interactions, less than 100
interactions and uh so on. And you as
you can see the gap between the uh the
360 brew model and the production model
actually grows uh as the number of
interactions decreases. So this actually
shows you that having the word knowledge
uh that the model uh inserts into this
these systems actually improves the the
quality uh of its uh predictions.
Finally uh uh we we were uh promising to
give you some generalization to the new
domains meaning that the problems that
model has never seen inside its
training. And in this graph as I show
these are four different tasks and these
tasks are completely out of domain. Not
no information about that surface the
model has seen during the training. But
as you can see it can actually uh be on
par or even beat uh some of the um the
production models and just to say these
production models are specific for that
specific task. So they have been trained
on that task. So this is not actually
the small feat. So it's actually
something that is significant. So as you
can see this gives the uh the people who
are developing these uh uh develop
developing these platforms to roll out
uh features and roll out surfaces much
more quickly because they can actually
use these models to do uh to do uh
recommendation for them. And now I give
it back to Hamemed to talk about
serving. So let me walk you through that
how can we production such a large model
in an environment that requires a very
high QPS and low latency. Many
recommendation systems have tens of
thousands of the of the QPS and they
also require more less than a second
like a 500 400 millisecond latency
at at best. Um
there are there are three levers that we
can we can pull in in order to make the
model more efficient and improve the
throughput and and reduce the latency
for these models. uh specification uh
going to this model model and
quantization
uh as as Moz
explained before uh a smaller models
definitely have a better throughput but
our recipe is that we need to go big and
then go small if you go with a smaller
model initially it doesn't have enough
capacity it doesn't have enough
reasoning power to to to solve the
complicated task that we have so we go
with a larger model and then we start
this 150 billion parameter model and
then we start distilling it to the
smaller model. And one of the recipe
here is that we need to do the
distillation step by step and that means
that we go with a for example 8B uh 8B
model then 3B model and then 1B model.
So we slowly decrease the size of the
model and we we we distill over and over
from the from the from the from the
previous model. Um and that recipe shows
to be much much much more effective
rather than basically directly going
from 150 billion parameter model to one
v parameter model. Uh same thing for
pruning. Uh so pruning is a mathemat
optimiz mathematical optimization
problem. You want to either reduce the
uh reduce the number of heads in the
transformers. You can reduce the number
of MLPS. Overall this transform model
tends uh proven to be very very
redundant in terms of keeping the
information. So we can start pruning and
removing some of these layers uh or or
reduce basically the precision for for u
for the for the for each of the
activations and parameters.
Uh however uh again if if you do the
pruning u very aggressively at the
beginning your performance would
significantly suffer. So the the recipe
here is also do the gradual pruning. uh
we we do we what we we do is that we
start pruning the model very small
pruning to the model. we we we distill
uh to the smaller model and we do it
over and over again. More more pruning,
more distillation, more pruning, more
distillation. And as you can see from
this plot u doing the gradual pruning uh
has um can be as effective as basically
no no information loss. Whereas if you
just basically do aggressive pruning at
the beginning, you can have up to 1%
reduction in the in the model quality.
Another
lever is is is quantization. Going to
lower precision uh we are leveraging FP8
uh for activation model parameters.
However, uh doing just FP8 in all the
layers uh here's the performance of the
or the quality of the model
significantly. So now basically your
tool would be to do mix precision. And
one of the important aspect when we
comes to ranking recommendations and and
overall uh prediction tasks is you want
the model the the prediction or the
probability of output of the model to
have a very good precision. So in the LM
head at the end of the language model
has to be in FP32. If you do it in FP16,
BF16 or FP8, uh what what happens is
that the numbers collapse and you don't
have a very good calibration on top of
that and you cannot distinguish between
different item recommended.
Uh last part is specification. We can
specify basically the the attentions the
most expensive part of the transformers
is is attention scores. And we can
leverage a specification. Not every item
needs to attend to every items. And when
you know your task when you know that
this recommendation these are the items
that you want to uh in the history you
can spearsify and not have every item
item to each other. And same same goes
with when you are recommending the
items. Instead of recommending one item
you can recommend 50 item 500 item at
the same time but you want to make sure
that these items are not attending to
each other. So you sparify uh the
attention scores uh for the output and
for the for the query.
If you put everything together uh we can
we can see that basically this we can we
can have a significant reduction in the
latency. What we have done is that in
the in in four or five of our release uh
uh one release after the other we were
able to reduce the latency by 7x and at
the same time increasing the throughput
which is basically the number of queries
that we can handle by one GPU by 30x. So
we are improving basically the amount of
the work work that the GPU is doing. At
the same time we are reducing the
latency that each query is sync.
Uh these are some of basically technical
report and and and papers that we
published uh during our journey to share
with the community basically a lesson
learned. Um
and that's the end of our talk. So we
have some time also to answer some
questions. Thank you.
Please come to the microphones. Um if
you want to ask the question. Yeah.
Yeah. Thank you. Great talk. One
question. How did you measure that it
doesn't lose generalization power?
Obviously you've done a lot of
fine-tuning. Uh and you mentioned it
works for four or five tasks instead of
task specific models. How do you know
it's going to work for the next five
tasks? That's a good question. So we
have a lot of I mean the answer overall
is having a very comprehensive
benchmarking set. We have something
around like 50 to 60 benchmarking. Some
of them are internal some of them are
external. For example we leverage if
evad to make sure that the model still
follows a very good instruction. Um and
as Mia mentioned some of the tasks are
not never being part of our training
data and that's how we are measuring
basically the generalization to the new
new domain within LinkedIn use cases for
example.
Hi, thanks thanks for the talk. Um, I'm
wondering what a small uh listing
website uh can use out of the box. Um,
have you heard of NL Web which was
launched recently by Microsoft? Uh, if
yes, what are your views on that as a
recommendation system? NL web. No, I
haven't actually heard the okay. Sorry
about that. Anything you for smaller
ones listing, let's say real estate
listing website has like thousands of uh
real estate listings. what are the out
of the box recommendation models that uh
people can start using? Uh I mean that's
the I uh I wish that that such a model
would exist. I I don't really I mean
that's why I think we started this work.
We tried we were trying to see if you
can actually make it a foundation model
so that you can actually solve those
kinds of problems. I think there's a lot
of potential for for this to be able to
serve a lot of the use cases that are
beyond the bigger companies but
definitely I don't know any I think you
should check out NL web one. Okay, I'll
look at that. Yeah, thanks.
Uh thank you for the great talk. Um uh
on the slide where you mentioned you
have a multi- uh item scoring. Uh I'm
curious like uh what does that uh
effectively mean? Does it mean that you
need to do multi-step decoding or it's
just a one step or just processing the
logits for multiple items? What does it
it's a multi-step. We don't want to
basically we didn't want to go to the
for example complication of speculative
decoding or basically the decoding
aspect. We wanted to have everything at
the preill. Okay. So what we did was
that basically all the items are being
sequenced all the recommended items or
potential candidates are sequenced
together but we also wanted to avoid
them to attend to each other. So we
leverage basically what we call it like
for the attention mask uh and we we
develop a special kernel actually in the
SG lang and VLM to to be able to do
that. And now when you have up to 500
items in in your query segment, those
items don't attend to each other. They
only attend to the historical user and
and and user profile information. Okay,
thank you.
Hey, great talk. Uh so a user history
means many things, right? So like there
is all of the jobs that they've applied
to or and the job postings. There are so
many entities and so on. Uh the the
context of the model can get quite
large. uh how did you manage that? Did
you compress it or were there parts that
you focused on? Yeah, so we we actually
experimented with a lot of things. Uh we
experimented with with rack system so
that basically when we have a query we
try to figure out what are the most
closest items in the your history to
bring it up. Uh we also experimented
with chronical orders and some sort of
weight decade on the chronical orders.
It turns out that for majority of
application that we have actually
chronical order is good enough and that
kind of makes sense because
recommendation systems are very biased
to the freshness. So the more recent
user activity helps one of the biggest
challenge is actually the this is more
now become more like a traditional LM
problem. How do you balance the
distribution of your positive and
negative within the context? And I think
that's becomes something that more like
a ML engineering effort to figure out
okay do I want more positive more
negative like how much how much
information I need to put in the
context. Yeah. Just I can add one more
thing to this. This there's also another
complication when you go to the serving
of these models. You don't want to break
the KV caching or something that you're
using in the serving. So it's going to
be a little bit more complicated, more
cumbersome to do something that's
smarter than just putting the
chronological order. So that's something
that needs to be designed. So it's not
something that's very obvious.
Absolutely. Uh one more question. Uh you
guys did so many experiments, tried out
so many things. How's your entire system
set up? Because I'm assuming that you
say quantization, but you must have
tried different forms of quantization,
whatnot. How do you set up the system in
such a way that you can try out multiple
experiments and see what works best? Can
you talk a bit about that?
Uh yes touched a bit on that one. I
think the the the one thing that we we
we hold a very high bar for the one was
automation. So our system is very
automated to the extent that when you're
running experimentation actually the
result of the experimental being pushed
automatically into the Excel sheet and
now when you have such an automated
system now basically the developers are
very efficient in terms of like I just
want to figure out different
quantization. So you just change the
quantization parameters and everything
else just by click button happens end to
end. Uh so I think uh automation is the
key if you want to basically really
optimize for these models. So did you
build all of that automation in house or
did you Yes, most of them I mean we
leverage for example lightning VLSG like
uh we leverage basically a lot of open
source tools but we make sure that they
are integrated very well with each other
and uh and optimize basically the entire
flow. Cool. Thank you.
Thank you for Thank you again Maza.
Thank you.
[Music]