Netflix's Big Bet: One model to rule recommendations: Yesu Feng, Netflix

Channel: aiDotEngineer
Published at: 2025-07-16
YouTube video id: AbZ4IYGbfpQ
Source: https://www.youtube.com/watch?v=AbZ4IYGbfpQ
[Music]
Uh good afternoon. Uh thank you uh
Eugene for the introduction. Uh so today
uh I'm going to share our big bet and
Netflix uh on personalization namely to
use one foundation model to cover all
the recommendation use cases.
Uh at Netflix we have diverse
recommendation needs. Uh this is example
homepage of one of profile on Netflix.
Uh it's a 2D layout rows and items.
Diversity comes at uh at least three
levels. Uh there is first level about
row. We have diverse rows. We have
genres for example rows on comedies,
roles on action movies. We have rows
about uh new trending uh just the
release titles. We also have rows about
for example titles only available on
Netflix. Um so that's the first
dimension. Second dimension is of course
of the the items or entities. uh in
addition to traditionally movie and TV
shows now we have games we have live
streaming and we are going to add more
so our content space is expanding to uh
very heterogeneous content types
the third level is page right so we have
homepage we have search page we have a
kids homepage which is tailored very
differently toward kids interest uh
mobile feed is a linear page page is not
a 2D layout. Uh so on and so forth. So
page different pages are also very
diverse. Uh what happened traditionally
was that these lead to naturally uh many
specialized models that got developed
over the years. Uh some models rank
videos, some rank rows, some focus on
for example shows user have not watched
yet. Some uh focus on shows what user
are already engaging.
uh and many of those models are were
built independently over the years. They
may have different objectives uh but
have a lot of overlaps as well.
Uh so naturally this lead to
duplications
uh uh duplications in our label
engineering as well as feature
engineering. Take the uh feature
engineering as example. Uh we have this
very commonly used factual data about
user interaction history. uh the the
factual data is the same but over the
years many features are developed
derived out of the same fact data like
counts of different actions kinds of
actions within various time window or
other kind of uh slice and dice
dimensions similarity between the users
history titles against the target titles
unique lastly like uh just a sequence of
unique show ids uh to be used as a
sequence feature into the model so This
list can go on and on and a lot of those
features uh because they are developed
independently into each model they have
slight variations but become very but
largely uh very similar so become very
hard to uh maintain. So the challenge
the challenge uh back then was uh is
this scalable? Uh obviously not. If we
keep expanding our landscape of content
type or business use cases, it's not
manageable to spin up new models for
each uh individual use cases. Uh there's
not much leverage. uh there's some
shared components on building the
feature label but still by and large
each model uh basically uh spinned up
independently and that also impact our
uh innovation velocity in in the terms
that you don't reuse as much as you can
instead you just spin up new models uh
pretty much from scratch so this was the
situation about four years ago uh at the
beginning or middle of the pandemic so
the question we asked at that time was
uh can we centralize the learning of
user representation
in one place.
So the answer is yes and we had this key
hypothesis that about foundation model
based on transformer architecture. Uh
concretely two hypothesis here. One
hypothesis is that through scale up
semi-supervised learning personalization
can be improved. Uh the scaling law also
applies to recommendation system as it
applies to LLM. Uh second is that by
integrating the foundation model into
all systems we can create high leverage.
we can simultaneously improve all the
downstream canvas facing models at the
same time. So we'll see in the following
slides how we validate those hypothesis.
Uh I I'll break up the overview into two
sub sessions. First about dating data
and training and later uh second about
application and serving.
So um about data and training. So
starting from data a very interesting
aspect of building such foundation model
auto reggressive transformer is that
there's a lot of analogy but also
differences sometimes uh between this
and LLM. So we can transfer a lot of
learnings inspirations from LLM uh uh
development. Uh if we start from the
very bottom layer which is basically
data cleaning and tokenization.
uh people work with LLM understand
tokenization decisions have profound
impact in your model quality. So uh
although it's the bottom layer the
decision you made there can percolate
through all the downstream layers and
manifest as either your model quality
problem or model quality plus. So uh
this applies to recommendation uh
foundation model as well. uh instead of
uh there are some differences very
importantly instead of language tokens
which is just the one ID here for uh if
we want to translate the user
interaction history or sequence each of
the token is a event interaction event
from the user right but that event has
many facets or many fields so it's not
just one ID you can represent there are
a lot of rich information about the
event so how you all of those fields can
play a role in making the decision of
tokenization. Uh I think that's what we
need to consider very carefully. Um what
is the granularity of tokenization and
trade off that versus the context window
for example. Um and through many
iterations we reach the right I think
reach the right abstraction and
interfaces that we can use to uh adjust
our tokenization to different use cases.
For example, you can imagine we have a
tokeniz one version of tokenization used
for pre-training for fine-tuning against
a specific application. We apply
slightly different tokenization.
Um so moving up from the tokenization
layer uh then becomes the model layers.
uh at high level uh from bottom to top
we go through the uh event
representation
uh embedding layer transformer layer and
the objective layer. So event
representation as we just briefly
touched upon uh many information in the
event but at high level you can break it
down by when where and what when that
event happened that's about time
encoding and where it happened it's
about the physical location your local
country so forth but also about device
about the uh canvas or which row which
page this action happened uh and then uh
what basically is about the target
entity or the title which title you
interacted with what is the interaction
how long and uh any that kind of
information associated with the action.
So um that's where the we need to decide
what information we need to keep what we
should drop so forth. uh moving one
layer above uh the embedding feature
transformation layer. Uh one thing that
needs to be pointed out is that for
recommendation we need to combine ID
embedding learning with other semantic
content information. Um if you only have
ID embedding learn from scratch in the
model then you have problem with co-star
meaning that titles the model hasn't
seen during training it doesn't know how
to deal with it at inference time. So we
need to have semantic content
information to be uh uh comp
complementaryary to those ID embeddings.
Uh this is not a problem for LLM but
very commonly encountered the co-star
problem for re recommendation system. Uh
uh transformer layer I think there's no
need to talk too much into this in terms
of architecture choices optimization so
on and so forth. The only thing that I
want to point out is that uh we are
using the hidden state output from this
layer as our user representation which
is one of the primary goal of the
foundation model is to learn a good
long-term user representation. Then uh
we need to put this into context. Then
things to consider are for example how
stable is our user user representation
given our user profile user interaction
history keep changing. How do we
guarantee or maintain the stability of
that representation and what kind of
aggregation we should use? You can think
of broadly aggregate across the time
dimension in terms of sequence dimension
or aggregate uh across the layers. You
have multiple self attention layer. How
do you aggregate that? Um and then
lastly, do we need to do explicit
adaptation of the representation based
on our downstream objective to fine-tune
it?
Um so then we move to last uh the very
top layer objective loss function. This
is also very interesting in the sense
that it's much richer than LLM because
you can see first we use uh instead of
one sequence but multiple sequence to
represent the output because you can
have a sequence of entity ids that's
your like uh next token prediction
softmax or sample softmax but then we
have many other facets of field of each
event that can be also used as a target
right so it could be for things like uh
action type It could be some aspect of
the entity's metadata like entity type,
yarn, language, so on so forth and also
about your action like prediction of the
duration or uh the device where the
action happen or the time when the next
uh user play will happen. So those are
all legitimate uh targets or labels
depends on your use case you can use
them to do the finetuning. Now instead
of so you can cast the problem as a
multitask learning problem multi head or
hierarchical prediction but you can also
use them just as your weights your
rewards or your mask on the loss
function. So in terms of to adapt the
model to zooming into one subcategory of
uh user behavior you want to you want
the model to learn. Okay. So that's
about the model architecture that I want
to talk about.
Um so does it scale? The first question
part of the first hypothesis we want to
answer is does a scaling law apply and I
think the answer is yes. So this is over
the uh roughly two two to two and a half
years we were scaling up and we
constantly still see the gain uh from
only on the order of 10 million profile
or a few million profile to now on the
order of 1 billion uh model parameters.
We scale up the data accordingly. Um now
we stop here because we can still keep
going but uh as you may realize that
recommendation system usually have much
stringent latency cost requirement. So
scaling up scaling up more require us to
also distill back. Yeah. But certainly I
think this is not the end of the scaling
law.
Uh before we wrapping up the data and
training discussion I would like to
highlight some of the learnings I think
quite interesting we borrow from LLM.
This is not exhaustive list but uh uh I
think very interesting to me uh the top
three one is top multi-token prediction.
You may have seen this in the deepseek
paper so on and so forth. So you can
implementation wise you can use multi
head multi- label so and different
implementation flavor but the goal is
really to force the model to be less
myopic more robust to serving time shift
because you have a time gap between your
training and serving and also force the
model to targets long-term user
satisfaction and long-term user behavior
instead of just focus on next action. Uh
I we have observed a very notable uh
metrics improvement by doing that.
uh the second is multi-layer
representation which uh I touched upon
on the profile representation. So this
is also translated from LLM side of
techniques of layer wise supervision,
self-distillation or multi-layer output
aggregation. The goal here is really to
make a better and more stable user
representation.
Lastly, u this is also should be no
surprise, long context window handling
from truncated sliding window to sparse
attention to progressively training uh
longer and longer sequences uh to
eventually all of the parallelism
strategies. So this is about more
efficient training and maximize the
learning.
Okay. So uh shift gear to talk about the
serving and applications.
uh before the foundation model FM uh
this is roughly the algo stack we have
for personalization many data many
features many models independently
developed each serving multiple or one
canvases or applications we call
now with the foundation model we
consolidate largely the data and
representation layer especially the user
representation as well as content
representation in the personalization
domain uh model layer as well because
model now each application model now are
built on top of FM so become a thinner
layer instead of a very standalone
full-fledged model train from scratch.
So how do the various models utilize a
foundation model? Um there are three
main approaches uh or consumption
patterns. Okay, the first is foundation
model can be integrated as a subgraph
within the downstream model. uh
additionally the content embeddings
learned from the foundation model can be
integrated as the embedding lookup
layers. So downstream model is a neuron
network uh may already have initially
some of the sequence transformer uh
tower or graph and then using a
pre-trained foundation model subgraph to
directly replace that.
Uh second is that uh we can push out
embeddings. This is no surprise from
both content side and entity embedding
as well as member embeddings. Uh the
only the main concern here of course is
how we want to re how frequently we want
to refresh the member embeddings and how
we make sure they are stable. Uh and
push them to the centralized embedding
store. And this of course allow far more
uh wider use cases than just the
personalization because people analytics
data scientists can also just fetch
those embeddings directly to do the
things that they want.
Finally user can u extract the models
and fine-tune it for specific
applications either fine-tune or they
need to do distillation to meet the uh
online serving uh requirement.
um especially for those with a very
strict latency requirement.
To wrap up uh I want to show at high
level the wings we accumulated over the
last one year and a half uh by
incorporating FM into various places. So
the blue bar represent how many
applications have FM incorporated. The
green bar represent the AB test swings
because in any application we may have
multiple AB tests going on there to have
wings. So we see we indeed see high
leverage of FM to bring about both AB
test wings as well as infrastructure
consolidation.
Uh so I think the big back uh big bets
are validated. Uh it is a scalable
solution uh in terms of both both in
terms of a scalable scale up the model
with improved quality as well as make
the whole infra consolidated and scale
uh to new applications to be much
easier. high leverage because it's a
centralized learning. Innovation
velocity also is faster because we allow
a lot of newly uh launched applications
to directly fine-tune the foundation
model to uh launch the first experience.
Uh so the current directions um one is
that um we want to have a universal
representation for heterogeneous
entities. This is uh as you can guess
the semantic ID and along those lines
because we want to cover that uh as
Netflix is expanding to very different
very heterogeneous content types. Uh
second is generative retrieval for
collection recommendation right so
instead of just recommend a single video
be generative at inference time and
serving time because you have a
multi-step decoding a lot of the
consideration about business business
rules or diversity for example can be
naturally handled in the decoding
process lastly faster adaptation through
prompt tuning so this is also borrowed
from LLM can we just train some soft
tokens so that at inference time we can
directly swap in and out the soft tokens
to prompt the FM to behave differently.
So that is also a very promising
direction that we are getting into. All
right, that concludes my talk. Thank you
for your attention and questions.
Thank you. Um if you have any questions,
may I invite you to come to the mics in
front um while we get our next speakers
from Mr. K.
Uh hi, thank you for the talk. Uh since
you get billions of users, so except the
recommendation system, you maybe it can
do much more, right? So what's your
thought on that? Since I can just ask it
to to predict who's the next president
in the United States. Thank you.
Um so I actually don't uh could you
explain a little bit what do you mean by
beyond recommendation? Do you mean the
other personalization or other things?
Yeah. Um yeah, since you get kind of
beating users preference. So actually
that that that preference is also been
leaning to what things they buy or who
they will vote for the next president.
So do you think your foundation model
has that capability to to expand not
only recommendation what videos they
want to look what others they like or
what's their opinions on anything else?
Thank you.
Yes. So I think we are expanding to
different uh I think entity type and
also capture
uh users taste from both on and off our
platform. I think that's a general trend
that we're going to.
Yes.
Okay.
Great. Thank you. This was really
helpful. Um question on and you might
not be able to share it um for IP
reasons but whatever you can.
Uh thoughts on graph models. Didn't I
didn't hear a lot of that in your talks.
graphs and uh reinforcement learning any
utilization there any benefits you saw
any boost in in performance and accuracy
yeah that's a very good question I think
we have actually uh a dedicated team sub
team doing graph model uh especially
around our knowledge graph to cover the
content space both on and off our
platform in the whole entire
entertainment ecosystem. So we use
actually a lot of embeddings for example
from the graph model to co-start that's
where I see I show those semantic
embeddings that's where it comes from in
terms of reinforcement learning yes as
well especially where we consider sparse
reward that we have on users from users
action are pretty much sparse but we
want to use them to guide how for
example we generate the whole collection
that's where we need to consider how to
use those reward to guide those uh
process.
Yeah.
One more question. I'm sorry.
Can I ask you two part questions?
Sure. I would be here and so we can also
follow. Yeah.
Do you also use these unified
representations as embedding features to
downstream models?
You had a slide how you use the unified
model.
Yeah. So uh the
we so for the embeddings learning within
our model we also expose to downstream
to direct consume them. Uh we also have
a to train our unified embedding we also
have some upstream like just for example
the GN embeddings that those are also
consumed to to do that.
Last one is it fast?
Uh
hello. Uh in your embeddings, are you
just using when someone does an action
or
sorry for the in these embeddings are
you just using metadata over the video
to understand what they like or are you
actually using like frame by frame of
the video or second clips?
Uh not yet. We do have that from some
other content group of our organization
but I think the trend will go there. So
we are not yet uh into very granular sub
like clips level or view level. We have
those embeddings but not quite yet to
incorporate. Yep.
Thank you. Thank you. Uh please another
round of applause for
[Music]