Netflix's Big Bet: One model to rule recommendations: Yesu Feng, Netflix
Channel: aiDotEngineer
Published at: 2025-07-16
YouTube video id: AbZ4IYGbfpQ
Source: https://www.youtube.com/watch?v=AbZ4IYGbfpQ
[Music] Uh good afternoon. Uh thank you uh Eugene for the introduction. Uh so today uh I'm going to share our big bet and Netflix uh on personalization namely to use one foundation model to cover all the recommendation use cases. Uh at Netflix we have diverse recommendation needs. Uh this is example homepage of one of profile on Netflix. Uh it's a 2D layout rows and items. Diversity comes at uh at least three levels. Uh there is first level about row. We have diverse rows. We have genres for example rows on comedies, roles on action movies. We have rows about uh new trending uh just the release titles. We also have rows about for example titles only available on Netflix. Um so that's the first dimension. Second dimension is of course of the the items or entities. uh in addition to traditionally movie and TV shows now we have games we have live streaming and we are going to add more so our content space is expanding to uh very heterogeneous content types the third level is page right so we have homepage we have search page we have a kids homepage which is tailored very differently toward kids interest uh mobile feed is a linear page page is not a 2D layout. Uh so on and so forth. So page different pages are also very diverse. Uh what happened traditionally was that these lead to naturally uh many specialized models that got developed over the years. Uh some models rank videos, some rank rows, some focus on for example shows user have not watched yet. Some uh focus on shows what user are already engaging. uh and many of those models are were built independently over the years. They may have different objectives uh but have a lot of overlaps as well. Uh so naturally this lead to duplications uh uh duplications in our label engineering as well as feature engineering. Take the uh feature engineering as example. Uh we have this very commonly used factual data about user interaction history. uh the the factual data is the same but over the years many features are developed derived out of the same fact data like counts of different actions kinds of actions within various time window or other kind of uh slice and dice dimensions similarity between the users history titles against the target titles unique lastly like uh just a sequence of unique show ids uh to be used as a sequence feature into the model so This list can go on and on and a lot of those features uh because they are developed independently into each model they have slight variations but become very but largely uh very similar so become very hard to uh maintain. So the challenge the challenge uh back then was uh is this scalable? Uh obviously not. If we keep expanding our landscape of content type or business use cases, it's not manageable to spin up new models for each uh individual use cases. Uh there's not much leverage. uh there's some shared components on building the feature label but still by and large each model uh basically uh spinned up independently and that also impact our uh innovation velocity in in the terms that you don't reuse as much as you can instead you just spin up new models uh pretty much from scratch so this was the situation about four years ago uh at the beginning or middle of the pandemic so the question we asked at that time was uh can we centralize the learning of user representation in one place. So the answer is yes and we had this key hypothesis that about foundation model based on transformer architecture. Uh concretely two hypothesis here. One hypothesis is that through scale up semi-supervised learning personalization can be improved. Uh the scaling law also applies to recommendation system as it applies to LLM. Uh second is that by integrating the foundation model into all systems we can create high leverage. we can simultaneously improve all the downstream canvas facing models at the same time. So we'll see in the following slides how we validate those hypothesis. Uh I I'll break up the overview into two sub sessions. First about dating data and training and later uh second about application and serving. So um about data and training. So starting from data a very interesting aspect of building such foundation model auto reggressive transformer is that there's a lot of analogy but also differences sometimes uh between this and LLM. So we can transfer a lot of learnings inspirations from LLM uh uh development. Uh if we start from the very bottom layer which is basically data cleaning and tokenization. uh people work with LLM understand tokenization decisions have profound impact in your model quality. So uh although it's the bottom layer the decision you made there can percolate through all the downstream layers and manifest as either your model quality problem or model quality plus. So uh this applies to recommendation uh foundation model as well. uh instead of uh there are some differences very importantly instead of language tokens which is just the one ID here for uh if we want to translate the user interaction history or sequence each of the token is a event interaction event from the user right but that event has many facets or many fields so it's not just one ID you can represent there are a lot of rich information about the event so how you all of those fields can play a role in making the decision of tokenization. Uh I think that's what we need to consider very carefully. Um what is the granularity of tokenization and trade off that versus the context window for example. Um and through many iterations we reach the right I think reach the right abstraction and interfaces that we can use to uh adjust our tokenization to different use cases. For example, you can imagine we have a tokeniz one version of tokenization used for pre-training for fine-tuning against a specific application. We apply slightly different tokenization. Um so moving up from the tokenization layer uh then becomes the model layers. uh at high level uh from bottom to top we go through the uh event representation uh embedding layer transformer layer and the objective layer. So event representation as we just briefly touched upon uh many information in the event but at high level you can break it down by when where and what when that event happened that's about time encoding and where it happened it's about the physical location your local country so forth but also about device about the uh canvas or which row which page this action happened uh and then uh what basically is about the target entity or the title which title you interacted with what is the interaction how long and uh any that kind of information associated with the action. So um that's where the we need to decide what information we need to keep what we should drop so forth. uh moving one layer above uh the embedding feature transformation layer. Uh one thing that needs to be pointed out is that for recommendation we need to combine ID embedding learning with other semantic content information. Um if you only have ID embedding learn from scratch in the model then you have problem with co-star meaning that titles the model hasn't seen during training it doesn't know how to deal with it at inference time. So we need to have semantic content information to be uh uh comp complementaryary to those ID embeddings. Uh this is not a problem for LLM but very commonly encountered the co-star problem for re recommendation system. Uh uh transformer layer I think there's no need to talk too much into this in terms of architecture choices optimization so on and so forth. The only thing that I want to point out is that uh we are using the hidden state output from this layer as our user representation which is one of the primary goal of the foundation model is to learn a good long-term user representation. Then uh we need to put this into context. Then things to consider are for example how stable is our user user representation given our user profile user interaction history keep changing. How do we guarantee or maintain the stability of that representation and what kind of aggregation we should use? You can think of broadly aggregate across the time dimension in terms of sequence dimension or aggregate uh across the layers. You have multiple self attention layer. How do you aggregate that? Um and then lastly, do we need to do explicit adaptation of the representation based on our downstream objective to fine-tune it? Um so then we move to last uh the very top layer objective loss function. This is also very interesting in the sense that it's much richer than LLM because you can see first we use uh instead of one sequence but multiple sequence to represent the output because you can have a sequence of entity ids that's your like uh next token prediction softmax or sample softmax but then we have many other facets of field of each event that can be also used as a target right so it could be for things like uh action type It could be some aspect of the entity's metadata like entity type, yarn, language, so on so forth and also about your action like prediction of the duration or uh the device where the action happen or the time when the next uh user play will happen. So those are all legitimate uh targets or labels depends on your use case you can use them to do the finetuning. Now instead of so you can cast the problem as a multitask learning problem multi head or hierarchical prediction but you can also use them just as your weights your rewards or your mask on the loss function. So in terms of to adapt the model to zooming into one subcategory of uh user behavior you want to you want the model to learn. Okay. So that's about the model architecture that I want to talk about. Um so does it scale? The first question part of the first hypothesis we want to answer is does a scaling law apply and I think the answer is yes. So this is over the uh roughly two two to two and a half years we were scaling up and we constantly still see the gain uh from only on the order of 10 million profile or a few million profile to now on the order of 1 billion uh model parameters. We scale up the data accordingly. Um now we stop here because we can still keep going but uh as you may realize that recommendation system usually have much stringent latency cost requirement. So scaling up scaling up more require us to also distill back. Yeah. But certainly I think this is not the end of the scaling law. Uh before we wrapping up the data and training discussion I would like to highlight some of the learnings I think quite interesting we borrow from LLM. This is not exhaustive list but uh uh I think very interesting to me uh the top three one is top multi-token prediction. You may have seen this in the deepseek paper so on and so forth. So you can implementation wise you can use multi head multi- label so and different implementation flavor but the goal is really to force the model to be less myopic more robust to serving time shift because you have a time gap between your training and serving and also force the model to targets long-term user satisfaction and long-term user behavior instead of just focus on next action. Uh I we have observed a very notable uh metrics improvement by doing that. uh the second is multi-layer representation which uh I touched upon on the profile representation. So this is also translated from LLM side of techniques of layer wise supervision, self-distillation or multi-layer output aggregation. The goal here is really to make a better and more stable user representation. Lastly, u this is also should be no surprise, long context window handling from truncated sliding window to sparse attention to progressively training uh longer and longer sequences uh to eventually all of the parallelism strategies. So this is about more efficient training and maximize the learning. Okay. So uh shift gear to talk about the serving and applications. uh before the foundation model FM uh this is roughly the algo stack we have for personalization many data many features many models independently developed each serving multiple or one canvases or applications we call now with the foundation model we consolidate largely the data and representation layer especially the user representation as well as content representation in the personalization domain uh model layer as well because model now each application model now are built on top of FM so become a thinner layer instead of a very standalone full-fledged model train from scratch. So how do the various models utilize a foundation model? Um there are three main approaches uh or consumption patterns. Okay, the first is foundation model can be integrated as a subgraph within the downstream model. uh additionally the content embeddings learned from the foundation model can be integrated as the embedding lookup layers. So downstream model is a neuron network uh may already have initially some of the sequence transformer uh tower or graph and then using a pre-trained foundation model subgraph to directly replace that. Uh second is that uh we can push out embeddings. This is no surprise from both content side and entity embedding as well as member embeddings. Uh the only the main concern here of course is how we want to re how frequently we want to refresh the member embeddings and how we make sure they are stable. Uh and push them to the centralized embedding store. And this of course allow far more uh wider use cases than just the personalization because people analytics data scientists can also just fetch those embeddings directly to do the things that they want. Finally user can u extract the models and fine-tune it for specific applications either fine-tune or they need to do distillation to meet the uh online serving uh requirement. um especially for those with a very strict latency requirement. To wrap up uh I want to show at high level the wings we accumulated over the last one year and a half uh by incorporating FM into various places. So the blue bar represent how many applications have FM incorporated. The green bar represent the AB test swings because in any application we may have multiple AB tests going on there to have wings. So we see we indeed see high leverage of FM to bring about both AB test wings as well as infrastructure consolidation. Uh so I think the big back uh big bets are validated. Uh it is a scalable solution uh in terms of both both in terms of a scalable scale up the model with improved quality as well as make the whole infra consolidated and scale uh to new applications to be much easier. high leverage because it's a centralized learning. Innovation velocity also is faster because we allow a lot of newly uh launched applications to directly fine-tune the foundation model to uh launch the first experience. Uh so the current directions um one is that um we want to have a universal representation for heterogeneous entities. This is uh as you can guess the semantic ID and along those lines because we want to cover that uh as Netflix is expanding to very different very heterogeneous content types. Uh second is generative retrieval for collection recommendation right so instead of just recommend a single video be generative at inference time and serving time because you have a multi-step decoding a lot of the consideration about business business rules or diversity for example can be naturally handled in the decoding process lastly faster adaptation through prompt tuning so this is also borrowed from LLM can we just train some soft tokens so that at inference time we can directly swap in and out the soft tokens to prompt the FM to behave differently. So that is also a very promising direction that we are getting into. All right, that concludes my talk. Thank you for your attention and questions. Thank you. Um if you have any questions, may I invite you to come to the mics in front um while we get our next speakers from Mr. K. Uh hi, thank you for the talk. Uh since you get billions of users, so except the recommendation system, you maybe it can do much more, right? So what's your thought on that? Since I can just ask it to to predict who's the next president in the United States. Thank you. Um so I actually don't uh could you explain a little bit what do you mean by beyond recommendation? Do you mean the other personalization or other things? Yeah. Um yeah, since you get kind of beating users preference. So actually that that that preference is also been leaning to what things they buy or who they will vote for the next president. So do you think your foundation model has that capability to to expand not only recommendation what videos they want to look what others they like or what's their opinions on anything else? Thank you. Yes. So I think we are expanding to different uh I think entity type and also capture uh users taste from both on and off our platform. I think that's a general trend that we're going to. Yes. Okay. Great. Thank you. This was really helpful. Um question on and you might not be able to share it um for IP reasons but whatever you can. Uh thoughts on graph models. Didn't I didn't hear a lot of that in your talks. graphs and uh reinforcement learning any utilization there any benefits you saw any boost in in performance and accuracy yeah that's a very good question I think we have actually uh a dedicated team sub team doing graph model uh especially around our knowledge graph to cover the content space both on and off our platform in the whole entire entertainment ecosystem. So we use actually a lot of embeddings for example from the graph model to co-start that's where I see I show those semantic embeddings that's where it comes from in terms of reinforcement learning yes as well especially where we consider sparse reward that we have on users from users action are pretty much sparse but we want to use them to guide how for example we generate the whole collection that's where we need to consider how to use those reward to guide those uh process. Yeah. One more question. I'm sorry. Can I ask you two part questions? Sure. I would be here and so we can also follow. Yeah. Do you also use these unified representations as embedding features to downstream models? You had a slide how you use the unified model. Yeah. So uh the we so for the embeddings learning within our model we also expose to downstream to direct consume them. Uh we also have a to train our unified embedding we also have some upstream like just for example the GN embeddings that those are also consumed to to do that. Last one is it fast? Uh hello. Uh in your embeddings, are you just using when someone does an action or sorry for the in these embeddings are you just using metadata over the video to understand what they like or are you actually using like frame by frame of the video or second clips? Uh not yet. We do have that from some other content group of our organization but I think the trend will go there. So we are not yet uh into very granular sub like clips level or view level. We have those embeddings but not quite yet to incorporate. Yep. Thank you. Thank you. Uh please another round of applause for [Music]