Teaching Gemini to Speak YouTube: Adapting LLMs for Video Recommendations to 2B+DAU - Devansh Tandon

Channel: aiDotEngineer
Published at: 2025-07-16
YouTube video id: LxQsQ3vZDqo
Source: https://www.youtube.com/watch?v=LxQsQ3vZDqo
[Music]
There's a lot of attention in terms of
how LLMs are going to transform search.
Uh Google search is having a revolution.
Chat GPT has a big chat interface.
Perplexity is a product that a lot of
people use. Um, but I think
recommendations is uh probably a bigger
problem that is underhyped because it's
kind of transparent to the user. Um, and
I think the application of LLMs to
recommendations is going to be a bigger
consumer application than search. Um, so
in terms of my talk, I just want to
introduce the problem of YouTube
recommendations and then talk about how
we've built large recommener models.
We're adapting Gemini for YouTube. how
we build semantic ID and how we're using
that and then end with this recipe of
how you might use an LLM to make a
recommendation system. Um to start why
this is important. Um who here watches
YouTube every day.
It's one of the biggest consumer apps in
the world. Um and a large majority of
the watch time on YouTube is driven by
the recommendation system. Uh and we
serve recommendations across home watch
next. We have a big shorts product and
even a lot of our search results are
personalized in some way. Uh and so if
you think about consumer applications of
LLMs,
I think in terms of consumer engagement
and impact, recommendations is going to
be a much bigger uh application than
than searches. And this is true of any
consumer app with a billion doubt. Um,
the way I think about the recommendation
problem is you're trying to learn this
function of you get a user and their
context as input and you're trying to
give them a bunch of recommendations.
Um, at YouTube we have a bunch of user
information like their demographics,
their age, their gender, where they're
located. We have a lot of context about
them. What are the last 100 videos they
watched? How deeply did they engage with
them? What did they comment on? who are
they subscribed to and we use all of
that to make uh video recommendations.
We've tried a lot of different modeling
techniques here. Multi-headed rankers,
embedding models, sequencetose sequence
transformers, there's a there's a long
history. Um and about two years ago, we
started thinking how can we rethink this
recommendation system on top of Gemini,
which has been making incredible
progress in modeling. how can we adapt
that for YouTube?
And so we've built this system which we
call LRM large recommener model uh where
we adapt Gemini for recommendations. So
we start with this base Gemini
checkpoint um and then we are adapting
it for YouTube recommendations teaching
it a lot of information about YouTube to
get this kind of unified YouTube
specific checkpoint of Gemini which we
call LRM.
Then we can align it for different
recommendation related tasks like
retrieval and ranking um and basically
make a small custom version of this
model for all of the major
recommendation surfaces. Um and so this
is a model that we have launched in
production at YouTube for a while in
terms of the retrieval system and we're
experimenting a lot on the ranking side.
So I want to start with just kind of
explaining how we built this YouTube and
Gemini model and then we'll talk about
how we use it for retrieval.
The first step of this kind of a model
is you have to develop a way to tokenize
videos. So when you
uh
in terms of an LLM when you give it an
input it it tokenizes that text and then
is predicting the next text token. Uh
the ideal product we wanted to make was
we want to give this model an input of a
number of video tokens and then just get
video tokens out that would be good
recommendations. Um, we had to build
this because even with a million tokens
of context, when you want to reason over
many videos, you have to compress that
video representation in some way. Um,
and before we kind of settled on this
approach, we tried a bunch of other
things like uh predicting search queries
and retrieving videos through that or
trying to just uh recommend videos
directly. And those solutions were just
not good enough. And so we built
semantic ID which we actually wrote a
paper about last year uh and it was
presented at Rexus. The way that this
semantic ID works is you take a video um
you extract a number of features out of
it like the title, description,
transcript, even the audio and video
frame level data. You put all of that
into a multi-dimensional embedding. Um
and then you quantize it using RQVE to
give every video uh a token. Um we've
written a pretty detailed paper about
this if people are interested. But at a
high level, the way I think about this
is we're making the atomic units for a
new language of YouTube videos. Um once
we have these tokens, you can imagine
the whole corpus of billions of videos
on YouTube gets organized around these
semantically meaningful tokens. And so
you could imagine the first token
representing topics like music, gaming,
sports. Within sports, you would have
different different sports and then you
can uh get to volleyball. And so these
two volleyball videos would share some
tokens in the prefix but also then have
a unique identifier. Um and this this I
think in itself is is an interesting
milestone to move away from hashbased
tokenization into a semantically
meaningful one. Um and we use this in
production at YouTube.
Uh what we then tried to do is this
process of what we call continued
pre-training where we're trying to take
this model and have it understand both
English and this new YouTube language.
And we do this in in two big steps. One
is around linking text and SID. Um and
then the second step is around having it
understand sequences of watches and be
able to reason across this video space.
And so some of the example training
tasks that we're teaching this model,
you have this video, it's a tennis
highlights video which has some semantic
ID and you can prompt it and say, "Hey,
this video has title XYZ." And the model
starts to learn to output the title. Um,
you could imagine a very similar thing
where you could say has creator or has
topics and so on. And so you're
basically trying to connect text and
this video token.
Then what we can try to do is we have a
corpus of all the YouTube engagement
data, all the paths that users took
through YouTube when they watch videos
together. Um, and you can prompt the
model with things like a user has
watched the following videos ABCD. And
you mask some of those videos and the
model starts to learn to predict those
masks. And now it's starting to
understand what are videos that are
watched together and make relationships
between videos on the basis of user
engagement. Right? Um, after a bunch of
pre-training tasks like this, we get
this really interesting model that can
reason across English and YouTube
videos. And so this is an example from a
user's watch history. Um, and we find
that this model can now reason across
these videos. So you could prompt it
with things like, hey, video one is
interesting to tennis fans because it's
about Wimbledon. Video two is
interesting for F1 because it's about
the Spanish Grand Prix. Video three is
interesting to math fans because it's
about pi. And then you prompt video four
is going to be interesting too. And the
model starts to be able to understand
that it's interesting to technology fans
because it's about AI. And this is just
based on the semantic ID uh definition
of a video. It doesn't really have a lot
of other information uh to go off of. So
I think this in itself is a very
interesting checkpoint that is starting
to reason across English and YouTube.
Once we have this model, we think about
how we can use this for different video
recommendation tasks at YouTube. And the
first one that we focused on is
generative retrieval. And so here you
could just construct a prompt for every
user and see what this model recommends.
And so in this example, you have a user,
they would be a 24 year old woman in the
US on Android. They're watching this
highlight video from the Olympics. Um,
and they have some watch history of, you
know, 50 videos they've watched in the
past, how they engaged with it. And you
can just construct a prompt like we have
on the right with this user demographic
information, the context video, and have
the model decode some video
recommendations as SIDs.
We find that this gives really
interesting unique recommendations,
especially for our hardest
recommendation tasks. So, in this
example, when you're watching uh this
highlight from the Olympics, the
production system before LRM would give
you other men's track races. Um, now
with this new model, it's able to find
this unique connection between the user
demographic and their past watch history
and find related women's uh races that
we weren't able to recommend in the
past. And so we find that especially for
users where we don't know as much about
them, we get very interesting and unique
recommendations out of this strategy.
Um, and so we've experimented with this
and launched it in a few places at
YouTube.
The big findings from this is that LRM
is a very powerful model, but it's
really expensive to serve. It it learns
very quickly. It's very training data
efficient. Um, and it handles our
toughest Rex tasks, but the biggest
limitation was that the serving costs
are too high, especially for the scale
that YouTube operates at with billions
of users. And so, after we got our first
experiments working, we spent a lot of
time just reducing the TPU serving cost.
And, you know, we got 95% plus cost
savings to be able to actually launch
this in production.
Um, one other strategy that we used
which I think is kind of interesting is
we tried to turn this into an offline
problem where it's the same prompt in
the same model. We just remove the
personal personalized aspects of this
prompt and we wanted to build just an
offline recommendations table where if
you're watching video A, what are the
candid videos that would be good to
watch next? Um, and normally these
unpersonalized recommendation models
just don't hold a candle to a
personalized recommener. But because
this LRM is trained from a really big
checkpoint, it actually gives us some
differentiated recommendations. Um and
so in the YouTube context like we can
take our corpus of billions of videos,
look at the head which represent a lot
of the watch time, do offline inference,
um make this offline rex table and then
we can just do a simple lookup to serve
some recommendations. Um and so this was
kind of a a complete way around our
serving problems.
Um, I want to talk a bit about the
challenges for YouTube. And I think in
some ways making an LLM based
recommendation system is harder than
training an LLM. Um, one of the big
differences is the vocabulary and size
of the corpus. Right? So for Gemini, if
you're training an English LLM, your
vocabulary is about 100,000 words in the
Oxford dictionary and they add about a
thousand words every year. Um at
YouTube, if you imagine the library of
YouTube, it has billions of videos. We
have 20 billion videos on YouTube, um
with millions added every day. Um and
the freshness of videos is really
important, much more so than LLMs. So if
you think about a new word that's added
to the English dictionary, word of 2023
was RZ. If your model Gemini doesn't
know about RZ, it can still answer 99%
of questions that people would have.
Maybe it misses some jokes, maybe it
misses some pop culture references.
But in the world of YouTube, if Taylor
Swift drops a new music video, you have
to be able to recommend it within the
next minutes or hours, otherwise a lot
of users are going to be upset. So even
within this large corpus you have to
very quickly understand what are the
videos that are important and start
recommending them to the right user. Um
and so what we do with this LRM
recommener is we have to continuously
pre-train it on the order of days and
hours which is very different than
classical LLM pre-training like Gemini
which happens maybe like once in 3 to 6
months. Um and so in that way it's a
much harder problem. Uh and then the
last part is scale. Um we have great
models in Gemini. Gemini Pro is
incredible, but there's no way that you
can serve that to billions of daily
active users. Um and so for YouTube, we
had to focus on the smaller, more
efficient models like flash and and even
smaller checkpoints than that um just to
be able to hit the latency and scale
requirements that we have.
Um, so I kind of want to summarize the
journey that we've been on YouTube in
this what I think of as a LM and Rexus
recipe that you can maybe adapt to your
own application. And there's three major
steps to this, right? The first is you
want to find a way to tokenize your
content. Um, just like LLM's tokenized
text, you want to find you want to make
some essence of your content into an
atomic token. Uh, one way to do that
which we've done is you find some rich
representation, a bunch of features,
build an embedding and then find a way
to tokenize or quantize it. And the
outcome of this is like you're making
your own domain specific language.
The second step is you then want to
adapt the LLM and basically make links
between English and your domain language
uh and find training tasks that help you
reason across English and these new
tokens you've built. And so the outcome
after this step in my mind is it's a
bilingual LLM that can speak English and
natural language but it can also speak
your domain specific language. Um and
then once you have this you can do the
third step of prompting it with user
information where you can just construct
personalized prompts with user
demographic user activity different
actions um and then train task specific
or surface specific models and you have
a generative recommendation system on
top of an LLM and there this is like a
tweet size summary of maybe two years of
work. Um,
maybe the last thing that I want to talk
about is kind of where I see this going.
Um, and some possible future directions
for LLM and Rexus. I think the stage
that we're at right now is that LMS are
just augmenting recommendations. They
bring these magical recommendation
experiences. They enhance the quality,
but they're largely invisible to users.
like your YouTube feed just got better,
but you don't really know whether a
Gemini inference happened or not. Um,
this is why I think the LLM application
of Rexus is very underhyped because
users don't directly know what's
happening. Um, I think we're close to a
world and we're experimenting with this.
If you have like we talked about a
bilingual LLM across English and
recommendations, users can then talk to
it in natural language. And I think
you're going to start to see experiences
where users can steer recommendations to
their own goals. The recommener can
explain why a candidate was recommended
to a user. Um, and users can start to
align it towards their own goals
expressed in natural language. Um, and I
think also the lines between search and
recommendation start to blur in this
world. Um, and then maybe a hint of the
future is I think you're going to see
recommendation and generative content
start to come together in the future
where we're going to be recommending a
personalized version of a piece of
content and in the future instead of
recommending content we may even start
creating it and you can get to really
interesting end of one content that's
generated for the user. Um, I think
we're a bit away from this, but it's
going to come sooner than you expect
with all the advances happening in AI.
Um, so yeah, thank you. I'll take any
questions.
Thank you, Danch. Um, we have time for a
few questions.
Hi, great talk. Um uh one question on
generally how you balance the learning
of the semantic ID embeddings within the
model versus keeping the general
language capability not damaged by
learning through for example a tokenized
user history which is a very second
language very different from English.
Any uh high level takeaway that you can
share?
That's a super interesting question. Um
we've struggled with this a lot. Um in
terms of some of our early applications,
we mostly cared just about
recommendation quality in which case we
overindexed on speaking the semantic ID
language. And as you overtrain on more
and more of those examples actually the
model forgets to speak English. Maybe
it's reasoning in some intermediate
layers which finally end up in semantic
ID language. Um we are trying a bunch of
things like you know with mixture of
experts maybe we can have a few experts
that retain the text capability
while other experts focus on the
semantic ID capability and so
it's it's a balance and I think we're
going to shift more towards text as we
try to build these interactive
experiences where uh text input from
user is going to become more important.
Thank you.
So during this process, did you learn
any uh any good suggestions for cold
starting embeddings on these domain
specific uh tokens?
Yeah. So the semantic one thing is
semantic ID training process is entirely
unsupervised. We're not telling like
it's making its own quantization of the
video corpus. when you sample to see
what the model is doing, we find that
it's learning concepts like sports
versus movies and entertainment. But we
didn't actually try to teach that
explicitly, which I think is very
interesting. I think the second aspect
is because of semantic ID, we can warm
start into a semantically meaningful
space. And what we find is performance
for videos that were uploaded in the
last day or the last week uh gets much
better because we're better
understanding this fresh and tail
content.
Got it. Thank you.
Hey, quick question. So when you said
you extract frames as part of making the
semantic ID, are you just running a
video at let's say 3 to 30 fps? uh
making a grid of them, running cig
and inserting that.
We're just trying to sample video
frames. Um
we we've tried a few different
approaches where like maybe we try to
sample from like key moments in the
video. We actually have the engagement
data if you've seen in the YouTube
player. Uh it can highlight what are the
places where people had the most
engagement. So we try to sample from
there. um you know given the scale we
can't sample a lot of video frames so we
try to intelligently select it but we do
have video frames and over time I think
we'll get more
in this way of selecting it are you able
to highlight important things that are
based on small objects in a video pretty
well
let's say it's a person in the distance
that's of attention of this video
hard to say because like at the end all
of this video information gets
compressed into eight tokens. So, it's
probably learning something, but it's
hard to know exactly, you know, what it
picked up from that video frame. Uh
yeah, so it it's unclear.
Thank you.
Yeah.
So, yeah, it was a pretty good talk. Uh
I have a question regarding
pre-training. Okay. So uh did you also
feed in a user query and what they
watched also as a pre-training data? If
yes then did you also use semantic ID
for user as well in a pre-training or or
just semantic ID is only for for the
videos. Yeah.
Yeah. So in this case we have only
tokenized videos um and we focused more
on
sequences of watches rather than search
query to what watch originated from that
search query. You could imagine some
parallel work where you try to tokenize
users and build some kind of user token
that represents like the last 500
watches that they have had and so on. Um
we've experimented with some stuff
there. I think it's less far along. Um
but yeah, I think it's a very
interesting like research direction to
do.
So, so the pre-training was done on top
of existing Gemini pre-trained model,
right?
Yeah, we basically take a Gemini
checkpoint and then adapt it for this
YouTube purpose and get this like
YouTube and Gemini LRM checkpoint.
Okay. Yeah. last. It would be cool to
see cementing ID of videos to V3, you
know.
Yeah.
Hey, uh I'm kind of curious how much uh
improvement do we see compared to the
non LLM or more traditional
recommendation system and when should we
use a more traditional one and when
should we use LM based recommendation
system?
Yeah, I I can't really share metrics
like I I was I can share everything
except code and metrics, you know. Um,
and so we've given you as much
conceptual steps of what we did. Maybe
what I'll say is I think it's been the
biggest improvement to recommendation
quality we've seen in the last few
years. So I do think it's quite
significant.
[Music]