AI Engineer World’s Fair 2025 - LLM Recommendation Systems (RecSys)

Channel: aiDotEngineer

Published at: 2025-06-04

YouTube video id: 3k4a0PemMu4

Source: https://www.youtube.com/watch?v=3k4a0PemMu4

Hi everyone. Um, thank you for joining
us in today's Rexus the inaugural Rexus
track at the AI engineer world's fair.
So today what I want to share about is
what the future might look like when we
use when we try to merge recommendation
recommendation systems and language
models. So my wife looked at my slides
and she's like they're so plain. So
therefore I'll be giving a talk together
with Latte and Mochi. You might have
seen Mochi wandering the halls around
somewhere but there'll be a lot of
doggos throughout the slides. I hope you
enjoy. First language modeling
techniques are not new in recommendation
systems. I mean it started to back in
2013. We started learning item
embeddings across um AC from co-
occurrences in user interaction
sequences and then after that we started
using GRU4. Right. I don't know who here
remembers recurrent neuronet networks
gated recurrent units. Yeah. So those
were very short-term and we predict the
next item from a short set of sequences.
Then of course um transformers and
attention came about and we we became
better on attention on long range
dependencies. So that's where we started
hey you know can we just process on
everything in the user sequence hundreds
2,000 item ids long and try to learn
from that. And of course now today in
this track I wanted to share with you
about three ideas that I think are worth
thinking about. semantic ids, data
augmentation and unified models. So the
first challenge we have is hashbased
item ids. Who here works on
recommendation
systems? So you you you probably know
that hashbased item ids actually don't
encode the content of the item itself.
And then the problem is that every time
you have a new item, you suffer from the
co-star problem which is that all you
have to relearn about this item all over
again. And and therefore and there's
also sparity, right? whereby you have a
long set of tail items that have maybe
one or two interactions or even up to
10, but it's just not enough to learn.
So, recommendation systems have this
issue of being very popularity biased
and they just struggle with coaround and
sparity. So, the solution is semantic
ids that may even involve multimodal
content. So, here's an example of
trainable multimodal semantic IDs from
Quao. So, Quao is kind of like Tik Tok
or Xiaongu is a short video platform in
China. I think it's the number two short
video platform. You might have used
their text to video model cling which
they released sometime last year. So the
problem they had, you know, being a
short video platform, users upload
hundreds of millions of short videos
every day and it's really hard to learn
from this short video. So how can we
combine static content embeddings with
dynamic user
behavior? Here's how they did it with
trainable multimodel semantic ids. So
I'm going to go through each step here.
So this is the qua model. It's a
standard two tower network. Um on the
left this is the embedding layer for the
user which is a standard sequence uh
sequence of ids and the user ID and on
the right is the embedding layer for the
item ids. So these are fairly standard.
What what's new here is that they now
take in content input. So all of these
slides will be available online. Um
don't don't worry about it. Uh I'll make
it available right media after this. Um
and to encode visual they use restnet to
encode video descriptions they use bird
and to encode audio they use
vGish. Now the thing about the trick is
this when you have this encoder models
it's very hard to back propagate and try
to update these encoder model
embeddings. So what did they do? Well
firstly they took all these content
embeddings and then they just
concatenated them together. I know it
sounds crazy right but just concat
concat them together. Then they learn
cluster ids. So I think they shared in
the paper they had like a 100 million
short videos and they learned just be
via c means clustering a thousand
cluster ids. So that's what you see over
there in the model encoder which is in
the boxes at at the bottom which is the
cluster ids. So be above the cluster ids
you have the non-trainable embeddings
below that you have the trainable
cluster ids which are then all mapped to
their own embedding table.
So the trick here is this. The motor
encoder as you train a model, the motor
encoder learns to map the content space
via the cluster ids which are mapped to
the embedding table to the behavioral
space. So the output is this. Um these
semantic ids not only outperform regular
hashbased ids on clicks and likes,
right? Like that's pretty standard. But
what they were able to do was they were
able to increase co-star coverage which
is the of a 100 videos that you share
how many of them are new they were able
to increase it by 3.6%. And also
increase co-star velocity which is okay
how many new videos were able to hit
some threshold of uh views and this they
did not they did not share what a
threshold was but being able to increase
co-star and co-star velocity by these
numbers are pretty outstanding.
So the long story short, the benefits of
semantic ids, you can address co-start
with the semantic ID itself and now your
recommendations understand content. So
later in the talk, we're going to see
some amazing uh sharing from Pinterest
and YouTube. And in the YouTube one, you
see how they actually blend language
models with semantic ids whereby it can
actually explain why you might like the
semantic ID because it understands the
semantic ID and is able to give human
readable explanations and vice
versa. Now the next question and I'm
sure all of this everyone here has this
challenge. The lifeblood of machine
learning is data. good quality data at
scale and this is very essential for
search and of course recommendation
systems but search is actually far more
important. You you need a lot of
metadata. You need a lot of uh query
expansion, synonyms, uh you need spell
checking, you need um uh all sorts of
metadata to attach to your search index.
But this is very costly and high effort
to get. In the past, we used to do with
human annotations or maybe you can try
to do it automatically. But LMS have
been outstanding at this. And I'm sure
everyone here is sort of doing this to
some extent using LMS for synthetic data
and labels. But I want to share with you
two examples uh from Spotify and Indeed.
Now the Indeed paper it's quite out uh I
really like it a lot. So the problem
that they were trying to face is that
they were sending job recommendations to
users via email. But some of these job
recommendations were bad. They they were
just not a good fit for the user. Right?
So they had poor user experience and
then users lost trust in the job
recommendations. Imagine and how how
they will indicate that they lost trust
was that these job recommendations are
not good a good fit for me. I'm just
going to unsubscribe. Now the moment a
user unsubscribes from your feed or for
your newsletter is very very very hard
to get them back. Almost impossible. So
while they had explicit negative
feedback, thumbs up and thumbs down,
this was very sparse. How often would
you actually give thumbs down feedback?
Very sparse. And implicit feedback is
often imprecise. What do I mean? If you
if you get some recommendations but you
actually don't act on it, is it because
you didn't like it or is it because it's
not the right time or maybe your s your
your wife works there and you don't want
to work in the same company as your
wife. So the solution they had was a
lightweight classifier to filter bad
racks. And I'll tell you why I really
like this paper from Indeed in the sense
that they didn't just share their
successes but they shared the entire
process and how they get how they got
there and it was fraught with
challenges. Well, of course, the first
thing that makes me really like it a lot
was that they started with emails. So,
they had their experts label um job
recommendations and uh user pairs and
from the user you have their resume
data, you have their activity data and
they try to see hey you know is this
recommendation a good
fit. Then they prompted open ALMs uh
Mistral and Lama 2. Unfortunately, the
performance is very poor. these models
couldn't really pay attention to what
was in the resume and what was in the
job description even though they had
sufficient context length and and the
output was just very
generic. So to get it to work they
prompted GB4 and GB4 worked really well
um specifically that GB4 had like 90%
precision and recall however it was very
costly. Um they didn't share the actual
cost but it's too slow. It's 22 seconds.
Okay if GBD4 is too slow what can we do?
Let's try GBD 3.5. Unfortunately, GP GBD
3.5 had very poor precision. What does
this mean in the sense that of the
recommendations that he said were bad,
only 63% of them were actually bad? What
this means is that they were throwing
out 37% of recommendations, which is
one/ird. And for a company that tries on
recommendations and people uh rec
recruiting through your recommendations,
throwing out one-third of them that are
actually good is is is quite a is quite
a guardrail for them. This was their key
metric here. And also GB. So what they
did then is they fine-tuned GBD 2.5. So
you can see the the entire journey,
right? Open models, GBD4, GBD3, now
fine-tuning GBD 2.5. Um GBD 2.5 got the
precision they wanted, 0.3 precision,
and you know, it's one quarter of GBD4's
cost and latency, right? But
unfortunately it was too still too slow.
It was about 6.7 seconds and this would
not work in an online filtering system.
So therefore what they did was they
distilled a lightweight classifier on
the fine tune GBD 215 labels and this
lightweight classifier was able to
achieve very high performance
uh specifically
0.86 AU ROC. I mean the numbers may not
make sense to you but suffice to say
that in an industrial setting this is
pretty good. And of course they didn't
mention the latency but it was good
enough for realtime filtering. I think
less than 200 millconds or
something. So the outcome of this was
that they were able to reduce bad
recommendations they they were able to
cut out bad recommendations by about
20%. So initially they had hypothesized
that by cutting down recommendations
even though they were bad you will get
fewer subscriptions. It's just like
sending out links right you might have
links that are clickbait even though
they are bad people just click on it.
And they thought that even if if we cut
down recommendations even if they were
bad we will get lower application rate.
But this was not the case. In fact,
because the recommendations were now
better, application rate actually went
up by 4%. And unsubscribe rate went down
by 5%. That that's that's quite a lot.
So essentially, what this means is that
in recommendations, quantity is not
everything. Quality makes a big
difference. And quality here moved the
needle quite a bit by
5%. Another example I want to share with
you is Spotify. So who here knows that
Spotify has podcast and audio
books? Oh, okay. I guess you you guys
are not the target. uh audience in in
this use case. So Spotify is really
known for song and artists and a lot of
their users just search for songs and
artists and they're very good at that.
But when they started introducing
podcasts and audio books, how would you
help your users know that you know these
new items are available and of course
there's a huge ass co-star problem. Now
it's not only co-star on item, it's now
co-star on category. How do you start
growing a new category within your
service? And of course exploratory
search was essential to the business
right for going for expanding beyond
music. Spotify doesn't want to do just
do music songs. They just now now they
doing audio. So the solution to that is
a query recommendation
system. So how did they recommend how
first how did they generate new queries?
Well um they have a bunch of ideas which
is you know extract it from catalog
titles, playlist titles, you mine it
from the search logs. You just take the
meth uh you just take the artist and
then you just add cover to it. And this
is what they use from existing data. Now
you might be wondering like where's the
LM in this? Well, the LM is used to
generate natural language queries. So
this might not be sexy, but this works
really well, right? Take whatever you
have with conventional techniques that
work really well and use the LM to
augment it when you need it. Don't use
the LM for everything at the start.
So now they have this exploratory
queries. When you search for something,
you still get the you still get the
immediate results here, right? So you
take all of this, you add the immediate
results and then you rank these new
queries. So this is why when you do a
search, this is the UX that you're
probably going to get right now. I got
this from a paper. It may have changed
recently. So you still see the item
queries at the bottom but at the top
with the query recommendations. This is
how Spotify informs users without having
a banner. Now we have audio books, now
we have podcast, right? You search for
something, it actually informs you that
we have these new
categories. The benefit here is plus 9%
exploratory queries. Essentially
onetenth of their users were now
exploring their new products. So imagine
that onetenth every day exploring their
new products. How quickly would you be
able to grow your your new product
category? Right? It's actually 1.1 to
the^ of n. You will grow pretty
fast. Long story short, I don't have to
tell you the about the benefits of LM
augmented synthetic data. Richer high
quality data at scale on the tail
queries, right? Even on the tail queries
and the tail items and it's far lower
cost and effort that is even possible
with human adaptation. So later we also
have a talk from Instacart who will tell
us about how they use uh LLMs to improve
their search and recommend uh their
search system.
Now the last thing I want to share is
this challenge whereby right
now in a regular company the system for
ads for recommendations for search
they're all separate systems and even
for recommendations the the model for
homepage recommendations the model for
item recommendations the model for cut
add to cut recommendations the model for
the thank you page recommendations they
may all be different models right so you
can imagine this you're going to have
many many models But you going to have
well leadership expects you to keep the
same amount of headcount. So then how do
you try to get around this right? You
have duplicative engineering pipelines.
There's a lot of maintenance costs and
improving one model doesn't naturally
transfer to the improvement in another
model. So the solution for this is
unified models, right? I mean it works
for vision, it works for language. So
why not recommendation systems? And
we've been doing this for a while. This
is not new. And aside maybe the text is
too small but this is a tweet from
Stripe whereby they built a
transformerbased payments fraud model
right even for payments the sequence of
payments you can build a foundation
model which is transformer based.
So I want to share an example of the
unified ranker for search and rexis and
Netflix. Right? The problem I mentioned
they have teams they are building
bespoke models for search similar item
similar video recommendations and
pre-quy recommendations like on the
search page before you enter a search
group high operational cost um you know
miss opportunities from learning
throughout. So their solution is a
unified ranker and they call it a
unified contextual ranker which is
unicorn. So you can see over here uh at
the bottom there's the user foundation
model and in it you put in a user watch
history and then you also have the
context and relevance model where where
you put in the context of the videos and
what they've
watched. Now the thing about this
unified model is that it takes in
unified input. Right? So now if you are
able to find a data schema where all
your use cases and all your features can
use the same input you can adopt an
approach like this which is multi
similar to multitask learning. So the
the user the input will be the user ID
the item ID you know the video or the
drama or the series the search query if
a search query exists the country and
the task. So of course they have many
different tasks in this example in the
paper they have three different tasks uh
search pre-quyer and more like this. Now
what they did then was very smart
imputation of missing items. So for
example, if you are doing an itemto item
recommendation, you're just done
watching this video, you want to
recommend the next video, you would have
no search query. How would you impute
it? Well, you just simply use the title
of the current item and try to find
similar
items. The outcome of this is that this
unified model was able to match or
exceed the metrics of their specialized
models on multiple tasks.
Think about it. I mean it doesn't seem
very impressive, right? It may not seem
very impressive. Match or exceed. It
might seem we did all this work just to
match but imagine unifying all of it
like removing the tech depth and
building a better foundation for your
future iterations. It's going to make
you iterate
faster. The last example I want to share
with you is unified embeddings at Etsy.
So you might think that embeddings are
not very sexy but this paper from Etsy
is really uh outstanding in what they
share in terms of model architecture as
well as their system. So the problem
they had was how can we help users get
better results from very specific
queries or very broad queries and if you
know that Etsy inventory is constantly
changing they they don't have the the
same products all all throughout right
it's very home homegrown so now you
might be quering for something like
mother's day gift that would almost
match very few items I think very few
items would have mother's day gift in
their description or their title right
and you know lexica embedding the other
problem is that knowledge based
embeddings like lexical embedding
retrieval don't account for user
preferences. So how do you try to
address
this? The problem the how do you address
this is with unified embedding and
retrieval. So if you remember at the
start of my presentation I talked about
the qu show tower model right there's
the user tower and then there's the item
tower. We will see the same pattern
again over here. You see the product
tower right? This is the product
encoder. So how they encode a product is
that they use T5 models for text
embeddings, right? Text item
descriptions as well as a query product
log for query embeddings. What was the
query that was made and what was the
product that was eventually clicked or
purchased? And then over here on the on
the left you see the query encoder which
is the search query encoder and they
both share encoders for the tokens which
is actually a text tokens the product
category which is a token of a token of
itself and the user location. So what
this means is that your now your
embedding is able to match user to the
location of the product itself. And then
of course to personalize this they
encode the user preferences via the
query user scale effect features at the
bottom. Especially what were the queries
that the user search for, what what did
they buy previously, all their
preferences. Now this is they also
shared their system architecture and
over here this is the product encoder
from the previous slide and the query
encoder from the previous slide. But
what's very interesting here is that
they added a quality vector because they
wanted to ensure that whatever was
search and retrieved was actually of
good quality in terms of ratings uh
freshness and conversion rate. And you
know what they did is they just simply
concatenated this quality vector to the
product embedding
vector. But when you do that for the
query vector, you have to you have to
expand the product vector by the same
dimension so that you can do a dot
product or cosine similarity. So
essentially they just slapped on a
constant vector uh for the query
embedding and it just
works. The result 2.6% increase in
conversion across entire site. That's
quite crazy. Um and more than 5%
increase in search purchases. If you
search for something the purchase rate
increases by 5%. Um this is very very
these are very very very good results
for
e-commerce. Um so the benefits of uh
unified models you simplify the system.
Uh you whatever you build to improve one
side of the tower uh improve your model
your unified model also improves other
use cases that use these unified models.
That said there may also be the
alignment types. You you you may find
that when you try to build this, try to
compress all 12 use cases into a single
unified model, you may need to split it
up into maybe two or three separate
unified models because that's just the
alignment text. We're trying to get
better at one task actually makes the
other task worse. We have a talk from uh
LinkedIn in the in this afternoon's in
this afternoon blog, the last talk of
the blog and then we also have a talk
from Netflix uh which will be sharing
about their unified model at the start
of the next blog. All right, the three
takeaways I have for you, think about
it, consider it. Semantic ids, data
augmentation and unified models. Um, and
of course do stay stay tuned for the
rest of the track uh for the rest of the
talks in this track. Okay, that's it.
Thank you.
I maybe have time for one question while
we have our speakers
from Pinterest come up and join
us. Oh,
oh, do you mind speaking in the mic,
please?
I read your very long paper that you
wrote on recommendation systems and
what's available today, but you didn't
mention the genre or HST work for META's
paper and I was just curious why you
left that out. I didn't deliberately
left that out. I think 20 I I think
there was so many papers that I just
didn't have time and I just time boxed
myself. I was like, "Okay, Eugene,
you'll be done with this in two weeks
and then two weeks is up. That's all I
have. So, ship it." Um that that's all.
Yes, another question.
Um I feel like I have read in anecdotal
blog posts about how part of what people
might say is some of the decline in
Google search quality is a move away
from explicit ranking factors and sort
of an easily auditable uh like ultimate
ranking algorithm to something more
blackbox and using more of those
techniques. And I guess I was curious if
you had an opinion on whether that seems
likely to be the case or whether that is
just, you know, noise and not actually
influencing the quality of the search
results. Yeah, that's a good question.
Um, unfortunately, I don't have any
insider information on why that might
happen. Um, I think we do have some
Google folks here. Maybe you can ask
them, but honestly, I have I haven't
realized this and I haven't even
experienced this the
degradation.
Wow. Okay.
This should
I didn't have
Thank you everyone for your patience.
Next we
have machine learning engineers from
Pinterest. They'll be sharing with us
about they'll be sharing with us about
how they integrated to enhance relevance
scoring at Pinterest. um how they
combine search queries with multimodal
context and this multimodal context
includes visual captions uh link based
text and user curated over to you.
Thanks for the introduction. Yeah. Hi
everyone. Um thanks for joining the talk
today. Um we're super excited to be here
and share some of the learnings we um we
have from integrating the LM into
Pinterest search. My name is Han and
today I'll be presenting with and we are
both machine learning engineers from
search relevance team at
Pinterest. So start with a brief
introduction to Pinterest. Um Pinterest
is a visual discovery platform where
piners can come to find inspiration to
create a life they love. And there are
three main discovery services on
Pinterest. The home feed, the related
things and search. In today's talk,
we'll be focusing on search and um where
the user can type in their queries and
um find useful inspiring content based
on their information need. And we'll
share um how we leverage LM to improve
the search
relevance. Um here are some key
statistics for Pinterest search. Every
month we handled over six billion
searches with billions of pins to search
from. covering topics from rusty, home
decor, travel, fashion and beyond. And
at Pinterest search is remotely global
and multilingual. We support over 45
languages and reaching in more than 100
countries. These numbers highlight the
importance of search at interest and why
we are investing in search relevance to
improve user experience.
So um this is an overview of how
Pinterest search work at the back end.
So it's similar to um many
recommendation system at industry. It
has query understanding, retrieval,
reanking and the blending stage and
finally produce um relevant and
engagement search feed. And um in
today's talk we'll be focusing on the
sematic relevance modeling that happened
at the reenting stage and share about
how we use LM to improve um the search
relevance on the
search. Okay. So um here's our search
relevance model which um is essentially
a classification model. Given a search
query and a ping the model will predict
how much the is relevant to this search
query. And to measure this, we use a
fivepoint scale um ranging from the most
relevant to most
irrelevant. All right. Um now we are
going to share some key learnings we
have from using the LM to improve search
Pinterest search relevance. And here are
four main takeaways that we would like
to um go into more details.
Lesson one, LMS are good at relevance
prediction. Um so before I present um
the result, let me first give a quick
overview of the model architecture that
we are using um we contain the query and
the pin text together and pass them into
a LM to get a um embedding. So this is
called cross encoder structure we where
we can better capture the interaction
between the query and the team and then
we feed the embedding from LM into MLP
layer to produce a
fivedimensional vector which correspond
to the five relevance levels and during
training we fine-tune some open source
LM using Pinterest internal data and to
better adapt the model to our Pinterest
content
And here um I'd like to share some
results
um to demonstrate that the usefulness of
LM and as a baseline we use search stage
which is a Pinterest inhouse content and
the quiry
embedding and um so if you look at the
table you can see that the LM has
substantially
um improved the performance of the
relevance prediction And as we use more
advanced LMS and increase the model
size, the performance keeps improving.
And for example, um the 8 billion L3
model gives um 12% of improvement over
the multilingual bird based model and
20% of improvement over the search stage
embedding
model. So um the lesson here is that um
LMS they are quite good at reance
prediction.
Um all right. Um lesson two, the vision
language model generated captions and
the user actions can be quite useful
content
annotations. So to use LM for search uh
for relevance prediction, we need a text
representation of each ping. And here I
listed several features that we um for
the user curated board that the ping has
been saved to or um the queries that led
to the less than a second like a 500 400
millisecond
latency at at best.
Um there are there are three levers that
we can we can pull in order to make the
model more efficient and improve the
throughput and and reduce the latency
for these models. uh specification uh
going to the smaller model and
quantization. Uh as as
Moz explained before uh a smaller models
definitely have a better throughput but
our recipe is that we need to go big and
then go small. If you go with the
smaller model initially it doesn't have
enough capacity it doesn't have enough
reasoning power to to to solve the
complicated task that we have. So we go
with a larger model and then we start
this 150 billion parameter model and
then we start distilling it with a
smaller model. And one of the recipe
here is that we need to do the
distillation step by step. And that
means that we go with a for example 8B
uh 8B model then 3B model and then 1B
model. So we slowly decrease the size of
the model and we we we distill over and
over from the from the from the from the
previous model. Um and that recipe shows
to be much much more effective rather
than basically directly going from 150
billion parameter model to one v
parameter model. Uh same thing for
pruning. Uh so pruning is a mathemat
optimiz mathematical optimization
problem. You want to either reduce the
uh reduce the uh number of heads in the
transformers. You can reduce the number
of MLPS. Overall this transform model
tends uh proven to be very very
redundant in terms of keeping the
information. So we can start pruning and
removing some of these layers uh or or
reduce basically the precision for for
uh for the for the for each of the
activations and
parameters. Uh however uh again if if
you do the pruning uh very aggressively
at the beginning your performance would
significantly suffer. So the the recipe
here is also do the gradual pruning. uh
we we do we what we do is that we start
pruning the model very small pruning to
the model. We we we distill uh to the
smaller model and we do it over and over
again. More more pruning, more
distillation, more pruning, more
distillation. And as you can see from
this plot u doing the gradual pruning uh
has u can be as effective as basically u
no no information loss. Whereas if you
just basically do aggressive pruning at
the beginning, you can have up to 1%
reduction in the in the model
quality. Another level is is is
quantization going to lower precision uh
we are leveraging FP8 uh for activation
model parameters. However, uh doing just
FP8 in all the layers uh here's the
performance of the or the quality of the
model significantly. So now basically
your tool would be to do mixed
precision. And one of the important
aspect when it comes to ranking
recommendations and overall uh
prediction tasks is you want the model
the prediction or the probability output
of the model to have a very good
precision. So in the LM head at the end
of the language model has to be in FP32.
If you do it in FP16, BF16 or FP8, uh
what what happens is that the numbers
collapse and you don't have a very good
calibration on top of that and you
cannot distinguish between different
item
recommended. Uh last part is
specification. We can specify basically
the the attentions the most expensive
part of the transformers is is attention
scores. And we can leverage a
specification. Not every item needs to
attend to every items. And when you know
your task when you know that is
recommendation these are the items that
you want to uh in the history you can
specify and not have every item it add
to each other and same same goes with
when you are recommending the items
instead of recommending one item you can
recommend 50 item 500 item at the same
time but you want to make sure that
these items are not attending to each
other. So you you spify uh the attention
scores uh for the output and for the for
the
query. If you put everything together uh
we can we can see that basically this we
can have a significant reduction in the
latency. What we have done is that in
the in four or five of our release uh uh
one release after the other we were able
to reduce the latency by 7x and at the
same time increasing the throughput
which is basically the number of queries
that we can handle by one GPU by 30x. So
we are improving basically the amount of
the work work that the GPU is doing at
the same time we are reducing the
latency that each query is sync.
Uh these are some of basically technical
reports and and and um papers that we
published uh during our journey to share
with the community basically our a
lesson learned.
Um and that's end of our talk. So we
have some time also to answer some
questions. Thank you.
Thank you. Great talk. One question, how
did you measure that it doesn't lose
generalization power? Obviously, you've
done a lot of fine-tuning. Uh, and you
mentioned it works for four or five
tasks instead of task specific models.
How do you know it's going to work for
the next five tasks? That's a good
question. So, we have a lot of I mean
the answer overall is having a very
comprehensive benchmarking set. We have
something around like 50 to 60
benchmarking. Some of them are internal,
some of them are external. For example,
we leverage if Eva to make sure that the
model still follows a very good
instruction. Um and uh as Maz mentioned
some of the tasks are not never being
part of our training data and that's how
we are measuring basically digization to
the new new domain within use cases for
example.
Thanks thanks for the um I'm wondering
what a small uh listing website uh can
use out of the box. Um have you heard of
NL web which was launched recently by
Microsoft? Uh if yes, what are your
views on that as a recommendation
system? and web. No, I haven't actually
heard. Okay. Okay. Sorry about that.
Anything you for smaller ones listing
say real estate listing website has like
thousands of uh real estate listings.
What are the out of the box
recommendation models that uh people can
start using. Uh I mean that's the I wish
that that such a model would exist. I I
don't really I mean that's why I think
we started this work. We try we trying
to see if we can actually make it a
foundation model so that you can
actually solve those kinds of problems.
I think there's a lot of potential for
for this to be able to serve a lot of
the use cases that are beyond the bigger
companies but definitely I don't know
any I think you should check out NL web.
Okay. Thanks.
Uh thank you for the great talk. Um uh
on the slide where you mentioned you
have a multi- uh item scoring. Uh I'm
curious like uh what does Does that uh
effectively mean? Does it mean that you
need to do multi-step decoding or it's
just a one step or just processing of
the logits for multiple items? What does
it it's a multi-step? We don't want to
basically we didn't want to go to the
for example complication of speculative
decoding or basically the decoding
aspect. We wanted to have everything at
the preill. So what we did was that
basically all the items are being
sequenced. Other recommended items or
potential candidates are sequenced
together. But we also wanted to avoid
them to attend to each other. So we
leverage basically what we call it like
a 4D attention mask. Uh and we we
developed a special kernel actually in
the SG lang and VLM to to be able to do
that. And now when you have up to 500
items in in your query segment, those
items don't attend to each other. They
only attend to the historical user and
and user profile information. Okay.
Thank you.
Hey, great talk. Uh, so a user history
means many things, right? So like there
is all of the jobs that they've applied
to or and the job postings there are so
many entities and so on. The the context
of the model can get quite large. Uh how
did you manage that? Did you compress it
or were there parts that you focused on?
So we we actually experimented with a
lot of things. Uh we experimented with
with rack system so that basically when
you have a query we try to figure out
what are the most closest items in your
history to bring it up. Uh we also uh
experimented with chronical orders and
some sort of weight decate on the
chronical orders. It turns out that for
majority of application that we have
actually chronical order is good enough
and that kind of makes sense because
recommendation systems are very biased
to the freshness. So the more recent
user activity helps. One of the biggest
challenge is actually the this is more
now become more like a traditional LM
problem. How do you balance the
distribution of your positive and
negative within the context and I think
that's become something that more like a
ML engineering effort to figure out okay
do I want more positive or negative like
how much how much information I need to
put in the context.
Just I can add one more thing to this
there's also another complication when
you go to the serving of these models
you don't want to break the KV caching
or that you're using in the serving. So,
it's going to be a little bit more
complicated, more cumbersome to do
something that's smarter than just
putting the chronological order. So,
that's something that needs to be
designed. So, it's not something that's
very obvious. Absolutely. One more
question. Uh you guys did so many
experiments, tried out so many things.
How's your entire system set up? Because
I'm assuming that you say quantization,
but you must have tried different forms
of quantization, what not. How do you
set up the system in such a way that you
can try out multiple experiments and see
what works best? Can you talk a bit
about that? Uh yes m touched a bit on
that one. I think the the the one thing
that we we we hold a very high bar for
the one was automation. So our system is
very automated to the extent that when
you're running experimentation actually
the result of the experimental being
pushed automatically into the excel
sheet and now when you have such
automated system now basically the
developers are very efficient in terms
of like I just want to figure out
different quantization. So you just
change the quantization parameters and
everything else just by click bottom
happens end to end. Uh so I think uh
automation is the key if you want to
basically really optimize for these
models. So did you build all of that
automation in house or yes most of them
but we leveraged for example lightning
vlg
uh we leverage basically a lot of open
source tools but we make sure that they
are integrated very well with each other
and uh and optimize basically the entire
flow. Thank you.
Thank you for uh thank you
again. Thank
you. So we'll come back after lunch at
2. So I'll see you guys back here. Thank
you.
Thank you.
Yes.
Okay.
You mean during my talk? Yeah. Okay.
Yeah.
What's up?
All
right. I hope I hope everyone has had a
good lunch. Um you know good discussions
in the hallways. The hallway discussions
and you know the after talks discussions
are the best where you get a great alpha
from the from our speakers. So
continuing from the previous sharing of
Netflix's foundational model, next we
have Yu, staff research scientist at
Netflix and he'll be sharing about their
big bat of building a foundational model
for personalization. Yes, please.
All right. Uh good afternoon. Uh thank
you uh Eugene for the introduction. Uh
so today uh I'm going to share our big
bat and Netflix on personalization
namely to use one foundation model to
cover all the recommendation use
cases. Uh at Netflix we have diverse
recommendation needs. Uh this is example
homepage of uh one profile on
Netflix. Uh it's a 2D layout rows and
items. Diversity comes at uh at least
three levels. Uh there is first level
about row. We have diverse rows. We have
genres for example rows on comedies,
roles on action movies. We have rows
about uh new trending uh just the
release titles. We also have rows about
for example titles only available on
Netflix. Um so that's the first
dimension. Second dimension is of course
of the the items or entities. uh in
addition to traditionally movie and TV
shows now we have games we have live
streaming and we're are going to add
more so our content space is ex
expanding to uh very heterogeneous
content
types the third level is page right so
we have homepage we have search page we
have a kids homepage which is tailored
very differently toward kids interest uh
mobile feed is a linear page page is not
a 2D layout uh so on and so forth. So
page different pages are also very
diverse. Uh what happened traditionally
was that these lead to naturally uh many
specialized models that got developed
over the years. Uh some models rank
videos, some rank rows, some focus on
for example shows user have not watched
yet. Some uh focus on shows what a user
are already engaging. Uh and many of
those models are were built
independently over the years. They may
have different objectives uh but have a
lot of overlaps as
well. Uh so naturally this lead to
duplications
uh uh duplications in our label
engineering as well as feature
engineering. Take the uh feature
engineering as example. uh we have this
very commonly used factual data about
user interaction history. Uh the the
factual data is the same but over the
years many features are developed
derived out of the same fact data like
counts of different actions, counts of
actions within various time window or
other kind of uh slice and dice
dimensions. Similarity between the users
history titles against the target titles
unique. Lastly like uh just a sequence
of unique show ids uh to be used as a
sequence feature into the model. So this
list can go on and on and a lot of those
features uh because they are developed
independently into each model they have
slight variations but become very but
largely uh very similar. So become very
hard to uh maintain. So the challenge
the challenge uh back then was uh is
this scalable? uh obviously not if we
keep expanding our landscape of content
type or business use cases, it's not
manageable to spin up new models for
each uh individual use cases. Uh there's
not much leverage. uh there's some
shared components on building the
feature label but still by and large
each model uh basically uh spinned up
independently and that also impact our
uh innovation velocity in in the terms
that you don't reuse as much as you can
instead you just spin up new models uh
pretty much from scratch so this was the
situation about four years ago uh at the
beginning or middle of the pandemic so
the question we asked at that time was
uh can we centralize the learning of
user representation in one place.
So the answer is yes and we have this
key hypothesis that about foundation
model based on transformer architecture.
Uh concretely two hypothesis here one
hypothesis that through scale up
semi-supervised learning personalization
can be improved. Uh the scaling law also
applies to recommendation system as it
applies to LLM. Uh second is that by
integrating the foundation model into
all systems we can create high leverage.
we can simultaneously improve all the
downstream canvas facing models at the
same time. So we'll see in the following
slides how we validate those
hypothesis. Uh I I'll break up the
overview into two sub sessions. First
about data data and training and then
later uh second about application and
serving. So um about data and training.
So starting from data a very interesting
aspect of building such foundation model
auto reggressive transformer is that
there's a lot of analogy but also
differences sometimes uh between this
and LLM. So we can transfer a lot of
learnings inspirations from LLM uh uh
development. uh if we start from the
very bottom layer which is basically
data cleaning and tokenization. Uh
people work with LLM understand
tokenization decisions have profound
impact in your model quality. So uh
although it's the bottom layer the
decision you made there can percolate
through all the downstream layers and
manifest as either your model quality
problem or model quality plus. So uh
this applies to recommendation uh
foundation model as well. uh instead of
uh there are some differences very
importantly instead of language tokens
which is just a one id here for uh if we
want to translate the user interaction
history or sequence each of a token is a
event interaction event from the user
right but that event has many facets or
many fields so it's not just one ID you
can represent there are a lot of rich
information about the event so how you
all of those fields can play a role in
making the decision of tokenization. Uh
I think that's what we need to consider
very carefully. Um what is the
granularity of tokenization and trade
off that versus the context window for
example. Um and through many iterations
we reach the right I think reach the
right abstraction and interfaces that we
can use to uh adjust our tokenization to
different use cases. For example, you
can imagine we have a token one version
of tokenization used for pre-training
for fine-tuning against a specific
application. We apply slightly different
tokenization. Um so moving up from the
tokenization layer uh then it becomes
the model layers. uh at high level uh
from bottom to top we go through the
uh event representation
uh embedding layer transformer layer and
the objective layer. So event
representation as we just briefly
touched upon uh many information in the
event about a high level you can break
it down by when where and what when that
event happened that's about time
encoding and where it happened it's
about a physical location your local
country so on and so forth but also
about device about the uh canvas or
which row which page this action
happened uh and then uh what basically
is about the target entity or the title
which title you interacted with what is
the interaction how long and uh any that
kind of information associated with the
action. So um that's where the we need
to decide what information we need to
keep what we should drop so and so
forth. uh moving one layer above uh the
embedding feature transformation layer.
Uh one thing that needs to be pointed
out is that for recommendation we need
to combine ID embedding learning with
other semantic content information. Um
if you only have ID embedding learn from
scratch in the model then you have
problem with co-star meaning that titles
the model hasn't seen during training it
doesn't know how to deal with it at
inference time. So we need to have
semantic content information to be uh
comp complementaryary to those ID
embeddings. Uh this is not a problem for
LLM but very commonly encountered the
co-star problem for re recommendation
system. Uh uh transformer layer I think
there's no need to talk too much into
this in terms of architecture choices
optimization so forth. The only thing
that I want to point out is that uh we
are using the hidden state output from
this layer as our user representation
which is one of the primary goal of the
foundation model is to learn a good
long-term user representation. Then uh
we need to put this into context. Then
things to consider are for example how
stable is our user user representation
given our user profile user interaction
history keep changing. How do we
guarantee or maintain the stability of
that representation and what kind of
aggregation we should use? You can think
of broadly aggregate across the time
dimension in terms of sequence dimension
or aggregate uh across the layers. You
have multiple self attention layer. How
do you aggregate that? Um and then
lastly, do we need to do explicit
adaptation of the representation based
on our downstream objective to fine-tune
it? Um so then we move to last uh the
very top layer objective loss function.
This is also very interesting in the
sense that it's much richer than LLM
because you can see first we use uh
instead of one sequence but multiple
sequence to represent the output because
you can have a sequence of entity ids
that's your like uh next token
prediction softmax or sample softmax but
then we have many other facets of field
of each event that can be also used as a
target right so it could be for things
like uh action type. It could be some
aspect of the entity's metadata like
entity type, yarn, language, so on and
so forth and also about your action like
prediction of the duration or uh the
device where the action happened or the
time when the next uh user play will
happen. So those are all legitimate uh
targets or labels depends on your use
case you can use them to do the
finetuning. Now instead of so you can
cast the problem as a multitask learning
problem multi head or hierarchical
prediction but you can also use them
just as your weights your rewards or
your mask on the loss function. So in
terms of to adapt the model to zooming
into one subcategory of uh user behavior
you want to you want the model to learn.
Okay. So that's about the model
architecture that I want to talk about.
Um so does it scale? The first question
part of the first hypothesis we want to
answer is does a scaling law apply and I
think the answer is yes. So this is over
the uh roughly two two two to two and a
half years we were scaling up and then
we constantly still see the gain uh from
only on the order of 10 million profile
or a few million profile to now on the
order of one billion u model parameters.
We scale up the data accordingly. Um now
we stop here because we can still keep
going but uh as you may realize that
recommendation system usually have much
stringent latency cost requirement. So
scaling up scaling up more requires to
also distill back. Yeah. But certainly I
think this is not the end of the scaling
law. Uh before we wrapping up the data
and training discussion I would like to
highlight some of the learnings I think
quite interesting we borrow from LLM.
This is not exhaustive list but uh uh I
think very interesting to me uh the top
three one is to multi-token prediction.
You may have seen this in the deepseek
paper so on and so forth. So you get
implementation wise you can use multi
head multi-lel so and uh different
implementation flavor but the goal is
really to force the model to be less
myopic more robust to serving time shift
because you have a time gap between your
training and serving and also force the
model to targets long-term user
satisfaction and long-term user behavior
instead of just focus on next action. uh
I we have observed a very notable uh
matrix improvement by doing that. Uh the
second is multi-layer representation
which uh I touched upon on the profile
representation. So this is also
translated from LLM side of techniques
of layer wise supervision
self-distillation or multi-layer output
aggregation. The goal here is really to
make a better and more stable user
representation.
Lastly, uh this is also should be no
surprise, long context window handling
from truncated sliding window to sparse
attention to progressively training uh
longer and longer sequences uh to
eventually all of the parallelism
strategies. So this is about more
efficient training and maximize the
learning. Okay. So uh shift gear to talk
about the serving and applications.
uh before the foundation model FM uh
this is roughly the algo stack we have
for personalization many data many
features many models independently
developed each serving multiple or one
canvases or applications we
call now with the foundation model we
consolidate largely the data and
representation layer especially the user
representation as well as content
representation in the personalization
domain uh modeling here as well because
model now each application model now are
built on top of FM so become a thinner
layer instead of a very standalone
full-fledged model trained from
scratch. So how do the various models
utilize a foundation model? Um there are
three main approaches uh or consumption
patterns. Okay, the first is foundation
model can be integrated as a subgraph
within the downstream model. Uh
additionally the content embeddings
learned from the foundation model can be
integrated as the embedding lookup
layers. So downstream model is a neuron
network uh it may already have
initialize some of the sequence
transformer uh tower or graph and then
using a pre-trained foundation model
subgraph to directly replace that. Uh
second is that uh we can push out
embeddings. This is no surprise from
both content side and entity embedding
as well as member embeddings. Uh the
only the main concern here of course is
how we want to re how frequently we want
to refresh the member embeddings and how
we make sure they are stable
uh and push them to the centralized
embedding store. And this of course
allow far more uh wider use cases than
just personalization because people
analytics data scientists can also just
fetch those embeddings directly to do
the things that they
want. Finally user can u extract the
models and fine-tune it for specific
applications either fine tune or they
need to do distillation to meet the
online serving uh requirement.
um especially for those with a very
strict latency
requirement. To wrap up uh I want to
show at high level the wings we
accumulated over the last one year and a
half uh by incorporating FM into various
places. So the blue bar represent how
many applications have FM incorporated.
The green bar represent the AB test
swings because in any application we may
have multiple AB tests going on there to
have wings. So we see we indeed see high
leverage of FM to bring about both AB
test wings as well as infrastructure
consolidation. Uh so I think the big
back uh big bets are validated. Uh it is
a scalable solution uh in terms of both
both in terms of a scalable scale up the
model with improved quality as well as
make the whole infra consolidated and
the scale uh to new applications to be
much easier. High leverage because it's
a centralized learning innovation
velocity also is faster because we allow
a lot of newly uh launched applications
directly fine-tune the foundation model
to uh launch the first experience.
Uh so the current directions um one is
that um we want to have universal
representation for heterogeneous
entities. This is uh as you can guess
the semantic ID and along those lines
because we want to cover that uh as
Netflix expanding to very different very
hoggeneous content types. Uh second is
generative retrieval for collection
recommendation. Right? So instead of
just recommend a single video be
generative at inference time and serving
time because you have multi-step
decoding. A lot of the consideration
about business rules or diversity for
example can be naturally handled in the
decoding process. Lastly faster
adaptation through promp tuning. So this
is also borrow from LLM. Can we just
train some soft tokens so that at
inference time we can directly swap in
and out those soft tokens to prompt the
FM to behave differently. So that is
also a very promising direction that we
are getting into. All right, that
concludes my talk. Thank you for your
attention and
questions. Thank you. Um if you have any
questions may I invite you to come to
the mic while we get our next speakers.
Hi thank you for the talk. Uh since you
get billions of users so except the
recommendation system you maybe it can
do much more right. So what's your
thought on that since I can just ask it
to to predict who's the next president
in the United States? Thank you.
Um so I actually don't uh could you
explain a little bit what do you mean by
beyond recommendation? Do you mean the
other personalization or other things?
Um yeah since you get kind of beating
users preference. So actually that that
that preference is also been leading to
what things they buy or who they will
vote for the next president. So do you
think your foundation model has that
capability to to expand not only
recommendation what videos they want to
look what others they like or what's
their opinions on anything else. Thank
you. Yes. So I think we are expanding to
different uh I think entity type and
also capture
uh users taste from both on and off our
platform. I think that's a general trend
that we're going to.
Yes.
Great. Thank you. This was really
helpful. um question on and you might
not be able to share it um for IP
reasons but whatever you can uh thoughts
on graph models then I didn't hear a lot
of that in your talks graphs and uh
reinforcement learning any utilization
there any benefits you saw any boost in
in performance and accuracy yeah that's
a good question I think we have actually
uh a dedicated team sub team doing graph
model uh especially around our knowledge
graph to cover the content space is both
on and off our platform in the whole
entire entertainment ecosystem. So we
use actually a lot of embeddings for
example from the graph model to co-start
that's where I see I show those semantic
embeddings that's where it comes from in
terms of reinforcement learning yes as
well especially where we consider sparse
reward that we have on users from users
action are pretty much sparse but we
want to use them to guide how for
example we generate the whole collection
that's where we need to consider how to
use those rewards to guide those
process. Yeah,
sure. I would be here and so we can also
follow. Yeah.
Do you also use these unified
representations as features to
models? You had a slide how you use
model. Yeah. So uh
the we so for the embeddings learning
within our model we also expose to
downstream to direct consume them. Uh we
also have a to train our unified
embedding. We also have some upstream
like just for example the GN embeddings
and those are also consumed to to do
that.
Hello. Uh in your embeddings, are you
just using when someone does an action
or sorry for the in these embeddings are
you just using metadata over the video
to understand what they like or are you
actually using like frame by frame of
the video or second clips? Uh not yet.
We do have that from some other content
group of our organization but I think
the trend will go there. So we are not
yet uh into very granular sub like clips
level or view level. We have those
embedded but not quite yet to
incorporate. Y thank you. Thank you.
Please another round of applause for
so next we have the Jastri and Mesh
director of machine learning and staff
machine learning engineer respectively
from Instacart. They're going to share
about how they use LMS to improve search
and discovery at Instacart.
Hi Good afternoon everyone. Uh my name
is Mesh and this we are part of the
search and machine learning team at
Instacart. So today we'd like to talk to
you about how we are using LMS to
transform our search and
discovery. Um so yeah so first a little
bit about ourselves. Yeah as I mentioned
we are part of the search and discovery
ML team at Instacart. And for those of
you who may not be part who may not be
familiar with Instacart, it's the leader
in online grocery in North America. uh
and our mission is to create a world
where everyone has access to the food
they love and more time to enjoy it
together. So coming to what we'll
actually talk about today uh first we'll
talk about the importance of search uh
in grocery e-commerce. Uh then we'll
look into some of the challenges facing
conventional search engines. Uh and then
actually get to the meat of the talk
today which is how we are using LMS to
solve some of these problems. Uh finally
we'll uh finish with some key takeaways
from today's
talk. So coming to the importance of
search and grocery e-commerce uh I think
we've all gone grocery shopping.
Customers come with long shopping lists.
Uh and it's the same on the platform as
well. People are looking for tens of
items. Uh and of these a majority of
them are just restocking purchases that
is things that the customer has bought
in the past. Uh and the remaining are
items that the user is trying out for
the first time. So uh and and a majority
of these purchases come from search. So
uh search has a dual role. It needs to
both support quick and efficient uh it
needs to have the customer quickly and
efficiently find the product they're
looking for and also enable this new
product discovery. Uh and new product
discovery isn't just important for the
customer. It's also great for our
advertisers because it helps them
showcase new products. uh and it's also
good for the platform because overall it
encourages larger basket sizes. Uh so
let's see what some problems are with
our existing setup that sort of uh makes
this hard. Uh so so to begin with uh we
have two classes of queries that are are
generally more challenging especially
from an e-commerce perspective. Uh the
first are overly broad queries in this
case like on the left the snacks query
where there are tons of products that
map to that query. Uh and now because
our models are trained on engagement
data, if we aren't exposing these
products uh to the user, it's hard to
actually collect engagement data to to
them, rank them up high. So the
traditional cold start problem in a way.
Uh then uh as you can see on the query
on the right we have very specific
queries like unsweetened plantbased
yogurt where the user is looking for
something very specific and these
queries uh don't happen very frequently
which means that we just don't have
enough engagement data to train the
models on. Uh and while we have uh done
quite a bit of work to sort of um
improve this, the challenge that we
continually keep facing is that while
recall improves, precision is still a
challenge, especially in a pre-LM world.
Uh the next class of problems is how do
we actually support that new item
discovery as we spoke about. So when a
customer walks into a grocery store uh
let's say into a pasta aisle, they might
see new brands of pasta that they would
want to try out. Uh along with that they
would also see pasta sauce and every
other thing that's needed to make a bowl
of pasta. Uh and customers would want a
similar experience on our site. Uh we
have heard multiple feedback multiple
rounds of feedback from our customers
that hey I can find the product that I'm
that I want via search but when I'm
trying to find any other related
products it's it's a bit of a dead end.
I would need to make multiple searches
to get to where I want to. So this was a
problem that we wanted to solve as well.
And yeah, as I mentioned, uh, prelims,
this was a a hard problem because of the
lack of engagement data, etc. So, let's
see how we actually use thems to sort of
solve these problems. Uh, I'll sort of
talk specifically about how we use the
LMS to uplevel our query understanding
module. Now, query understanding, as I'm
sure most of you know, uh, is the most
upstream part of the search stack. uh
and very accurate outputs are needed to
sort of enable better retrieval and
recall uh and finally improve our
ranking results. Uh so our query
understanding module has multiple models
in them like query normalization, query
tagging, query classification, category
classification etc. So in the interest
of time uh I'll just uh pick a couple of
models and talk about how we uh sort of
really improve them. Uh the first is our
query to category a product category
classifier. Uh essentially we are taking
a query and mapping it to a category in
our taxonomy. Uh so as an example if you
take a query like watermelon that maps
to categories like fruits, organic
fruit, foods etc. Uh and our taxonomy
has about 10,000 labels of its 6,000 are
more commonly used. So because a product
a query can map to multiple labels. This
is essentially a multilation problem.
Um, and in the past our traditional
models, uh, which were we actually had a
a couple of different models. One was a
fast text based, uh, neural network
which essentially modeled the semantic
relationship between the query and the
category. Uh, and then as a fallback, we
had an NPMI model which was a
statistical co-occurrence model between
the query and the category. Now, while
these uh techniques work great for the
for the head and torso queries, we had
really low coverage for our tail queries
because again, we just didn't have
enough engagement data to train the
models on. Um, and to be honest, we
actually tried more sophisticated
bird-based models as well. Uh, and while
we did see some improvement, the lack of
engagement data meant that for the
increased latency, we didn't see the
wins that we actually hoped for.
So um so this is where we actually tried
to use an LLM. Uh first we took all of
our queries uh and we along with the
taxonomy if we fed it into an LLM and
asked it to predict the most relevant
categories for that query. Now the
output that came back was decent
actually when we all looked at it it
made a lot of sense. Uh but when we
actually ran an online AB test the
results weren't as great. Uh and one
particular example that illustrates this
point very well is a query like protein.
Uh users that come to Instacart when
they type something like protein,
they're looking for maybe protein
shakes, uh protein bars or other protein
supplements. The LLM on the other hand
thinks that pro when a user types
protein, they're looking for maybe
chicken, tofu or other protein foods. So
this mismatch wherein the LLM doesn't
truly understand Instacart user behavior
was really the cause of the
problem. So to sort of maybe improve our
results, we sort of switched the problem
around, we took the most commonly
converting categories or the top K
converting categories for each query and
fed that as additional context to the
LN. Uh and then I'm sort of simplifying
this a bit. there's a bunch of uh
ranking and downstream validation that
happens. But essentially that that was
what we did. We generated a bunch of
candidates uh rank candidates and this
greatly simplified the problem for the
LM as well. Uh and again to illustrate
this with an example uh take a query
like Wernner soda. Uh our previous model
actually identified this as a as a brand
of fruit fl or a fruit flavored soda
which is not incorrect uh but it's not
very precise either. Now the LLM did a
much better job. It identified it as a
brand of ginger ale. And with this our
downstream retrieval and ranking
improved greatly as well. And as you can
see from uh the results below uh
especially for tail queries we saw a big
improvement. Our precision improved by
our 18 percentage points and our recall
improved by our 70 percentage points
which is actually pretty significant for
our tail queries. Um and maybe to very
briefly look at our prompt. As you can
see it's very simple. uh we are
essentially passing in the C the top
converted categories as context uh there
are a bunch of guidelines about what the
LM should actually outd do and and
that's it so this was all that is needed
to sort of enable this uh again I'm
simplifying the overall flow but uh the
general concepts are pretty
straightforward so coming to the another
model uh the query rewrites model is
actually pretty important as well uh
from uh a from an e-commerce
perspective, especially at Instacart
because not all retailers are created
equal. Some have large catalog, some
have very small cataloges. The same
query may not always return results. And
that is where a rewrite is really
helpful. For example, going from a query
like 1% milk to just milk would at least
return results that the customer can
decide to buy or not. Uh and again, our
previous approach which was trained on
uh engagement data didn't do too well.
it suffered or it did decently well on
head and torso queries but it suffered
from a lack of engagement data on tail
queries. Uh so by using an LLM similar
to how we did for the product category
classifier uh we were able to generate
very precise rewrites. Uh in the example
here you can see that there's a a
substitute a broad and a synonymous
rewrite. So for the case of avocado oil,
a substitute is olive oil. A broader
rewrite is healthy cooking oil and a
synonymous rewrite is just avocado
extract. And again just just looking at
the results from this and if you we saw
a bunch of offline improvements and just
moving from uh from using third party
LLMs here just going from more simpler
models to better models improved uh the
results quite a bit. Uh this is based
off of our human evaluation data. Uh so
as you can see just improving the models
itself improved the overall performance
of the task and in terms of online
improvements we actually saw a large
drop in the number of queries without
any results. This is pretty significant
again because uh we could now actually
show results to users where they
previously saw empty results uh which
was great for the business.
So uh coming to the sort sort of the
important part of this which is how we
actually scored and served the the
data the thing is that Instacart has a
has a pretty idiosyncratic u query
parent. There's a very fat head and
torso set of queries and we have a sort
of a long tail. So by comput
precomputing the outputs for for all of
the head and torso queries offline in a
batch mode we were able to sort of uh
cache all of this data and then at
online when a query comes in we could
just serve it off of the cache with very
low impact on latency uh and fall back
to our existing models for the long tail
of queries. And again, this worked
really well because it didn't uh impact
our latency while it greatly improved
our coverage for the long tail of
queries. Now, for the the really long
tail where I said we would fall back to
our existing models, we're actually
trying to replace them with a distilled
lid model uh so that we can actually do
a much better job compared to the
existing models. Um so yeah to sort of
summarize uh essentially what we saw was
that uh from a query understanding
perspective we have a bunch of models uh
and just using our hybrid approach
greatly improved their performance but
what's actually more interesting is that
today query understanding consists of a
bunch of models and as Yazu was talking
about in the Netflix talk managing all
of these models is actually complex from
a system perspective so uh consolidating
all of these into an SL LM or or maybe a
large language model uh can make the
results a lot more consistent. And I'll
finish it off by giving an example here.
Uh there's a query hum that we sort of
saw some interesting issues with uh
which is which is spelled hu m. uh the
actual query the our our query brand
tagger identified the brand correctly as
a brand of kombucha but then our spell
corrector unfortunately corrected it as
hummus so the results were really
confusing to users uh and was pretty bad
but by using a more unified model I
think the results were much better the
second is by passing in by using an LLM
for query understanding uh we can
actually pass in extra context um so
instead of just generating results for
that query in isolation, we can really
try to understand what the customer's
mission is. Um, so for example, detect
if they're actually here to buy
ingredients for a recipe,
etc., and then generate the content for
that. So to talk more about that, uh, I
have TSV
here. Thank you. Uh, now I'll quickly
talk about how we used LLM for showing
more discovery oriented content in
search results page. Uh just to re
restate the problem, uh our users found
that while our search engine was very
good at showing exactly the the results
that they exactly wanted to see once
they added an item to the cart, they
couldn't do anything useful with the
search results page. They either had to
do like another search or go to another
page to fulfill their next intent to
some starts. Uh tradition solving this
with traditional methods would require
like a lot of feature engineering or
manual work. uh LLM solve this problem
for us and I will talk about how uh so
this is how it looked in the end. So for
queries like swordfish uh let's say
there are no exact results we used llms
to generate substitute results like
other seafood alternatives meaty fish
like til clapia and whatnot u and
similarly for queries like sushi where
there were a lot of exact results let's
say uh we would show at the bottom of
the search results page we would show
things like Asian cooking ingredients or
Japanese drinks and so on uh in order to
like you know get the users to engage uh
I'll talk about the techniques here But
uh both of these uh both of these
discovery oriented results we saw like
improve uh led to like improvement in
engagement as well as improvement in
revenue uh for our for each search. Uh
cool. Uh like I said I'll get into the
techniques but let's first talk about
the requirements to generate such
content. Uh first uh obviously we wanted
to generate content that is incremented
to the current solutions. We don't want
duplicates to what we were already
showing. And the second requirement and
the most important one is we wanted all
of the LLM answers or or the generation
to be aligned with Instacart's domain
knowledge. What does this mean? So if a
query if a user searches for a query
called dishes, L&M should understand
that it refers to like cookware and not
food and vice versa for a query like
Thanksgiving dishes, right? So with
these requirements in mind, we set up
with we started with like a very basic
generation approach. So what did we do?
We took the query and we told the LLM,
hey, you are an AI assistant and your
job is to generate two shopping lists.
One is a list of complimentary items and
another is a list of like uh substitute
items for a given query, right? Um
looked good. Uh I mean like so so we saw
the results, they looked pretty good. Uh
our PMS vetted everything. We looked at
everything. uh and and like vines said
in in like u we when we launched this to
our users uh we saw that the results
were good but users weren't engaging it
as much as we would have liked it to so
we went back to the drawing board and we
were like we try to analyze what was
going on and what we realized quickly
was while LLM's answers uh were like
common sense like answers and so on such
they weren't really what users were
looking for uh taking the protein
example again like uh users when they
search for protein They look for protein
bars and protein shakes rather than what
LLM would give us an answer which is
chicken, turkey and tofu and whatnot.
Right? So uh so what we did was we
augmented the prompt with Instacart
domain knowledge. So uh in one case what
we did was we took the query and then we
augmented it with like here here is the
query and here are the top converting
categories uh for this particular query
along with any annotations from the
query understanding model like hey here
is a brand present in the query here is
like a uh dietary attribute trend is
present in the query and so on as such.
Uh in another case we were like here is
the query and here are the subsequent
queries that users did once they issued
this particular query. So once you
augmented the prompt with this
additional metadata about how Instacart
users behave, the the the results were
far more better. I don't have the time
to show like the before and after, but
like I said, we definitely saw like a
huge improvement in both engagement as
well as revenue. Uh I'll quickly talk
about like how we served uh all of these
contents. Uh like very similar to QU,
it's impractical to call the LLM in real
time because of latency and maybe cost
concern sometimes. So what we did was uh
we took all of our uh historical search
logs. We called LLM in like a batch mode
and stored everything. So query content
metadata along with even the products
that could potentially show up in the
carousel and online it's just a very
quick look up from a feature store. Uh
and that's how we were able to like uh
serve all of these recommendations in
like days and fast time. Uh again things
weren't as simple as as we making them
out to be. the the like vines said the
overall concept is simple. The the
prompt itself is very simple but there
were three key challenges that we solved
along the way. Uh one is aligning
generation with business metrics like
revenue. Uh this was very important to
select topline wins. So we iterated over
the prompts and the kind of metadata
that we that we would feed to the LLM in
order to achieve this. Second, we spent
a lot of time on ranking uh on improving
improving the ranking of the content
itself and so on as such. So our
traditional PCTR PCBR models did not
work. So we had to like employ
strategies like uh diversity based
reanking and so on and so forth to get
users to engage with the content. Uh and
then the third thing is evaluating the
content itself. So one is making sure
that hey whatever LLM is giving uh is
one right it's not hallucinating
something uh and second it adhered to
like what Instacart or what we need as a
product right cool. Uh so summarizing
the the key takeaways from our talk uh
LLM's world knowledge was super
important uh to improve uh query
understanding predictions for especially
for the tail queries. Uh while LLMs were
super helpful we really found success by
combining the domain knowledge of
Instacart with LLMs uh in order to see
the topline wins that we saw. Uh and the
third and the last one is evaluating the
content as well as the cure predictions
and so on as such was far more important
and far more difficult uh than we
anticipated. We used LLM as a judge in
order to make this happen but very very
important step and we realized that kind
of late. So yeah that's all from us.
We'll take questions now.
[Applause]
Thank you Jri V. Um yeah we'll take
questions at the mic while the next
speaker gets set up. Hi uh thanks for
the talk. Um Have you also been trying
around queries which are very long in
natural language like uh I want these
three items and these five items like
what we would do it on chat GPT or it's
still like single item that's the
focus. Uh yeah I think we have we have
actually launched something in the past
uh like ask instart if you heard of it
which essentially takes natural language
queries and tries to map that to search
intent. So, for example, you might ask,
you might say healthy foods for a
three-year-old baby or something like
that. And so, that would map to things
like fruit slices, uh, and if
three-year-old toddlers can eat popcorn,
but something along those lines. And,
and then we had our usual ranking,
recall, and ranking stack sort of
retrieve those results. So, any
learnings from that experiment?
Yeah. So, so I think we actually have a
lot of learnings from that.
Essentially as this we already mentioned
uh we need to inject a lot of Instacart
context into the model to be able to get
decent results. The evaluation part is
really key. Uh so having a robust
automated evaluation pipeline was
important. And lastly passing context
that is for example if it's
a let's say it's a mother's day query
and let's say we come up with individual
search intents as perfumes. you really
want women's perfumes to be in there.
Whereas when we just had perfumes, you
could see all kinds of items. So passing
that context from the LLM to the
downstream systems is really important.
Thanks. Yeah, we have a lot of examples
where we say we can talk. Thank you. Um
I'm sorry I don't think we have time for
any more questions. Uh but we have our
last speaker and then right after that
the speakers will be hanging around. So
thank you tori and vines again.
[Applause]
Finally, uh we have Dash, principal
product manager at Google. He he'll be
sharing with us how they adapted a
Gemini checkpoint for YouTube
recommendations. He he also share about
how they built semantic IDs that we've
heard a lot so much in this track that
they still multimodal features into
tokens. Take it away.
Wonderful. Um I guess while we got the
slides up, I can just introduce myself.
I'm Devanch. I'm a product manager at
YouTube. Um, I've been working on
recommendation systems at Google for a
while. Um, and built a new
recommendation system across DeepMind
and YouTube that uses Gemini to
recommend YouTube videos. Um, and so I
just want to take you through the
process of how we built that with a lot
of examples and share a recipe of how
you might build this kind of LLM based
recommendation system.
I'll just get started without some of
the slides. Um, here's why I think this
is important, right? I think there's a
lot of attention in terms of how LLMs
are going to transform search. Uh Google
search is having a revolution. Chat GPT
has a big chat interface. Perplexity is
a product that a lot of people use. Um
but I think recommendations is uh
probably a bigger problem that is
underhyped because it's kind of
transparent to the user. Um and I think
the application of LLMs to
recommendations is going to be a bigger
consumer application than search. Um, so
in terms of my talk, I just want to
introduce the problem of YouTube
recommendations and then talk about how
we've built large recommener models.
We're adapting Gemini for YouTube, how
we build semantic ID and how we're using
that and then end with this recipe of
how you might use an LLM to make a
recommendation system. Um, to start why
this is important. Um, who here watches
YouTube every
day? It's one of the biggest consumer
apps in the world. Um, and a large
majority of the watch time on YouTube is
driven by the recommendation system. Uh,
and we serve recommendations across
home, watch next, we have a big shorts
product and even a lot of our search
results are personalized in some way. Uh
and so if you think about consumer
applications of
LLMs, I think in terms of consumer
engagement and impact, recommendations
is going to be a much bigger uh
application than than searches. And this
is true of any consumer app with a
billion D. Um, the way I think about the
recommendation problem is you're trying
to learn this function of you get a user
and their context as input and you're
trying to give them a bunch of
recommendations. Um, at YouTube we have
a bunch of user information like their
demographics, their age, their gender,
where they're located. We have a lot of
context about them. What are the last
hundred videos they watched? How deeply
did they engage with them? What did they
comment on? who are they subscribed to
and we use all of that to make uh video
recommendations. We've tried a lot of
different modeling techniques here.
Multi-headed rankers, embedding models,
sequence to sequence transformers.
There's a there's a long history. Um and
about two years ago, we started thinking
how can we rethink this recommendation
system on top of Gemini, which has been
making incredible progress in modeling.
How can we adapt that for YouTube?
And so we've built this system which we
call LRM large recommener model uh where
we adapt Gemini for recommendations.
Um
okay. Okay. Um and so the way that we do
this is I guess I'll just pause for a
second.
That
good? Yeah. Okay, cool. Okay. Uh, back
to how we're adapting Gemini for
recommendation tasks. So, we start with
this base Gemini checkpoint. Um, and
then we're adapting it for YouTube
recommendations, teaching it a lot of
information about YouTube to get this
kind of unified YouTube specific
checkpoint of Gemini, which we call
LRM. Then we can align it for different
recommendation related tasks like
retrieval and ranking. Um, and basically
make a small custom version of this
model for all of the major
recommendation surfaces. Um, and so this
is a model that we have launched in
production at YouTube for a while in
terms of the retrieval system and we're
experimenting a lot on the ranking side.
So I want to start with just kind of
explaining how we built this YouTube and
Gemini model and then we'll talk about
how we use it for
retrieval. The first step of this kind
of a model is you have to develop a way
to tokenize videos. So when you
uh in terms of an LLM when you give it
an input it it tokenizes that text and
then is predicting the next text token.
Uh the ideal product we wanted to make
was we want to give this model an input
of a number of video tokens and then
just get video tokens out that would be
good recommendations.
Um, we had to build this because even
with a million tokens of context, when
you want to reason over many videos, you
have to compress that video
representation in some way. Um, and
before we kind of settle on this
approach, we tried a bunch of other
things like uh predicting search queries
and retrieving videos through that or
trying to just uh recommend videos
directly. And those solutions were just
not good enough. And so we built
semantic ID which we actually wrote a
paper about last year uh and it was
presented at Rexus. The way that this
semantic ID works is you take a video um
you extract a number of features out of
it like the title, description,
transcript, even the audio and video
frame level data. You put all of that
into a multi-dimensional embedding. Um,
and then you quantize it using RQVE to
give every video uh a token. Um, we've
written a pretty detailed paper about
this if people are interested. But at a
high level, the way I think about this
is we're making the atomic units for a
new language of YouTube videos. Um, once
we have these tokens, you can imagine
the whole corpus of billions of videos
on YouTube gets organized around these
semantically meaningful tokens. And so
you could imagine the first token
representing topics like music, gaming,
sports. Within sports, you would have
different different sports and then you
can uh get to volleyball. And so these
two volleyball videos would share some
tokens in the prefix but also then have
a unique identifier. Um and this this I
think in itself is is an interesting
milestone to move away from hashbased
tokenization into a semantically
meaningful one. Um and we use this in
production at
YouTube. Uh what we then tried to do is
this process of what we call continued
pre-training where we're trying to take
this model and have it understand both
English and this new YouTube language.
And we do this in in two big steps. One
is around linking text and SID. Um and
then the second step is around having it
understand sequences of watches and be
able to reason across this video space.
And so some of the example training
tasks that we're teaching this model,
you have this video, it it's a tennis
highlights video which has some semantic
ID and you can prompt it and say, "Hey,
this video has title XYZ." And the model
starts to learn to output the title. Um,
you could imagine a very similar thing
where you could say has creator or has
topics and so on. And so you're
basically trying to connect text and
this video token.
Then what we can try to do is we have a
corpus of all the YouTube engagement
data, all the paths that users took
through YouTube when they watch videos
together. Um, and you can prompt the
model with things like a user has
watched the following videos ABCD. And
you mask some of those videos and the
model starts to learn to predict those
masks. And now it's starting to
understand what are videos that are
watched together and make relationships
between videos on the basis of user
engagement. Right? Um, after a bunch of
pre-training tasks like this, we get
this really interesting model that can
reason across English and YouTube
videos. And so this is an example from a
user's watch history. Um, and we find
that this model can now reason across
these videos. So you could prompt it
with things like, hey, video one is
interesting to tennis fans because it's
about Wimbledon. Video two is
interesting for F1 because it's about
the Spanish Grand Prix. video three is
interesting to math fans because it's
about pi and then you prompt video four
is going to be interesting too and the
model starts to be able to understand
that it's interesting to technology fans
because it's about AI and this is just
based on the semantic ID uh definition
of a video it doesn't really have a lot
of other information uh to go off of. So
I think this in itself is a very
interesting checkpoint that is starting
to reason across English and
YouTube. Once we have this model, we
think about how we can use this for
different video recommendation tasks at
YouTube. And the first one that we
focused on is generative retrieval. And
so here you could just construct a
prompt for every user and see what this
model recommends. And so in this
example, you have a user, they would be
a 24 year old woman in the US on
Android. They're watching this highlight
video from the Olympics. Um, and they
have some watch history of, you know, 50
videos they've watched in the past, how
they engaged with it. And you can just
construct a prompt like we have on the
right with this user demographic
information, the context video, and have
the model decode some video
recommendations as
SIDs. We find that this gives really
interesting unique recommendations,
especially for our hardest
recommendation tasks. So, in this
example, when you're watching uh this
highlight from the Olympics, the
production system before LRM would give
you other men's track races. Um, now
with this new model, it's able to find
this unique connection between the user
demographic and their past watch history
and find related women's uh races that
we weren't able to recommend in the
past. And so we find that especially for
users where we don't know as much about
them, we get very interesting and unique
recommendations out of this
strategy. Um, and so we've experimented
with this and launched it in a few
places at
YouTube. The big findings from this is
that LRM is a very powerful model, but
it's really expensive to serve. It it
learns very quickly. It's very training
data efficient. Um, and it handles our
toughest Rex tasks, but the biggest
limitation was that the serving costs
are too high, especially for the scale
that YouTube operates at with billions
of users. And so, after we got our first
experiments working, we spent a lot of
time just reducing the TPU serving cost.
And, you know, we got 95% plus cost
savings to be able to actually launch
this in production.
Um, one other strategy that we used
which I think is kind of interesting is
we tried to turn this into an offline
problem where it's the same prompt and
the same model. We just remove the
personal personalized aspects of this
prompt. And we wanted to build just an
offline recommendations table where if
you're watching video A, what are the
candid videos that would be good to
watch next? Um, and normally these
unpersonalized recommendation models
just don't hold a candle to a
personalized recommener. But because
this LRM is trained from a really big
checkpoint, it actually gives us some
differentiated recommendations. Um and
so in the YouTube context like we can
take our corpus of billions of videos
look at the head which represent a lot
of the watch time do offline inference
um make this offline Rex table and then
we can just do a simple lookup to serve
some recommendations. Um and so this was
kind of a a complete way around our
serving problems.
Um, I want to talk a bit about the
challenges for YouTube. And I think in
some ways making an LLM based
recommendation system is harder than
training an LLM. Um, one of the big
differences is the vocabulary and size
of the corpus. Right? So for Gemini, if
you're training an English LLM, your
vocabulary is about 100,000 words in the
Oxford dictionary and they add about a
thousand words every year. Um at
YouTube, if you imagine the library of
YouTube, it has billions of videos. We
have 20 billion videos on YouTube, um
with millions added every day. Um and
the freshness of videos is really
important, much more so than LLMs. So if
you think about a new word that's added
to the English dictionary, word of 2023
was RZ.
If your model Gemini doesn't know about
RZ, it can still answer 99% of questions
that people would have. Maybe it misses
some jokes, maybe it misses some pop
culture
references. But in the world of YouTube,
if Taylor Swift drops a new music video,
you have to be able to recommend it
within the next minutes or hours,
otherwise a lot of users are going to be
upset. So even within this large corpus
you have to very quickly understand what
are the videos that are important and
start recommending them to the right
user. Um and so what we do with this LRM
recommener is we have to continuously
pre-train it on the order of days and
hours which is very different than
classical LLM pre-training like Gemini
which happens maybe like once in three
to six months. Um and so in that way
it's a much harder problem. Uh, and then
the last part is scale. Um, we have
great models in Gemini. Gemini Pro is
incredible, but there's no way that you
can serve that to billions of daily
active users. Um, and so for YouTube, we
had to focus on the smaller, more
efficient models like flash and and even
smaller checkpoints than that um just to
be able to hit the latency and scale
requirements that we have.
Um, so I kind of want to summarize the
journey that we've been on YouTube in
this what I think of as a LM and Rexus
recipe that you can maybe adapt to your
own application. And there's three major
steps to this, right? The first is you
want to find a way to tokenize your
content. Um, just like LLM's tokenized
text, you want to find you want to make
some essence of your content into an
atomic token. Uh one way to do that
which we've done is you find some rich
representation a bunch of features build
an embedding and then find a way to
tokenize or quantize it. And the outcome
of this is like you're making your own
domain specific
language. The second step is you then
want to adapt the LLM and basically make
links between English and your domain
language uh and find training tasks that
help you reason across English and these
new tokens you've built. And so the
outcome after this step in my mind is
it's a bilingual LLM that can speak
English and natural language but it can
also speak your domain specific
language. Um and then once you have this
you can do the third step of prompting
it with user information where you can
just construct personalized prompts with
user demographic user activity different
actions um and then train task specific
or surface specific models and you have
a generative recommendation system on
top of an LLM and there this is like a
tweet size summary of maybe two years of
work.
Um, maybe the last thing that I want to
talk about is kind of where I see this
going. Um, and some possible future
directions for LLM and Rexus. I think
the stage that we're at right now is
that LMS are just augmenting
recommendations. They bring these
magical recommendation experiences. They
enhance the quality, but they're largely
invisible to users. like your YouTube
feed just got better, but you don't
really know whether a Gemini inference
happened or not. Um, this is why I think
the LLM application of Rexus is very
underhyped because users don't directly
know what's happening. Um, I think we're
close to a world and we're experimenting
with this. If you have like we talked
about a bilingual LLM across English and
recommendations, users can then talk to
it in natural language. And I think
you're going to start to see experiences
where users can steer recommendations to
their own goals. The recommener can
explain why a candidate was recommended
to a user. Um, and users can start to
align it towards their own goals
expressed in natural language. Um, and I
think also the lines between search and
recommendation start to blur in this
world. Um, and then maybe a hint of the
future is I think you're going to see
recommendation and generative content
start to come together in the future
where we're going to be recommending a
personalized version of a piece of
content and in the future instead of
recommending content we may even start
creating it and you can get to really
interesting end of one content that's
generated for the user. Um, I think
we're a bit away from this, but it's
going to come sooner than you expect
with all the advances happening in
AI. Um, so yeah, thank you. I'll take
any questions.
Thank you. We have time for a few
questions.
Hi, great talk. Um uh one question on
generally how you balance the learning
of the semantic ID embeddings within the
model versus keeping the general
language capability not damaged by
learning through for example a tokenized
user history which is a very second
language very different from English.
Any uh high level takeaway that you can
share? That's a super interesting
question.
Um we've struggled with this a lot. Um
in terms of some of our early
applications, we mostly cared just about
recommendation quality in which case we
overindexed on speaking the semantic ID
language. And as you overtrain on more
and more of those examples actually the
model forgets to speak English. Maybe
it's reasoning in some intermediate
layers which finally end up in semantic
ID language. Um we are trying a bunch of
things like you know with mixture of
experts maybe we can have a few experts
that retain the text
capability while other experts focus on
the semantic ID capability and
so it it's it's a balance and I think
we're going to shift more towards text
as we try to build these interactive
experiences where uh text input from
user is going to become more important.
Thank you.
So during this process, did you learn
any
uh any good suggestions for cold
starting embeddings on these domain
specific uh tokens? Yeah. So the
semantic one thing is semantic ID
training process is entirely
unsupervised. We're not telling like
it's making its own quantization of the
video corpus. when you sample to see
what the model is doing, we find that
it's learning concepts like sports
versus movies and entertainment. But we
didn't actually try to teach that
explicitly, which I think is very
interesting. I think the second aspect
is because of semantic ID, we can warm
start into a semantically meaningful
space. And what we find is performance
for videos that were uploaded in the
last day or the last week uh gets much
better because we're better
understanding this fresh and tail
content. Thank you.
Hey, quick question. So when you said
you extract frames as part of making the
semantic ID, are you just running a
video at let's say 3 to 30 fps? uh
making a grid of them, running
cig and inserting that. We're just
trying to sample video frames.
Um we we've tried a few different
approaches where like maybe we try to
sample from like key moments in the
video. We actually have the engagement
data if you've seen in the YouTube
player. Uh it can highlight what are the
places where people had the most
engagement. So we try to sample from
there. um you know given the scale we
can't sample a lot of video frames so we
try to intelligently select it but we do
have video frames and over time I think
we'll get more in this way of selecting
it are you able to highlight important
things that are based on small objects
in a video pretty
well let's say it's a person in the
distance that's of attention of this
video
hard to say because like at the end all
of this video information gets
compressed to eight tokens. So, it's
probably learning something, but it's
hard to know exactly, you know, what it
picked up from that video frame. Uh
yeah, so it it's unclear. Thank
you. Yeah, it was a pretty good talk. Uh
I have a question regarding
pre-training. Okay. So uh did you also
feed in a user query and what they
watched also as a pre-training data? If
yes then did you also use semantic ID
for user as well in pre-training or or
just semantic ID is only for for the
videos.
Yeah. So in this case we have only
tokenized videos um and we focused more
on sequences of watches rather than
search query to what watch originated
from that search query. You could
imagine some parallel work where you try
to tokenize users and build some kind of
user token that represents like the last
500 watches that they have had and so
on. Um we've experimented with some
stuff there. I think it's less far
along. Um but yeah, I think it's a very
interesting like research direction to
do. So, so the pre-training was done on
top of existing Gemini pre-trained
model, right? Yeah, we basically take a
Gemini checkpoint and then adapt it for
this YouTube purpose and get this like
YouTube and Gemini LRM checkpoint. Okay.
Yeah. So, it would be cool to see
cementing ID of videos to V3.
Yeah. Hey, uh I'm kind of curious how
much uh improvement do we see compared
to the non LLM or more traditional
recommendation system and when should we
use a more traditional one and when
should we use LM based recommendation
system?
Yeah, I can't really share metrics like
I I was I can share everything except
code and metrics, you know. Um and so
we've given you as much conceptual steps
of what we did. Maybe what I'll say is I
think it's been the biggest improvement
to recommendation quality we've seen in
the last few years. So I do think it's
quite significant. Thank you.
Um well please join me and give a big
round of applause to all our
speakers. So for all of these speakers
we actually reached out to them
personally. They didn't submit any
talks. We reached out to them because we
know that their work was high quality
and I was really
interested to share their work with you.
Some of them are still hanging out,
hanging around. Um, you can try to find
them and just ask them as many questions
as you have. Um, thank you everyone for
joining and I hope you have a awesome
rest of the conference. Thank you.