What We Learned from Using LLMs in Pinterest — Mukuntha Narayanan, Han Wang, Pinterest

Channel: aiDotEngineer

Published at: 2025-07-16

YouTube video id: XdAWgO11zuk

Source: https://www.youtube.com/watch?v=XdAWgO11zuk

[Music]
Yeah. Hi everyone. Um, thanks for
joining the talk today. Um, we're super
excited to be here and shares some of
the learnings we um, we have from
integrating the LM into Pinterest
search. My name is Khan and today I'll
be presenting with Mukunda and we are
both machine learning engineers from
search relevance team at Pinterest.
So start with a brief introduction to
Pinterest. Um Pinterest is a visual
discovery platform where piners can come
to find inspiration to create a life
they love. And there are three main
discovery surfaces on Pinterest. The
home feed, the related things and
search. In today's talk, we'll be
focusing on search and um where the user
can type in their queries and um find
useful inspiring content based on their
information need. And we'll share um how
we leverage LM to improve the search
relevance.
Um here are some key statistic for
Pinterest search. Every month we handled
over six billion searches with billions
of pins to search from. covering topics
from recipe, home decor, travel, fashion
and beyond. And at Pinterest, search is
remarkably global and multilingual. We
support over 45 languages and reaching
pingers in more than 100 countries.
These numbers highlight the importance
of search at Pinterest and why we are
investing um in search relevance to
improving the source experience.
So um this is an overview of how
Pinterest search work and the back end.
So it's similar to um many
recommendation system and industry. It
has query understanding retrieval
ranking and the blending stage and
finally produced um relevant and
engagement search feeds. And um in
today's SC talk we'll be focusing on the
sematic relevance modeling that happened
at the reanking stage and share about
how we use LN to improve um the search
relevance on the search.
Okay. So um here's our search relevance
model which um is essentially a
classification model. Given a search
query and a ping, the model will predict
how much the ping is relevant to this
search query. And to measure this, we
use a fivepoint scale um ranging from
the most relevant to most irrelevant.
All right. Um now we are going to share
some key learnings we have from using
the LM to improve search Pinterest
search relevance. And here are four main
takeaways that we would like to um go
into more details.
Lesson one, LMS are good at relevance
prediction.
Um so before I present um the result,
let me first give a quick overview of
the model architecture that we are
using. um we contain the query and the
ping text together and pass them into a
ln to get a um embedding. So this is
called um cross encoder structure we
where we can better capture the
interaction between the query and the
ping and then we feed the um embedding
from LM into MLP layer to produce a
fivedimensional
factor which correspond to the um five
relevance levels and during training we
fine-tune some open source LM using tin
internal data and to better adapt the
model to our Pinterest content
And here um I'd like to share some
results
um to demonstrate that the usefulness of
LM and as a baseline we use search Sage
which is a Pinterest inhouse content and
the query embedding
and um so if you look at the table you
can see that the LM has substantially
um improved the performance of the
relevance prediction.
And as we use more advanced LMS and
increase the model size, the performance
keeps improving. And for example, um the
8 billion lama mastery model gives um
12% of improvement over the multilingual
birdbased model and 20% of improvement
over the search stage embedding model.
So um the lesson here is that um LMS
they are quite good at valance
prediction.
Um two the mission language model
generated captions and the user actions
can be quite useful content annotations.
So to use LM for search uh for relevance
prediction we need to build a text
representation of each cane. And here I
listed several features that we used in
our model. Besides the um the title of
description of the pin, we also include
um the VM generated synthetic image
caption to directly extract information
from the image itself. And besides that
we add some um user engagement based
feature like the board titles um for the
user curated board that the ping has
been saved to or um the queries that led
to the highest engagement with this ping
on search surface. So these two user
action based features um serves as
additional annotation for the content
and um here the five source of feature
together helps to build a more um robust
and comprehensive text representation
for each pin.
U to understand the um importance of t
each vortex feature we also did some
oblation studies. We use the um BM
generated image caption as a baseline
and um as you can see itself already pro
um provide a very solid baseline and as
we sequentially add more vortex feature
we keep seeing performance improvement
and this indicate that enriching the
vortex feature is quite useful for
relevance prediction and notably um the
last two rows of the table shows the
performance gain we have by adding these
user action based features. So these
features turned out to be quite useful
content annotation that help model
better understand the content.
All right. Um next I will hand over to
Makunta to talk about how we use
knowledge distillation to productionize
this model.
>> Great. Yeah. Uh so now we have a good
relevance model which is good at
predicting search relevance. Uh but how
do we actually scale this up without
bankrupting Pinfest? Uh usually the
answer is call this resolution into
smaller models. Um and this is the
production served relevant student model
that we distilled from the teacher model
using semi-supervised learning. Uh the
student model is trained to predict five
scale relevant scores too. Uh it trains
using the five uh scale soft scores
produced by the teacher model. Um and we
produce data for this using a
semi-supervised learning setup that uh
I'll show in the next slide.
So the LLM teacher model is trained on a
small set of human label data that we
get from human annotators who are
trained in very specific segments. Uh we
fine-tune and this is a multilingual
language model which uses pretty generic
features which scale across a lot of
different domains etc. Um and the way we
get training data from the student uh is
through uh sampling from daily search
logs which is um all the searches uh
people make on Pinterest. Uh
and since we Apple daily uh this
includes any trending queries, all the
latest freshest pins on Pinterest
and this is also remarkably global like
we mentioned and only a small subset of
this comes from the US where most of our
human label data comes from. Um we
sample from this and we label using the
teacher and we scale it up pretty much
100x uh across different domains,
languages, countries where uh the LLM
teacher model produces pretty good
labels. We train the student model and
this is the model that actually gets
served online. Um
and uh zooming into the student model u
also this also has language models in
it. uh but unlike the teacher model it's
not a cross encoder it uh is a by
encoder uh which essentially means we
don't have cross interactions between
the pin and the query uh representations
u the pin gets embedded separately query
gets embedded separately and it also
uses a lot of other features like um
sage that we previously mentioned for
both embedding the query and the pen uh
we have graph stage embeddings which
pinest published papers on um and omnis
and a lot of other embedding features
for query and pin but we also use um a
lot of pin query text match statistics
like BM25 which we've seen historically
perform really well for predicting
search um and the reason this scales
well is the by encoder uh benccoder
large language models can scale really
well uh when we uh use offline inference
and caching uh the pin embedding here is
entirely offline inferred on billions of
pins uh it uses predominantly the same
text features that we mentioned on the
teacher uh which helps distill
efficiently. Um and uh we only uh
reinfer uh these embeddings every time
that these inputs meaningfully change.
Uh meaning that uh every time that we
need new embeddings um it's only going
to run on a few set of new pins. Um and
u this is offline input. So none of this
is happening online when a user issues a
search query. Uh and the query embedding
is pretty much uh realtime inferred
online. Uh and search queries are pretty
short. Um they don't occupy too many
tokens which means u we can keep the
latencies for the query embedding up to
like a few milliseconds. Um and we also
cache this u because search queries get
repeated a lot and we get around an 85%
cash hit rate. Um and yeah this scales
really well. uh to actually serve
Pinterest traffic. Um the online results
here uh the first four numbers are
relevance uh measurements and DCNG uh
precision at 8 uh measured on the US,
Germany, France uh specific segments
that we zoomed into. Uh we can actually
see that we get relevance gains
international uh internationally even
though we started with a very limited
set of US data for this particular
experiment. Um and uh we also see that
search fulfillment which measures
engagement on search um fulfilling
actions um also goes up u also on non US
even though our uh starting data was
predominantly US and uh yeah uh large
language models are very good at uh
expanding across many different domains
countries uh even though uh they have
weren't explicitly trained for this. Um
and uh this is a bonus. Uh we also found
that relevance tuned large language
models produce really good rich uh
sematic representations which are very
good general purpose. Uh this is the
same production relevant student model
that I shared on the previous slide. uh
and uh the pin embedding and the query
embedding uh are basically free
representations that we get from these
models uh which can be used across
Pinterest for representing pins and
search queries. Um we also use this to
represent boards using the titles etc.
Um, and we found that using these
embeddings, especially since they've
been distilled from a large language
model teacher and also have large
language models in them, um, they are
very good at semantic content
representations. Uh, and yeah, they
perform pretty well across uh, related
pins, home feed, and a lot of other
surfaces where we've seen uh,
representations improve by adding these
things. Um so let me go over the key
takeaways again. Um I think lesson one
we found that LLMs are really good at
relevance prediction. Uh lesson two we
found that visual language model
captions are good uh good ways to imbue
them with uh image representations and u
user actions are very good content
annotations. Um
three uh we found that knowledge
distillation is a very good way to scale
uh and efficiently serve models uh
online and uh lesson four uh relevance
tuning produces pretty rich
representations that embed semantic
representations for content really well.
>> Thank you. Um I wonder if there are any
questions from the audience. Please come
up to the mics.
How did you decide which open source
LLMs to fine-tune?
>> Yeah, that's a yeah, that's a very good
question. So, we did a lot of experiment
trying different language models and um
in the previous slide we also share some
um performance for different language
model. Yeah. So, we did a lot of
experiment and find the one that gives
out the best performance.
>> Yes.
Uh if you could just walk us through
somebody typing a search prompt. The
confusion that I have is you have like
LLM's uh building some sort of matching.
Is it just being used for the label to
be distilled or how did you shim that
into the bio encoder? It wasn't really
clear on the two tower offline and how
the LLM search kind of influenced that.
We use LLMs to distill into a student
model which predicts search relevance
specifically and produces five scale
relevance scores. Um and it's served at
the end of the search pipeline. It's uh
the reanking stage. Um like every
recommendation system we have a lot of
uh CGs which candidate generators we
have early stage ranking and this is one
of the things that sits further down the
pipeline which actually predicts search
relevant scores and uh is used
right before blending to actually
produce a feed. So I think it's very
similar to most recommener systems.
So I I have a excuse me uh I have a
question around how you evolved into
this architecture like I'm sure printer
has prelim era search as well like what
limitations did you see in those systems
that this new architecture solved for
>> so um if understanding correctly your
question is about what's the difference
between the new system with the from the
>> what was What was the driver to adopt
adopting LLMs uh for for in in your
search pipeline? Did the type did it
support new features or is it does it
improve on the existing features where
we had limitations?
>> I think they definitely improve uh
especially with visual language model
captions. I think we would very
effectively able to expand beyond
limited markets for actually measuring
relevance data and getting relevance
data. And yeah, these multilingual
models are very good at uh getting
synthetic data for different markets for
uh
>> hey um great great talk. Um I was
wondering why uh or if if the embedding
model is inherently multimodal um
because you have text which is uh the
the query and then you're matching
against um either text links or images
and so how do you think about
multimodality
>> it's definitely something we're
exploring but then uh on a lot of
applications I think we found uh visual
captions are very good at c uh capturing
what the image has um and we have some
very good capturing models in house
which uh yeah help us.
Great. Thanks.
>> Great talk. Yeah, just a quick question.
You mentioned that you saw improvements
in other languages as well. Did you
start with a common baseline model for
all languages or did you have to sort of
and just change the features for each
language or did you actually also start
with separate models for individual
languages? Yeah, I was
>> curious how you actually saw the
improvements manifest everywhere.
>> Yeah. Um, so we we use the um same model
for all languages and we because we we
are using the multilingual um LM so we
believe it can help transfer to other
languages.
[Music]