What We Learned from Using LLMs in Pinterest — Mukuntha Narayanan, Han Wang, Pinterest
Channel: aiDotEngineer
Published at: 2025-07-16
YouTube video id: XdAWgO11zuk
Source: https://www.youtube.com/watch?v=XdAWgO11zuk
[Music] Yeah. Hi everyone. Um, thanks for joining the talk today. Um, we're super excited to be here and shares some of the learnings we um, we have from integrating the LM into Pinterest search. My name is Khan and today I'll be presenting with Mukunda and we are both machine learning engineers from search relevance team at Pinterest. So start with a brief introduction to Pinterest. Um Pinterest is a visual discovery platform where piners can come to find inspiration to create a life they love. And there are three main discovery surfaces on Pinterest. The home feed, the related things and search. In today's talk, we'll be focusing on search and um where the user can type in their queries and um find useful inspiring content based on their information need. And we'll share um how we leverage LM to improve the search relevance. Um here are some key statistic for Pinterest search. Every month we handled over six billion searches with billions of pins to search from. covering topics from recipe, home decor, travel, fashion and beyond. And at Pinterest, search is remarkably global and multilingual. We support over 45 languages and reaching pingers in more than 100 countries. These numbers highlight the importance of search at Pinterest and why we are investing um in search relevance to improving the source experience. So um this is an overview of how Pinterest search work and the back end. So it's similar to um many recommendation system and industry. It has query understanding retrieval ranking and the blending stage and finally produced um relevant and engagement search feeds. And um in today's SC talk we'll be focusing on the sematic relevance modeling that happened at the reanking stage and share about how we use LN to improve um the search relevance on the search. Okay. So um here's our search relevance model which um is essentially a classification model. Given a search query and a ping, the model will predict how much the ping is relevant to this search query. And to measure this, we use a fivepoint scale um ranging from the most relevant to most irrelevant. All right. Um now we are going to share some key learnings we have from using the LM to improve search Pinterest search relevance. And here are four main takeaways that we would like to um go into more details. Lesson one, LMS are good at relevance prediction. Um so before I present um the result, let me first give a quick overview of the model architecture that we are using. um we contain the query and the ping text together and pass them into a ln to get a um embedding. So this is called um cross encoder structure we where we can better capture the interaction between the query and the ping and then we feed the um embedding from LM into MLP layer to produce a fivedimensional factor which correspond to the um five relevance levels and during training we fine-tune some open source LM using tin internal data and to better adapt the model to our Pinterest content And here um I'd like to share some results um to demonstrate that the usefulness of LM and as a baseline we use search Sage which is a Pinterest inhouse content and the query embedding and um so if you look at the table you can see that the LM has substantially um improved the performance of the relevance prediction. And as we use more advanced LMS and increase the model size, the performance keeps improving. And for example, um the 8 billion lama mastery model gives um 12% of improvement over the multilingual birdbased model and 20% of improvement over the search stage embedding model. So um the lesson here is that um LMS they are quite good at valance prediction. Um two the mission language model generated captions and the user actions can be quite useful content annotations. So to use LM for search uh for relevance prediction we need to build a text representation of each cane. And here I listed several features that we used in our model. Besides the um the title of description of the pin, we also include um the VM generated synthetic image caption to directly extract information from the image itself. And besides that we add some um user engagement based feature like the board titles um for the user curated board that the ping has been saved to or um the queries that led to the highest engagement with this ping on search surface. So these two user action based features um serves as additional annotation for the content and um here the five source of feature together helps to build a more um robust and comprehensive text representation for each pin. U to understand the um importance of t each vortex feature we also did some oblation studies. We use the um BM generated image caption as a baseline and um as you can see itself already pro um provide a very solid baseline and as we sequentially add more vortex feature we keep seeing performance improvement and this indicate that enriching the vortex feature is quite useful for relevance prediction and notably um the last two rows of the table shows the performance gain we have by adding these user action based features. So these features turned out to be quite useful content annotation that help model better understand the content. All right. Um next I will hand over to Makunta to talk about how we use knowledge distillation to productionize this model. >> Great. Yeah. Uh so now we have a good relevance model which is good at predicting search relevance. Uh but how do we actually scale this up without bankrupting Pinfest? Uh usually the answer is call this resolution into smaller models. Um and this is the production served relevant student model that we distilled from the teacher model using semi-supervised learning. Uh the student model is trained to predict five scale relevant scores too. Uh it trains using the five uh scale soft scores produced by the teacher model. Um and we produce data for this using a semi-supervised learning setup that uh I'll show in the next slide. So the LLM teacher model is trained on a small set of human label data that we get from human annotators who are trained in very specific segments. Uh we fine-tune and this is a multilingual language model which uses pretty generic features which scale across a lot of different domains etc. Um and the way we get training data from the student uh is through uh sampling from daily search logs which is um all the searches uh people make on Pinterest. Uh and since we Apple daily uh this includes any trending queries, all the latest freshest pins on Pinterest and this is also remarkably global like we mentioned and only a small subset of this comes from the US where most of our human label data comes from. Um we sample from this and we label using the teacher and we scale it up pretty much 100x uh across different domains, languages, countries where uh the LLM teacher model produces pretty good labels. We train the student model and this is the model that actually gets served online. Um and uh zooming into the student model u also this also has language models in it. uh but unlike the teacher model it's not a cross encoder it uh is a by encoder uh which essentially means we don't have cross interactions between the pin and the query uh representations u the pin gets embedded separately query gets embedded separately and it also uses a lot of other features like um sage that we previously mentioned for both embedding the query and the pen uh we have graph stage embeddings which pinest published papers on um and omnis and a lot of other embedding features for query and pin but we also use um a lot of pin query text match statistics like BM25 which we've seen historically perform really well for predicting search um and the reason this scales well is the by encoder uh benccoder large language models can scale really well uh when we uh use offline inference and caching uh the pin embedding here is entirely offline inferred on billions of pins uh it uses predominantly the same text features that we mentioned on the teacher uh which helps distill efficiently. Um and uh we only uh reinfer uh these embeddings every time that these inputs meaningfully change. Uh meaning that uh every time that we need new embeddings um it's only going to run on a few set of new pins. Um and u this is offline input. So none of this is happening online when a user issues a search query. Uh and the query embedding is pretty much uh realtime inferred online. Uh and search queries are pretty short. Um they don't occupy too many tokens which means u we can keep the latencies for the query embedding up to like a few milliseconds. Um and we also cache this u because search queries get repeated a lot and we get around an 85% cash hit rate. Um and yeah this scales really well. uh to actually serve Pinterest traffic. Um the online results here uh the first four numbers are relevance uh measurements and DCNG uh precision at 8 uh measured on the US, Germany, France uh specific segments that we zoomed into. Uh we can actually see that we get relevance gains international uh internationally even though we started with a very limited set of US data for this particular experiment. Um and uh we also see that search fulfillment which measures engagement on search um fulfilling actions um also goes up u also on non US even though our uh starting data was predominantly US and uh yeah uh large language models are very good at uh expanding across many different domains countries uh even though uh they have weren't explicitly trained for this. Um and uh this is a bonus. Uh we also found that relevance tuned large language models produce really good rich uh sematic representations which are very good general purpose. Uh this is the same production relevant student model that I shared on the previous slide. uh and uh the pin embedding and the query embedding uh are basically free representations that we get from these models uh which can be used across Pinterest for representing pins and search queries. Um we also use this to represent boards using the titles etc. Um, and we found that using these embeddings, especially since they've been distilled from a large language model teacher and also have large language models in them, um, they are very good at semantic content representations. Uh, and yeah, they perform pretty well across uh, related pins, home feed, and a lot of other surfaces where we've seen uh, representations improve by adding these things. Um so let me go over the key takeaways again. Um I think lesson one we found that LLMs are really good at relevance prediction. Uh lesson two we found that visual language model captions are good uh good ways to imbue them with uh image representations and u user actions are very good content annotations. Um three uh we found that knowledge distillation is a very good way to scale uh and efficiently serve models uh online and uh lesson four uh relevance tuning produces pretty rich representations that embed semantic representations for content really well. >> Thank you. Um I wonder if there are any questions from the audience. Please come up to the mics. How did you decide which open source LLMs to fine-tune? >> Yeah, that's a yeah, that's a very good question. So, we did a lot of experiment trying different language models and um in the previous slide we also share some um performance for different language model. Yeah. So, we did a lot of experiment and find the one that gives out the best performance. >> Yes. Uh if you could just walk us through somebody typing a search prompt. The confusion that I have is you have like LLM's uh building some sort of matching. Is it just being used for the label to be distilled or how did you shim that into the bio encoder? It wasn't really clear on the two tower offline and how the LLM search kind of influenced that. We use LLMs to distill into a student model which predicts search relevance specifically and produces five scale relevance scores. Um and it's served at the end of the search pipeline. It's uh the reanking stage. Um like every recommendation system we have a lot of uh CGs which candidate generators we have early stage ranking and this is one of the things that sits further down the pipeline which actually predicts search relevant scores and uh is used right before blending to actually produce a feed. So I think it's very similar to most recommener systems. So I I have a excuse me uh I have a question around how you evolved into this architecture like I'm sure printer has prelim era search as well like what limitations did you see in those systems that this new architecture solved for >> so um if understanding correctly your question is about what's the difference between the new system with the from the >> what was What was the driver to adopt adopting LLMs uh for for in in your search pipeline? Did the type did it support new features or is it does it improve on the existing features where we had limitations? >> I think they definitely improve uh especially with visual language model captions. I think we would very effectively able to expand beyond limited markets for actually measuring relevance data and getting relevance data. And yeah, these multilingual models are very good at uh getting synthetic data for different markets for uh >> hey um great great talk. Um I was wondering why uh or if if the embedding model is inherently multimodal um because you have text which is uh the the query and then you're matching against um either text links or images and so how do you think about multimodality >> it's definitely something we're exploring but then uh on a lot of applications I think we found uh visual captions are very good at c uh capturing what the image has um and we have some very good capturing models in house which uh yeah help us. Great. Thanks. >> Great talk. Yeah, just a quick question. You mentioned that you saw improvements in other languages as well. Did you start with a common baseline model for all languages or did you have to sort of and just change the features for each language or did you actually also start with separate models for individual languages? Yeah, I was >> curious how you actually saw the improvements manifest everywhere. >> Yeah. Um, so we we use the um same model for all languages and we because we we are using the multilingual um LM so we believe it can help transfer to other languages. [Music]