How to look at your data — Jeff Huber (Chroma) + Jason Liu (567)
Channel: aiDotEngineer
Published at: 2025-08-06
YouTube video id: jryZvCuA0Uc
Source: https://www.youtube.com/watch?v=jryZvCuA0Uc
All [Music] right, welcome everybody. Um, I'm Jeff Huber, the co-founder and CEO of Chroma, and I'm joined by Jason. We're going to do a two-parter here. We're really going to pack in the content. It's the last session of the day, and so we thought I'd give you a lot. Um everything in this presentation today is open source and code available. So we're also not selling you any tools. Um and so there'll be QR codes and stuff throughout to grab the code. So let's talk about how to look at your data. Um all of you are AI practitioners. Um you're all building stuff and uh this probably these questions probably resonate quite deeply with you. Uh what chunking strategy should I use? Is my embedding model the best betting embedding model for my data? Um and more. And our contention is that you can really only manage what you measure. Again, I think Peter Ducker is the original who coined that. So, I can't take too much credit. Um, but it certainly is still true today. So, we have a very simple hypothesis here, which is you should look at your data. Um, the goal is to say look at your data I think at least 15 times this presentation. So, that's two. Um, and great measurement ultimately is what makes systematic improvement easy. And it really can be easy. It doesn't have to be super complicated. So, I'm going to talk about part one, how to look at your inputs. And then Jason's going to talk about part two, how to look at your outputs. So, let's get into it. All right. Looking at your inputs. How do you know whether or not your retrieval system is good? And how do you know how to make it better? Um, there are a few options. Um, there is guess and cross your fingers. That's certainly one option. Um, another option is to use an LLM as a judge. you're using, you know, some of these frameworks where you're checking factuality and other metrics like this and they cost $600 and take three hours to run. If that is your preference, you certainly can do that. Um, you can use public benchmarks. So, you can look at things like MTeb to figure out, oh, which embedding model is the best on English. That's another option. Um, but our contention is you should use fast evals. And I will tell you exactly what fast evals are. All right. So, what is a fast eval? Um a fast eval is simply a set of query and document pairs. So the first step is if this query is put in this document should come out. Uh a set of those is called a golden data set and then the way that you measure your system is you put all the queries in and then you see do those documents come out. Um and obviously you can retrieve five or retrieve 10 or retrieve 20. It kind of depends on your application. Um it's very fast and very inexpensive to run. And this is very important because it enables you to run a lot of experiments quickly and cheaply. Um you know I'm sure all of you know that like experimentation time and your energy to do experimentation goes down significantly when you have to click go and then come back six hours later. Um all of these metrics should run extremely quickly for pennies. So maybe you don't have yet, you have your documents, you have your chunks, you know, you have your stuff in your retrieval system, but you don't have queries yet. That's okay. Um, we found that you can actually use an LLM to write questions and write good questions. Um, you know, I think just doing naive like, "Hey LM, write me a question for this document." Not a great strategy. However, we found that you can actually teach LLMs how to write queries. Um, these slides are getting a little bit cropped. I'm not sure why, but we'll make the most of it. Um, to give you an example. So, this is actually a um example from one of the MTe kind of the golden data sets around embedding models, benchmark data sets. Um, this also points to the fact that like many of these benchmark data sets are overly clean, right? What is a pergola used for in a garden? And then the beginning of that sentence is a pergola in a garden dot dot dot dot dot. Um, real world data is never this clean. Um so what we did in this report um the link is in a few slides. Um we did a huge deep dive into how can we actually align queries that are representative of real world queries. It's too easy to trick yourself into thinking that your system's working really well with synthetic queries that are overly specific to your data. Um and so what these graphs show is that we're actually able to semantically align the specificity of queries synthetically generated to real queries that users might ask of your system. So what this enables is, you know, if a new cool sexy embedding model comes out and it's doing really well in the MTB score and everybody on Twitter is talking about it, instead of just, you know, going into your code and changing it and guessing and checking and hoping that it's going to work, um, you now can empirically say whether it's good, better or not for your data, um, and you know, the kind of example here is quite contrived and simple. Um, you know, but you can actually look at the actual success rate. Okay, great. These are the queries that I care about. Do I get back more documents than I did before? If so, maybe you should consider changing. Now, of course, you need to re-mbed your data. That service could be more expensive. It could be slower. The API for that service could be flaky. There's a lot of considerations obviously with making very good engineering decisions. Um, but clearly the north star of like success rate of how many documents that I get for my queries. Super fast and super useful and makes your improvement of your system much more systematic and deterministic. All right. So we actually uh worked with weights and biases um looking at their chatbot um to kind of ground a lot of this work. So what you see here is for the weights and biases chatbot um you can see four different embedding models and you can see the recall at 10 across those four different embedding models. And then I'll point out that uh blue is ground truth. So these are actual queries that were logged um in weave and then sent over. And then there's generated. These are the ones that are synthetically generated. And we want to see is a few things. We want to see that those are pretty close and we want to see that they are always the same kind of in order of accuracy, right? We don't want to see any like big flips between um ground truth and generated and uh we're really happy to see that we found uh that answer. Now there are a few fun findings here which is and of course they're going to get crocked out but that's okay. Um number one uh the original embedding model used for this application was actually text embedding three small. um this actually performed the worst out of all the embedding models that we evaluated just for in this case. Um and so probably wasn't the best choice. Um the second one was that actually if you look at MTeb Gina embeddings v3 does very well in English. It's like you know way better than anything else but for this application uh it didn't actually perform that well. It was actually the voyage 3 large model which performed the best and that was empirically determined by actually running this fast evval and looking at your data. That's number three. All right. All right. So, if you'd like access to the full report, um, you can scan this QR code. It's at research.tra.com. There's also an adjoining video which is kind of screenshotted here, which goes into much more detail. There are full notebooks with all the code. It's all open source. You can run it on your own data. And um, hopefully this is helpful for you all thinking about how again you can systematically and deterministically improve your retrieval systems. And with that, I'll hand it over to Jason. Thank you. So, you know, if you're working with some kind of system, there's always going to be the inputs that we look at. And so we talked about maybe thinking about things like retrieval, how does the embeddings work? But ultimately we also have to look at the outputs, right? And the outputs of many systems might be the outputs of a conversation that has happened, a you know agent execution that has happened. And the idea is that if you can look at these outputs, maybe we can do some kind of analysis that figures out, you know, what kind of product should we build, what kind of portfolio of tools should we develop for our agents and so forth. And so the idea is, you know, if you have a bunch of queries that users are putting in or even a couple of hundred of conversations, it's pretty good to just look at everything manually, right? Think very carefully about each interaction and then only use these models when they make sense. And then oftent times if I say this, they can say, you know what, if we just put everything in 03 and then here, generally only use the language models if you think you're not smarter than the language model. Then when you have a lot of users and actual good product, you might get thousands of queries or tens of thousands of conversations and now you run into an issue where there's too much volume to manually review. There's too much detail in the conversations and you're not really going to be the expert that can actually figure out what is useful and what is good. And ultimately with these long conversations with tool calls and chains and reasoning steps, these outputs are now really hard to scan and really hard to understand. But there's still a lot of value in these conversations, right? If you've used a chatbot, whether it's in cursor or any kind of like cloud code system, oftentimes you do say things like, "Try again. This is not really what I meant, you know, be less lazy next time." It turns out a lot of the feedback you give is in those conversations, right? We could build things like feedback widgets or thumbs up or thumbs down, but a lot of the information exists in those conversations. and the frustration and the retry patterns that exist can be extracted from those conversations and the idea is that the data really already exists in this conversation. If we think of a simple example in a different industry, you know, we can imagine the analogy of marketing, right? Maybe we run our evals and the number is 0.5. I don't really know what that means. Factuality is point6. I don't know if that's good or bad is 0.5. The average, who knows? But imagine we run a marketing campaign and our, you know, ad metric or our KPI is 0.5. There's not much we can do. But if we realize that 80% of our users are under 35 and 20% are over and we realize that the younger audience performs well and the older audience performs poorly. What we've done is we've just drawn a line in the sand on who our users are. And now we can make a decision. Do we want to double down on marketing to a younger audience or do we want to figure out why we aren't uh successfully marketing to to the older population, right? Do I find more part podcasts to market to? You know, should I run a Super Bowl ad? Now, just by drawing a line in the sand and deciding which segment to target, we can now make decisions on what to improve. Whereas just making them ads better is a sort of very generic sentiment that people can have. And so one of the best ways of doing that is effectively just extracting some kind of data out of these conversations in some structured way and just doing very traditional data analysis. And so here we have some kind of object that says I want to extract a summary of what has happened. Maybe some tools that it's used maybe the errors that we've noticed the conversations that that happened. Maybe some metric for for satisfaction maybe some metric for frustration. The idea is that we can build this portfolio of metadata that we can extract. And then what we can do is we can embed this find clusters identify segments and then start testing our hypothesis. And so what we what we might want to do is sort of build this extraction, put into an LLM, get this data back out and just start doing very traditional data analysis, no different than any kind of uh product engineer or any kind of data scientist. And this tends to work quite well. You know, if you look at some of the things that Anthropic Cleo did, they basically found that, you know, uh code use was 40x more represented by cloud cloud users than by you know uh GDP value creation. they go okay maybe code is like a good avenue and and obviously that's not the really the case but the idea is that by understanding how your users develop a product you can now figure out where to invest your time and so this is why we built a library called cura that allows us to summarize conversations cluster them build hierarchies of these clusters and ultimately allow us to compare our eval across different KPIs again so now you know if we have factuality is 6 that's really hard but if it turns out that factuality is really low for queries that require time filters, right? Or factuality is really high when queries revolve on, you know, contract search. Now we know something's happening in one area, something's happening in another. And then we can make a decision on what to do and how to invest our time. And the pipeline is pretty simple. We have models to do summarization, models to do clustering, and models that do this aggregation step. And so what you might want to do is just load in some conversations. And here we've made some a fake data set, maybe conversations, fake conversations from Gemini. And the idea is that first we can extract some kind of summary model where there's topics that we discuss, frustrations, errors, etc. We can then cluster them to find cohesive groups. And here we can find maybe you know some of the conversations are around data visualization, SEO content requests and authentication errors. And now we get some idea of how people are using the software. And then as we group them together, we realize, okay, really there are some themes around technical support. Does the agent have tools that can do this? as well. Do we have tools to debug these database issues? Do we have tools to debug authentication? Do we have tools to do data visualization? Um, that's something that's going to be very useful. And at the end of this pipeline, we're sort of presented with these printouts of clusters, right? We know what the tools are, how the chatbot is being used at a higher level, you know, SEO, content, data analysis, and at a lower level, you know, maybe it's blog post and marketing. And just by looking at this, we might have some hypothesis as to what kind of tools we should build, how we should, you know, develop, you know, even our marketing or how we can think about changing our prompts. We can do a ton of these kinds of things. And this is because the ultimate goal is to understand what to do next, right? You do the segmentation to figure out what kind of new hypotheses that you can have. And then you can make these targeted investments within these certain segments. If it turns out that, you know, 80% of the conversations that I'm having with the chatbot is around SEO optimization, maybe I should have some integrations that do that. Maybe I should reevaluate the prompts or have other workflows to make that use case more powerful for them. And again, the goal really is to just make a portfolio of tools of metadata filters of data sources that allows the agent to do its job. And oftent times the solution isn't really making the AI better. It's really just providing the right infrastructure. Right? A lot of times if you find that a lot of queries use time filters and you just didn't add a time filter that can probably improve your eval by quite a bit right we have situations where we wanted to figure out if contracts were signed and if we just extracted one more step in the OCR process now we can do this large scale filters and figure out you know what data exists and generally the practice of improving your applications is pretty straightforward right we all know to define eval but not everyone that I work with has really been thinking about something like finding clusters and comparing KPIs across clusters. But once you do, then you can start making decisions on what to build, what to fix, and what to ignore. Maybe you have a two-sided uh quadrants, right? Maybe you have low usage and high usage, and you have high performing evals and low performing evals, right? If a large portion of your population are using tools that you are bad at, that is clearly the thing you have to fix. But if a large proportion of people are using tools that you're good at, that's totally fine. If a small proportion of people use something that do something that you're good at, maybe there's some product changes you need to make. Maybe it's about educating the user. Maybe it's adding some, you know, pre-filler or automated questions to show them that we can do these kind of capabilities. And if there are things that nobody does, but when we do them, they're bad, maybe that's a oneline change in the prompt that says, "Sorry, I can't help you. Go talk to your manager." Right? These are now decisions that we can make just by looking at, you know, what proportion of our conversations are of a certain category and whether or not we can do well in that category. And as you understand this, then you can go out, you can build these classifiers to identify these specific intents. Maybe you build routers, maybe you build more tools. And then you can start doing things like monitoring and having the ability to do these group buys, right? So now you have different categories of query types over time and you can just see what the performance looks like, right? where 0.5 doesn't really mean anything but whether or not a metric changes over time across a certain category can determine a lot about how your products is being used. By doing this we figured out that you know some customers when we onboard them they they use our applications very differently than our historical customers and we can now then make other investments in how to improve these systems and ultimately the goal is to create a datadriven way of defining the product roadmap. Oftentimes it is research that leads to better products now rather than products justifying some research that we don't know is possible. And again the real marker of progress is your ability to have a high quality hypothesis and your ability to test a lot of these hypotheses. And if you segment you can make clearer hypotheses. If you use faster evals you can run more experiments. And by having this continuous feedback through monitoring this is how you actually build a product. Right? This is regardless of being an AI product. This is just how you build a product. And so if you look at the takeaways really when you think about measuring the inputs, we really want to think about not using public benchmarks, building evals on your data and focusing first on retrieval because that is the only thing a LLM improvement won't fix, right? If the retrieval is bad, the LLM will still get better over time, but you need to earn the right to sort of twink tinker with the LLM by having good retrieval. And then lastly, if you don't have any customers or any users, you can start thinking about synthetic data as a way of augmenting that. And once you have users, look at your data as well. Look at the outputs, right? Extract structure from these conversations. Understand, you know, how many conversations are happening? How often are tools being misused? What are the errors? And how are people frustrated? And by doing that, you can do this population level data analysis, find these similar clusters, and have some kind of impact weighted understanding of what the tools are. Right? It's one thing to say, you know, maybe we should build more tools for data visualization. It's another thing to say, hey boss, 40% of our conversations are around data visualization and the, you know, the code engine or the code execution can't really do that well. Maybe we should build two more tools for plotting and then see if that's worth it. And you can justify that because we know there's a 40% of the population is using data visualization and we do that, you know, maybe only 10% of the time, right? This is impact weighted. And ultimately as you compare these KPIs across these clusters, you can just make better decisions across your entire product development process. So again, start small, look for structure, understand that structure, and start comparing your KPIs. And once you can do that, you can make decisions on what to fix, what to build, and what to ignore. If you want to find more resources, feel free to check out these QR codes. Uh the first one is the Chroma cloud to understand a little bit more about their research. And the second one is actually a set of notebooks that we've built out that go through this process. So we load the weights and biases conversations. We do this cluster analysis and we show you how we can use that to make better product decisions. So there's three Jupyter notebooks in that repo. Check them out on your own time. And uh thank you for listening. We do have time for we do have time for like one quick question and of course as well outside as well. So thank you if anybody wants to grab the mic there and over there. Y for spicy. What's the spicy take today? It's not KPI, by the way. That's not the spicy. I think I think more agent businesses should try to price and like price their services on the work done than the tokens used. Yeah. Price on success, price on value. Very unrelated to this talk, but [Music]