Building AI Products That Actually Work — Ben Hylak (Raindrop), Sid Bendre (Oleve)
Channel: aiDotEngineer
Published at: 2025-07-24
YouTube video id: eSvXbb2EBYc
Source: https://www.youtube.com/watch?v=eSvXbb2EBYc
[Music] Uh my name is Ben Hilac and uh also just feeling really grateful to be with all of you guys today. Uh it's pretty exciting and we're here to talk about building AI products that actually work. Um I'll introduce this guy in a second. Sorry, wasn't the right word. Uh, so I tweeted last night. I was kind of like, what should we uh what should we talk about today? Uh, and the overwhelming response I got was like, please no more evals. Uh, apparently there's a lot of eval tracks. We'll touch on eval still just a little bit, but mainly we're going to be focusing on how to iterate on AI products. And so I think iteration is actually one of the most important parts of building AI products that actually work. So again, just a little bit about us. So, I'm the CTO of a company called Raindrop. And Raindrop helps companies find and fix issues in their AI products. Uh, before that, I was actually kind of a weird background, but I used to be really into robotics. I did avionics at SpaceX for a little bit. Um, and then most recently, I was an engineer and then on the design team at Apple for almost four years. And, uh, we also have Sid. So, uh, in the spirit of sharing how to build things that actually work, uh, I brought Sid, who actually knows how to build products that actually work. So, I think Sid is like, uh, the co-founder of a company called Alie. Um, with just four people, they grew a suite of viral apps to over 6 million AR. So, Sid is going to share again how to build products that actually work. I think it's actually a really exciting time for AI products. And I say it's an exciting time because in the last year, we've seen that it's possible to really focus on a use case, really focus on something and make that thing exceptional, like really really crack it. Um, we've seen that it's possible to train like small models, really really tiny models to just be exceptional at specific tasks if you focus on a specific use case. And we're also seeing that increasingly providers right are actually focusing on on launching those sort of products which is you know that might be the scary part. Um but deep research is a great example right where chatpt just focused on how do we you know how do we collect a data set how do we train something to just be exceptionally good at searching the web and they were I think it's one of the best products that they've released but even openai is not immune to shipping like not so great products right I think like to me I I don't know uh what your guys' experience is but I think that like I've actually had a lot of trouble with codeex and I don't know that it's like exceptionally better than uh other things that exist. Like this is kind of a funny one. I was like write some tests and it it actually correctly generated this hash for the word hello, you know, but it's like I'm not sure this is like, you know, when I think about writing tests for my back end, I'm not sure that this is what I wanted, right? Um and it's not just open AI, right? Like I think that increasingly in the last year, AI products still even in the last couple months, couple weeks, like there's all these weird issues like Yeah, this is a funny one, right? So, Virgin Money, their chatbot was threatening to cut off their customers for using the word virgin, right? So, uh just the other day I was using uh uh Google Cloud and I asked it where my credits are and it was like, "Are you talking about Azure credits or Roblox credits?" You know, and I was like, "What? How is this possible?" It's funny because I tweeted this and it's like, "This isn't just a one-off thing, right?" Like, someone's like, "Oh, yeah. This exact same thing happened to me, right?" Um just a few weeks ago Grock had this crazy thing right where people were asking in this case about enterprise software and it's like oh by the way you know let's talk about uh the you know claims of white genocide in South Africa you know just completely off off the rails here and we only see we only caught something like this only kind of entered the public you know awareness because Grock is public and because you can kind of see everything. Funny enough, um I I actually tweet a lot about, if you follow me, you know I tweet a lot about AI products and where they fail. And so last night when I was like rushing to get this presentation, my part of Hit done, uh I asked it to find tweets of mine about AI failures and it says, "I don't have access to your personal Twitter. I can't search tweets." I was like, "I think I can." So I I doubled down. I'm like, "You are literally Grock, you know, like this is what you're made for." And it's like, "Oh, you're right. I can I just don't have your username, you know?" So it's absurd. And I actually like like this is this is yesterday, right? this still a bug that they have. So, I feel really lucky to be, you know, like like I mentioned, I I I'm a CTO, co-founder of a company called Raindrop and we're in this really cool position where we get to work with some of the coolest, fastest growing companies in the world and just a huge range of companies. So it's everything from you know apps like SIDS which he'll share about to things like clay.com you know which is like a sales sort of outreach tool to like alien companion apps to coding assistance. It's just this insane range of products and so I get I think we get to see so much of like what works what doesn't work. We are also like it's not just all secondhand like we also have a massive uh AI pipeline where you know every single event that we receive is being analyzed is being kind of divvied up in some way and we're kind of like you know we have this product we're also kind of this like stealth frontier lab of some sort of where we are kind of shipping some of the coolest AI features I've ever seen. Um we have like tools like deep search that allows people to go really deep into their production data and build just classifiers from just a few examples. So, it's been cool to sort of build this intuition both from firsthand from our customers and kind of merge that and I think we've we've have a pretty good intuition of what actually works right now. One question I get a lot is will it get easier to make AI products, right? Like how much of this is just a moment in time? I think this is a very very interesting question and I think the answer is actually twofold, right? So, the first answer is yes. Like yes, it will get easier. Uh, and we know this because we've seen it. A year ago, you had to give, you know, threaten to kill your, you know, GPT4 in order to get it to output JSON, right? Like it was like you had to threaten to kill its firstborn or something. And now it's just like a parameter in the API. Like you're just like, in fact, here's the exact schema I want you to output. It just works. So those sort of things will get easier. But I think the second part of this answer is actually no. Like like in a lot of ways, it's not going to get easier. And I think that comes from the fact that communication is hard. like communication is a hard thing. Um, what do I mean by this? I actually um I'm a big Paul Graham fan. I'm sure a lot of a lot of us are, but I actually really really disagree with this. And the reason why is so so he says it seems to me AGI would mean the end of prompt engineering. Moderately intelligent humans can figure out what you want without elaborate prompts. I don't think that's true. Like I I think that if you can think of all the times, you know, you've your partner has told you something and you've gotten it wrong, right? like you completely misinterpreted what they wanted, right? What their goal was. If you think about onboarding a new hire, right? And like like you told them to do something and they come back, what what the hell is this? Right? Um I think it's really really hard to communicate what you want to someone, especially someone that doesn't have a lot of context. So yes, I think this is wrong. The other reason why I'm not sure it's going to get that much easier in a lot of ways is that as these models, as our products become more capable, there's just more undefined behavior, right? There's more edge cases you didn't think about. And this is only becoming more true, you know, as our products have to start integrating with other tools through like MCP, for example. There's going to be new data formats, new ways of doing things. So I I think that as our products become more capable, as the a as these models get more intelligent, we're it's a little bit uh we're kind of stuck in the same same situation. So this is this is how I like to think about it. I think you can't define the entire scope of your product's behavior up front anymore. You can't just say like, you know, here's the PRD, here's the document of everything I want my product to do. Like you actually have to iterate on it. You have to kind of ship it, see what it does, and then iterate on it. So I think eval are a very very important part of this actually but I also think there's a lot of confusion. You know I use the word lie is a little spicy but I think there's there's a lot of sort of misinformation around evals. So I'm not going to share I'm not going to like rehash what eval are. I'm not going to kind of go into all the details but I will talk about I think some like common misconceptions I've seen around evals. So, one is that this idea that eval are going to tell you how good your product is. They're not. Um, they're really not. Uh, if you're not familiar with Goodart's law, it's like kind of the reason for this. Um, the eval that you collect are only the things you already know of. It's going to be easy to saturate them. If you look at recent model launches, a lot of them are actually performing lower on eval, you know, previous ones, but they're just way better in real world use. So, it's not going to do this. The other lie is this idea that like oh okay well if you have a sort of like imagine you have something like how funny is my joke you know that my app is generating. This is the example I always hear used. You'll just like ask an LM to judge how funny your joke is. Um I this doesn't work like largely does not work. Uh uh they're tempting because you know these LM judges take text as an input and they output a score they output a decision whatever it is. um like largely the best companies are not doing this. They're they're not they're they're the best companies are using highly curated data sets. They're using autogradable evals. Autogradable here meaning like you know there's some way of in some deterministic way figuring out if the model passed or not. Um they're not really using LM as judges. Um there's some edge cases here but just like largely this is not the thing you should reach for. The last one I see, which also really confuses me, which I don't think is real, is like eval production data. Um, there's this idea that you should just move your offline evals online. You use the same judges, the same scoring. Um, largely doesn't work either. I think that a it could be very expensive, especially if you're, you know, you have some sort of judge that requires the model to be a lot smarter. Um, so it's either it's really expensive or you're only doing a small percentage of production traffic. Um, it's really hard to set up accurately. you're not really getting the patterns that are emerging. Um, it's often limited to what you already know. Even OpenAI talks about this or they have like this kind of really weird behavioral issue with chatbt recently and they talk about this in their postmortem. They're like, you know, our evals aren't going to catch everything, right? The eval are catching things we already knew and real world use is what helps us spot problems. And so to build reliable AI apps, you really need signals. If you think about issues in an app like Sentry, you have what the issue is, but then you have how many times it happened and how many users it affected. But for AI apps, there is no concrete error, right? There's no exception being thrown. And that's why like I think signals are really the thing you need to be looking at. And signals I define as like and at Rindrop we call them like ground truthy indicators of your app's performance. And so the anatomy of an AI issue looks like some combination of signals implicit and explicit and then intents which what which are what the users are trying to do. And there's this process of essentially defining these signals, exploring these signals and refining them. So briefly let's talk about defining signals. There's explicit signals which is almost like an analytics event your app can send. And then there's implicit data that's sort of hiding in your data. uh sorry, implicit signals. So a common explicit signal is thumbs up, thumbs down, but there really are way more signals than that. So chatbt themselves actually track what portion of a message you copy out of chatbt. That's something that they track. That's a signal that they're tracking. They do preference data, right? You may have seen this sort of AB, which response do you prefer. There's a whole host of possible both positive and negative signals. Everything from errors to regenerating to like syntax errors if you're a coding assistant to copy, sharing, suggesting. We actually use this. So we have a flow where users can search for data and we actually look at how many were marked correct, how many were marked wrong and we can use that to figure out an RL on like and improve the quality of our searches. It's a super interesting signal. But there's also implicit signals which are like essentially detecting rather than judging. So we detect things like refusals, task failure, user frustration. And if you think about like the Grock example, when you cluster them, it gets very interesting. So we can look at and say, okay, there's this cluster of user frustration and it's all around people trying to search for tweets. And that's where exploring comes in. So just like you can explore tags in Sentry, you need some way of exploring tags and metadata for us that's like properties, models, etc. keywords and intents because like I just said the intent really changes what the actual issue is. So again that's why we talk about the anatomy of an AI issue being a the signal with the intent. Just parting thoughts here. You really need a constant IV of your app's data. We send Slack notifications. You can do whatever you want but you need to be looking at your data whether that's searching it etc. And then you really need to just refine and define new issues which means you look find these patterns. Look at your data. talk to your users, find new definitions of issues you weren't expecting, and then start tracking them. So, I'm going to cut this part. If you want to know how to fix these things, I'm happy to talk about some of the advancements in SFT and things I've seen work, but let's uh move over to Sid. Cool. Thanks, Ben. Hey, everybody. I'm Sid. I'm the co-founder of Aliv, and we're building a portfolio of consumer products that have with the aim of building products that are fulfilling and productive for people's lives. We're a tiny team based out of New York that successfully scaled viral products around $6 million an hour profitably and generated about half a billion views on socials. Today I'm going to talk about the framework that drives the success which is powered by raindrop. There are two features of a viral AI product for it to be successful. The first part is a wow factor for virality and the second part is reliable consistent user experiences. The problem is AI is chaotic and nondeterministic. And this begs for a structure and approach that allows us to create some sort of scaling system that still caters to the AI magic that is nondeterministic. The idea is that we want to have a systematic approach for continuously improving our AI experiences so that we can scale to millions of users worldwide and keep experiences reliable without taking away the magic of AI that people fall in love with. We need some way to guide the chaos instead of eliminating it. This is why we came up with Trellis. Trellis is our framework for continuously refining our AI experiences so that we can systematically improve the user experiences across our AI products at scale designed specifically around our virality engine. There are three core axioms to trus. One is discretization where we take the infinite output space and break it down into specific buckets of focus. Then we prioritize. This involves ranking those bucket spaces by what will drive the most impact for your business. And finally, recursive refinement. We repeat this process within those buckets of output spaces so that we can continue to create structure and order within the chaotic uh output plane. There effectively six steps to trus. A lot of this has been shared by Ben in terms of the the grounding principles of it. The first is you want to initialize an output space by launching an MVP agent that is informed by some product priors and some product expectations. But the goal is really to collect a lot of user data. The second step is once you've unders on once you have all this user data, you want to correctly classify these into intents based on usage patterns. The goal is you want to understand exactly why people are sticking to your product and what they're using in your product, especially when it's a conversational open-ended AI agent experience. The third step is converting these intents into dedicated semi- semi-deterministic workflows. A workflow is a predefined set of steps that allows you to achieve a certain output. The goal is you want these workflows to be broad enough to be useful for many possibilities but narrow enough to be reliable. After you have your workflows, you want to prioritize them by some scoring mechanism. This has to be something that's tied to your company's KPIs. Um, and finally, you want to analyze these workflows from within. You want to understand the failure patterns within them. You want to understand the sub intents and you want to keep recursing from there, which is what step six involves. A quick note on prioritization. There's a simple and naive way to do it, which is volume only. This involves focusing on the workflows that have the most volume. However, this leaves a lot of room on the table for improving general satisfaction across your product. A more recommended approach is volume times negative sentiment score. In this, we try to score the ex the expected lift we'd like to get by focusing on a workflow that might be generating a lot of negative satisfaction on your product. An even more informed score is negative sentiment times volume times estimated achievable delta times some strategic relevance. The idea of estimated achievable delta is comes down to you coming up with a way to score the actual achievable delta you can gain from working on that workflow and improving the product. If you're going to need to train a foundational model to improve something, its achievable delta is probably near zero depending on the kind of company you are. All in all, the goal is once you have these intents identified, you can build structured workflows where each workflow is self-attributable, deterministic, and is self-bound, which means which which allows your teams to move much more quickly because when you when you uh improve a specific workflow, all those changes are contained and self accountable to that one workflow instead of spilling over into other workflows. This allows your team to move more reliably. And uh while we have a few more seconds, you can continue to further refine this process going deeper and deeper into all your workflows. And at the end of the day, you create magic which is engineered, repeatable, testable, and attributable, but not accidental. If you'd like to read more about this, feel free to scan this QR code to read about our blog post on the Trellis framework. Thank you for having me. [Music]