Building AI Products That Actually Work — Ben Hylak (Raindrop), Sid Bendre (Oleve)

Channel: aiDotEngineer

Published at: 2025-07-24

YouTube video id: eSvXbb2EBYc

Source: https://www.youtube.com/watch?v=eSvXbb2EBYc

[Music]
Uh my name is Ben Hilac and uh also just
feeling really grateful to be with all
of you guys today. Uh it's pretty
exciting and we're here to talk about
building AI products that actually work.
Um I'll introduce this guy in a second.
Sorry, wasn't the right word. Uh, so I
tweeted last night. I was kind of like,
what should we uh what should we talk
about today? Uh, and the overwhelming
response I got was like, please no more
evals. Uh, apparently there's a lot of
eval tracks. We'll touch on eval still
just a little bit, but mainly we're
going to be focusing on how to iterate
on AI products. And so I think iteration
is actually one of the most important
parts of building AI products that
actually work. So again, just a little
bit about us. So, I'm the CTO of a
company called Raindrop. And Raindrop
helps companies find and fix issues in
their AI products. Uh, before that, I
was actually kind of a weird background,
but I used to be really into robotics. I
did avionics at SpaceX for a little bit.
Um, and then most recently, I was an
engineer and then on the design team at
Apple for almost four years. And, uh, we
also have Sid. So, uh, in the spirit of
sharing how to build things that
actually work, uh, I brought Sid, who
actually knows how to build products
that actually work. So, I think Sid is
like, uh, the co-founder of a company
called Alie. Um, with just four people,
they grew a suite of viral apps to over
6 million AR. So, Sid is going to share
again how to build products that
actually work.
I think it's actually a really exciting
time for AI products. And I say it's an
exciting time because in the last year,
we've seen that it's possible to really
focus on a use case, really focus on
something and make that thing
exceptional, like really really crack
it. Um, we've seen that it's possible to
train like small models, really really
tiny models to just be exceptional at
specific tasks if you focus on a
specific use case. And we're also seeing
that increasingly providers right are
actually focusing on on launching those
sort of products which is you know that
might be the scary part. Um but deep
research is a great example right where
chatpt just focused on how do we you
know how do we collect a data set how do
we train something to just be
exceptionally good at searching the web
and they were I think it's one of the
best products that they've released
but even openai is not immune to
shipping like not so great products
right I think like to me I I don't know
uh what your guys' experience is but I
think that like I've actually had a lot
of trouble with codeex and I don't know
that it's like exceptionally better than
uh other things that exist. Like this is
kind of a funny one. I was like write
some tests and it it actually correctly
generated this hash for the word hello,
you know, but it's like I'm not sure
this is like, you know, when I think
about writing tests for my back end, I'm
not sure that this is what I wanted,
right? Um
and it's not just open AI, right? Like I
think that increasingly in the last
year, AI products still even in the last
couple months, couple weeks, like
there's all these weird issues like
Yeah, this is a funny one, right? So,
Virgin Money, their chatbot was
threatening to cut off their customers
for using the word virgin, right? So, uh
just the other day I was using uh uh
Google Cloud and I asked it where my
credits are and it was like, "Are you
talking about Azure credits or Roblox
credits?" You know, and I was like,
"What? How is this possible?" It's funny
because I tweeted this and it's like,
"This isn't just a one-off thing,
right?" Like, someone's like, "Oh, yeah.
This exact same thing happened to me,
right?"
Um
just a few weeks ago Grock had this
crazy thing right where people were
asking in this case about enterprise
software and it's like oh by the way you
know let's talk about uh the you know
claims of white genocide in South Africa
you know just completely off off the
rails here and we only see we only
caught something like this only kind of
entered the public you know awareness
because Grock is public and because you
can kind of see everything. Funny
enough, um I I actually tweet a lot
about, if you follow me, you know I
tweet a lot about AI products and where
they fail. And so last night when I was
like rushing to get this presentation,
my part of Hit done, uh I asked it to
find tweets of mine about AI failures
and it says, "I don't have access to
your personal Twitter. I can't search
tweets." I was like, "I think I can." So
I I doubled down. I'm like, "You are
literally Grock, you know, like this is
what you're made for." And it's like,
"Oh, you're right. I can I just don't
have your username, you know?" So it's
absurd. And I actually like like this is
this is yesterday, right? this still a
bug that they have.
So, I feel really lucky to be, you know,
like like I mentioned, I I I'm a CTO,
co-founder of a company called Raindrop
and we're in this really cool position
where we get to work with some of the
coolest, fastest growing companies in
the world and just a huge range of
companies. So it's everything from you
know apps like SIDS which he'll share
about to things like clay.com you know
which is like a sales sort of outreach
tool to like alien companion apps to
coding assistance. It's just this insane
range of products and so I get I think
we get to see so much of like what works
what doesn't work.
We are also like it's not just all
secondhand like we also have a massive
uh AI pipeline where you know every
single event that we receive is being
analyzed is being kind of divvied up in
some way and we're kind of like you know
we have this product we're also kind of
this like stealth frontier lab of some
sort of where we are kind of shipping
some of the coolest AI features I've
ever seen. Um we have like tools like
deep search that allows people to go
really deep into their production data
and build just classifiers from just a
few examples. So, it's been cool to sort
of build this intuition both from
firsthand from our customers and kind of
merge that and I think we've we've have
a pretty good intuition of what actually
works right now.
One question I get a lot is
will it get easier to make AI products,
right? Like how much of this is just a
moment in time? I think this is a very
very interesting question and I think
the answer is actually twofold, right?
So, the first answer is yes. Like yes,
it will get easier. Uh, and we know this
because we've seen it. A year ago, you
had to give, you know, threaten to kill
your, you know, GPT4 in order to get it
to output JSON, right? Like it was like
you had to threaten to kill its
firstborn or something. And now it's
just like a parameter in the API. Like
you're just like, in fact, here's the
exact schema I want you to output. It
just works. So those sort of things will
get easier. But I think the second part
of this answer is actually no. Like like
in a lot of ways, it's not going to get
easier. And I think that comes from the
fact that communication is hard. like
communication is a hard thing. Um, what
do I mean by this? I actually um I'm a
big Paul Graham fan. I'm sure a lot of a
lot of us are, but I actually really
really disagree with this. And the
reason why is so so he says it seems to
me AGI would mean the end of prompt
engineering. Moderately intelligent
humans can figure out what you want
without elaborate prompts. I don't think
that's true. Like I I think that if you
can think of all the times, you know,
you've your partner has told you
something and you've gotten it wrong,
right? like you completely
misinterpreted what they wanted, right?
What their goal was. If you think about
onboarding a new hire, right? And like
like you told them to do something and
they come back, what what the hell is
this? Right? Um I think it's really
really hard to communicate what you want
to someone, especially someone that
doesn't have a lot of context.
So yes, I think this is wrong.
The other reason why I'm not sure it's
going to get that much easier in a lot
of ways is that as these models, as our
products become more capable, there's
just more undefined behavior, right?
There's more edge cases you didn't think
about. And this is only becoming more
true, you know, as our products have to
start integrating with other tools
through like MCP, for example. There's
going to be new data formats, new ways
of doing things. So I I think that as
our products become more capable, as the
a as these models get more intelligent,
we're it's a little bit uh we're kind of
stuck in the same same situation.
So this is this is how I like to think
about it. I think you can't define the
entire scope of your product's behavior
up front anymore. You can't just say
like, you know, here's the PRD, here's
the document of everything I want my
product to do. Like you actually have to
iterate on it. You have to kind of ship
it, see what it does, and then iterate
on it.
So
I think eval are a very very important
part of this actually but I also think
there's a lot of confusion. You know I
use the word lie is a little spicy but I
think there's there's a lot of sort of
misinformation around evals. So I'm not
going to share I'm not going to like
rehash what eval are. I'm not going to
kind of go into all the details but I
will talk about I think some like common
misconceptions I've seen around evals.
So, one is that this idea that eval are
going to tell you how good your product
is. They're not. Um, they're really not.
Uh, if you're not familiar with
Goodart's law, it's like kind of the
reason for this. Um,
the eval that you collect are only the
things you already know of. It's going
to be easy to saturate them. If you look
at recent model launches, a lot of them
are actually performing lower on eval,
you know, previous ones, but they're
just way better in real world use. So,
it's not going to do this.
The other lie is this idea that like oh
okay well if you have a sort of like
imagine you have something like how
funny is my joke you know that my app is
generating. This is the example I always
hear used. You'll just like ask an LM to
judge how funny your joke is. Um I this
doesn't work like largely does not work.
Uh uh they're tempting because you know
these LM judges take text as an input
and they output a score they output a
decision whatever it is. um like largely
the best companies are not doing this.
They're they're not they're they're the
best companies are using highly curated
data sets. They're using autogradable
evals. Autogradable here meaning like
you know there's some way of in some
deterministic way figuring out if the
model passed or not. Um they're not
really using LM as judges. Um there's
some edge cases here but just like
largely this is not the thing you should
reach for.
The last one I see, which also really
confuses me, which I don't think is
real, is like eval production data. Um,
there's this idea that you should just
move your offline evals online. You use
the same judges, the same scoring. Um,
largely doesn't work either. I think
that a it could be very expensive,
especially if you're, you know, you have
some sort of judge that requires the
model to be a lot smarter. Um, so it's
either it's really expensive or you're
only doing a small percentage of
production traffic. Um, it's really hard
to set up accurately. you're not really
getting the patterns that are emerging.
Um, it's often limited to what you
already know. Even OpenAI talks about
this or they have like this kind of
really weird behavioral issue with
chatbt recently and they talk about this
in their postmortem. They're like, you
know, our evals aren't going to catch
everything, right? The eval are catching
things we already knew and real world
use is what helps us spot problems.
And so to build reliable AI apps, you
really need signals.
If you think about issues in an app like
Sentry,
you have what the issue is, but then you
have how many times it happened and how
many users it affected.
But for AI apps, there is no concrete
error, right? There's no exception being
thrown. And that's why like I think
signals are really the thing you need to
be looking at.
And signals I define as like and at
Rindrop we call them like ground truthy
indicators of your app's performance.
And so the anatomy of an AI issue looks
like some combination of signals
implicit and explicit and then intents
which what which are what the users are
trying to do.
And there's this process of essentially
defining these signals, exploring these
signals and refining them.
So briefly let's talk about defining
signals. There's explicit signals which
is almost like an analytics event your
app can send. And then there's implicit
data that's sort of hiding in your data.
uh sorry, implicit signals.
So a common explicit signal is thumbs
up, thumbs down, but there really are
way more signals than that. So chatbt
themselves actually track what portion
of a message you copy out of chatbt.
That's something that they track. That's
a signal that they're tracking.
They do preference data, right? You may
have seen this sort of AB, which
response do you prefer.
There's a whole host of possible both
positive and negative signals.
Everything from errors to regenerating
to like syntax errors if you're a coding
assistant to copy, sharing, suggesting.
We actually use this. So we have a flow
where users can search for data and we
actually look at how many were marked
correct, how many were marked wrong and
we can use that to figure out an RL on
like and improve the quality of our
searches. It's a super interesting
signal. But there's also implicit
signals which are like essentially
detecting rather than judging. So we
detect things like refusals, task
failure, user frustration. And if you
think about like the Grock example, when
you cluster them, it gets very
interesting. So we can look at and say,
okay, there's this cluster of user
frustration and it's all around people
trying to search for tweets.
And that's where exploring comes in. So
just like you can explore tags in
Sentry, you need some way of exploring
tags and metadata
for us that's like properties, models,
etc. keywords and intents because like I
just said the intent really changes what
the actual issue is. So again that's why
we talk about the anatomy of an AI issue
being a the signal with the intent.
Just parting thoughts here. You really
need a constant IV of your app's data.
We send Slack notifications. You can do
whatever you want but you need to be
looking at your data whether that's
searching it etc.
And then you really need to just refine
and define new issues which means you
look find these patterns. Look at your
data. talk to your users, find new
definitions of issues you weren't
expecting, and then start tracking them.
So, I'm going to cut this part. If you
want to know how to fix these things,
I'm happy to talk about some of the
advancements in SFT and things I've seen
work, but let's uh move over to Sid.
Cool. Thanks, Ben. Hey, everybody. I'm
Sid. I'm the co-founder of Aliv, and
we're building a portfolio of consumer
products that have with the aim of
building products that are fulfilling
and productive for people's lives. We're
a tiny team based out of New York that
successfully scaled viral products
around $6 million an hour profitably and
generated about half a billion views on
socials.
Today I'm going to talk about the
framework that drives the success which
is powered by raindrop.
There are two features of a viral AI
product for it to be successful. The
first part is a wow factor for virality
and the second part is reliable
consistent user experiences. The problem
is AI is chaotic and nondeterministic.
And this begs for a structure and
approach that allows us to create some
sort of scaling system that still caters
to the AI magic that is
nondeterministic.
The idea is that we want to have a
systematic approach for continuously
improving our AI experiences so that we
can scale to millions of users worldwide
and keep experiences reliable without
taking away the magic of AI that people
fall in love with. We need some way to
guide the chaos instead of eliminating
it.
This is why we came up with Trellis.
Trellis is our framework for
continuously refining our AI experiences
so that we can systematically improve
the user experiences across our AI
products at scale designed specifically
around our virality engine. There are
three core axioms to trus. One is
discretization where we take the
infinite output space and break it down
into specific buckets of focus. Then we
prioritize. This involves ranking those
bucket spaces by what will drive the
most impact for your business. And
finally, recursive refinement. We repeat
this process within those buckets of
output spaces so that we can continue to
create structure and order within the
chaotic uh output plane. There
effectively six steps to trus. A lot of
this has been shared by Ben in terms of
the the grounding principles of it. The
first is you want to initialize an
output space by launching an MVP agent
that is informed by some product priors
and some product expectations. But the
goal is really to collect a lot of user
data. The second step is once you've
unders on once you have all this user
data, you want to correctly classify
these into intents based on usage
patterns. The goal is you want to
understand exactly why people are
sticking to your product and what
they're using in your product,
especially when it's a conversational
open-ended AI agent experience. The
third step is converting these intents
into dedicated semi- semi-deterministic
workflows. A workflow is a predefined
set of steps that allows you to achieve
a certain output. The goal is you want
these workflows to be broad enough to be
useful for many possibilities but narrow
enough to be reliable. After you have
your workflows, you want to prioritize
them by some scoring mechanism. This has
to be something that's tied to your
company's KPIs. Um, and finally, you
want to analyze these workflows from
within. You want to understand the
failure patterns within them. You want
to understand the sub intents and you
want to keep recursing from there, which
is what step six involves.
A quick note on prioritization. There's
a simple and naive way to do it, which
is volume only. This involves focusing
on the workflows that have the most
volume. However, this leaves a lot of
room on the table for improving general
satisfaction across your product. A more
recommended approach is volume times
negative sentiment score. In this, we
try to score the ex the expected lift
we'd like to get by focusing on a
workflow that might be generating a lot
of negative satisfaction on your
product. An even more informed score is
negative sentiment times volume times
estimated achievable delta times some
strategic relevance. The idea of
estimated achievable delta is comes down
to you coming up with a way to score the
actual achievable delta you can gain
from working on that workflow and
improving the product. If you're going
to need to train a foundational model to
improve something, its achievable delta
is probably near zero depending on the
kind of company you are. All in all, the
goal is once you have these intents
identified, you can build structured
workflows where each workflow is
self-attributable, deterministic,
and is self-bound, which means which
which allows your teams to move much
more quickly because when you when you
uh improve a specific workflow,
all those changes are contained and self
accountable to that one workflow instead
of spilling over into other workflows.
This allows your team to move more
reliably.
And uh while we have a few more seconds,
you can continue to further refine this
process going deeper and deeper into all
your workflows. And at the end of the
day, you create magic which is
engineered, repeatable, testable, and
attributable, but not accidental. If
you'd like to read more about this, feel
free to scan this QR code to read about
our blog post on the Trellis framework.
Thank you for having me.
[Music]