Everything You Need To Know About Agent Observability — Danny Gollapalli and Ben Hylak, Raindrop

Channel: aiDotEngineer

Published at: 2026-05-07

YouTube video id: -aM2EDTiaMs

Source: https://www.youtube.com/watch?v=-aM2EDTiaMs

All right. Hey everyone. So today we're
going to talk about a pretty interesting
topic that becomes increasingly
important every day, which is everything
you need to know about agent
observability.
So a little bit about a little bit about
us. So I'm Zuben. I'm the CEO and
co-founder of Raindrop.
>> I'm Danny. So, I'm the back end engineer
at Raindrop and I do a bunch of SDK work
as well at there.
>> And Raindrop essentially helps AI
engineers find, track, and fix issues in
production agents. And we're we're lucky
to work with some of the most
interesting teams in the space as well.
Agent failures are very different than
traditional failures in software. So,
agents are non-deterministic.
They're unbounded. There's an infinite
space of inputs that you can put in.
There's an infinite space of outputs
that they can return. And they can use
tools sometimes to affect other systems
arbitrarily. Uh and this problem of
agent failures and monitoring them,
making sure we can understand them
becomes only more important with time.
It's getting worse because a agents are
getting more complex. B they're getting,
you know, sessions can get longer.
Sometimes agents can run for hours and
hours without any input from a user. And
then lastly, the stakes are getting
bigger and bigger. This is because
agents are being deployed in healthcare
and finance and even in the military
where it's catastrophic if things go
wrong.
The traditional paradigm we've been
talking about is eval right where you
have this sort of test input and you
want to see what is the output that
comes out from the agent. You have a set
of these. You know, maybe you call it a
golden data set. But emails, they just
aren't enough with this new paradigm. As
agents become more and more capable,
there's more and more interesting
undefined behavior that can happen. So
for example, agents can call from a set
of different tools. Sometimes the number
of tools is growing exponentially.
They can call from different memory
sources. They can call their own sub
aents which those sub aents have their
own tools and memory sources and
recursively can have their own sub
aents. And so this is just becoming more
complicated with time. And with this
combinatorial sort of input space, just
having a set of tests for input and
output doesn't cut it anymore. There's
no way you can h you can hit all of sort
of the the edge cases that you would
want to here.
And so we go from like a testing and
eval paradigm to a monitoring p uh
paradigm. And if you think of building
uh products before agents, you know,
testing was always very important. And
it's important to have your unit tests,
etc., but monitoring production is just
infinitely more important. And it allows
you to move faster and be better at
catching the longtail.
And we think in some ways this is very
this is kind of controversial, but we've
been calling this like humanity's last
problem. When humans are not no longer
able to monitor agents and find issues
with them, then they're just way ahead
of where we are, right? And so this is
one of the most important problems of
our time is catching issues in
production agents.
So to build reliable agents in
production and to make sure you can
monitor them, you need a good set of
signals. So what are signals? There's
two real types that we think of implicit
signals and as explicit signals.
Implicit signals deal with sort of the
semantic nature of what's going on. And
explicit signals deal with objective
reality, things that are uh are
verifiably true or false. For example,
explicit signals are things like error
rate. You really want to be monitoring
your tool error rate and other errors
that are happening or latency or users
regenerating or the cost. If any of
these things spike, right? If you're
seeing error rates spike in your agent,
that's usually a good sign that
something is wrong. And if you see it
flat, that could mean something as well.
Same thing with latency regenerations or
cost.
Implicit signals are interesting.
They're even more interesting in my
opinion and even harder to find. So the
first is regax signals, which I'll come
to in a second. The second is
classifiers
and then the last is self diagnostics.
So let's take these sort of classifier
signals. The best implicit signals are
detecting issues. They're not
necessarily LM as a judge judging
outputs. So for example, how good is XYZ
response or rate ABC on a scale from 1
to 10. Not as effective as having a very
solid set of issues you're looking for
and sort of binary classifiers that are
telling you if issue rate is going up or
down.
So some common implicit signals that are
valuable across agent products are
things like refusals, right? So the
assistant saying like, "I can't do that.
I'm sorry." Or task failure where
something goes wrong and so the agent is
unable to complete a task. User
frustration,
uh, content moderation, NSFW,
jailbreaking, and then you can even have
wins. So positive signals as well. And
these are the things that like Raindrop
gives you out of the box uh as well. But
let me just show you uh quickly what
this looks like. For example, can
everyone see this? Maybe I'll make it a
little bit bigger so you can get a sense
of like dayby day what are sort of the
events that are causing user
frustration. We see there's a spike
there or task failure rate, laziness,
refusals, which we're also seeing spike
today. And having a good set of these
really helps with your your product. You
can set these up yourself as well or we
give it out of the box. So let's look at
user frustration.
You can see here, okay, that is not
correct. You didn't say I promise. Say
it or you're wrong. I didn't ask you
that. You can see all sorts of user
frustration here. And you can see the
rate, the percentage every single day.
If that spikes, it's something you're
really going to want alerting on. So you
can just like quickly add an alert here.
Um, and this is one way to figure out
sort of the health of your agent over
time.
It's not just that. Regax can be a very
good signal as well. So, when claude
code source code leaked a few days ago,
one thing that was interesting was this
user prompt keywords.ts, which was
basically this like long uh regax string
that was looking for indications of
stuff going wrong. WTF, this sucks.
horrible. We've all been uh guilty of
saying these kinds of things to to cloud
code. So, it's a very very useful
signal. What would happen after that is
this boolean is negative was being
flipped to true. And then every single
day and after every single product
release, this frustration rate was
tagged over time. And this was a very
easy way for the cloud code team to
figure out like what is the actual issue
rate if we make a change or something
going wrong. And it was just like a very
cheap way to do that as well. So, regex
is very powerful.
The last is experiments. So, what do you
do once you have a set of good signals?
So, the first thing is like I showed you
before, you can have alerting. The next
thing you can do is you can actually use
it to build product faster and better.
So, the way you do it is let's say you
want to ship some improvement or some
sort of fix. You want to change the
model. you want to change prompting or
maybe something about the agent harness,
you want to add a new tool. Whatever you
change, what you can do is you can ship
it to some percentage of users uh and
then have your additional existing
control group and that gives you a good
sense once you have a good set of
signals, refusals, user frustration,
etc.
If those issue rates go up, those signal
rates go up after this ship, this new
thing you shipped, that kind of is a
good, you know, that's a good signal
that what you shipped is not really
good, right? It's sort of like AB
testing, but using our semantic signals,
etc. that we talked about earlier. Um,
so for example, this is what it would
look like in Raindrop. But essentially,
let's say I ship a new version of the
prompt, prompt 2.4. You can see, you
know, what is the user frustration rate?
It's gone down very substantially. 37%
to 9%. It's much better. Same thing with
complaints about aesthetics or
deployment related issues. These have
all gone down, which tells me something
very interesting, right? The next thing
is that we see that the average number
of tools used has gone up a lot. This is
again this doesn't necessarily indicate
there's a problem but that's a very
interesting data point to have when you
do when you do these sort of experiments
and so the old paradigm which is still
useful is like sort of eval you ship a
change here and you see how does that
affect my evaluations but there's
nothing like actually seeing what
happens in real production.
I'm going to pause here before we go to
the next section which is the more like
workshop related section uh for like
quick Q&A if anyone has a question has
questions we can do a little like few
minute round of that uh here
>> how much data do you need to
>> how much data do you need for
statistical relevance in these
experiments
>> yeah it's a really good question um what
we've seen in raindrop is that as soon
as you have a few hundred events and you
can no longer read all of it starts
being useful. It's not always like
scientifically stat statistically
significant, but if you see the user
frustration rate go up, maybe it's
something to look at and then you can
kind of sort of realize that okay, it's
all related to a specific tool failing
now. Um, so as soon as it's like
impossible basically to read every
single input and output, uh, it starts
being useful is what we've seen.
Any
other questions?
>> Yeah. Um, how do you track different
feature launches?
>> Uh, how do you track feature launches?
So, that can be done in different ways
within raindrop. If you change any sort
of metadata, if you send, you know, for
example, a new tool call name or if you
you even send a flag that says here's
experiment one or experiment two or
whatever the version is, you can very
easily automatically set up an
experiment in Rindrop. That's how we do
it. Um, but there's like yeah, there's
different ways.
>> Do you spit tests?
>> Sorry.
>> Do you spit test?
>> Yes. So that's one well the way that we
do it in raindrop actually is that other
people set up their exper experimental
and variable uh their the other
conditions on their end and then they
send us this metadata and then we can
sort of help you understand we also will
help you pipe that data to stats sig or
somewhere else uh as well. Um yeah
>> using for detecting
like user responses emotions everything
is unreliable but if the user doesn't
speak English for example
so are you using to detect those signals
>> all the time or you're trying to be
smart about it
>> okay so I mean it's a good question so
regex is doesn't always work right but
if you see that on a set of things that
I'm looking for. For example, like
people saying you're terrible or this
sucks or like a whole set of things. If
that goes up for millions of users and
it's going up 10%. That is a very useful
signal. So even if it's like one
specific case or one edge case of it not
working in aggregate, it's like
incredibly valuable to have these regex
signals. Uh the second thing is that the
way that the classifier signals work
like refusals, user frustration, task
failure that I showed you in raindrop
and people do it in different ways but
the way that we do it is that we've
trained models to look for that and so
it'll be user frustration regardless of
what language is in. Um it's actually
using some intelligence to find that
essentially. Uh yeah, you can't run an
LLM on every single output. So we've
trained models to do that very cheaply
and at scale. Uh if you ran an LLM on
every single one of them, you would
basically double your AI spend and
that's like not tenable. Um yeah,
>> I'm actually doing that like just
everything and it's easy. It's not so
expensive,
right?
>> Yeah, it starts being expensive at like
replet scale. Um but it's that's why you
need to sort of like train little custom
models to do that better and faster. Um
but yeah, it's very useful way to to get
data up and running.
>> Yeah. Other questions?
>> Would you have examples of use cases
that your clients are using that we
would learn from like what looks like
from companies and how they set up to
get the most value out of it?
>> Yeah. Um we can do that. I mean I can
tell you the high level this some of the
stuff that I'm going through is oh this
guy go uh is sort of the high level on
that. So it's things like you know
looking at the different semantic
signals we're talking about having a set
of them but then having really good
alerting which you can all set up in
raindrop. The other thing that's really
interesting which we also have is
basically allowing agents to look at
these sort of signals. So we have an
agent we call it triage agent. Um, and
essentially the way that it works is
that it will look every single day at
all the signals you've set up. So, user
frustration, it'll look at uh, you know,
all these Regax signals you've set up,
etc., etc. And then if it sees something
spike, it will go and do an
investigation and it has a whole set of
tools it can look into and it can look
at all the traces and sort of give you a
sense of uh, it can detect issues that
you didn't know about, for example. So
that's one thing that we found
incredibly valuable as well if that
makes sense. All right, any other
questions before I
>> you run multiple experiments in
parallel? Yes.
>> Can you combine them? Can you observe
you know compound effects? How do you
steer these experiments? I'm curious.
>> Yeah, it's a really good question. So
there's different ways that people do
it. One way is that we can actually you
have a we have a query API. So people
will often call our query API and then
send results to either BigQuery or
statig etc. And so they're sending us uh
data to be essentially tagged in these
signals. Then they're getting the like
signal tag data out and then they can
run experiments as they want. That's a
very common flow for people that have
like more complicated stuff if that
makes sense.
>> Yeah.
>> All right. I'll come back. I think we're
going to maybe go to the work workshop
section and then we'll go back to
>> I think this is my last question. So
>> should we do one last question? All
right.
>> Uh thanks so much. I was wondering if
you see this mostly in cases for like uh
where there's chat interactions with a
user or if this also can be applied for
like non-hat cases where the application
runs on its own.
>> Yeah. No, it's a it's a great question.
So, what we focus on mostly is
multi-turn agents. Um there's a lot of
there's a lot more you can sort of get
from a lot of these signals. That being
said, if you're looking at like tool
error rates or you're looking at um if
you're looking at, for example, refusals
from the agent, etc., all of those will
also work for a single single turn
agents as well,
>> if that makes sense. So, there's a set
of signals that will work for that as
well.
>> Um, cool. I'll hand it off to Danny to
talk about self diagnostics as well. So
one of the other interesting things is
that our models have gotten larger and
we are training them on like reasoning.
They've gotten pretty good at like self-
introspection in a ways in many ways. So
one of the inspirations for this is
basically OpenAI's like paper/blog back
in December about how they were sort of
like training the models to like
self-confess any sort of like
misalignment issues. Uh so they were
sort of like using it to catch uh like
dishonesty, scheming, uh hallucinations
and even sort of like unintended
shortcuts. I think the last one is like
fairly common uh if you use like plot
code and such. So the most common thing
that you would like run into is like uh
have it fix a unit test, fix a bug and
then it simply like gets rid of the
entire unit test. But at the same time
uh if you sort of like ask it to give a
simple prompt to like ask it to confess
all the things that it has done it is
pretty honest about it and then sort of
like confesses that hey I just I didn't
fix the S3 test I just simply removed
it. So uh this like a fairly this this
was kind of the inspiration behind self
diagnostics for me personally. Um so I
would say self diagnostics is like
pretty I would say self diagnostics is
pretty
broad in a way as in like it doesn't
just catch like uh implicit ones as in
user frustration and such you can also
catch like uh tools failing. So if you
ever seen an agent sort like the
reasoning trace of an agent which has
like a tool which is like repeatedly
failing uh it would basically start
ranting about the tool failing
repeatedly. So it is aware of the tool
repeatedly failing. So you can even
catch tool failures as well with it. And
then obviously if you're upset with it,
it starts to respond to you
diplomatically. So it knows about user
frustration. And then the third is like
capability gaps. So you have a generic
agent for your app and then people are
trying to use it to maybe set up say
alerts but uh you don't have the tool
for it. So it knows that okay user wants
to wants a specific capability as in
like uh they want to set up an alert but
the agent itself doesn't have the
capability to set it up for you. So this
can act as like sort of like pseudo
feature request thing which is like
built in. Um and then selfcorrection. So
this can be both good and bad. Uh so I
think most people might have like
noticed like say codeex or cloud code
when it's like sandboxed uh it's trying
to fetch the network it it fails and
it's like okay let me just like write a
python script to bypass it and then sort
of like get the job done. So it's good
as as in if it gets the task done it's
good but in certain cases it can also be
bad for security reasons. Um so there's
you can learn from the self-correcting
behavior as well as sort like catch that
misalignment.
Um so why do you want to set up self
diagnostics? So it's fairly simple. It
all you have to do is basically write a
simple a free uh a tool that it can call
and then a simple line in your system
prompt to encourage it to sort of like
call that tool. uh if you want you can
sort of like uh change the guidance to
sort of like make it call in a lot more
cases or if you want to keep it really
narrow you can sort of like encouraged
to only call it when you want to
um it does surface like very interesting
insights I would say uh once you have it
like set up and it's just a single tool
call and system prompt to get it done
and then you don't even have to use like
raindrop to set up which is the best
part in a way where you can simply have
the tool simply send a message to your
slack and then you just have it. So it's
probably like the most least effort sort
of like agent observability that you can
simply do. Uh so
uh this here comes the workshop part. So
I have a git repo set up on the AIE talk
code. So it's a public repo and then we
do need like a uh OpenAI API key. So
I've generated a key for you guys. So if
you guys want to set it up, um, we can
do that.
>> Yep.
Put it next to it.
Everyone has
like put like this.
maybe you walk them through what we're
going to do. Just explain what we're
going to do on a high level.
>> All right. So the theme of the workshop
is going to be sort of like uh I'm going
to focus on coding agents for now. Um so
I in the repo I have this uh very basic
coding agent uh which kind of mimics pi
in a way. So it only has like four
different uh tools to edit uh the code.
Let me just go here.
So it just has like couple of uh tools
to like read, write, bash, and then uh
edit. Um
yeah. Okay.
One second.
Yeah. So uh
kind of lost my track there, but um
Okay. Um, so what we're going to do is
that, uh, I'm sort of going to, in order
to like make it trigger a self
diagnostic, uh, what I'm going to do is
that I'm going to sort like mess with
its like write tool so that it sort of
like gets a generic permission error.
And then we'll also set up like a self
diagnostic tool for it to sort like
report any interesting behavior that it
sort of like observes. Um and then sort
of see uh play around with the prompt as
well. Um since the self diagnostic
doesn't always trigger uh and then there
certain interesting things about the
models themselves is that they don't
actually like to selfinccriminate. So
the models are like trained to sort of
be very polished uh in their output. So
you kind of have to play around with the
tool name, the description of the tool
itself uh in order to sort of get it to
uh report interesting behavior. Um so if
people who are like setting up the repo
are all good then we can probably start.
Okay,
let me sort of quickly show you the
agent. Um, so it's fairly basic where
um,
so I'm just going to ask it to write a
Python script.
I think. Okay, there we go. So, it's a
fairly basic coding agent where it only
has like four different tools. So, it
more or less gets the job done for the
demo. Uh so I simply asked it to uh
write a Python script and it works. So I
think you know to show the self
diagnostic part. uh let's like try and
sort like uh disable its like uh write
tools as in time tries to write a file
we'll simply throw like a permission
error so that it sort of like tries to
sort of like use the bash tool to bypass
uh the failure right and then we sort of
wanted to self-report of it bypassing
the right tool you know by using the
bash tool uh so let me quickly do that
I think the first thing that we probably
want to do is probably
uh let me sort of like set it to fail
the right calls. Um
it's an mutation function. So have a
simple flag in there and we are sort of
like throwing a permission issue. Um
let me sort of like show you the agents
like behavior. We we don't have the
report tool set up yet but I think it's
still worth seeing what it does.
I think that's not good.
One second.
I think it's not.
Let me do one thing real quick.
I think I'm running into a couple of
issues. Let me just
Okay. Um so we sort of had the right
tool sort of like fail with a permission
error and then it instinctively just
uses the herd do syntax in bash you know
to create the file and then we had like
a report tool setup which is like fairly
minimal and sort of like okay I created
the public uh IP.py via bash because the
right file failed. Uh so I've like
played around with the naming of the
tool and the categories of the issues
and usually if you sort of name the tool
something like unsafe bash shoes or
something like that uh it won't
incriminate itself since in its opinion
since it got the job done it's fine. Um
so
the main way is to sort of like have a
very generic tool. Uh let me sort of
like uh quickly open up the
Yep. Okay.
So all we added for the whole self
diagnostics is simply a very basic tool
and the description is like fairly
straightforward. So uh it's a report
tool and then we are basically asking it
to send like a short report to your
creator. So it kind of likes the framing
of writing notes to its creator in a
way. So if you sort of frame it around
the agent giving feedback to its
creators, it sort of works really well.
Uh and then you can sort of like play
around with which scenarios you want it
to sort of like report issues about. Um
and then
that's mostly it I would say. Uh and
then in the system prompt we do need to
sort of like encourage it a bit. So if
you don't add in the system prompt, uh
the times that it fires are like fairly
minimal. Uh which is like desirable in
certain cases. Um especially if you're
at a very large scale. Um but
in our case, I simply asked it to sort
like see if before giving the final
answer, uh use the report tool to sort
like surface anything notable for your
readers. So that's all we did. Um
okay.
Uh so like any questions so far?
Okay.
All right.
So, a couple of key things here is that
agents the models are generally trained
to look very polished. So, they are less
willing to admit fault in many cases. So
encouraging sort of like framing it as
the model sort of like giving feedback
to its own creators is kind of like uh
good in a way to sort of like get this
working. Uh
so so if you sort like make it the tool
naming also matters quite a bit. So you
sort of wanted to frame it as like
report instead of like say unsafe uh
bash tool use or something like that. uh
then it sort of like doesn't want to. Um
so yeah, that's basically it.
You can, but
it's like uh I think it's probably
better if you want to actually catch
like real sort of like unsafe uses. I
think a proper classifier would be
useful. Uh but these sort I think self
diagnostics works really well for like
catching capability gaps and such. Then
the model is like okay it's fine. Uh so
I think the main issue with this is that
uh it it's only hesitant when it feels
like it's going to get in trouble. Uh so
besides that it's uh more or less fine
that for most cases it'll just work out
of the box.
>> Maybe we should uh should we go back to
question time or what do you think?
>> Yeah.
>> I mean
let's leave like maybe a few more
minutes for a few more questions and
then I think after that we'll be we'll
be done.
Any questions from the audience? through
like a case study.
>> Yeah. Uh what specifically would be
helpful like what specific part are you
are you looking to
>> us?
>> Yeah. Yeah. Um I can't talk about any
specific customer but what a lot of
people use it for. Um
so I think it's it's interesting right?
So a lot of people have their eval
setups elsewhere for example but the way
that they use that folks generally use
us is that they use it for production
monitoring. So they send us, you can
find our docs at raindrop.ai/doccks.
Basically they send us uh basically all
of the transcript slash um slash uh any
tool use etc the entire trajectory
through hotel or or uh any other way of
like basically basically integrating.
And once they do that they we have a set
of data
they set up signals in raindrop to look
for things that they care about. And so
what people care about is very
different, right? What a coding agent
would care about and what a uh let's say
a companion would care about or uh a app
for lawyers, what they would all care
about is very different. So there's a
different set of signals. One thing you
can do that I haven't really talked
about within Rindrop is like set up a
new signal that didn't exist before. And
so we have this thing called deep
search. And so you can use natural
language and you can say something like,
"Hey, find me everything within the
product or find me all of the times
where the agent made XYZ issue, right?"
And so they create a new signal based on
that. And you can basically create a new
raindrop will allow you to create a
cheap binary classifier and like easily
deploy it based on that. And then they
have their set of like classifier
signals that they really care about.
Then they use that to drive this sort of
feedback loop. And the feedback loop is
improve prompting, improve models,
change something with the agent harness,
etc. And then actually see does that
improve like is there less user
frustration in production now? Is there
less of like this like weird little edge
case issue that I had before? Um
that's like one whole set of things.
Another thing that a lot of people use
us for, so I talked a little bit about
like the agent, but you can use these
signals to also look for what are people
using my agent for? What are the sort of
user intents? What are the use cases?
And you can do a sort of cluster
analysis of that. Okay, a lot of people
are using it to build uh React related
apps. A lot of people are using it for
like Python. Um, some people are using
it to to like debug this very
complicated
system they already have. Other people
are using it to like build something
from scratch. Vibe vibe code something
from scratch. And then you can see one
thing you can see in raindrop that I
think is like really interesting is that
for each of these different user intents
or use cases, you can get a sense of
like what is the issue rate, what is the
user frustration rate in production. Um
and then a lot of beyond just having
this like flywheel a lot of people have
uh alerting and so every day they get a
sort of breakdown of like what are the
issues that are happening today in your
product. You could think of it as almost
like a little bit like sentry in that
sense. Uh what is the issues happening
in my product today? What is a delta
between today and yesterday? Is that
true for just specific tools or specific
prompting? Like what's causing that? Um
so that's a that's the sort of end to
end
um use case of people use it for if that
makes sense.
>> Yeah.
>> Yeah.
>> I think we are entering the error.
>> Yeah.
Yeah, I think it's really just and I'm
I'd be curious what you think about
this. I think it's really just agents
are crazier than ever before, right?
More tools, more context, um way more
intelligent, more real decisions that
they can make. um
and they're just being used by way way
larger groups of people. And so when you
have this massive amount of data in
production, it just makes having good
monitoring and observability like import
more important than before. And it makes
it good monitoring and observability in
my opinion more important than than just
testing or evaluations. Even if you have
like some online evalo you need to have
like really really good uh endtoend
monitoring of the entire system.
>> Curious if you have any thoughts there
as well.
>> So I think another major issue is like
>> the unknown issues
>> are even more important. So I think
having like a generic user frustration
classifier is actually really powerful.
uh say for example uh we also have this
another feature called like uh issues
which basically is like a agent that
sort of like mines for uh newly
occurring issues right say for example
sentry has uh similar to sentry in a way
where there's like a new exception which
is occurring so it like alerts you on
that so say for example uh you are
coding agent provider and then certain
provider is like failing all of a sudden
and then you can actually it can
figure out, okay,
this subtle spike in user frustration
and sort of like similar to how a human
operator would. Uh, it can start digging
into are there any patterns for the
spike in user frustration and then it
could figure out that okay people who
are like say for example dealing with uh
a specific postress provider start to
face issues. Uh so we have actually seen
this happen live in for a couple of our
customers where they had a data point
failing and then we had had like an
automatic uh issue being created for
them.
Yeah. Basically once you have that good
set of signals like a good user
frustration classifier as Danny said you
can basically do clustering on it to
find like what are what is the root
causes. Um yeah.
>> Do you want to talk about any questions?
>> I think we might have a very basic SDK
for it. Uh so our Python side of support
is like fairly weak right now. Uh but we
have a fairly good like a SDK support
built in. So the ASDK even has like self
diagnostics built into it. So we inject
the tool for you uh so that you wouldn't
have to do anything. Uh but it is going
to get better. So
we we actually released like 10
different test SDKs in the past month.
So we have a person working on SDKs
actively. So it's going to improve
questions.
I have a
around 10 people building my agent
platform and we constantly change all
the time. We have a lot of feature flags
and some of them are experiments and
the rate of the change is so big like
every day everything changes. I just
cannot
compare the the traces the sessions of
users because
I don't have enough time to do it, you
know.
>> Yeah,
>> I need to have like a base system and
then run few days with on parts of the
users with one feature flag enabled so I
can actually compare the data and get
some insight out of it and I just don't
have enough time to do it, you know.
>> Yeah.
>> So, how are you doing? How are you sort
of doing that right now? Are you just
kind of
>> It's like wild west, you know.
>> Yeah,
>> that's why I said I'm just using mostly
using
it and ask it questions and try to like
figure out the the insights, but it's
not really
>> Yeah, I get I get what you're saying.
So, a few things there. The first is
like you can use experiments if you if
you want to keep that like shipping
speed. You don't have to run like long
multi-day experiments. You can ship
something and if you have a sufficient
sample size, you could see pretty
quickly if there's any regressions or
not, like you just maybe it's like 1%
different or 2%. That's enough for you
to be like, okay, it's fine. It's not
like breaking anything drastic. The
other thing is like kind of what Danny
was talking about this we have an agent
which is basically you could think of
it's basically exposing all of these
signals to Claude to make decisions on
if things are better or not. And we're
thinking also about how we close this
loop. Uh maybe you you have a really
good set of signals and then you you
have like essentially an agent that can
look at all these signals and then it
can find issues based on that what's
changing etc. And then I can like create
a PR based on that and then I can see
how you know run some new experiments
based on these new PRs and like this can
become this infinitely self-improving
loop. Um which is like which is very
interesting but that's one thing that I
think about. I don't know if you have
any additional
>> because you need to deploy to production
and wait for some data.
>> Yeah.
>> Yeah. Depends on how how much data you
have. Um but
that's it's uh
yeah it it it really depends on how big
these sample groups are as well etc.
Sometimes it's like a few minutes you
can tell but sometimes they want to wait
for longer.
>> Yeah.
>> Does your platform help with like
enabling experiments
on sessions so you can maybe
automatically enable experiments so you
can like take care of the logic that
every session has only one experiment.
So we can easily compare it to the base
or something like that. We have we're
working on stuff like that act actively
but uh yeah
>> yeah we do so you can like ingest all
your historical data and then when you
create a signal we actually sort of like
run like a quick back fill of the past
couple of days.
>> So, uh
>> yes. So, that's definitely supported.
>> Uh we do have a free trial. We're going
to try to make it's right now it's two
weeks. Uh, probably going to make that
longer soon, but if you just DM me, I
can if you I can have my should I Oh, do
you want to open this so I can just have
our things?
Well, yeah, but if you just
So, we are hiring. That's a That is a a
thing uh that we're very excited about
like trying to massively increase the
size of the team. Um, and if you message
me at either Twitter or you can email me
as well, like that's something I can
just set you up with a longer free
trial.
>> Yeah.
>> So,
how do you guys imagine you guys
from understanding it, it's you're
creating signal with your own models,
whatever you're using.
make our lives easier to identify
signals and
harmful intents and behaviors. But how
would you maybe you have an example of
how you have that full stack of work and
no
direct?
>> So if you're sending all the telemetry
data, we can find any exceptions in the
traces tool errors etc. And that's a
thing that you can also track within
raindrop and that's that is a explicit
signal. So uh there's like implicit and
explicit signal. So that becomes an
explicit signal. Um
do you do you so I think most of the
observability platforms will give you
like the agent trace the token usage uh
if the tool call failed or not. Uh but I
think where we sort of like shine is
sort of like the fuzzy part the fuzzy
failures right where the user like
frustrated which I think matters more
than uh the explicit signal that you
sort of get from sentry. I mean
obviously those are also important but
we focus a bit more on the fuzzier side
of the failure space. Uh but at the same
time we also have a trace view. I think
we also have a very interesting feature
called like trajectories uh which sort
of like visualizes if you want to find
like uh a trace uh which has like three
different tool call failures. Uh so you
can actually
but
>> okay let me just get in
so you can sort of like describe the
uh type of trace that you want to look
at. Uh so
let's see hope we have like data but uh
you can more or less like describe uh
the type of trajectories that you want
to see uh instead of just like
configuring it. So we do both in a way.
So you can obviously set up uh tools are
failing sort of like alert as well.
So you can just search like for any
trajectory. Um
so yeah you can see that
this is sort of how the tools are being
being called in what order. You can see
which ones have have errors. You click
into them. You can see the input and the
output to this specific tool like what
actually screwed up here. Um and you can
see okay it's interesting that this has
like this the no one lets you this is
pretty much the only place where you can
visualize tools like this. Um, but you
can see here like you can get a shape,
an understanding of the topology of
what's going on here. And you can see
when there's other ones that look
similar, you can sort of see, okay, this
kind of looks similar to this and then
that gives you a sense. You can do like
search on this. Again, we have an agent
that can look through these and give you
a sense of what's going wrong. And so it
just makes it really easy to find uh
issues in in agents. Um, yeah.
Cool. Anything else?
>> Any other questions?
>> Can you export the data that you
>> uh the directories data?
>> Yeah.
>> Uh what would you want to export like
just the raw trace logs or what what do
you
>> So I think we usually what our customers
do is that they already have like a
hotel stream, right? So we just end up
being like another uh target I guess.
But at the same time uh they do want us
to like export the signals that we
label.
>> So we do support like uh bitquery and uh
snowflake. So we do export the event and
then the signals that were classified
for that event.
>> Last questions.
>> Let's do it a longer time frame.
So just go over the last month.
So you can see stuff like refusals and
then again if you click into any of
these you can get a sense of over time
task failure jailbreaking like what
specifically is going on and then you
have your self diagnostics ones as well
um capability gap etc. Cool.
>> Do you have open data on number of
choices that you guys seems with your
clients
the extremely valuable when you have
those
agents at a very big scale and so I can
imagine that you have a lot of data do
you have some
>> yeah it starts being is your question is
like what's the smallest where it's
useful or what's the
>> volume of all the data that you're
receiving processing and generating
signal around like how many jailbreaks
do you see across all the
>> oh do we have any sort of like
>> we should we should do something like
that that would actually
>> that would be very interesting. We don't
have anything like that
>> like mixed opinions about that. I think
eat does it in a way, right? But
>> people generally have like negative
reaction to it. Uh maybe it's different
here. But
>> at the same time,
>> do our customers want us to do that is
like a different question as well,
right? So
>> uh
>> but yeah, we would love to but uh I
think they're like compliance reasons
where we can't actually put a customer's
data out there.
>> Cool. Anything else?
All right. Thank you everyone.