AI Engineer World's Fair 2025 - Evals

Channel: aiDotEngineer
Published at: 2025-06-06
YouTube video id: Vqsfn9rWXR8
Source: https://www.youtube.com/watch?v=Vqsfn9rWXR8
two testing. Hello.
Okay.
When you're ready,
you
one, two, one, two, three. Look at Look
at Look at your data. Look at your data.
Great to meet you.
So big fan of the spy. I think I've come
across your Twitter. Yeah. Yeah. Yeah. I
tried to do the same in in Typescript
and failed.
So now that I have a good portion of
I have something funny here. I wonder if
you're all okay with it or we don't have
to do it, but I asked the AI to give us
the intros.
And so, one of the funny things I asked
it to do is tell me the pronunciation
for the name. So, I'd be curious which
AI model does better. So, it'll also be
a way that I can kind of like interact
with the audience. So, like, all right,
pick me pick the AI model that you want
to try to make the the model the thing
funny. Is that okay with everybody?
Sounds good. All right, cool. I don't
know if I'll have has to connect it, but
I can always just use it and like say it
out loud. So, that's cool. For sure. All
right. Cool. And uh no guarantees it did
a good job. Yeah, I think that's the
point. Yeah. And it
probably might have been me entry. So,
feel free to like call that out. Okay.
That thing was wrong about this. Hello.
Hello. For example, let's try Omar real
quick. Where's your mic? Right here.
It's invisible. visible, bro.
Nice. There you go.
I don't know if it used the web or what
other things to like maybe I I think I
submit we submitted. Yeah, I submitted
the I included whatever you provided.
Yeah, that's cheating.
But some of them I want to turn your
belt back off your presentation.
Radio
frequencies.
handle. Yours is fine. Mine is good. Uh,
should I twist it or
Yeah. You done a lot of these? I bet.
Have you done a lot of these? It's been
a while.
I didn't know you worked at database. Is
that recent? Been there for a year.
Yeah, I must might have just missed it.
Yeah. Cool. Nice.
Check. Check. Check. Check.
Do you want me to head back or do I
stay?
There's some cans there. I think that's
the water.
Super stressful to be honest. It's all
good. Just be yourself.
Ignore the crap. Usually ends of that.
Look at the last row. You know that that
makes it look like you're looking at
everyone.
Oh yeah, I predict a lot.
Maybe we contradict each other. That
would be interesting. That would be even
more interesting. Well, he's going to
contradict us for sure if you
ours are going to be pretty
complimentary, I think, because ours are
like, okay, we did a lot of unit tests
like Evos and then we regretted it and
then your yours came after like Evos are
not unit test, duh. So they complent
each other.
Yeah. We just do a bunch of different
types of evos now do more like
trajectory full conversation evos and
stuff like that. Oh yeah. 18 people
watching. 18 people, bro. I'm famous.
[Laughter]
Wow. Test test test.
Hello. Hello. Test test.
Test. Test.
Hello. Hello. How was your flight? Good.
Good flight. Yeah. Very long.
Oh, I actually live in Spain now. So,
it's like a very brutal very brutal
flight.
Like 12 hours. Damn. Yep. Yep.
Brutal.
Yeah, it's rough. It's rough.
So I was like I want to go to France.
Yeah, that's a good move. Good move.
Yeah. Yeah. The shortest way from United
States to Europe.
Yeah, indeed. Yeah.
Uh
no, I guess it's pretty much more or
less the same.
Maybe slightly shorter.
Maybe they're better flights.
[Music]
because
totally they told you the dream, but it
was it was a lie.
Makes sense.
You're good. You nervous?
I think it was in the big room. Yeah.
Oh, yeah. No, I would be melting now if
I wasn't going to present that room. I
would be flying back to Poland. Like,
sorry guys, I'm sick. I'm changing jobs.
I'm going to be a electrician.
Yeah.
Yeah. Okay. So, if you if we have
trouble
Yeah, that's true. If we have trouble
with the AV, you take over because you
you're going to see better.
I thought
ours might be right there on the limit I
feel. Yeah. At the very beginning it was
like half an hour and then we like
started trimming stuff removing slides.
Really? Yeah. Yeah.
Yeah. I don't think we will have time
for the Q&A. Ain't nobody ain't nobody
got time for
Q&A. Move on to the next talks.
Yes.
You cannot optimize for any domain,
right? Indeed.
[Music]
Exactly. Yeah. Yeah. It's really
difficult. Yeah. Um
Yeah. We mostly bring like the average
for everyone high, but it's still like
not super good at any specific vertical.
Uh, so we struggle with that a lot
trying to work.
Yeah. Yeah. Yeah. It's it's like it's
for the normies.
So I get people hooked into it, you
know, and then
I think for us really
they just
gauing you don't know if you're
improving. Oh yeah, we have something
really pretty similar. Yeah, actually
like we changed the model sonnet to
newer set and the scores like all over
the board like what the [ __ ] is this?
Yeah, even though it was obviously the
best model because we were already like
using it in cursor and everything and
hey, how you doing? How are you? Nice to
meet you. Yeah, he works at Oh, nice to
meet you. I used to work with these
guys.
Yeah,
I'm excited for your talk. And Brain
Trust is MCing this whole track. Uh
yeah, almost up there. They they just
got the whole track. Yeah. Cool. Yeah,
makes sense. Yeah. I mean, not if you're
a competitor, but Yeah. True. through
that.
Y'all ready? Yeah. No.
Have you given a talk before? Not not
like this fancy. No. No. I've done a lot
of like online talks. Yeah. Yeah. When
you did the course, you did. Yeah. Yeah.
And that's where a lot of the content
much easier. I just don't look at any
any face. I just look at my presenter
notes and I and I just zone out. I'm
going to try to do the same, I think.
Did you repurpose the content from uh a
lot of it was repurposed and then
there's some new stuff. The good stuff,
right? Yeah. There's some the good stuff
repurposed and then some new stuff.
Cool.
So like it's it's a mix half and half.
You'll get longer and longer prompting
guides for the latest models that are
supposed to be, you know, closer and
closer to AGI. Um and if you're less
lucky, you have to figure that out on
your own, right? Um if you're even less
lucky, the prompting guides from the
provider are not even that good. So you
have to actually kind of figure out what
what the right thing is. And every day
uh maybe at an even faster pace, someone
is releasing an archive paper or a tweet
or something that introduces a new
learning algorithm, maybe some
reinforcement learning uh bills and
whistles, maybe some prompting tricks,
maybe a prompt optimization, you know,
technique, something or the other that
promises to make your system learn
better and sort of fit your goals
better. someone else is introducing some
search or scaling or inference
strategies or agent frameworks or agent
architectures that are promising to
finally unlock levels of reliability or
quality better than what you had before.
And I think if you're actually doing a
reasonable job now most like you know
I've got to stay on top of at least some
of this stuff so that like I don't fall
behind. Um and in many cases like you
know model APIs actually change the
model under the hood even though you
know you're using the same name. So it's
actually you're forced to
scramble and actually I would say maybe
the question isn't whether you will
scramble every week maybe a different
question is will you even get to
scramble for long if you think about the
rate of progress of these LLMs like are
they going to eat your lunch right so
these are I think questions that are on
a lot of people's minds and this is
where the talk is is is going to be
addressing so the talk mentions the
bitter lesson which is sounds like this
you know really ancient old kind of AI
lore but it's just you know six years
old uh where the current
Turing award winner Rich Sutton who's a
pioneer of reinforcement learning wrote
this short essay uh on his website
basically that says 70 years of AI has
taught him and taught you know other
people in the AI community from his
perspective that when AI researchers
leverage domain knowledge to solve
problems like I don't know chess or
something we build complicated methods
that essentially don't scale and we get
stuck and we get beat by methods that
leverage scale a lot better. What seems
to have to work better according to uh
to Sutton is general methods to scale
and he identifies search which is not
like retrieval more of like uh you know
exploring large spaces and learning um
so getting the system to kind of
understand its environment maybe for
example um work best and search here is
what we'd call in LLM land maybe
inference time scaling or something so I
don't speak for Sutton and I'm not you
know suggesting that I have the right
understanding of what he's saying or
that I necessarily agree or disagree but
I think this is just fundamental and
important kind of um concept in this
space. So I think it's it's it raises
interesting questions for us as people
who build uh you know engineer AI
systems because if leveraging domain
knowledge is bad what exactly is AI
engineering supposed to be about? I mean
engineering is understanding your domain
and working in it with a lot of human
ingenuity in repeatable ways let's say
or with principles. So like are we just
doomed like are we just wasting our
time? Why are we at an AI engineering
you know fair? And I'll tell you um how
to resolve this. I've not really seen a
lot of people discuss that. Sutton is
talking about and a lot of people, you
know, throw the bitter lesson around. So
clearly somebody has to think about
this, right? Sutton is talking about
maximizing intelligence. All of us
probably care about that to some degree,
but which is like something like the
ability to figure things out in a new
environment really fast, let's say. Um,
all of us kind of care about this to
some degree. I'm also an AI researcher.
Um, but when we're building AI systems,
I think it's important to remember that
the reason we build software is not that
we lack AGI. We build software um you
know and and the reason for this and the
way kind of to understand this is we
already have general intelligence
everywhere. We have eight billions of
them. Um they're unreliable because
that's what intelligence is. Um and
they've not solve the problems that we
want to solve with software. That's why
we're building software. Um so we
program software not because we lack AGI
but because we want reliable, robust,
controllable uh scalable systems. Um and
we want these things that to be things
that we can reason about, understand at
scale. Um and actually if you think
about engineering and reliable systems,
if you think about checks and balances
in any case where you try to systematize
stuff, it's about subtracting agency and
subtracting intelligence in exactly the
right places uh carefully uh and not
restricting the intelligence otherwise.
So this is a very different axis from
the kinds of lessons that you would draw
on from the bitter lesson. Now that does
not mean the bitter lesson is
irrelevant. Let me tell you the precise
way in which it's relevant. Um so the
first takeaway here is that scaling
search and learning works best for
intelligence. this is the right thing to
do if you're an AI researcher interested
in building, you know, agents that learn
really well, really fast in new
environments, right? Don't hard code
stuff at all. Um, or unless you really
have to. But in building AI systems,
it's helpful to think about, well, sure,
search and learning, but searching for
what, right? Like what is your AI system
even supposed to be doing? What is the
the the fundamental problem that you're
solving? It's not intelligence. It's
something else. Um, and what are you
learning for, right? Like what is the
system learning in order to do well? And
that is what you need to be engineering.
Um not the specifics of search and not
the specifics of learning as I'll talk
about in the rest of this talk. So he's
saying um Sutton is saying complicated
methods get in the way of scaling uh
especially if you do it early uh like
before you know what you're doing
essentially. Did we hear that before? I
feel like I heard that back in the 1970s
although I wasn't around. This is you
know the notion of structured
programming uh with with KUS saying his
popular you know uh uh phrase in a paper
premature optimization is the root of
all evil. I think this is the bitter
lesson for software and thereby also for
AI software. So it's human ingenuity and
human knowledge of the domain. It's not
that it's harmful. It's that when you do
it prematurely in ways that constrain
your system in ways that reflect poor
understanding they're bad. But you can't
get away in an engineering field with
not engineering your system. Like you're
just quitting or something, right? So
here's a little piece of code. Um, if
you follow me on X on Twitter, you might
recognize it, but otherwise I think it
looks pretty opaque to me in like 3
seconds. And I can't really look at this
and tell exactly what it's doing. And I
al also honestly don't really care. Um,
so lo and behold, um, this is computing
a square root in a certain floating
point representation on an old machine.
And I think the thing that jumps at me
immediately is this is not the most
futurep proof program possible. If you
change the machine architecture,
different floatingoint representations,
better CPUs, first of all, it'll be
wrong because, you know, like it's just
hard coding some values here. Um, and
second of all, it'll probably be slower
than a normal, you know, square root
that maybe is a single instruction or
maybe the compiler has a really smart
way of doing it or, you know, a lot of
other things that could be optimized for
you, right? So someone who wrote this,
maybe they had a good reason, maybe they
didn't, but certainly if you're writing
this kind of thing often, you're
probably messing up as an engineer. So
premature optimization is maybe the um
square root of all evil or something,
but what counts as uh
premature? Like I mean that's kind of
the name of the game, right? Like we
could just say that, but it doesn't mean
anything. Um so I don't think any
strategy will work in tech. Nobody can
anticipate what will happen in three
years, 5 years, 10 years. But I think
you still have to have a conceptual
model that you're working off of. And I
happen to have built two things that
are, you know, on the order of several
years old that have fundamentally stayed
the same over the years from the days of
BERT text Dainci 2 up to four 04 mini
and they're bigger now than they ever
were. And they're sort of like these um
stable fundamental kind of um
abstractions or AI systems around around
LLMs. So what gives what happens u in
order to get something like Kar or
something like DSP in this ecosystem? um
and sort of endure a few years which is
like you know centuries in AI land. Um
I'll try to reflect on this and you know
again none of this is guaranteed to be
something that lasts forever. So here's
my hypothesis. Um premature optimization
is what is happening if and only if
you're hard coding stuff at a lower
level of abstraction that you can then
you can justify. Um if you want a square
root please just say give me a square
root. Don't start doing random bit
shifts and bit stuff like you know bit
manipulation that happens to appease
your particular machine today. Um but
actually take a step back. Do you even
want a square root or are you computing
something even more general? And is
there a way you could express that thing
that is more general right and only you
know uh stoop down or go down to the
level of abstraction that's lower if
you've demonstrated that a higher level
of abstraction is not good enough.
So I think the bigger picture here is
applied machine learning and definitely
prompt engineering has a huge issue
here. Tight coupling is known tighter
coupling than necessary is known to be
bad in software but it's not really
something we talk about when we're
building machine learning systems. In
fact the name of the game in machine
learning is usually like hey this latest
thing came out let's rewrite everything
so that we're working around that
specific thing. And I tweeted about this
a year ago, 13 months ago, last May in
2024, saying the bitter lesson is just
an artifact of lacking high level good
highle ML abstractions. Deep learning uh
scaling deep learning helps predictably,
but after every paradigm shift, the best
systems always include modular
specializations because we're trying to
build software. We need those. Um and
every time they basically look the same
and they should have been reusable, but
they're not because we're writing code
bad. We're writing bad code. So here's a
nice example just to demonstrate this.
It's not special at all. Here's a 2006
paper. The title could have really been
a paper of now, right? A modular
approach for multilingual question
answering. And here's a system
architecture. It looks like your
favorite multi- aent framework today,
right? It has an execution manager. It
has some question analyzers and
retrieval strategy strategists from a
bunch of corpora. And it's like a figure
you, you know, if you color it, you
would think it's a paper maybe from last
year or something. Um, now here's a
problem. It's a pretty figure. The
system architecturally is actually not
that wrong. I'm not saying it's the
perfect architecture, but in a normal
software environment, you could actually
just upgrade the machine, right? Put it
on a new hardware, put it on a new
operating system, and it would just work
and actually work reasonably well
because the architecture is not that
bad. But we know that that's not the
case for these ML sort of architectures
because they're not expressed in the
right way. So, I think fundamentally I
can express this most um passionately
against prompts. A prompt is a horrible
abstraction for programming and this
needs to be fixed ASAP. Um, I say for
programming because it's actually not a
horrible one for management. If you want
to manage an employee or an agent, a
prompt is a reasonably kind of like it's
an Slack channel. You have a remote
employee. If you want to be a pet
trainer, you know, working with tensors
and, you know, objectives is a great way
to iterate. That's how we build the
models. But I want us to be able to also
engineer AI systems. And I think for
engineering and programming, a prompt is
a horrible abstraction. Here's why. It's
a stringly typed canvas, just a big
blur, no structure whatsoever, even if
structure actually exists in a latent
way. Um, that couples and entangles the
fundamental task definition you want to
say, which is really important stuff.
This is what you're engineering with
some random over uh fitted halfbaked
decisions about, hey, this LLM responded
to this language, you know, when I talk
to it this way or I put this example to
demonstrate my point um and it kind of
clicked for this model, so I'll just
keep it in. And there's no way to really
tell the difference. What was the
fundamental thing you're solving and
like you know what was the random uh
trick you applied? It's like a square
root thing except you don't call it a
square root and we just have to stare at
it and be like wait why are we shifting
to the left five you know by five bits
or something. Um you're also inducing
the inference time strategy which is
like changing every few weeks or people
are proposing stuff all the time and
you're baking it literally entangling it
into your system. So if it's an agent
your prompt is telling it it's an agent.
your system has no big like deal knowing
about the fact it's an agent or a
reasoning system or whatever. What are
you actually trying to solve? Right? If
it's like if you're writing a square
root function and then you're like hey
here is the layout of the strcts in
memory or something. Um you're also
talking about formatting and parsing
things you know write in XML produce
JSON whatever like again that's really
none of your business most of the time.
So you want to write a human readable
spec but you're saying things like do
not ignore this generate XML answer in
JSON. you are professor Einstein, a wise
expert in the field of I'll tip you
$1,000, right? Like that is just not
engineering, guys. Um, so what should we
do? Trusty old separation of concerns, I
think, is the answer. Your job as an
engineer is to invest in your actual
system design. And you know, starting
with the spec, the spec unfortunately or
fortunately cannot be reduced to one
thing. And this is the time I'll talk
about evals. I know everyone hears about
eval. This is the one line about evals
that makes this talk about evals. A lot
of the time um you want to invest in
natural language descriptions because
that is the power of this new framework.
Natural language definitions are not
prompts. They are highly localized
pieces of ambiguous stuff that could not
have been said in any other way. Right?
I can't tell the system certain things
except in English. So I'll say it in
English. But a lot of the time I'm
actually iterating to uh to appease a
certain model and to make it perform
well relative to some criteria I have
not telling it the criteria just
tinkering with things there eval is the
way to do this because eval say here's
what I actually care about change the
model the eval are still what I care
about it's a fundamental thing now eval
are not for everything if you try to use
eval to define the core behavior of your
system you will not learn induction
learning from data is a lot harder than
following instructions right so you need
to have Both code is another thing that
you need. You know, a lot of people are
like, "Oh, it's just like a, you know,
just just ask it to do the thing." Well,
who's going to define the tools? Who's
going to define the structure? How do
you handle information flow? Like, you
know, like things that are private
should not flow in the wrong places,
right? You need to control these things.
Um, how do you apply function
composition? LLMs are horrible at
composition because neural networks kind
of essentially don't learn things that
reliably. Function composition in
software is always perfectly reliable
basically, right, by construction. So a
lot of the things um are often best
delegated to code, right? But it's it's
hard and it's really important that you
can actually juggle and combine these
things and you need a canvas that can
allow it can allow you to combine these
things. Well, when you do this, a good
canvas, the definition here of a good
canvas or a c the the criteria for a
good canvas is that it should allow you
to express those three in a way that's
highly streamlined and in a way that is
decoupled and not entangled with models
that are changing. I should just be able
to hot swap models. Uh inference
strategies that are changing. Hey, I
want to switch from a chain of thought
to an agent. I want to switch from an
agent to a Monte Carlo research.
Whatever the latest thing that has come
out is, right? I should be able to just
do that. Um and new learning algorithms.
So, this is really important. We talked
about learning, but learning uh is, you
know, always happening at the level of
your entire system if you're engineering
it or at least you got to be thinking
about it that way where you're saying, I
want the whole thing to work as a whole
for my problem, not for some general
default. Right? So that's what the eval
here are going to be doing. And you want
a a way of expressing this that allows
you to do reinforcement learning, but
also allows you to do prompt
optimization, but also allows you to do
any of these things at the level of
abstraction that you're actually working
with. So the second takeaway is that you
should invest in defining things
specific to your AI system and decouple
from uh the lower level swappable pieces
because they'll expire faster than ever.
Um so I'll just conclude by telling you
we've built and been building for three
years this DSPI framework which is the
only framework that actually decouples u
your job from which is writing the lower
level AI software from our job which is
giving you powerful evolving toolkits
for learning and for search which is
scaling um and for swapping LLMs through
adapters um so there's only one concept
you have to learn it is a new concept
which is which we call signatures a new
first class concept If you learn it,
you've learned DSPI. Um, I'll have to
unfortunately skip this because of the
because of the time for the other
speakers. But let me give you a summary.
I can't predict the future. I'm not
telling you if you do this, you know,
this the code you write tomorrow will be
there forever. But I'm telling you the
least you can do. This is not like the
the the kind of the top level. It's just
like the B the baseline I would say is
avoid handineering at lower levels than
today allows you to do. Right? That's
the that's the big lesson from the
bitter lesson and from premature
optimization being the root of all evil.
Um, among your safest bets, they could
all turn out to be wrong. I don't know,
is um, models are not anytime soon going
to read uh, specs off of your mind. I
don't know if like we'll figure that
out. Um, and they're not going to
magically collect all the structure and
tools specific to your application. So,
that's clearly stuff you should invest
in, right? When you're building a
system, invest in the signatures, which
I again you can learn about on the DSP
site, uh, DSpi.ai. Invest in essential
control flow and tools and invest in
evaluating on by hand and ride the wave
of swappable models. Write the wave of
the modules we build. Um you just swap
them in and out and ride the wave of
optimizers which can do things like
reinforcement learning or prompt
optimization for any application that it
is that you've built. Uh all right,
thank you
everyone. All
right.
All right. We won't take up time on the
computer. I think I learned that lesson.
So, for now, I'll go ahead and pick
something. Llama 3. Let's see how it
did. So, here's the introduction, even
though I asked it not to give me a
preface. It's my pleasure to introduce
our next speakers. Rafael Wulinski, also
known as Rafal Wii,
and Vtor Bologo, also known as Vtor
Baloo as the AI tech lead and staff AI
engineer at Zapier. Recently, they'll
share their insights on how to turn
agent failures into opportunities for
growth and improvement. And then it
said, "Note, I've also used the phonetic
pronunciation guides to help with the
speaker's names, which are often tricky
for pronounce for native speakers." So
that's llama 3 right there. All right,
I'll let you all take it from here.
Cool. Thank you.
How's everybody doing? I don't think I
have a mic yet. So, give me a sec. Yeah.
Oh,
now is the time to take a
selfie? Take a All right, cool. Hey
everybody, how's it going? Uh we're Vtor
and Rafal and we're AI engineers at
Zapier and we're building Zapier agents
and today we would like to share a
little bit with you some of the hard one
lessons about building agents that we
learned from the past two years building
Zapier
agents. Um yeah and brief introduction
to Zapier agents. U I believe many of
you know what Zapier is. This automation
software a lot of boxes arrows
essentially about automation uh
automating your business processes. Um,
agents is just well more agentic
alternative to Zapier. You describe what
you want, we propose you a bunch of
tools, a trigger and you enable that and
hopefully we enable your we automate
your whole business processes. Um, and a
key lesson that we have after those
those two years is that um, building
good AI agents is hard and building good
platform to enable nontechnical people
to build AI agents is even harder.
That's because AI is nondeterministic.
But on top of that, your users are even
more nondeterministic. They are going to
use your products in a way that you
cannot imagine up front. So um if you
think that building agents is not that
hard, you probably have this kind of
picture in mind. Um you probably
stumbled upon this library called blank
chain. Um you pulled some examples
tutorial. um you tweaked the prompt,
pulled a bunch of tools, you chatted
with this solution, and they thought,
well, it's actually kind of working. All
right, so let's deploy it and let's
collect some profit. Um, turns out the
reality has a surprising amount of
detail, and we believe that building
probabilistic software is a little bit
different than building traditional
software. Um, the initial prototype is
only a start and after you ship
something to your users, your
responsibility switches to building the
data flywheel. So once you start once
your user starts using your product uh
you need to collect the feedback you're
starting to understand the usage
patterns the failures so they can then
you can build more evolves be an
understanding of what's failing what are
the use cases um as you're building more
evolves and burn features probably your
product is getting better so you're
getting more users and there are more
failures and you have to build more
features and on and on and on. So yeah,
it form it forms this data flywheel. But
starting with the first
step. Okay. Yeah. So starting from the
beginning, how do you start collecting
actionable
feedback? Backing up for just a second.
The first step is to make sure you're
instrumenting your code, right? Which
you probably already are doing. Whether
you're using Brain Trust or something
else, they all offer like an easy way to
get started like just tracing your
completion calls. And this is a good
start, but actually you also want to
make sure that you're recording much
more than that in your traces. You want
to record the two calls, the errors from
those two calls, the pre- and
postprocessing steps. That way, it will
be much easier to debug what went wrong
with the
run. And you also want to strive to make
the run repeatable for eval purposes.
So, for instance, if you log data in the
same shape as it appears in the runtime,
it makes it much easier to convert it to
an Evo run later because you can just
prepopulate the inputs and expected
outputs directly from your trace for
free. And this is especially useful as
well for two calls because if your two
call produces any side effects, you
probably want to mock those in your
emails. So you get all that for free if
you're recording them in a
trace. Okay, great. So you've
instrumented your code and you started
getting uh all this raw data from your
runs. Now it's time to figure out what
runs to actually pay attention to. Um
explicit user feedback is feedback is
really high signal. So that's a good
good place to start. Unfortunately, not
many people actually click those classic
thumbs up, thumbs up and thumbs down
buttons. So, you got to work a bit
harder for that feedback. And in our
experience, this works best when you ask
for the feedback in the right context.
So, you can be a little bit more
aggressive about asking for the
feedback, but you're in the right
context. You're not bothering the user
before that. So, for us, one example of
this is once an agent finished running,
even if it was just a test run, we show
a feedback call to action at the bottom,
right? Did this run do what you
expected? Give us the feedback now. Um,
and this small change actually gave us
like a really nice bump in feedback
submissions surprisingly. So, thumbs up
and thumbs down are a good benchmark, a
good baseline, but try to find these
critical moments in your user's journey
where they'll be most likely to provide
you that feedback either because they're
happy and satisfied or because they're
angry and they want to tell you about
it. Um, even if you work really hard for
the feedback, explicit feedback is still
really rare. uh and explicit feedback
that's detailed and actionable is even
harder because people are just not that
interested in providing feedback
generally. So you also want to mine user
interaction for implicit feedback. And
the good news is there's actually a lot
of lowhanging fruit possibilities here.
Here's an example from our from our app.
Users can test an agent before they turn
it on to see if everything's going okay.
So if they do turn it on, that's
actually really strong positive implicit
feedback, right?
Copying a model's response is also good
implicit feedback. Even uh OpenAI is
doing this for chat
GPT. And you can also look for implicit
signals in the conversation. Here the
user is clearly letting us know that
they're not happy with the
results. Here they're telling the agent
to stop slacking around, which is
clearly implicit negative feedback. I
think sometimes the user sends a
follow-up message that is mostly
rehashing what it asked the previous
time to see if LM interprets that
phrasing better. That's also also good
implicit negative feedback. And there's
also a surprising amount of of
cursing. Uh recently we had a lot of
success as well using an LLM to detect
and group frustrations and we have this
weekly report that we post in our Slack.
But it took us a lot of tinkering to
make sure that LLM understood what
frustration means in the context of our
products. So I encourage you to try it
out but expect a lot of
tinkering. You should also not forget to
look at more traditional user metrics
right uh there's a lot of stuff in there
for you to mine implicit signal too. So
find what metrics your business cares
about and figure out how to track them.
Then you can distill some signal from
that data. You can look for customers
for example that turned in the last
seven days and go look at their last
interactions with your product before
they left and you're likely to find some
signal
there. Okay. So I have raw data but now
I'll let the indust I'll let the
industry experts
speak. Why is it
Yeah. Or the beating will continue until
everyone looks at their
data. Um okay, but how actually you're
going to do that? So um we believe that
the first step is to either buy or build
LM ops software. We do both. Um, you're
definitely going to need that to
understand your agent runs because one
agent run is probably multiple LM calls,
multiple database interactions, tool
calls, rest calls, whatever. Each one of
them can be source of failure and it's
really important to piece together this
whole story. Understand this, you know,
what caused this cascading failure. Um
yeah, I said we are doing both because I
believe v coding your own internal
tooling is really really easy right now
with cursor and cloud cod and it's going
to pay you massive dividends in the
future um for two reasons. First of all,
it gives you an ability to understand
your data in your own specific domain
context. Uh and the second of all it
also you should be also able to create a
functionality to turn every single
interacting case or every failure into
an eval with the minimal amount of um
fraction. So whenever you see something
interesting there should be like a one
click to turn it into an eval. It should
become your
instinct. Once you understand what's
going on on a singular run basis you can
start understanding things at scale. So
now you can do feedback aggregations,
clustering, you can bucket your uh your
failure modes, you can bucket your
interactions, and then you're going to
starting to see what kind of tools are
failing the most, what kind of
interactions are the most problematic.
That's going to almost create for you
like an automatic road map. So you'll
know where to apply your time and effort
to improve your product the most. Um
doing anything else is going to be a
suboptimal strategy. Something that we
are also experimenting with is using
reasoning models to explain the
failures. Turns out that if you give
them the trace output input instructions
and anything you can find, they are
pretty good at finding the root cause of
a failure. Um, even if they are not
going to do that, they're probably going
to explain you the whole run or just
direct your attention into something
that's really interesting and might help
you find the root cause of the
problem. Cool. So now you have a good
short list of failure modes you want to
work on first. It's time to start
building out your
evals. And we realized over time that
there are different types of evals and
the types of EVs that we want to build
can be placed into this hierarchy that
resembles the testing pyramid for those
of you that know that. Um so with unit
test like EVLs at the base, end to end
EVLs or trajectory evals how we like to
call them in the middle and the ultimate
way of evaluating using AB testing with
stage walls rollouts at the top. So
let's talk a bit about those.
Starting with unit test eos, we're just
trying to predict the n plus1 state from
the current state. So these work great
when you want to do simple assertions,
right? For instance, you could check
whether the next state is a specific
tool call or if the two call parameters
are correct or if the answer contains a
specific keyword or if agent determined
that it was done, all that good stuff.
So if you're starting out, we recommend
focusing on unit test evos first because
these are the easiest to add. It helps
you build that muscle of looking at your
data, spotting problems, creating evals
that reproduce them, and then just
focusing on fixing them, right? Beware
though of like turning every positive
feedback into an eval. We found that
unit test evoss are best for hill
climbing specific failure modes that you
spot in your
data. So now unit test evals are not
perfect and we realized that ourselves.
uh we realized we had overindexed on
your test evos when the new models were
coming out that were objectively
stronger models but we were they were
still performing worse in our internal
benchmarks which was weird
um and because the majority of our evals
were so fine grain this made it really
hard to see the forest for the trees
when benchmarking new new models there
was always a lot of noise when we try
comparing runs like when you're looking
at a single trace it's it's easy to kind
of go through the trace and understand
what's happening but when you need to
kind of look at it from I don't know how
to play it again uh when you want to
look at it uh through an aggregation of
many traces then it starts getting
difficult to understand what's happening
why are so many of these passing and
some of these are regressing yeah so we
realized that uh maybe machine can help
us it turns out in that pre previous
video when I was investigating uh one
experiment inside brain trust there's a
lot of looking at that screen trying to
figure out what went wrong and we were
like hey maybe we can like just give
this old data to once again a reasoning
LLM and compare the models For us, it
turns out that with brain trust MCP and
reasoning model, you can just ask it to,
hey, look at this run, look at this run
and tell me what's actually different
about the new model that we are going to
deploy. In this case, it was Gemini Pro
versus Claude. And what the reasoning
model found was actually really, really
good. It found that Claude is like a
decisive exeutor, whereas Gemini is
really yapping a lot. It's asking
follow-up questions. It needs some
positive affirmations, and it's
sometimes even hallucinating. uh bad
JSON structures. So yeah, it helped us a
lot. It also surfaces a problem with
unit test evals a lot which is um
different models have different ways of
trying to achieve the same goal and unit
test evals are penalizing different
paths. They're like hardcoded to only
follow follow one path and yeah our unit
test evals were overfitting to our
existing models are actually data
collecting using that model. So what we
started experimenting with is trajectory
evals. Um yeah instead of grading just
one iteration of an agent, we let the
agent run till the whole uh till the to
the end state and we are not grading
just the end state but we are also
grading all the tool calls that were
made along the way and uh all the
artifacts that have been generated along
the way. Um this can be also paired with
LM as a judge. Ver is going to speak
about it later.
Um yeah, but they are not free. I think
they have really high return on
investment, but they are much harder to
set up. Um especially if you are
evaluating runs that have tools that
cause side effects, right? When you are
running an eval, you definitely don't
want to send a email on behalf of the
customer once again, right? Um so we had
a fundamental question whether we should
mock environment or not. And we decided
that we are not going mock we are not
going to mock the environment because
otherwise you're going to get data that
is just not reflecting the reality. So
what we started doing is just mirroring
uh users environment and crafting a
synthesis uh synthetic copy of that. Uh
also they are much slower right so they
can sometimes take like up to an hour.
Um so it's not pretty
great. And we're also learning a bit
more into LM as a judge. Uh this is when
you use an LLM to grade or compare
results from your EBOS. And it's
tempting to lean into them for
everything, but you need to make sure
that the judge is judging things
correctly, which can be surprisingly
hard. Uh and you also have to be careful
not to introduce subtle biases, right?
Because even small things that you might
overlook might end up influencing
it. Lately, we've also been
experimenting with this concept of
rubrics based scoring. We use an LLM to
judge the run, but each row in our data
set has a different set of rubrics,
rubrics that were handcrafted by a human
and describe in natural language what
specifically about this run should the
LM be paying attention to for the score.
Uh so one example of this, did the agent
react to an unexpected error from the
calendar API and then try it
again. So to sum it up, here's our
current mental model of the types of
evals that we built for Zapper agents.
We use LM as a judge or rubrics based
evals to build a highle overview of your
systems capabilities and these are great
for benchmarking new models. We use
trajectory evals to capture multi-turn
criteria and we use unit test like evals
to debug specific failures. Uh he'll
climb them. Uh but beware of overfitting
with
these. Yeah. And a couple of closing
thoughts. Don't obsess over metrics. Uh
remember that when a good metrics become
a target, it ceases to be a good target.
Uh so when you're close to achieving
100% score on your evil data set, it's
not meaning that you're doing good job.
Actually meaning that your data set is
just not interesting, right? Because we
don't have AGI yet. So it's probably not
true that your model is that good. Um
something that we're experimenting with
lately is dividing the data set into two
pools. uh into the regressions data set
to make sure that we are making any
changes. We are not breaking existing
use cases for the customers and also the
aspirational data set of things that are
extremely hard. For instance, like
nailing 200 tool calls uh in a row. And
lastly, um let's take a step back.
What's the point of creating evils in
the first place? Uh your goal isn't to
maximize some imaginary number in a
lab-like setting. Uh your end goal is
user satisfaction. So the ultimate judge
um are your users. You shouldn't be
optimizing for the biggest scores for
the evos and completely disregard the
vibes. So that's why you think the
ultimate verification method is an AB
test. Just take a small proportion of
your portion of your traffic, let's say
5% and route it to the new model, route
it to the new prompt, monitor the
feedback, check your metrics like
activation, user retention and so on.
Based on that probably you can make the
most educated guess uh instead of being
in the lab and optimizing this imaginary
number.
That's all. Thank you.
We have time for a couple of questions
like one or two. There's a mic here on
the left and another mic right there for
anybody that wants to ask a question uh
for VTO and for Rafael. Oh yeah, I'll
stay.
Anybody interested? You can also raise
your hand and not interesting.
You can also say the question and if you
don't mind veto repeat five mic five.
Hey, I'm here. Um the question I had for
you was um you mentioned at the very end
you were talking about splitting these
things up into three different data
sets. Do you combine that with like the
rubrics thing to sort of create like a
like a map of like how things end up
looking or like how how does that
actually uh end up driving product
decisions and like everything in
general? Yeah, great question. We're
still experimenting with the rubrics
based evos and I think they're really
useful when you want to eval
trajectories or especially useful when
you want to compare like a new model
that just came out. Um but I think
ultimately the combination of all the
different EVOs is what what tracks our
our our product roadmap right especially
when we look at things in aggregate and
we see we group things in buckets and
then we know okay this is the the most
important thing that we have to focus
now and then we can kind of go and like
slice our our data sets to find which
which part of that data set like is most
representative for us to kind of iterate
over and of course afterwards we run the
complete complete data set as well to
catch any regressions but that's kind of
how we think about
Cool. Do we have another question over
there? Yeah, one more question here. I'm
Drew from Vabs. Um, I was really curious
what your process for like evaling and
aligning your eval looked like. Was that
a pretty time consuming thing? It's
something we've struggled with and met a
lot of people who have struggled with so
uh yeah sure. So, first of all, we I
think we start building EVO by looking
at the customer feedback especially
negative customer feedback. Uh we used
to have labeling parties where we like
group together and we are looking at the
trace and we are trying to understand
what went wrong and what's the expected
behavior. Uh we also have slack channel
and we have something like an on call
rotation. There's always like a one
engineer that just looks at it and does
its uh best to to find the problem.
Yeah. But there is no like really hard
process for that. I think it's best
effort I would say. Uh yeah. Yeah. And
we kind of stumbled through different
types of evals over time. We started
like with big long trajectory evolves
them as a judge. Then we doubled down on
unit test uh evals and kind of neglected
the rest and now we're feeling a bit of
the pain from that. So in the end we
kind of spread out uh over different
types of evals and we're still kind of
building that intuition about uh what
the what a ratio of those should be.
Yeah. All right. Thank you folks. Right.
[Applause]
Good luck.
All right. So, thank you.
Uh in in case folks uh came in at the
last second, what I'm doing is I
actually have eval uh Vive presenting
and I have some multiple models here.
Thank you. So, let's try claw four.
Good. So really up next we have Ido
Pesal an AI engineer at Verscell who's
been deep in the trenches building the
AI systems behind V0ero Verscell's AI
powered web development tool. Uh he's
here to share hard one wisdom about why
evaluating AI systems requires throwing
out everything you thought you knew
about testing because when your system
is nondeterministic your emails need to
be anything but conventional. All right,
take it away. Thank you. Hello everyone.
How are you?
Let's see.
Hopefully my slides go up.
Awesome. Okay, thank you so much for
coming. My name is Ido. I'm an engineer
at Verscell working on
Vzero. If you don't know, Vzero is a
full stack Vibe coding platform. It's
the easiest and fastest way to
prototype, build on the web, and express
new ideas. Uh, here are some examples of
cool things people have built and shared
on
Twitter. And to catch you up, we
recently just launched GitHub sync, so
you can now push generated code to
GitHub directly from VZero. You can also
uh automatically pull changes from
GitHub into your chat, and furthermore,
switch branches and open PRs to
collaborate with your team.
I'm very excited to announce we recently
crossed 100 million messages sent and
we're really excited to keep growing
from
here. So my goal of this talk is for it
to be an introduction to eval
specifically at the application layer.
You may be used to eval layer which is
what the research labs will site in
model releases but this will be a focus
on what do evals mean for your users,
your apps and your data. The model's now
in the wild, out of the lab, and it
needs to work for your use case. And to
do this, I have a story. Uh, it's a
story about this app called Fruit Letter
Counter. And if the name didn't already
give it away, all it is is an app that
counts the letters in
fruit. So, the vision is we'll make a
logo with Chachi BT. Uh, there might be
problem market fit already because
everyone on X is dying to know the
number of letters in fruit. If you
didn't get it, it's a joke on the how
many Rs are in strawberry prompt. Uh
we'll have Vzero make all the UI and
backend and then we can
ship. So we had Vzero write the code. It
it used uh AI SDK to do the stream text
call. And what do you know? It worked
first try. GPT 4.1 said three. And not
only did it say three once, I even
tested it twice and it worked both times
in a row. So from there, we're good to
ship, right? Let's launch on Twitter.
Want to know how many letters are in a
fruit? Just launched fruit
lettercounter.io. The.com and.ai were
taken. Um, and yeah, everything was
going great. We launched and deployed on
Verscell. We had fluid compute on until
we suddenly get this tweet. John said,
"I asked how many Rs in Strawberry and
it said two." So, of course, I just
tested it twice. How is this even
possible? Um, but I think you get where
I'm going with this, which is that by
nature, LMS can be very unreliable. And
this principle scales from a small
letter counting app all the way to the
biggest AI apps in the world. The reason
why it's so important to recognize this
is because no one is going to use
something that doesn't work. It's
literally unusable. Um, and this is a
significant challenge when you're
building AI apps. So, I have a funny
meme here, but basically AI apps have
this unique property. They're very like
demo savvy. You'll demo it, it looks
super good, you'll show it to your
co-workers and then you ship to prod and
then suddenly hallucinations come and
get you. Um, so we always have this in
in the back of our head when we're
building back to where we were. Let's
actually not give up, right? We actually
want to solve this for our users. We
want to make a really good fruit letter
counting app. So you might say, how do
we make reliable software that uses LMS?
Our initial uh prompt was a simple
question, right? But maybe we can try
prompt engineering. Maybe we can add
some chain of thought, something else to
make it more reliable. So, we spend all
night working on this new prompt. Uh,
you're an exuberant fruitloving AI on an
epic quest, dot dot dot. Uh, and this
time we actually tested it 10 times in a
row on ChachiBT. And it worked every
single time. 10 times in a row. It's
amazing. So, we ship and everything was
going great
until John tweeted at me again and he
said, "I asked how many Rs are in
strawberry, banana, pineapple, mango,
kiwi, dragon fruit, apple, raspberry,
and it said
five." So, we failed John again. Um,
although this example is pretty simple,
but this is actually what will happen
when you start deploying to production.
You'll get users that come up with
queries you could have never imagined
and you actually have to start thinking
about how do we solve
it. And the interesting thing if you
think about it is 95% of our app works
100% of the time. We can have unit tests
for every single function end to end
test for the off the login the sign out
it will all work but it's that most
crucial 5% that can fail on us. So let's
improve
it. Now to visualize this I have a
diagram for you. Hopefully you can see
the code. Uh maybe I need to make my
screen brighter. Can you see the code? I
don't know. Okay.
Um Okay. Well, we'll come back to this.
But basically, we're going to start
building evals. And to visualize this, I
have a basketball court. So, today's day
one of the NBA finals. I don't know if
you care. Um you don't need to know much
about basketball, but just know that
someone is trying to throw a ball in the
basket. And here the basket is the
glowing golden c uh glowing golden
circle. So blue will represent a shot
make and red will represent a shot miss.
And one property to consider is that the
farther away your shot is from the
basket the harder it is. Uh another
property is that the court has
boundaries. So this blue dot although
the shot goes in it's out of your uh out
of the court. So it doesn't really count
in the game. Let's start plotting our
data. So here we have a question. How
many Rs in strawberry? This after our
new prompt will probably work. So, we'll
label it blue. Um, and we'll put it
close to the basket because it's pretty
easy. However, how many Rs are in that
big array? We'll label it red. And we'll
put it farther away from the
basket. Hopefully, you can see that.
Maybe we can make it a little bit
brighter. But this is the data part of
our eval. Basically, you're trying to
collect uh what what prompts your users
are asking. And you want to just store
this over time and keep building it and
store where these points are on your
court. Two more prompts I want to bring
up is like what if someone says how many
Rs are in strawberry, pineapple, dragon
fruit, mango after we replace all the
vowels with Rs, right? Insane prompt,
but still technically in our domain. Uh
so we'll we'll label it as red all the
way down there. U but a funny one is
like how many syllables are in carrot?
So this we'll call it out of bounds,
right? This no none of our users are
actually going to ask. Um it's not part
of our app, so no one is going to
care. Um I hope you can see the code,
but basically when you're making evout,
here's how you can think about it. Your
data is the point on the court. Your
shot, or in this case in brain trust,
they call it a task, is the way you
shoot the ball towards the basket. And
your score is basically a check of did
it go in the basket or did it not go in
the
basket. To make good evals, you must
understand your court. This is the most
important
step. And you have to be careful of
falling into some traps. First is the
out-of- bounds traps. Don't spend time
making emails for your data your users
don't care about. You have enough
problems, I promise you, of problem uh
queries that your users do care about.
So be careful not try and be productive
and you know you're making a lot of
evals but they're not really applicable
to your app. And another visualization
is don't have a concentrated set of
points. When you really understand your
core you're going to understand you know
where the boundaries are and you want to
make sure you you test across the entire
court. Uh a lot of people have been
talking about this today but to collect
as much data as possible here are some
uh things you can do. First is collect
thumbs up thumbs down data. This can be
noisy but it also can be really really
good signal as to where your app is
struggling. Another thing is if you have
observability which is highly
recommended you can just read through
random samples in your log in your logs.
Um although users might not be you know
giving you signal but if you take like a
hundred random samples and go through it
like once a week you'll get a really un
good understanding of what your users
and how your users are using the
product. Uh if you have community forums
these are also great. People will often
report issues they're having with the
LLM and also X and Twitter are also
great but can be noisy. And there really
is no shortcut here. You really have to
do the work and understand what your
court looks like. So here is actually
what if you are doing a good job of
understanding your court and a good job
of building your data set. This is what
it should look like. You should know the
boundaries. You should be testing in
your boundaries and you should
understand where your system is has blue
and verse where it has red. So here it's
really easy to tell, okay, maybe next
week we need to prioritize uh the team
to work on that bottom right corner.
This is something where a lot of users
are struggling and we can really do a
good job on flipping the tiles from red
to
blue. Another thing you can do and I
hope I really hope you can see but you
want to put constants in data variables
in the task. So just like in math or
programming, you want to factor
constants so it improves clarity, reuse,
and
generalizations. If you have, let's say
you want to test your system prompt,
right? Keep the constant data that all
that your users are going to ask. So for
example, how many Rs in Scarby that goes
in the data that's a constant. It's
never going to change throughout your
app. But what you're going to test is in
that task, you're going to try different
system prompts. You might try different
pre-processing, different rag, and
that's what you want to put in your task
section. This way your app actually
scales and you never have to let's say
when you change your system prompt redo
all your data and this is a really nice
feature of brain trust. Um and if you
don't know AI SDK actually offers a
thing called middleware and it's a
really good abstraction to put basically
all your logic of pre-processing. So rag
system prompt you can put in here etc.
And you can now share this between your
actual API route that's doing the
completion and your evals. So, if you
think about the court, the basketball
court as if we're doing we're going like
basketball practice and we're trying to
practice our system ac across different
models. Um, you want your practice to be
as similar as possible to the real game.
That's what makes a good practice. So,
you want to share the pretty much the
exact same code between evals and what
you're actually
running. Now, I want to talk a little
bit about scores, which is the last step
of the eval. The unfortunate thing is it
does vary greatly depending on your
domain. So in this case it's like super
simple. Uh you're just checking if you
know the output contains the correct
number of letters. But maybe if you're
doing writing a task like writing that's
very very difficult.
Um from principles you want to actually
lean towards deterministic scoring and
pass fail. This is because when you're
doing debugging, uh, you're gonna get a
ton of input and logs and you want to
make it as easy as possible for you to
actually figure out what's going wrong.
So, if you're sh if you're building if
you're overengineering your score, it
might be very difficult to share with
your team and distribute across
different teams uh, your evals because
no one will understand how these things
are getting scored. Keep your scores as
simple as possible. Um, and a good
question to ask yourself is when you're
looking at the data, what am I looking
for to see if this failed? Right? So
with Vzero, we're looking for if the
code didn't work. Um, but maybe for
writing, you're looking for a certain
linguistics. Ask yourself that question
and write the code that looks for you.
Um, there are some cases where it's so
hard to write the code that you may need
to do human review and that's okay. At
the end of the day, you want to build
your core and you want to collect
signal. Even if you need you must do
human human review to get the correct
signal. Don't worry at the if you do the
correct practice, it will pay off in the
long run and you'll get better results
for your
users. One trick you can do for scoring
is don't be scared to like add a little
bit of extra um prompt to your to the
original prompt. So for example, here we
can say output your final answer uh in
these answer tags. What this will do is
basically make it very easy for you to
do string matching um and etc. Whereas
in production you don't really want this
but yeah you can do some little twe
tweaks to your prompts so that scoring
is
easier. Another thing we really highly
recommend is add evals to your CI. So
brain trust is really nice because you
can get these eval reports. Um so it'll
run your e your task across all your
data and then it will give you this uh
report at the end for the improvements
and regressions. Assume my colleague
made a PR that changes a bit of the
prompt. We want to know like how did it
do across the court, right? Visualize
like did it change more tiles from red
to blue? Maybe now our prompt fixed one
part but it broke the other part of our
app. Um so this is a really useful
report to have when you're doing
PRs. So yeah, going back this this is
the summary of the talk. You want to
make your evals a a core of your data.
And this you can treat it like practice.
Your model is basically going to
practice. Maybe you want to switch
players, right? When you switch models,
you can see how a different player is
going to perform in your practice. But
this gives you such a good understanding
of how your system is doing when you
change things like maybe your rag or
your system prompt. And you can now go
to your colleague and say, "Hey, this
actually did help our app, right?"
Because improvement without measurement
is limited and imprecise. And eval give
you the clarity you need to
systematically improve your app.
When you do that, you're going to get
better reliability and quality, higher
conversion and retention, and you also
get to do just spend less time on
support and ops, right? Because your
evals, your practice environment will
take care of that for you. Uh, and if
you're wondering about how I built all
these court diagrams, I actually just
used Vzero and it made me some app that
I just added these shots made and missed
in uh the basket. So, yeah, thank you
very much. I hope you learned a little
bit about evals. Thank you.
So, we do have some time for some
questions. There are two mics, one over
here, one over there. Um, we can take
two or three of those, please, if
anybody's interested in asking.
We have one over there.
Um, mic five,
please. Or you can repeat the question
as well if you don't mind.
Yeah. Yeah. You can think of it. It's
really like practice. Like maybe your a
basketball player will like, you know,
in general score like 90% but they might
miss more shots here or there. If you
run it like we do it like we run every
day at least. Um and then we get a good
sense of like where are we actually like
failing? Did we have some regression? Um
so yeah, running it like pretty daily or
at least in some schedule will give you
a good idea. I was thinking what if you
ran like the same question through it
five times, right? It's like like what's
the percentage? just making it four out
of five or you know five. I see. So it's
definitely like as you go further away
like the harder questions get like lower
pass rates. Um but overall you can
measure that too like how is it running
like four out of five like a week or
something and you just want to improve
that. Evals are just a way to measure so
you can actually do the improvement.
Yeah. Uh hello I have a question about
v0ero evaluation like real evaluation.
Yeah. uh as I understand you have a lot
of trajectories user trajectories
internal one so it's uh not about
analyzing one trajectory but hundreds or
even thousands so the question is how do
you analyze which method do you use to
analyze thousand of trajectories and
getting valuable insights from it at
scale yeah so vzero spe uh in a special
domain because we're we're doing code so
we can we actually like run the code
when you run in v 0ero we we can collect
errors Um, so we do a lot of
measurements based on those errors that
we get. However, if you have a task like
writing or something else that's like
more difficult to measure, you might be
in a a worse spot where you need more
human evals. But what we do is yeah, we
track a lot of it like errors in the
apps. Um, and that usually is a good
signal for the final trajectory of like
the the app that user is building. Does
that make sense?
I'm also interested about how do you
analyze which tools uh was called and
how properly it was called like it's
undeterministic and it's interesting how
to understand it. So we also found like
the zap year presentation earlier you
want to just look like the best thing is
look at the final output. Um it can you
can get a little carried away with
looking at every tool call but at v 0ero
because it's like did the app work a lot
of times that's after all the uh steps
in the between and that's what we really
look at is like did the thing work at
the at the end or not. Um and then if it
didn't work then we go in and we look at
like all the steps and kind of try to
find the one that broke. But we start
with looking at the final trajectory
like after all the stuff's happened.
Yeah. Okay. Yes. Thank you. Awesome. And
then we have one more on the side. Yeah.
How do you think about uh eval in terms
of like each user query in like a full
conversation? Do you I know you were
just talking about kind of like thinking
through like the whole thing, but like
what if the user doesn't know where
they're going at the end and Yeah. So
for VZ like the conversations can get
really long. I would equate that to like
just a more difficult like that's like
actually a different data point. Um I
would equate that to like a very
difficult shot on the court. Um and we
test those separately. So like maybe you
can have multiple evals within one chat
like depending on where they are in the
conversation and you want to just you
guess you at that point you test like
one message at a time. Um but yeah, does
that answer your question? Yeah,
somewhat. Thanks. Awesome. Thank you.
Yeah. Did you have one really quickly?
Yeah. I was going to ask how you deal
with uh if you have an error that's like
three or five turns into a conversation
and you need to mock up to that state
and figure out whether it's able to
resolve it from there. Yeah, that's a
good question. So on v 0ero like we have
um the state at every point. So we were
lucky so we can uh kind of really easily
traverse that that state.
But what we do is again like I was
saying earlier, we basically on our
emails you test like one message ahead.
Um, and if they got into a broken state
already, then it's not really worth
testing anymore because usually like the
LLM's like really suck at coming back
from it. So we're mostly focused on like
finding yeah where was that point where
it did break and you want to try to
reduce those number of points. Uh maybe
in the future as they get better you can
start um and we're working on it like
more a better loop of like error
correction. This is something we're
constantly working on. Sure. But usually
it's it's really helpful to find like
yeah that one point where it where it
ended. Okay. So go from there. Makes
sense. Thank you all. Thank you. Thank
you.
Yeah, the remainder.
Yep. Appreciate that. So, we do have one
remaining uh speaker, but they're not in
right now. So, feel free to stick
around. Hopefully, he shows up.
Otherwise, there's also lunch available,
right? Appreciate it. All right.
Oh, yeah. I was like thinking, how do I
visualize this? I was actually talking
to
All right. So, we have a guest speaker.
Actually, if if you don't mind
introducing yourself because I don't
think even the LLM
Yeah, go for it. Hey, my name is
Randall. Um, I just launched the
framework today for evals that I'm
trying to figure out if it's good. Uh,
it's called Bolt Foundry. That's our
company name. And we are trying to help
people who do JavaScript. Most people
who do JavaScript don't have a great way
to run evals um or create synthetic
data. Uh we have a different approach
that uh is based on journalism actually.
Um building prompts through a structured
way that's more similar to like the way
a newspaper is written than the way most
people do where they scare the crap out
of a a LLM. Um let's see if I can plug
in. This is totally gorilla but no one
else is here so why not, right?
All right, let's do this. So,
um, it's live on GitHub right now. Uh,
if you want to go
github.comboldfoundry or just
boldfoundry.com has our thing. All
right, cool.
So, let me spin up my
environment. And the cool thing is that
to start you can just run um npx
bff-eval and then we have a demo that
you can run
d-demo and what this does is it will
create
um it all works with open router so you
can test against like any uh system that
you want like you can oh this is a
replet bug hold on let me fix this
really quick shell
All right, one more time. BFF
npx
bff-eval
d-demo. So here we've made a simple JSON
validator. Rather than write the JSON
validator to actually parse the JSON,
this actually is a grader. So a grader
is like someone who it's a uh LLM that
you can use to figure out if your thing
is good. Um we have a set of samples.
Let me just grab
them.
Samples.jsonl. Uh, let's
do this one. Sure, that's the wrong one,
but it's okay. So, essentially, the the
messages come in as a user message, an
assistant response, and then um
optionally a score. So, you can actually
set your ground truth and say, "This is
a sample of something that's really
good. This is a sample of something
that's really shitty." Um, and then you
can describe it. The description is
mostly for your team. So that um when
you have a like they were talking about
if you're pulling something from
production and you have a a sample that
you want your team to know about you can
describe like why this is relevant. Um
the the eval runs in this case we have
10 runs um each have a a rubric score
from positive three to negative three
and um the other person's here now. So
thank you so much for letting me show
you something cool. bullfoundry.com. Oh
I have a minute left. All right. So the
other cool thing is I'll just show you
the end. Um, you can actually see where
your um where your greater diverged from
um the from the ground truth. So you can
say like, "Oh, this is weird. Why is
this one acting this way?" Anyway, we're
going to have more stuff. It's going to
be cool. Thanks, man. Thank you for
giving me a second. Yeah, no problem.
All right. Uh, next up,
Diego. Uh, I I had your introduction
ready, but hopefully you don't mind uh
if you introduce yourself. Go ahead and
set up um connect the USB there and then
the HDMI there.
There's no audio in your computer. I
just project. Yeah, I'm happy to
project. That's fine.
Yeah, it I'm gonna give the this
Yeah. Yeah. It just gave me for context.
Everyone just gave me like some
instruction that I was supposed to be
there and I was like wait someone's
starting
have to go there afterwards.
Yeah.
And then you can use this one.
There's one here that I think you can
use. Go back there. Okay. Cool, cool,
cool.
[Music]
Okay. Yeah. Yeah. Yeah, that makes
sense. Are we starting like very very
soon or should I wait a few minutes?
Yeah. And
Hello. Hello. Test one, two, one, two,
one, two.
Okay, good. Okay,
perfect.
Cool. Should be relatively short, too.
So, we have time. We can wait few
minutes if that's better.
[Music]
Okay, cool. I'm gonna start. Okay, so
hello everyone. Uh my name is Diego
Rodriguez. Uh I'm the co-founder of
Korea, a startup in the AI space like
many others uh in particular generative
media, multimedia, multimodel and all
the uh buzzwords. But uh I I come here
mainly to tell a story
about how we think about
evaluations when we have to take into
account human perception and human
opinion and aesthetics uh into the mix.
Right? So I'm going to start with a very
simple story. It's like I put an AI
generated image of a hand. Obviously it
looks horrible. Uh, and then I asked 03,
what do you think of this image? Then he
thought for 17 seconds, obviously tool
calling, does Python analysis, OpenCV
goes crazy, and then after he charges me
a few cents, it's like, oh, just a
couple M jumps out. It's like it's
mostly natural, but like and it's like,
okay, we have like what many people
claim is basically AGI and it is
completely unable of answering a very
simple question.
And and
like that's it that's a surprising thing
if you think about it because we as
humans when people see that image is
like we just react so naturally right
against that is like what is that like
that that's not natural and and I feel
like that's precisely what AI models are
being trained on a on human data right
uh uh second on human preference data uh
and and third like in a way limit
by the data that we humans based on our
preconceived notions and perception and
all of that. So that's what this talk is
about about why what um what can we do
better and
honestly to ask ourselves some questions
that I think are not being asked enough
in the field. Um cool. So tiny tiny bit
of history. Uh there's uh we all know
about Claude uh Claude Shannon that is
the the father of information theory
uh and according to many his master
thesis one of the most important master
thesis in the world where he laid
foundations for digital circuits and
then eventually communication and all
and to a degree we can say that even
LLMs uh nowadays right if we fast
forward
um and I want
I want to focus on the inform all the
like the fact that
we call his work foundational and
information theory. Well, when he
published it, it was actually called
like mathematical foundation for
communication theory and he was always
focused on communication. Um there's
this image uh appears on that work uh
and it's all about okay this is the
source this is the channel this is
nation there can be some noise there
um and as a well as a founder of a
company that is focusing on on
media to me is interesting to realize
like these parallels between classic
information theory and communication
Let me see. Did I put the image? Let's
see. Well, if you I didn't put the
image, but if you have any context
around variational autoenccoders or
neural networks and whatever, you you
can squint and be like, oh, is that a
neural network, right?
Um and in the context of information and
communication I want to talk about how
compression is going to be uh related to
how we think about evaluation right in
I'm going to talk for an for example on
JPEG
um JPEG
exploits like human nature in the sense
that we are very sensitive to brightness
but not to color and this is a illusion
that also talks about that where A and B
is actually the same color but we are
basically unable to perceive it until we
do this and then suddenly it's like oh
really um and it's kind of like what's
going on there right
um and so JPEG just does the same thing
where okay we have RGB color space to
represent images with computers uh we
notice that there's a diagonal that
represents the brightness of the images
we can change into a different color
space uh that separates it's color
versus brightness and then we can down
sample the channels around color because
we are actually not even that sensitive
to it. So we can remove that or parts of
it and then uh once we do that this is
an image where we can see the uh
brightness and color component
separated. Once we down sample, we can
try to recreate the image. And this is
an example of like basically original
image and then the image with the down
sample color looks the same to us. Uh
and the image is like 50% less
information,
right? And other stuff. There's husman
Huffman coding and more stuff, but like
the point is the same, right? And then
the thing is if you exploit the same for
audio like what can we hear? What can we
not hear? Well, you you do the same and
we have MP3. And then if you do the
exact same thing across time, well,
congrats, now you have MP4. It's like
it's all this principle of like let's
exploit how we humans perceive the
world, right?
Um, but this made me think about myself
because I studied artificial systems
engineering, which is engineering around
all of these mic how microphones work,
how speakers work. And it was just
interesting to me that I was
coding. I start deleting information. I
know for a fact that I'm deleting
information yet and then I rerender the
image and I see the same. It's like like
philosophers always tell you about like
oh we are limited by our senses but like
this is the first time I was like I'm
seeing it right like I am not seeing the
difference.
Um but
then if all if if a lot of our data is
the internet right like we're scraping
data from the internet bunch of those
images are also compressed right like
are we taking into account that perhaps
our AIS are limited too because we're
kind
like we have some sort of contagion
going on of our flaws into the AI
um and then it gets more tricky because
for instance uh this is a just a
screenshot I took from a paper I think
it's called clean FID and FID scores for
all of you who don't have context is one
of the standard metrics used for uh how
well for instance diffusion models are
are reproducing an image but then you
start adding JPEG artifacts and the
score is like oh no no no this is
horrible horrible image and it's like
perceptually the four images are
basically the same yet the FID score is
like no no no this is really bad. So
then it's like why are we using FID
scores or metrics along those lines to
deci to decide oh this generative AI
model is good or bad right
um so the thing is sometimes I feel like
we are focused on measuring just things
that are easy to measure right like
problem adher adherence with clip how
many objects are there is this blue is
this red
etc
but what about here. Oh, it's like, oh
no, really bad, really bad generator
because that's not how clock looks and
the sky that makes no sense. And it's
like, okay,
how not only are we limiting our AIS by
our human perceptions? Uh, on top of
that, we forget about the relativity of
metrics, right? like uh no actually this
is art and this is great and and and and
there's sometimes meaning behind the
work that is not like it is conveyed in
the image but only if you're human you
you get it right like oh this is what
the author is trying to tell me but I
feel like the metrics don't show that
and kind
like commercially and professionally my
job is kind of like okay how can we make
a company
that allows creatives artists is of all
sorts. Uh we can start with imagery, we
can start with video, but to better
express themselves. But how are we
supposed to do that if this is kind of
like the state-of-the-art, right?
Um
then a friend of mine uh Changloo
actually works at at Mid Journey. it was
uh he has
um like he has great talks that you
should all check but he told me once a
little bit over a year ago a quote that
I just can't stop thinking about which
goes something
like hey man if you think about it like
predicting the car back when everything
was horses it's not that hard I was like
well like yeah it's not that hard to to
like oh cars are the future and whatever
like we have We have a thing that goes
like this. We have horses that make
energy. So, you swap the thing for the
engine. That's essentially a car, right?
I was like, come on, how hard is that?
It's like, you know what's hard to
predict?
Traffic, right? And and then I just kept
thinking about I was like, oh man, like
as engineers, as researchers, as
founders, what are the traffics that
we're missing now? because I feel like
everyone's focused on like yeah but you
can you can I don't know transform from
JSON to JAMAL and like who cares that
dude who cares like or or yes it's
important right like
but what kind of big picture are we all
missing right um then he talks about
well you know the myth of the tower of
babel
where in a nutshell is like
God like we we want to go
and meet God and then he's like no I
don't want that. So instead I'm just
going to confuse all of you and then
you're not going to be able to uh
coordinate uh and then you're all each
one is going to speak a different
language and then it's just basically
going to be impossible to keep the thing
going which like reminds me of like a
standard infrastructure meetings with
backend engineers. is like, "No, we
should use Kubernetes. No, we should
use." And it's like it's just all
fighting and whatever and nothing get
nothing gets built. I'm like, "Dude, God
is winning. God damn it." Um, but then
this makes me think about
like we are now in we just entered the
age where you can
have models
essentially they solve translation,
right? Or they solved it to a very high
degree. So, so what happens now that we
that we can all speak our own languages
yet at the same time communicate with
each other? I'm already doing it. For
instance, I do sometimes customer
support manually for Korea and I
literally speak Japanese with some of my
users and I don't speak Japanese. I
learn a little bit but I don't speak it
and and and like I'm now able to provide
an excellent founder whatever that means
uh customer support level to a country
that otherwise I would be unable to do
right
and and so I invite us all to think
about what that really means.
Um because this for for instance means
that we can now
understand better or transmit our own
opinion better to others.
And on the previous point that I was
talking about with the
art that's kind of like an opinion,
right? Like
evolves are not just about are there
four cuts here. It's about this cut is
blue and it's like yeah but is it blue
or is it teal? What kind of blue? and I
don't like this blue and all of that.
Um,
so like in a nutshell is like how how do
we evolve our evals, right? Like from my
opinion like from my opinion this is
bad. Um, then I want metrics that take
into account my my opinion too. And then
it's like okay consider myself I may be
a visual learner. What that means is
like
maybe your evil should take into account
how we humans perceive images, right? So
and and and also the the nature of the
data such as oh it's all trained on JPEG
on the internet. So take into account
the artifacts, take into account uh like
all of these while while training your
data.
Um, okay. I guess mandatory slide before
the thank you. Uh, bunch of users, bunch
of money. We did all of that with eight
people. Now we're 12. And this is an
email that I set up today for high
priority applications. Uh, for anyone
who wants to work on research around,
uh, aesthetics research, uh,
hyperpersonalization, scaling generative
AI models in real time for multimedia,
image, video, audio, 3D uh, across the
globe. Uh we have customers like those
and that's it. Thank you.
Oh, Q& A. Okay, perfect. Any questions?
Uh
Can there's many points there. Can you
like refrain the question like
Yeah.
Yeah.
Yeah. So, so the question like in a
nutshell is like are there uh
perceptually aware metrics, right? like
like okay I showed an example of FID
score it changes a lot with JPEG
artifacts are those where it's almost
like the opposite barely changes uh and
the metric is still good like there are
some and many of these are used also in
traditional uh encoding uh techniques um
but in a way I'm here to invite us all
to start thinking about those
like like to we can
actually train like we can train uh I
mean it's it's called a classifier right
like or or or a continuous classifier we
can train so that it understands what we
mean and it's like hey I show you these
five images these five images are
actually all good and then they can have
all sorts of artifact not just JPEG
artifact and this is exactly where
machine learning excels right when it's
all about opinions and it's like let me
just know and you will know you know you
know what you will know when you see
That's precisely the type of question
that AI is amazing at. Uh all right,
thank you everyone. We
are Thank
you. He'll be sticking around for
questions on the side if you have some.
Thank you. And we are doing a working
lunch.
like three, right? Okay. So, check it
out. It'll give you a graphic EQ
analyzer if you hit it again.
Check check check
check
check. Oh, it doesn't do it in the in
the ma in the main, but check it out. Go
to the individual channel. No, I go in
the individual channel. I hit it again.
There it is. It's an analyzer. So,
something's feeding back. You'll see it.
You'll see.
There's a port.
United States.
All right. Yes.
Hello. Hello.
Hi. Good.
Good morning.
Cool.
Thank you.
Hello. Hello.
Hello. Is it on?
I
think it
I just
Hey everybody, my name is Doug Guthrie.
Uh, I'm a solutions engineer at Brain
Trust. Um, as you can see here, we're an
endto-end developer platform for
building AI products. We do eval. If you
watched the the keynote this morning,
you saw our founding engineer jumping up
and down on stage yelling evals. I am
not going to do that. I'm not as uh as
funny or cool as him, but uh we should
all be very excited about eval here.
very brief agenda of of what we'll cover
today in this sort of uh this intro. Uh
very very brief company overview. Uh
give you an intro to
eval. Why you why would you even uh
start thinking about using them? What
are they? Uh what are the different
components that you need to create an
eval? U some more brain trust specific
things via uh running evals via our SDK.
you'll see uh in the examples um that
you can run evals both uh in the
platform itself as well as the SDK. I
think it's it's a really kind of cool
thing that we can connect uh maybe the
local development that we're that we're
doing with what we're doing within the
platform. Uh and then how we then move
to production. Uh this is uh maybe human
review. This is online scoring. So how
well is our our application performing
uh in production? And then lastly, a
little bit of human in the loop. uh
getting user feedback from your users.
How do we now uh take some of these uh
these production logs and feed them into
the data sets that we're using in our
evals uh creating this really great
flywheel
effect. Cool. Quick quick company
overview. Um you see up there maybe in
the top right uh some of our fun excuse
me some of our investors. Uh maybe a
quick call out on the on the leadership
side. Anker Goyle is our CEO. Uh maybe
the the reason to call out that is uh
Anker in the last uh two two stops he he
essentially built brain uh brain trust
uh from scratch at the last two places
and and this is really where he found
the the idea to like like maybe this is
actually a a thing that that people need
and and really like the origination
story of brain trust. Uh the other thing
to call out here is like we already have
a lot of companies using brain trust in
production today. This is just a a few
of uh the companies that are that are
utilizing us for for eval uh for
observability of their their Genai
applications. I won't bore you too much
with with that, but uh let's jump into
this. If you're at the keynote, you
probably saw a a similar slide here. I
didn't I didn't take it out, but uh you
see like the tech luminaries here. I
think as as Manu referenced them talking
about evals and the importance of them.
I think this is a you know obviously
this is an intro to eval uh track and
what better way to start this out with
uh some some really um you know
influential people in the space talking
about eval and and why they are so
important.
So why would you think about eval?
Right? They they they help you answer
questions. So here's here's a few of
them. Um when I change the underlying
model to my application, is it is it
getting better or is it getting worse?
When I change my prompt, right, when I
change uh certain certain things about
the application, is it getting better or
is it getting worse? We want to get away
from uh not having a rigorous sort of
process around building with large
language models, which as you all know,
right? nondeterministic outputs uh
creates somewhat of a challenge and
without uh eval becomes really really
hard to uh to create a good application
that we can put into production. Maybe
some other ones like obviously being
able to uh detect regressions within the
within the code. I think the other the
other thing that Ancher mentioned uh to
me when I first started which I didn't
really mention this but uh this is my my
third week at at brain trust as a
solutions engineer but one of the things
that he mentioned to me that I thought
really resonated was uh eval are a
really great way I think people think of
them as almost like unit tests for uh
for you know our applications but he
kind of described it another way of like
this is a really great way for us to
play offense uh as opposed to just
playing defense where I think maybe unit
tests are are kind of used for This can
actually be used as a tool to to really
help uh create a lot of rigor around us
building and developing these
applications and ensuring that we
actually build uh things that we can put
into
production. Maybe from like a business
perspective, uh why would you think
about running evals or using evals? Uh
here's a few here. Uh if you have um
eval, you create the this feedback loop
or this flywheel effect. I think Manu
mentioned in the in the keynote the
keynote, excuse me. Um, and this
flywheel effect allows us to, you know,
uh, cut dev time, allows us to enhance
the quality of this application that
we're putting out in production. If
you're able to connect the things that
are happening in in real life with your
users in production and being able to uh
filter those logs down, add those SP
spans to data sets, inform what you're
doing in an offline way creates uh that
that flywheel effect that becomes really
powerful from a development
perspective. Uh again, a little bit more
on the on the customer side. Here's a
few of our customers and some of the
outcomes that they've that they've seen
using brain trust, whether it's uh
moving a little bit faster, pushing more
AI features into production or just
increasing the quality of of the
applications that they have. Here's a
few of the the outcomes that we've seen
or that they have
seen. So, let's let's start talking a
little bit about the the core concepts
of of brain trust. Obviously, we're here
to talk about evals. uh the the the
things that I think um you see like
these arrow these arrows going uh one
way and and the other this again is that
that flywheel effect that that I
described earlier. Um there's the the
prompt engineering aspect of this uh in
brain trust think of uh this playground
that we have as an IDE for LLM outputs.
Uh the playgrounds allow for that rapid
prototyping as we make those changes as
we change the underlying model. what is
the impact to that to that uh particular
uh task or that application and those
evaluability aspect right this is the
the logs that we're generating in
production the the ability to uh have a
human review those logs in a really easy
uh intuitive interface and then have
user feedback from actual users be
logged into the application as
well so what is an eval you probably
heard this several times today
throughout the week if you stop by the
booth. But sort of our definition here
is that structured test that checks how
well your AI system performs, right? It
helps you measure these things that are
important. Quality, reliability, uh
correctness. So what are the ingredients
in an eval? Right? I've been talking a
little bit about task, right? This is
the thing uh the code or the prompt that
we want to uh evaluate. The the really
cool thing about brain trust is that
this can be as simple as a as a s excuse
me a single prompt or it could be this
this full sort of agentic workflow where
we're calling out the tools. There's no
sort of limit onto the the complexity
that we put into this task. The only
thing it requires is an input and an
output. The second thing is a data set.
This is our uh real world examples. This
is essentially what we're going to run
the task against to understand how well
uh our application is performing. And
how we do that is via scores. So the
score is really the the logic behind
your evals. There's there's a couple
different ways to think about this.
There's the LLM as a judge type score.
So uh you give it the the output and
some criteria and it is able to assess
uh like say I want uh based on this
output is this excellent, is this fair,
is this poor? And then those outputs
then correspond to you know zero point
five or one. Uh you also have codebased
scores right these are maybe a little
bit more heruristic or binary but uh we
can use both of these to really aid in
the development of that of that eval and
ensuring that we're building a really
good
application. Uh I think I just sort of
mentioned this here as well but like the
the two mental models here of of eval
there's there's offline and online.
Offline is pre-production. This is us
actually doing that that iteration. It's
identifying and resolving issues before
deployment. Uh this is where we're
defining those tasks. It's where we're
defining those scores. Uh online evals,
this is that real-time tracing of of the
application in production. It's logging
the model inputs and the outputs uh the
intermediate steps, the tool calls,
everything that's happening. It allows
us to diagnose performance and
reliability issues, latency. Uh based on
how you instrument your your application
with brain trust, we can pull back lots
of different metrics related to cost and
tokens and duration and all of these
things be help inform uh how we build
this
application. Um I'll I'll jump into a
little bit more of like how we can
instrument our app for online evals. Uh
we're going to first I think talk a
little bit more of the
offline. Before doing that maybe just
like level set on how to improve. I
think one of the the the things that
I've seen uh in the last few days here
the conversations that I've had it's
like almost how do I get started or what
do I do if X or you know those types of
questions. Another thing I heard from
Anker uh very early on is that like just
get started, create that baseline that
you can then iterate and build from. I
think a lot of people get caught in like
creating this this golden data set of
test cases uh that they they can then
like iterate from. Start you don't
necessarily have to do that. Start and
build that uh that baseline. Establish
that foundation that you can then
improve upon. But this is a really good
sort of matrix of like uh if I have good
output but a low score, what do I do?
Right? Improve your evals. If I have bad
output and a high score, improve your
your evals or your scoring. But really
good kind of like highlevel uh
understanding of of where to start to
target your your efforts when you are
building these apps and your and you're
creating
evals. So let's jump into the actual
components that I just that I just
talked about. So within the brain trust
platform we have a task. Uh again this
could be a prompt. This could be like
this full agentic workflow. Uh very
basic. You see that that gif running.
This is a prompt within the platform.
You uh specify an underlying model that
you wanted to use. You give it a a
system prompt. You can also uh give it
access to tools. It has access to musta
mustache templating. So you can pass in
variables like user questions or um you
know the input from the user it's chat
history or metadata. Right. So when we
actually go and want to parse through
these logs, the metadata becomes
actually beneficial uh enabling us to do
that in a really easy way. Um going
forward, uh maybe we have a multi-turn
sort of chat type scenario where we want
to add uh additional messages for the
system and the assistant and the user uh
and our tool calls as well. the uh the
the platform allows for that just via
this uh plus this messages uh button and
then you're able to add those different
messages to the
prompt. Um also we can add tools. So
oftentimes the the prompt will will need
access to to something, right? Maybe
it's a a rag type workflow, maybe it's
doing web search, whatever it is, we can
now use those tools as part of that
prompt, right? And so when you you sort
of encode in that prompt like make sure
you use X tool, this prompt has access
to that tool while it's
running. Uh the last one, this is
actually a feature that is in beta right
now. Uh it's actually creating more of
that agentic type workflow within your
uh within the brain trust platform
itself. So it's it's a way right now at
least to chain together prompts where
the the output of one now becomes the
input of the other. But if you think
back maybe to to this slide, if I have
sort of um this prompt that has access
to tools, you you create a pretty
powerful system here where you're able
to go from uh maybe that first step that
has access to a certain tool and we get
some output from that. We can then go to
that next step that has access to maybe
some other tools, right? This sort of
like maybe multi- aent type of workflow
we can create with the underlying tools
within those prompts.
The second thing that I talked about is
data sets, right? These are our test
cases that we want to give to the the
task to run. So we can sort of iterate
over that. We can get the output and
then going down a little bit further
actually score that. But this is um
obviously really important when we're
when we're running our eval. And then
when we are uh trying to pull from
production, right, the the actual logs
that are happening, we can add those
those uh spans, those traces to the data
sets in a really easy way. And I can
show you what that looks like. But uh if
you look there at the bottom, the only
thing that's required is the the input.
Uh you also have the ability to add uh
expected. So what is the expected output
for that input, right? You can sort of
like uh create some sort of score that
looks at the output with the expected.
There is a a score called Levvenstein
that allows you to, you know, measure
the the the difference between those
two. So you can do some different things
based on what you provide to that data
set. You also have metadata as well.
Again, being able to filter down
different things, pulling the data set
maybe into your own codebase and I want
to filter by again X, Y, or Z via the
metadata. That's all
possible. Uh I mentioned this a little
bit ago, but just start small and
iterate, right? You don't have to create
this golden data set to to get started
here. Um just just start and then uh
then continue to iterate and build from
that baseline. Uh the human review
portion also becomes really powerful
again when we have stuff uh being logged
within production having humans actually
go through those logs and you there's
lots of different ways to filter it down
to the things that they should be
looking at and then we can now decide to
add those things to certain data sets
that then inform uh the offline evals
that we're
running. Uh the last thing, the last
ingredient here that we need for our
excuse me uh for our evals are our
scores. We have both codebased scores,
right? Again, this is like more of those
binary type conditions, but you can
actually uh code TypeScript or Python.
You can do that within the UI as you see
over there on the bottom left. Or you
can within your own codebase, create
that score, and then push it into Brain
Trust. So we can use it in the platform.
Other users who maybe aren't in the
codebase can use that score as well. The
other score that we have access to is
called LLM as a judge. So this allows us
to use an LLM to sort of judge the
output. We can give it the the set of
criteria that indicates what a good or a
fair or a bad score or whatever it is.
You you get to decide uh what that looks
like. So you give it that criteria and
it says if it's good uh I want to do a
one. If it's bad I want to do a zero.
But this starts to create the scores uh
that we can use in that offline in that
online sense. The other thing to call
out here is that we have um internally
built a package called auto evals. So
this is something that you can now pull
into your project. These are out of the
box scores that are both LLM as a judge
as well as codebased. And so it just
allows you to get started very very
quickly. Another thing I heard Anker
mention is um maybe starting with
Levvenstein maybe not the best score in
a lot of cases but again it establishes
a baseline very very little development
work for for uh our users but it creates
that thing that we can then build from
and now you have uh a direction a
direction to go in to go build maybe
that more custom score.
some of the things that we've heard from
our customers. Uh some tips that um you
know important to think about a lot of
our customers are using higher quality
models for scoring. Uh even if the
prompt uses a cheaper model just just
makes a lot of sense like while we're
running that application to use the
cheaper model but use the the more
expensive one to actually go out and
score it. Um also break your scoring
into uh very focused areas. So the
example that I'll show is a an
application that that generates a change
log from a series of commits. So I could
create a score that says assess my
accuracy, my formatting and my
correctness. Or I could create three
different scores that assess accuracy
and then formatting and then
correctness. So have your scores be very
targeted to the thing uh that they're
supposed to be doing. uh test your score
promp excuse me prompts in the
playground before use and then avoid
overloading the score or prompt with
context. Uh focus it on the relevant
input and the
output. Uh a couple things here. Here's
where over on the left we have our
playgrounds. This is where we do that
that sort of like rapid iteration where
we can pull in those prompts. We can
pull in those agents, add our data sets,
and add our scores. and we can click run
and it'll go out and sort of churn
through that data set that we've defined
and will give you a sense for how well
your task is performing against the data
set with the scores that we defined. But
this is the place where uh developers uh
PMS we even have a healthcare company
that has doctors coming into the
platform and interacting with the
playground and even doing human review
as well. Uh depends a little bit
obviously on on the organization. The
thing on the right is our experiments.
This is our sort of like snapshot in
time of those eval. So imagine now like
as we are doing this development and
we're trying to understand uh like the
last you know the last month or so are
are we getting better right the the
changes that we are making the model
changes whatever it is are we improving
our our application and the experiments
is a really great way to to understand
that.
Uh really important maybe to call out as
well. You can see in the bottom right
the eval can happen from uh the
application right the the brain trust
platform as well as via the
SDK. Cool. Maybe just really quick
because uh nobody likes looking at at
slides all the time. I I certainly
don't. Uh maybe if you haven't seen
brain trust yet this is maybe a good a
good quick demo. Uh so again like the
the uh the idea here is I have this this
application. I'll just give you I'll
show you over here. uh you give it a
GitHub repository URL. It uh grabs the
most recent commits and then creates a
change log from there. And then once
this completes, you can even provide
some user feedback. But this is the
thing that we want to uh
evaluate. So what I can do uh I'll go
into my my playground, right? This is
the place where I can start to uh run
those those uh those experiments or I
can start to iterate on that prompt that
I have from my project. I've actually
loaded in two different uh two different
prompts. So maybe I'll before going into
the playground, I've actually created
these two prompts within my codebase and
I've pushed them into the brain trust
platform. I've also created a data set
in that codebase and I've also created
some scores. Right? These are the
ingredients that we need to run our
evals. So now when we have we have those
we have those different components. Now
we're able to start to iterate here. So
I'm gonna actually create a net new
playground and I will load in one of
these
prompts. So again, here's my first
prompt. Uh my first prompt has a model
associated with it. What becomes really
cool here is the ability to to like to
iterate on the underlying model, right?
I think a lot of us are we have access
to a lot of underlying providers and we
want to be able to understand if I
change this or if I you know a provider
adds a new model what is the impact to
my application. So I can duplicate this
prompt and maybe change this to GPT41. I
can run
this. Before I run it, I have to add all
of my components. So I'll add my my data
set and then I can add my different
scores that I've uh that I've configured
here for my change
log. And so I can click run and then now
we'll understand here what is the effect
of changing the model uh for this
particular task with the scores that
I've configured against this data set.
So this will churn through all of these
in in parallel and we'll start to get
some results back. Uh lots of different
ways to actually start to look at this
data. I always like coming over to the
summary layout because I can understand
like um you can see over here this is my
base task right here and then my
comparison task. So I can understand uh
looks like you know on average uh the
base is performing a little bit better
than my uh comparison task on my
completeness score. Uh it's fairing a
little bit worse on my accuracy score.
Uh both of them are you know 0% on my
formatting. So probably have some work
to do there. But you can start to see
how you can use this type of interface
to iterate very quickly. Right now uh
the other thing that um maybe shouldn't
do but uh I can't resist because we just
released this today is this new loop
feature. So imagine you are uh you know
you're a user within brain trust and
before this you would sort of manually
iterate here creating net new prompts uh
making modifications changing the model.
What if you could now uh utilize AI to
go and do that for you? So any sort of
like cursor like uh interface, we can
ask it to optimize a prompt. And I think
the really unique thing here is it has
access to those uh those evaluation
results. And so when it goes to go uh
change that prompt, it understands that
it changes the prompt, it runs the
evaluation, it understands if it got
better relative to the scores that we
defined. So you can see it's going to go
through here. It'll fetch some eval
results. Uh you'll probably see a diff
here very very soon. If we don't, I I
won't hang out here too long, but I do
want to highlight one of the things that
we are releasing that really enables our
users to iterate in a really uh really
fast way. So, here's my my change. We
can click accept and then it'll actually
go out and run that eval again or it
would uh I think I have an issue with my
anthropic API keys. But the idea here
again is like we can create that very
rapid uh iterative feedback loop here
within the playground.
The other thing here is uh we can run
these as experiments. So this is um this
is where we can start to create those
snapshots in time of that eval. And
again see as I make these changes to
that that application, how is it sort of
performed over time. I want to make sure
I don't I don't want to go down. I don't
want to like uh decrease the performance
of my scores relative to obviously the
last time it ran.
looking at over the last month, six
months, whatever it is that we're
tracking.
Cool. So, that was very very uh brief
sort of intro to like eval via the UI,
right? Again, like just to summarize, we
need a task, we need a data set, and we
need at least one score. We can pull
those into the playground, and now we
can start to iterate. We can save these
via experiments. uh and now we can we
have a way in which we can understand
how well this application is performing
right this is no longer like qualitative
right this isn't like hey I think this
got better that output looks better
there is actual rigor behind this now um
customers oftentimes ask though like I
don't really want to use or I'm not
going to use the the platform as much
I'd rather use this from my my codebase
is that possible and it is uh so uh we
have a Python SDK we have a TypeScript
SDK there there's some other ones as
Well, Go, Java, Cotlin. Uh, for the most
part, most of our users are using Python
or TypeScript. Here's a just a couple
examples of what this might look like
from an SDK perspective. Um, actually,
if you all aren't opposed to looking at
uh some code, here's just a a really
basic example of uh defining a prompt
within my my codebase and then pushing
it into Brain Trust. So, just leveraging
that that Python SDK. Uh, another
example should come over here.
creating a score. So you give it sort of
like the the things that it's looking
for, but now I've sort of defined this
score within my codebase. Uh it's
version controlled. Also the the prompts
that you create within the UI are
version controlled as well. Um but this
is just another way to start to interact
with with brain trust. So again, scores,
we could do data sets, uh and then you
can even do uh prompts up here as well,
I believe. So here's my eval data set.
Here's my change log to prompt. Again,
being able to start from the codebase
and actually push them into the platform
is possible. Just depends on the
organization where they want to
start. The other thing here is like this
is more on the like the components of
the eval side. So that's that top
portion. Define those assets in code,
run that brain trust push, and now you
have access to that in that brain trust
library. The other one is actually like
defining the eval code, right? So what
that looks like is is slightly
different. Um come over here. So we have
our eval. This is just a that's coming
from our Brain Trust SDK. Again, it's
looking for the exact same things that I
just described, right? A data set that
we can use from Brain Trust itself, uh
the task that we want to invoke and then
the scores. So, again, defining this
here within uh within your codebase
certainly possible and then I can run
you know a
command that actually runs that eval
within brain trust. So, from here go
into brain trust see the eval running.
This is now experiment that I can view
over time. So again like you saw two
different types of workflows here. Uh
again catering to maybe two different
personas or again the way in which
organizations want to work. It's up to
them. Brain trust is very flexible and
how we allow our users to uh to consume
or use the platform.
Uh probably jumped ahead a little bit,
but this is sort of a recap of what I
just showed you. Again, from your code,
you you create your prompts, your
scores, your data sets. You can push
them in there. Uh maybe just
importantly, important to highlight here
of like why you would do this. Uh you
want to source control your your
prompts. Uh the big one here to call out
is the online scoring. I have a section
in a little bit diving a little bit
deeper into that. uh but if you want to
use those scores that we define in the
data set, we should push them into brain
trust so that we can create online
scores. We can understand how our
application is performing in production
relative to those scores that we want or
that we're using within our offline
evals. What I just showed you maybe
another uh another variation of that
that eval within our code again defining
that data set defining that task
defining those scores becomes very very
easy to now connect these two things.
It's just again up to you to decide
where you are where you want to do this.
The other thing to call out here is that
this can be run via CI/CD. We do have
some customers that that want to run
their eval as part of the CI process. So
understanding in a more automated way,
right, the the score for A, B, and C,
whatever they've configured. Has it
gotten better? Has it gotten worse? This
becomes maybe a check as part of CI.
There is uh if you look within our
documentation, there's a a GitHub action
example that shows you how you could set
this up.
Cool. Let's let's move to
production. Moving to production entails
setting up logging, right? It entails
instrumenting our application with uh
with brain trust um code. Uh being able
to like say I want to wrap this this uh
LLM client. I want to wrap this
particular function when it goes to call
that tool. Becomes very very easy to do
that. But so so why should you do it?
Um, I think I've probably said it said
it numerous times here, but we want to
measure quality on live traffic, right?
We actually want to understand how well
our application is performing with those
scores. Really great to use during
offline evals becomes uh our aid in
ensuring that we that we build really
good applications that we're not
creating regressions, but also really
important to monitor that live traffic.
The other really important thing to call
out, I think, is that that flywheel
effect that it creates. So we have these
data sets that we use to inform our
offline evals. It's very very easy now
to take the the logs that are generated
within production and add those back to
data sets. This also um speaks to some
of that human review component where we
want to now bring those humans in. they
can start to review some of the logs
that are relevant like maybe there's
user feedback equals zero, maybe there's
a comment or whatever it is, but like
they can filter down to those particular
things and as they find really
interesting maybe test cases, it's very
very easy to add those uh back to the
data set that we use in our offline
evals. So I think the the feedback loop
or the flywheel effect that that this
creates is one of the really fundamental
value props of of the
platform. So how do we do this? Uh there
there's a couple different ways, right?
We're first going to initialize a
logger. This is just going to
authenticate us into Brain Trust and
point us to a project. Uh you may have
seen when I open up the platform, I had
numerous projects inside of there. You
can almost think of a a project as a as
a container for that feature, right? So
you probably have multiple AI features
that you're building. I want to have a
container for feature A for those
prompts, those scores, those data sets.
You could certainly utilize those things
across projects, but it becomes a really
good sort of way to uh containerize the
things that are important for that
feature. Then you can start really
basic, right? You can wrap a an LLM
client. Uh so when you saw those some of
those metrics with like tokens and l and
duration and costs uh just very
basically within the this uh the script
or excuse me the the code here I just
wrapped that open AAI client and now I'm
just sort of ingesting all of that those
metrics into my logs. That's the easiest
way to get started. You obviously
probably want to uh do a little bit
more. Again maybe you want to uh
understand when you know that LLM
invokes a tool. So, I want to trace uh I
can add a trace decorator on top of a
function. Uh I can even use some of the
like the brain trust low-level like span
elements to create custom logs and I
want to customize the input and I want
to customize the output and the metadata
that we that we log to that span. So
again, you can you can start very basic
with wrapping in a client and then go
down to like the individual span itself
specifying that input and that output.
This leads us to online scoring, right?
This is I talked a little bit about
this, right? This is where like when our
logs are coming in, we can actually
configure within the platform those
scores that we want to run and we can
specify sort of a sampling rate. So, we
don't necessarily run that score across
every single log that comes in. Maybe
it's 10% 20% so on. Um, but it but it
creates that really tight feedback loop
that that I've been talking about. Uh al
also maybe just important to mention the
early regression alerts. So we can
create automations within the brain
trust platform. If my score drops below
a certain threshold, let's create an
alert uh with it with our automation
feature. This is just uh and I can maybe
walk through what this looks like
instead of showing you here. Uh the
custom views. This is where like there's
a lot of really rich information within
these logs and it becomes really
important I think again for the human
review component to like filter these
down to the things that they care about
but the things that anybody cares about.
So we can create custom views within
brain trust with the appropriate filters
uh and then it's very easy for that
human to go into what we call human
review mode within brain trust and sort
of parse through those logs uh the ones
that are the the ones that are going to
be most meaningful to them.
Let me uh let me connect some of those
those dots there.
So again, showing you some code. Maybe
good, maybe bad, but um I'm guessing
there's some technical people in the
room that that don't mind
here. So if I look for um
[Music]
the uh you may have s seen in one of
those slides there is a the Verscell AI
SDK. I want to wrap this AI SDK model.
Again, this allows us to just create all
of those metrics within brain trust with
just zero lift from us as a developer.
This becomes really easy to do. Uh you
can also see where I have specified that
span itself, right? I actually want to
define the inputs and the outputs of
that. The reason you would do that is
because you have a specific data set
with a structure that you want to ensure
maps to that. So like when you are
within those logs parsing through them
becomes really easy to add those spans
back to that data set. So ensuring that
that data structure is sort of
consistent across offline and online
becomes really important again to create
that feedback loop. So this is you know
very high level of like how we can start
to create those
spans. Um now that we do right we can
now go in uh the platform and start to
configure our online scoring. So this is
here just within this configuration
pane. I can click online
scoring. I'll just delete I'll create a
new rule. Uh so my new rule and here's
where we can add different scores.
Right? Obviously I have a few here that
I've been using for the offline evals. I
don't necessarily need to select all of
them, but I certainly
can. And then I want to uh apply a
sampling rate. So I want to actually
give you an example of what this looks
like. So I'm going to do 100%. The other
thing to call out here is that you can
apply these to the individual spans
themselves and not the entire root span.
So where this becomes beneficial is like
when you are invoking maybe tool calls,
you're invoking like a a rag workflow
and you actually want to create a score
on whether or not the thing that it gave
back uh is actually relevant to the
user's query. So we can actually create
a score specifically for that and
highlight what that span is here. So
again very very uh flexible in how you
apply these scores to the things that
are happening
online. Now when I come back here to the
application and we'll just run this
again uh creating that change
log you'll now start to see here within
uh the
logs this will start to show up and then
you'll start to see these scores be
generated. Right? And again this is
where like you can now start to
understand over time in production how
are these things doing? How are they
fairing? Where can we get better? Uh
again now we can connect again like the
things that are happening in our offline
evals with the things that are happening
with online. The other thing to call out
here is the uh the feedback mechanism
right? Uh we certainly have uh the
ability to do like human review but
oftentimes you want your users to
provide provide feedback as well. And so
this is just a basic example of a thumbs
up thumbs down. You can even provide a
comment
here. This can now be logged to Brain
Trust. So I should see over here my user
feedback. So here's my my comment and
then I have my user feedback score. But
now I can also do something like this.
So again, maybe I want to filter my logs
down to where user feedback is zero. So
click that button. I'm going to change
this to
zero.
Right, I don't have any rows yet like
that. But now I can save this as a view
and people who are now using this as
human review can filter this down to
where user feedback equals zero and we
can figure out what's going on, right?
What are the things that that fell down
here within this application that we
need to go fix? The other thing I'll
highlight here is our sort of human
review component. Uh actually um you
click that that button or you can hit
just R and it opens up this this
different pane of your log. So it's a
paired down version of what you just saw
there. It's a little bit easier for a
human to go through and actually uh look
at that that that input and that output.
But you as a as a user of Brain Trust
can configure the human review scores
that you would like to uh to use. So I
have this add something here. So maybe
this is a little bit more free text. I
have a better score. Again, these are
the things that that you can add to your
platform that map to the the the review,
excuse me, the scores that you want your
your humans to to add to those
logs. Um, just really quick, I I'll
highlight some of these things here. Um,
this is what it starts to look like when
you instrument your uh your application
with those different uh wrappers or
those different trace functions. um I'm
able to understand at a very granular
level, excuse me, granular level the
things that it's doing, right? So I
essentially have these tool calls where
it's going out and it's grabbing the
commits from GitHub. It's understanding
what the latest release is and it's
fetching the commits from that latest
release and now I can generate that
change log. But again, the the really
unique thing here and maybe a different
example of
this is I can start to score those
individual things that are happening. So
this is a different uh application with
a
with this example. So if I open this up,
I have these uh this conversational
analytics application. So a user can ask
a question, it can return back some
data. But this application goes through
these various steps, right? The first
step is to rephrase the question that
the user asked. So imagine like there's
this chat history that we can load in as
input and the LLM needs to rephrase that
user question. If the LLM does a really
bad job of rephrasing this question,
everything as a result of this will fall
down. Probably not going to get a right
a right answer. So what I can do is
create a score specifically for that
span to understand how well the LLM did
in rephrasing that question. Can also
understand the intent that I was able to
derive or the LLM was able to derive
from that question. Is that right? But
you start to think of like these these
more complex type of applications that
you build. You need to be able to
understand the individual steps that are
happening and brain trust allows for
that very very easily via these scores
and then being able to apply them not
only again while you're in offline eval
kind of mode but also online right we
want to understand these logs and be
able to apply these scores at the
individual span level this becomes
pretty powerful as well.
Um I think I actually stole from my my
next
section my human in the loop kind of
walk through this uh a little bit. Um
maybe just another a call out if you
happen to be at one of our workshops on
Tuesday. Um Sarah from notion uh who's a
a brain trust customer talked a little
bit about how they think about human in
the loop. I think it's it's important to
consider like the the size of her
organization and what they're doing.
Um, she mentioned that like she has a
special type of role that they use for
human in the loop type of interaction,
right? There is it's almost like a
product manager mixed with an LLM
specialist. Uh, they're the people that
are going through and doing those human
reviews. Smaller organizations, she made
a comment that was it actually makes a
lot of sense for the engineers, some
engineers to actually go through and do
this as well. Becomes really powerful to
pair like the automation with the human
component of this. like this is not
going to go away. I think it it adds
value to the
process.
Um again, I think I just stole for
myself like why why this matters, right?
This is really critical for the the
quality and the reliability of your
application. Uh it provides that that
ground truth for for what you're for
what you're
doing. Two types of human in the loop uh
interactions here. I walked you through
that human review. Um give me one
second. I'll and I'll call you. Yeah. uh
the the two excuse me the two types the
human review uh being able to like
create that that uh interface within
brain trust that allows that user to
kind of parse through the logs in a
really easy manner as well as
configuring scores that allow them to
add the relevant scores to that
particular log and then the user
feedback. This is actually coming from
our users in the application. Again
being able to create sort of uh views on
top of that feedback that then power uh
maybe the human review and then creates
that flywheel effect uh that we that we
that we want.
That's all I have today. Appreciate you
all coming out here and and listening to
me. But yeah, you have a
question. Thanks.
Yeah, the question is around like how
are we using uh human review and like
some of the logs and informing the the
offline eval portion of this largely
that goal. Uh yeah, one thing I maybe I
didn't highlight here is so maybe back
within brain
trust, I'm going to go back to my
initial
project. So imagine now like we have we
have all of these logs. We filtered it
down to a particular
Oh, are we still
showing on the screen?
Awesome. Thank you. Um, yeah. So,
imagine like we have this uh this this
process now, right, where we're we're
doing that human review. We filtered it
down to the records that are meaningful
for whatever reason. it becomes really
easy again to connect what's happening
within production. So I maybe I select
all of these rows or I select individual
rows but I can add these back to the
data set that we're using within those
offline evals. I I think I've said this
like a hundred times over this
conference this flywheel effect. This is
like I think what's missing oftentimes
when we're building these these AI
applications and what brain trust allows
for really seamlessly.
Yeah. Uh I have two questions. Uh the
first one is about production. Uh is it
possible to have multiple models in
production and compare how they behave?
Yeah, I I don't see why not. Like my
guess is in the underlying application,
you're swapping them out like having
like AB tests, you know, I can have like
two or three or four and easily compare.
Absolutely. Yeah. Um let's see if I have
an example
here. You're able to to group some of
these scores. Uh maybe this is sort of
an example of of what you're talking
about. So like maybe within production
we have different models running. Uh
this this sort of view here allows us to
understand like the the models that
we're using under the hood. Uh and this
is just you know you could do this
within production as well and sort of do
that AB testing. Cool. Uh my my second
question is about humans in the loop,
right?
Um let's suppose that I have multiple
humans and um they behave uh slightly
different as a scorer scorers do you
have anything or what what is the vision
to do that like is there a way that I
can actually compare how they're scoring
or something like that or not
really so different users can maybe have
different sort of criteria for scoring
maybe the first thing I would say to
that is like there should be like maybe
a rubric for your users who are
interacting with human review so you're
not creating that uh you certainly have
the ability to see like who is scoring
different things within the platform. Um
I'm not sure if you're able to pull that
as like a data set to like assess the
differences there, but maybe like before
it gets to that place like have a
rubric, have a guideline of what scoring
looks like for your humans. Okay, thank
you. Yeah, of course.
Hi. Um so the scores I'm used to working
with for like LLM as a judge are like
they're relativistic, right? So they
can't tell you is the answer relevancy
good or bad for a single run but it can
tell you how it compares to previous
iteration of like the same test set for
example.
Um, do you guys use LM as a judge scores
for
online or is and
like how are are they relativistic like
that or do you have some way to be like
this is a good answer you know in and of
itself for this sample or because it's
all you have new data coming in right
yeah I think a lot of our customers who
are are thinking about this are are like
almost doing eval on their eval like
trying to understand did the LLM as a
judge actually do a good job there. So
like when that actually runs there's a
rationale behind it and so you can sort
of run an eval of those LLMs as a judge.
I think Sarah from notion in our
workshop described sort of a process
like that within notion but I think
that's that's sort of like where I would
aim it. Okay.
Cool.
Awesome.
Are any of your customers doing eval
before they launch? Like I'm working
with a
government. They don't want to launch
until we show some accuracy levels.
Yeah. So, we're getting our subject
matter experts to enter in all the
questions that they have, right? They
have huge data sets of thousands of
questions, believe me, as a government.
Um, and then we're using measures like
you're talking
about. Do you have a way to do that? I I
guess it I guess it's the same. Is it?
So what you're describing is what we
call offline evals, right? This is
development, right? Um, we can actually
do this testing before we get into
production. This is what I was talking
about like establish that baseline,
right? Using those scores, using that
that data set that you've already um
created, but this all happens before we
get into production, right? And then you
can like one of the things that I heard
from from somebody earlier is like one
of my challenging things of building
this AI application is uh establishing
trust or creating that trust in this
thing. That's part of what this is,
right? It's like it's showing showing
those people the scores of that
application. So you start to iterate on
this thing. Maybe it starts at 20% then
it goes to 30 then at 40 and so on. That
to me is the thing that you use to
create that trust and uh create that
like ground swell to push it into
production. Okay. Yeah, that's what it
that is what we're trying to do. So but
I wondered if I can see the tool does
that. Thank you. Yeah, of course.
Time for one more. Thanks. Um quick
question. I love the CI/CD components.
Um we're trying to build a lot of um
we're trying to build like ML as a as a
platform for our team. Um so we get into
evals and stuff like that. So how much
of how much of the monitoring dashboard
you have in brain trust can actually be
like take the data taken out and post it
in a unified dashboard somewhere else.
Yeah. All of this is available via SDK,
right? You can pull down experiments,
you can pull down data sets. Uh so
you're able to you know pull this down
like we have we have a customer that is
actually building their own UI on top of
like the the SDK itself like so they
built their own sort of like components
utilizing the SDK and pulling the the
sort of things that we've logged the
experiments that we have in the
application into into their own UI. So
certainly possible
cool thanks everybody.
[Applause]
CEO as well as
I don't
Yeah, I'm going to go
All right.
Yeah. Check check check.
Hey, nice job. Good to see you.
Not too bad.
Was tolerable.
Okay, great. It's time. Yeah. Okay. So,
a couple of tips on your
video and
testing one, two, three. Awesome. So,
uh, make sure you have no copyrighted
material in your slides because it is
being streamed on YouTube and very
couple people had a like a slide that
was stock photo, but still I don't think
we have any stock photos. Great. Um, are
you both speaking? He's speaking. I'm
I'm the handler. Oh, okay. Great. Um,
the other thing is because we are
streaming, I have to It's mostly a slide
and then I
have a closeup of you so hang around the
podium. I will. You know, some people
like to pace. No, I don't like to pace
the camera. But what happens is it looks
like this. I'm just going to put my
backpack over here. Oh, over there.
Perfect. Yeah,
I'm not cool enough to pace.
Um, and we also started in the
Perfect. out of respect for the next
and that includes the Q&A. Excellent. I
don't think there's any Q&A. We'll see.
We'll see. And if somebody does shout
something from the audience because
people do um just in incorporate the
question in your answer, please somebody
on the stream might have the same
question.
You are. Yeah. I'm John.
Good to meet you.
Yes. Let's make sure
computer's working. Okay.
And I do have a hard line for the
internet. Great. I don't think I need
any internet.
Let me hide some windows before it
resizes again.
It's not sharing right now, is it? No,
it's not. Okay. I just want
to hold.
Perfect.
Okay.
What's up, Alan?
Thank you.
Oh, hey. I saw Sydney yesterday or the
day before. Yeah.
probably
How's the conference been for you guys
so far?
Turn this off basically right now.
Right.
Great.
Nice. Slip it in the pocket.
Oh, yeah. Yeah.
Like you guys on TV. It's all you see
like this. Mr. President, I get down.
Yeah. Thank you. Are you guys doing an
event tonight? Hello. Hello. Hello.
Hello.
Okay. Is it related to the conference?
Okay.
Okay. Nice. Oh, I'm on now.
Well, okay. Whatever. Okay, I'll stop
the small talk.
So, how many people saw Manu's uh
keynote?
Nice. I'm Manu's uh less cool brother.
Awesome. All right, welcome
y'all. Uh, welcome again to the second
part of the track for the EVAL here at
the AI engineering world's fair
conference. Uh, we're doing something
interesting in this room where I've
actually asked the AI to write up the
introduction for our speakers. So, um,
without further ado, I've asked Gemini
2.5 flash preview to uh, introduce
Anchor Goyle here. It's saying that uh
please welcome the CEO of brain trust
drawing from his extensive experience
including leading the product for Google
cloud AI platform I don't think true I
don't think that's true so anchor is
here to share invaluable lessons from
the eval front lines for every AI
engineer so without further ado
thank you
Gemini awesome um really excited to uh
uh share some stuff with you all today.
Um and I'm I'm going to talk about five
hard-earned lessons uh that we've uh
learned over the last year or so about
evals. Um so first just a little bit
about us and where we learned some of
these lessons. Um at Brain Trust we
build a product that helps people do
EVELs and observability. Um we work with
some really great um companies that are
building products in the space. Um and
we uh we pulled up some stats. I find
some of these numbers kind of
interesting. Across all of our users,
and this includes like free users who
just sign up uh to use our free product,
um the average org runs almost 13 emails
a day. I I think that's pretty cool. Um
it's really what our product is about,
but um it's great to see just how much
people are evaluating stuff. Um some of
the most advanced organizations that we
work with are actually running more than
3,000 emails a day. Um they're really
testing and experimenting with quite a
bit. And some of the users who are deep
into evaluation are spending around two
hours a day around their data
understanding what's working, what's not
working, it just speaks to how much
impact evaluation can have when you
really take it seriously and build it
into your process.
So with that said, uh let's talk about
some of the interesting things we've
learned uh over time.
Um so the first thing is I think it's
super important for you to uh understand
and define um whether evals are actually
providing value uh for your organization
or not. Um and I tried to come up with
three signs that you should look for um
that that are good. Uh so the first is
um if a new model comes out uh you
should be prepared um via your evals to
be able to launch an update to your
product within 24 hours that
incorporates the new model. Uh Sarah
from notion um she talked yesterday she
talked about this um specifically but um
for the past several model releases
every time something comes out notion's
able to incorporate um the new model
within 24 hours. And I think that's a
really good sign of success. If you
can't do that, um, then it means that,
uh, you have some work to do on your
evals. Um, another sign of success is if
a user complains about something, do you
have a very clear and straightforward
path to take their complaint and add it
into your emails? Um, if you do, then
you have a shot at actually um,
incorporating user feedback, pulling it
into your emails, and ultimately doing
it better. If you don't, then you're
going to lose a lot of valuable
information into the ether. Uh so again
I think this is a really important kind
of threshold or milestone to
hit. Um and the last one which I'm
actually going to talk about a little
bit more throughout the presentation is
um you should really start using evals
to play offense and understand which use
cases you can solve um and how well you
can solve them before you actually ship
things not like unit tests which allow
you to just test for regressions. Um,
and so if you if you really adopt evals,
then I think uh before you launch a new
product, you have a really good idea of
how well the product might work uh given
what your eval
say. Um the second lesson is that great
evals uh they have to be engineered.
They don't just come for free with uh
synthetic data sets and random LLM as a
judge scores that you read about online.
Um, and I think there's maybe two ways
of thinking about this. Um, there's no
data set that is perfectly aligned with
reality. I think in the cases that there
are, there's like basically nothing to
do and the use cases already work, which
there are a few that that are kind of
like that, like solving competition math
problems for example. But for most real
world use cases, any data set that you
can come up with ahead of time is not
going to represent what users are
actually experiencing. And I think um
the best data sets are those that you
can continuously reconcile um as you
actually experience what happens in
reality. And doing that well requires
quite a bit of engineering. Um of course
brain trust can help you with that. But
I think the the point is you have to
think about uh a data set as an
engineering problem not just something
that's given to
you. And the same is true with scorers.
I think um a lot of people we talk to
ask hey what scorers does brain trust
come with and and how can we use those
so that we don't need to think about
scoring and we actually have a really uh
powerful um open source library called
autoevels but it's very open source and
uh uh flexible for a reason which is
that um every company that we work with
that's sufficiently advanced is writing
their own scoring functions um and
modifying them uh constantly and I think
uh one way to think about scores is
they're like a spec or like a PRD for
your AI application. And if you think
about them that way, um, one, it it
actually justifies making an investment
in scoring beyond just using something
off the shelf. And two, hopefully it's
fairly obvious that if you just use, you
know, an open- source or generic scorer,
that's a spec for someone else's
project, not
yours. Um, there's been a real shift
towards context in prompts that's not
just the system prompt that you write.
And I actually think that um just
traditional prompt engineering pe people
say this in different ways but I think
traditional prompt engineering is
evolving quite a bit and it's very
important to think about context not
just a prompt. Um so this um is an
example of what kind of a modern prompt
looks like for an agent. Usually you
have a system prompt and then a for loop
which you know uh runs LLM calls uh
issues tool calls incorporates the tool
calls into the prompt and then iterates
and iterates. Um, and I I actually took
a few uh um uh uh trajectories from
agents that that we see in the wild and
summarized these numbers. And as you can
see, a vast majority of the tokens in
the average prompt um are not from the
system prompt. And so, yes, it's very
important to write a good system prompt
and continue to improve it. But if
you're not very precise about uh how you
define tools and how you define their
outputs, uh then you're leaving a lot on
the table. And I think one of the most
important things we've learned together
with some customers is that
um you can't just take tools as a
reflection of your APIs or your product
as it exists today. You have to think
about tools in terms of what the LLM
wants to see um and how you can use you
know exactly what you uh present to the
LLM to make it work really well. And I
think that in most projects um it's
actually very disruptive when you write
good tools. Um it's not something that's
just like an API layer on top of the
stuff that you already have. And the
same is true with their outputs. Um
there's one example that we uh worked on
recently for an internal project where
um shifting the output of a tool from
JSON to YAML actually made a significant
difference. And I know that's a little
bit of a meme in the AI universe, but
it's just so much more token efficient
and easy for an LLM to look at um a YAML
shaped data while doing analysis than
extremely verbose JSON. Um now, if
you're writing code and you're plugging
something into, you know, a charting
library, it makes no difference because
to JavaScript, YAML and JSON are both
structured data. Um but to an LLM,
they're very different. And so I think
you have to be very very thoughtful
about um you know how you actually
construct the definition of a tool and
how you construct its output for the LLM
to maximally benefit from it.
So I think one of the most important
things we've learned um and actually I
would credit some of the folks at Replet
uh for really uh pioneering this
pattern. Um but you know every time a
new model comes out uh everything might
change. Um and I think you need to
engineer your product, engineer your
team, um engineer your you know mindset
so that when a new model comes out if it
changes everything for you, you can jump
on that opportunity and and ship
something that maybe wasn't possible
before. Um and I'm going to show you
some numbers uh for a product uh feature
that we're actually launching and I'm
going to show you a little bit of it
today. Um, but uh we we've had an eval
for a while that tells us how well this
feature might work and we run it every
few months and you can see you know it
wasn't uh that long ago that
GBT40 was the best model out there. Um
but but things have changed uh and you
know progressively uh GPT41 did a little
bit better. Uh 37 sonnet is much better
and and for sonnet is actually even more
remarkably better. Um and uh what what
that's meant for us is that this feature
that um you know at 10% would would
really not be viable for our users to
use suddenly becomes viable. Um and so
you know cloud for sonnet actually came
out two weeks ago. Um, and we're
shipping the first version of this
feature today, which is just two weeks
later. But we were able to jump on that
opportunity because we ran this EVEL.
Um, we were ready to do it and we we saw
that, okay, great. We've actually
finally crossed uh this threshold. Um so
everyone that I personally work with or
talk to I encourage to create evals that
are very very ambitious and um likely
not uh uh viable with today's models and
construct them in a way that when a new
model comes out you can just plug the
new model in and try it. Um in Brain
Trust we have this tool called the brain
trust proxy. Um there's a lot of similar
tools. You could use ours or you could
use something else. Uh but really the
point is that you don't need to change
any code to work across model providers.
And so uh you know Google just launched
the newest version of of uh Gemini. Um
actually Gemini 2.5
Pro0520 scores 1% on this benchmark. Uh
so we didn't even put it on here. Um but
maybe the thing they launched today
actually uh does a lot better. We can
find out you know with with just a few
key strokes maybe right after this talk.
Um and the last thing is it's super
important if you uh think about um
optimizing your prompts to optimize the
entire system. Um so that means uh
thinking holistically about your um AI
system as the data that you use for your
evals, the task which is you know the
prompt, the agentic system, tools etc.
and the scoring functions and and every
time you think about making um you know
your your app better, you need to think
about improving this overall system. Um
we actually ran a benchmark uh which is
uh the same benchmark that I showed
previously. um it autooptimizes prompts
uh using um an LLM and uh we ran it once
by just giving it the prompt and saying
like hey please optimize the prompt and
a second time giving it the prompt the
data set and the scores and said please
optimize this whole system. Um and you
can see there's a very dramatic
difference. So again um something goes
from unviable to viable. Um, but it's
just super important to optimize the
entire system, not not just the
prompt. And actually, uh, this is, uh, a
new product feature that we are starting
to launch today. Um, if you're a Brain
Trust user, uh, you can go to the
feature flag section of Brain and turn
on a new feature flag called Loop. Um
and uh the loop is this amazing cool new
feature that actually autooptimizes
uh your evals um directly within brain
trust. Uh so uh you can work in our
playground and um give it you know a
prompt uh a data set um and some scores
and it can actually create prompts, data
sets and scores too um and just you know
work with it. Uh the kinds of things
that we've seen work really well are
optimize this prompt or uh what am I
missing from this data set that would be
really good to test for this use case?
Um why is my score so low? Um or why is
my score so high? Can you please help me
write a score that is uh you know
harsher than the one that I have right
now? Um you can also try it out with
different models. So, uh, as you could
see from this, uh, we've definitely seen
the best performance with Cloud 4 Sonnet
and Cloud 4 Opus performs a couple of
percentage points better. Um, but we
encourage you to try it out with
different models. You can use 03, you
can use 04 mini, you can use Gemini,
maybe you're building your own uh, LLM
or fine-tune model. You can try that as
well. Um, and yeah, we're very excited
uh, for this. I think uh, I'm going to
talk about this a little bit later. Um,
and I'm happy to do it with some Q&A as
well, but um, I actually I really think
that the workflow around evals is going
to dramatically change now that LLMs are
capable of looking at prompts and
looking at data and actually making um,
you know, constructive improvements
automatically. A lot of the manual labor
that went into iterating with EVELs um,
doesn't need to be there anymore. So,
it's it's really exciting. Uh, we're
excited uh, to ship this and and start
to get some feedback.
Uh so just to recap um five lessons that
I think are really important. Um
effective eval speak for themselves.
It's it's important to understand
whether you've kind of reached a point
of eval competence in your organization
or not. It's okay if you haven't. Um
it's not easy, but it's important to be
honest about that and work towards it.
Um when you're working on evals, it's
very important to engineer the entire
system. So don't just think about the
prompt. Don't just think about improving
the prompt. Please don't just use
synthetic data or hugging face data
sets. I know they're awesome, but please
use more than just that. Please don't
use off-the-shelf scores only. Write
your own. Think very deliberately about
um how you can craft the spec of what
you're working on into your scoring
functions. Um think very carefully about
context. And I think in particular um
what helps me personally is to think
about writing tools uh like I would
think about writing a prompt. It's my
opportunity to communicate with an LLM
and set it up for success. And how I
define the API interface of the tool and
I define its output has a very dramatic
impact on
that. Make sure that you're ready for
new models to come out and to just
change everything. Um, so if a if a new
model comes out, you want to be prepared
to know that immediately, ideally the
day that it comes out. Um, and also be
prepared to like rip out everything and
replace it with a fundamentally new
architecture that takes advantage of
that new model. And I think part of that
is obviously having the right evals.
Part of it is engineering your product
in a way that actually allows you to do
that. And then finally when you think
about optimizing or improving uh your
eval performance um you have to think
about optimizing the whole system the
data and how you get that data the task
itself um which you know the prompt
tools etc and the scoring
functions. Um and with that uh we have
some time for Q&A. Yeah there's uh two
microphones up here one on the left side
one on the right side. uh feel free to
stand up and ask your questions.
Hi, this is Joti. Um, one of your slides
said take feedback and turn it into an
eval. Are you concerned about
overfitting evals at that point where
every feedback then turns into an eval?
Oh, that's a great question. Um, also
nice to see you. Um so uh the question
was um one of the slides was about
taking feedback uh from you know real
data and adding it to a data set and
incorporating it in an eval. Are you
worried about overfitting? Um and I
think the answer is I'm actually way
more worried about overfitting to the
data set without the user's feedback
than I am to um adjusting the fit to
incorporate the user's feedback. Like
the most important thing about a data
set is not the state of the data set at
any point in time. It is how well you
are equipped to reconcile the data set
with the reality that you want. Um and I
actually think one of the things that we
discourage uh in the product and some
people complain to us about this. I get
it uh if you're one of those people. Um
but we don't automatically take user
feedback and add it to data sets right
now. We actually want a human who has
some taste and maybe uh can build some
intuition about the problem to find the
uh data points from users that are
interesting and add them to the data
set. And I think that is your
opportunity as a user to apply some
judgment about like oh okay this user is
trying to do something that should
obviously work. It's really sad that it
doesn't work in my product. Let me add
it to the data set so I can make sure it
does. Excuse me.
You had a slide I think in the tool
descriptions about like with some
percentages on it. Yeah, this one. What
what is that? Yeah. So, um we took a few
agents um like we you know have a lot of
traces uh and we analyzed the relative
um number of tokens for different
message types. So the system prompt is
one message type. Tool definitions um
are you know the spec of what uh tools
the model can call. User and assistant
um uh are
um tokens from user and assistant just
text interactions and then tool
responses are um tokens from the that
you know that the the tool generates
itself. Oh this is the percentage of
tokens correct and this is the relative
percentage of those tokens. Yeah.
Yeah. Yeah. Yeah. So the the the the
point that we're trying to make here is
that um I think in modern agentic
systems uh tools actually like very very
significantly dominate the token budget
of the LLM. And I think that it's very
important to um think about how you
define the definition of tools and how
you define their outputs so that you u
you know engineer the LLM for success.
uh not just sort of take you know your
GraphQL API and give it as a bunch of uh
you know uh tool calls to to the LLM.
Um first off that point about the thumbs
down is such a good point. I'm working
with the government and people don't
like the answer they got for example
about taxes and they give it a thumbs
down. Yeah. Right. So like adding that
human aspect is a really good idea. We
actually even added a little thing that
said the answer is right, but I just
don't like it. That's awesome. Um, but
my question is about your point that the
new model changes everything. We've
updated our models several times and and
use clo and open AAI and we haven't
found huge differences other than
recently someone really cheap wanted to
use 4.1 mini and like it seemed to
ignore every it. I swear it ignored the
system prop completely. Yeah. But what
kind of things when you say it changes
everything, can you tell me a little
more about what kind of changes you're
seeing? For sure. I think um the use
case that we just shipped with loop is a
really good example of that. So this is
a very ambitious uh agent. It's looking
at prompts and uh data sets and scores
and automatically optimizing the prompts
based on the data sets and scores. And
this is something that um you know we
wrote a benchmark for a while ago and we
ran with every consecutive model launch
and the numbers looked more like what
you see for GPT40 for a very long time.
This isn't true for every benchmark. So
um as part of this exercise we actually
have a bunch of uh evals that loop
optimizes. That's our eval set. And
there's some evals like uh classifying
taking movie quotes and figuring out
what movie they're coming from that have
worked really well since GPT 3.5. Um and
so there are certain use cases where it
just doesn't matter. There are other use
cases where um they're so ambitious that
they just don't work today. And I think
you want to create evals uh so that if
there's something ambitious that you
want to do in the future, you are very
well prepared when a new model comes out
to just push a button and find that out.
Okay. Thank you. So, we are out of time,
but I'm sure Anchor would be happy to
take more questions um outside as well.
For sure. Um so, in to for the interest
of time, we we'll call it that done on
that one.
Um and yeah, if you don't
mind. All right, let's let's hear a
round of applause.
So, while you get going, I'll introduce
you. All right. So like uh if in case
people are not aware I fed uh the
information of the speakers to AI and uh
we're doing an eval on the uh output of
the AI. So here I have GPT40 mini and uh
it says please welcome John Dickerson.
It says pronounced Dick Earth son. I
don't know why, but okay. The CEO of
Misilla AI exclamation point with a rich
background as a co-founder of Arthur AI.
Please let us know if it's true or not.
That is true. And a former tenure
professor, John is at the forefront of
building an open-source AI ecosystem and
has extensive experience in responsible
AI practices. Today, he'll share
insights on the urgent need for robust
AI evaluations in enterprises, drawing
from a decade of experience in the
field. That's roughly right. Thank you.
All right, cool. Thanks everyone for
being here. Um I'm going to give this
talk mostly from the uh point of view of
being a co-founder and chief scientist
at Arthur AI where for six years prior
to joining Mozilla AI as CEO. Um I do
want to say Mozilla AI operates in the
open source world where we're providing
open source AI tooling and we're
supporting the open source AI stack. Our
end goal is to uh enable the open source
community to be at the same table as a
Sam Alman when talking about AI moving
forward. So if you're interested in
that, that's not what this talk is going
to be about, but we can talk about that
offline. This talk is going to be about
2025 finally being the year of the
evals. And as was written uh written as
was spoken uh by my introduction, I've
been in the space for a very long time.
Arthur AI, for example, was in and still
is in obser observability evaluation and
security in both traditional ML and AI
and then into the deep learning
revolution and then into the genai
revolution and then into the agentic
revolution. And I think we're finally at
the point where all of these companies
are going to start seeing hockey stick
growth, which is
exciting. So the thesis of this talk,
one thing is that uh I I see a IML
monitoring and evaluation as two sides
of the same sword or ruler, right? You
can't do monitoring or observability
without being able to measure and
measurement is uh the core functionality
for
evaluation. This was not uh top of mind
really with the seuite uh until two
things happened concurrently.
One is AI became a thing that people who
aren't a CIO or CTO could understand. So
the CEOs, the CFOs, the CISOs began to
understand it basically when chat GBT
came out and
simultaneously there was a perfectly
timed budget freeze across enterprise at
least in the US that happened due to a
fear of an impending recession. So this
is right before chat GBT launched. This
is like October, November when most
enterprises would set up budget for the
next year. At that time there was a
freeze except for money that could be
opened up for a specific pet project and
that pet project because CEOs and CFOs
then knew about it uh was Gen
AI. So that happened and then now we
have the final sort of uh vertex on this
triangle which is going to force
evaluation to be top of mind this year
which is that we have systems that are
now acting for
uh humans acting for teams as opposed to
just providing inputs into larger
systems. So these three things together
are as you saw with Brain Trust, as
we're seeing at Arthur, as you're seeing
with like Arise AI or Galileo, other big
players in this space. We're starting to
see big takeoff because of
this. Cool. So right now, this is what's
happening. You're the agent. We all hear
about it. We're hearing about it here at
this conference. Agents are starting to
make decisions and take actions, complex
steps that lead toward an action, either
autonomously or semi-autonomously. uh as
a question on the last talk um um uh
brought up you know bringing humans into
the loop is still obviously a very good
idea on many systems but we're getting
closer and closer to full
automation and agentic systems are going
into deployment now okay and that's in
enterprises that's in SMBs that's in pet
projects and so on and what that means
is by that last slide this is also the
year that we
need flash back uh to then up to like a
year ago where every year we were asking
hey is this the year of ML? Is this the
year of evaluations? And prior to sort
of these agentic systems coming out, we
would have machine learning models
basically spitting out numbers that
would then be ingested into a more
complex system. And that complexity uh
would sort of erase uh the top of mind
need to think about what's coming out of
the model itself. Except for people in
this audience, we know that's very
important. But when it comes to decision
makers, it would often get wrapped up
into this sort of opaque box.
And that meant that the ML part didn't
really bubble up beyond like the CIO or
whatever org the system was going into
which meant that typically that year was
not the year for EVAs because the need
for evals was not obvious to the entire
seuite. Cool. So let's take a quick step
back in time. Before November 30th of
2022, the chatbt launch, uh, ML
monitoring was certainly a thing, right?
Like data science teams have long used
statistical methods as part of larger
systems to understand what's going on,
right? This is this is core. Um, like I
mentioned though, there's a tenuous
connection to sort of downstream
business KPIs. And at the end of the
day, that's what gets your product
bought in the enterprise is being able
to make a sale about dollars saved or
about dollars earned. So being able to
connect machine learning the components
specifically to a downstream business
KPI there was a lot of lip service
around a IML around the ROI from the
seauite including CIO CEOs um but that
was just lip service in our experience
at least uh it was still basically
selling into the CIO so basically it
made it hard to sell outside of that now
obviously this is a large space this has
been happening since you know about 2012
I would say is when a IML monitoring
really started up with like H2O and
algorithmia and Seldon sort of the first
generation of these companies coming
around Y labs Aporeia Arise Arthur
Galileo Fiddler protect AAI and so on
and so on and so on I've put the cut off
here at uh like mid 2022 sort of like
before the genai revolution happened
there have obviously been companies
founded after that you know we just saw
brain trust talking as well and then you
know the big players here as well right
snowflake data bricks data dog sage
maker vertex you know Microsoft's
products and so
So people have
been rarely top of mind for the CEO and
the issue. So when we would talk to
people it's always yes we understand
that we need this but security is going
to be a bigger issue or latency is going
to be a bigger issue or some of these
more traditional technology sort of
problems are going to be the issue. It
wasn't the machine learning model
itself. So basically, you know, I do a
lot of due diligence for venture
capitalists as well now as a, you know,
um, a multi-time founder and so on. And
basically every pitch deck in this space
from like the mid2010s onward had a
slide that said, "This is the year that
a CEO is going to get fired because of
an ML related screw-up." Uh, and to my
knowledge, it just still hasn't
happened. There have been some
forwardinking leaders. So I have here a
um the annual report from Jamie Diamond,
head of JPMC. Uh, this came out in April
of 2022. So it covered basically JPMC up
through their fiscal year in 2021. Uh
and he's talking about the spend that
they have going into AI. But if you
squint and you look at these numbers,
they're still like comically small. So
basically one of these is in the
consumer world, he makes the statement
that from 2017 up through the end of
2021, they had put $100 million into a
IML, right? That's not a huge amount of
money uh for
JPMC. So keep sitting back in time
pre-hat GPT and so on. But let's now
flip to the macroeconomic side of
things.
So the economy started getting pretty
dicey um right up until about chatbt um
uh um launched right so that was the end
of November for chatbt uh a lot of
enterprise budgets are set in October
November for the following year and
toward mid to the end of 2022 there were
very deep fears about an impending
recession that didn't end up happening
but those fears basically made it so
that most enterprises either froze or
shrunk their IT budgets for 20 23. Okay.
So what that meant is were it not for
particular tailwinds called JGPT, we
probably wouldn't have seen a lot of new
technology being developed in the IT
departments at these large enterprises
in
2023. Right? So this is sort of a
bittersweet uh mix. I guess I just
talked about a lot of this. Um, this was
sort of a bittersweet mix in the sense
that it did set us up to put the eye of
Sauron on a particular specific small
pet project where uh a small amount of
budget could be applied and that small
pet project uh came from our friends at
OpenAI launching chatbt right before the
holiday breaks and I'm convinced just
from talking to some seuite folks across
the enterprises here in the US that
basically what happened is now CEOs and
CFOs and you know less technical and the
computer science sense of the word were
able to interact swiftly and easily with
a single UI on the internet and get
wowed by AI. Right? So I was also wowed
by AI. In fact, we were hosting um
Arthur was hosting with our um series A
leaders index ventures an event at
Nurups, the major machine learning
conference the night that chatbt came
out and it basically took over a bunch
of nerds in a room being like, "Wow,
this is very, very impressive." And that
happened to everybody else, right? Flip
back to November 30th. You know, you can
make Eminem rap like Taylor Swift. Oh,
hey, isn't this funny? Oh, hey, my mom
did this sort of like joke about, you
know, I want to have poetry but in the
uh uh written in the, you know, the
words of a rapper or whatever. This
started to happen over and over again.
And what that meant is that that
discretionary budget uh which exists uh
was unlocked specifically for now the
CEO's pet projects which were called
Genai. So 2023 uh we still had austerity
forced on us because of those frozen
budgets
um or even reduced budgets. But the
thing that was happening here is that
now the only money going around that
could be allocated was going to
specifically genai. And so everybody
focused on this. A it's a cool
technology. Everybody focused on this
and the science projects started to sort
of float around within enterprise.
2024 we started to see genai based
applications going into production right
chat applications are the obvious one
internal chat applications internal
hiring tools things like
that and that's because basically the
only budget going into new projects in
2023 was going to genai and now as
things go into production primarily
internally in 2024 uh we have the folks
who tend to dress in business suits uh
asking questions around ROI governance
risk compliance brand optics and that
kind of thing so now we're starting to
get a little bit closer to people
outside of the machine learning, the
data science, the computer science
world, the CIO's office caring about
evaluation, right? If I need to quantit
quant uh if I need to have a
quantitative estimate of risk, then I
need to do
evaluation. 2025, we've seen, you know,
scaleups, right? Look look at the
revenue numbers for any frontier model
provider. Look at the revenue numbers
for a lot of us in this room. Just
everything is really going up right now.
And that's because of usage, which is
great.
That also means that's a function of the
seauite basically becoming comfortable
basically talking about and putting
large real budget into AI right so
2023's IT budgets 2024's IT budgets uh
set for the following year those weren't
frozen right those are earmarked
specifically for AI applications and
things along those lines so we had
science projects in 2023 go into
production in 2024 and they're now
shipping and scaling uh in 2025 and also
like frankly the the technologies just
gotten really amazing Right? Like all of
us in this room are are technical, but
even I'm just amazed every time a new
model is
dropped. The community has also really
gotten behind this. You know, open
source has gotten behind this. Venture
capital, big tech is writing huge checks
into uh into frontier model providers
and so on. So everything is sort of
coming together in
2025. And also remember that third
vertex, we have machine learning systems
now moving toward autonomy. Okay, so
2025 we all hear it. It's year of the
agent. I you know, no longer is a
question mark needed here. It's clearly
the the year of the
agent. Now, quick 30 second definition
of an agent. As defined, you know, in
the late 50s onward, um agents need to
perceive the environment. They need to
learn. They need to abstract and
generalize. And unlike traditional
machine learning, they're going to
reason and act, right? We have reasoning
models out there. We have systems that
are acting in virtual environments or
cyber physical environments. And what
that means is you have a lot of
complexity introduced into the system
and you have a lot of risk introduced
into the system and that's great for
those of us in uh
evol. So at the end of the day, the
thing that really matters, like I
mentioned, is connecting when you're
selling any product, not just our
products in this room. When you're
selling any product into an enterprise
or SMB, is being able to attach uh your
product into some sort of downstream
business KPI, risk mitigation, revenue
gains, uh you know, losing less money,
whatever. So now evaluations, you need
to be able to do this because you're
quantifying things and they're finally a
first class discussion point, which is
fantastic. So we have the CEO like I
mentioned November 30th 2022 and onward
now at least knows what the tech is and
I'm not saying they know you know what
attention means uh right but I am saying
that they know some of the capabilities
around these generative models they know
some of the capabilities around agentic
and multi-agent systems uh and they're
comfortable um you know talking to
experts about it uh allocating budget
for it uh and talking to their board of
directors and shareholders about it as
well.
Uh, we have the CFO who because the CEO
cares about this stuff, uh, also
obviously needs to care about it, but
she's going to care about the impact to
the bottom line, right? That's what a
CFO does. They're doing allocation.
They're doing budget planning. Uh, and
they're going to need to basically write
some numbers into an Excel spreadsheet,
and those numbers have to come from in
part quantitative
evaluation. Uh, CISOs, uh, now see this
as, you know, a huge security risk and
opportunity. And for those who haven't
sold into enterprises before, CISOs are
typically willing to write checks uh
smaller checks especially for startups
uh more quickly and with less overhead
than like a CIO would. CIOS tend to have
a bigger org for one and also tend to
have a lot more process. CISOs tend to
be a little bit more scrappy and willing
to try out tools and so on. And so this
actually happened before the agentic uh
revolution. This actually happened when
Gen AI started coming up where CISOs
were like, hey, hallucination detection,
prompt injection, things like that.
that's firmly in the security space,
which is why you've seen a lot of
guardrail products, including Arthurs,
including the ones that come from our
competitors, going into the CISO's
office and and basically being able to
sign a lot of
deals. Uh, the CIO, poor poor CIO, has
been on board the entire time. Uh, and
they're just trying to keep the ship
sailing. So, that's great. They're still
on board. Uh, they want to keep their
job. And the CTO's now, uh, they always
want standards, right? They need to make
these decisions based on numbers. Those
numbers are coming in part from like
hotel standards, standards like that.
And that's great, right? So I've listed
a lot of the seauite here. I haven't
talked about chief strategy officers or
otherwise, but like the CEO, the CFO,
the CTO, the CIO, the CISO, they control
a lot of budget and now they are all
willing to talk about and they're all
aligned about basically the need to
understand evaluation from
AI. Great. So uh quick remind me, you
know, you should hold me truthful here.
All the evaluation companies,
observability companies, monitoring
companies, security companies, whatever
you want to call them, have shifted into
aentic and multi-agent systems
monitoring. Right? The point around you
should monitor the whole system. You
shouldn't just monitor the one model
that is being used by one particular
agent. That's well taken and well
understood in industry and in
government. And I think that's that's
great to have that at that topline
discussion
point. But, you know, keep me honest
here. uh there was an article that came
out in midappril uh in the information
uh showing some leaked revenue numbers
for a variety of startups in the
evaluation space weights and biases
Galileo brain trust and so on but they
were lagged by about 6 months or eight
months um and just from talking to
friends in the space uh those numbers
are no longer representative of what
folks in this area are making and so
let's see what the information leaks in
early 2026 about this and maybe we'll
see something like revenue no longer
lags at AI evaluation startups because
this is the
for
its mention is not firmly in the
evaluation systems. So if you're playing
around with different multi- aent system
frameworks, check out any agent. We
implement a lot of them for you under a
unified interface. So for people in this
room, that might be a fun project to
play around with. So thank you. I'll uh
have three three minutes for questions.
[Applause]
I go first.
Okay. Thanks. Uh thanks great
presentation. I just have a question
really about the enterprise value about
the most of the evaluations in Genead
require domain expertise. So for
example, if you're building a multi-
aent system to do financial investment
analysis to do something called a
discounted cash flow spreadsheet, is the
agent doing it correctly or not? Uh I'm
just trying to understand is How is that
problem getting solved? Because most of
them are, you know, coming from an ML
background where it was structured data,
but this is a lot of unstructured data
and you have to measure the quality like
is it in acting like a human, right?
Yeah. Yeah. It's it's great. Actually, I
have a paper in nature machine
intelligence talking about some of the
problems that can come around when you
do persona based agents where I say act
like a farmer in Ohio in your mid-40s
and so on. There's value in that, but
you can't do it perfectly. And my gut
reaction to this is um there was a
leaked spreadsheet from Merkor, which is
a company that uh can hire in experts
showing, you know, $50 an hour, $100 an
hour, $200 an hour for experts to be
hired by, for example, Google or for
example, Meta or for example, large
banks to do kind of what you're saying,
which is you're going to have an expert
sitting alongside the multi- aent
system. Uh basically sitting next to,
you know, the intern who's going to come
in and take your job in like a year or
two or change your job. Maybe take is
not the right word, but they're
basically doing that expensive human
validation and lock step with the multi-
aent system, which you know, if you're
going to be doing discounted cash flow
analysis, right, the kind of thing where
a you can either make or lose a lot of
money and b lose your job if you get it
wrong. Uh it's worth spending that large
amount of money doing the human
validation. It's a question for everyone
though is like what does that look like
in five years once that data is
incorporated into the systems themselves
and that can be a mode as well, right?
when you talk to anyone in the eval
space like it's the data set creation
and the environment creation that
matters more than anything which is the
point you're getting at as well. So if I
if I spend a bunch of money to have a
very good competitive like uh DCF
environment or whatever that can help me
versus my competitors. So there is like
um there is capex going into that.
Uh yeah thanks for the presentation. Um
do you have a rough uh maybe timeline on
when you think um eval will primarily be
driven by uh maybe gen AI or even LMS?
Yeah, the LLM as a judge paradigm is and
I think this was talked about in in the
previous talk as well. Uh we see it
getting used in practice because there
are issues with it, right? We have a
paper inclar uh from last month talking
about some of the biases that LLMs as
judges have versus humans and things
like conciseness or helpfulness and some
of those anthropic words. But the long
and short of it is that like it it
solves the data set creation problem in
some sense and that like you can ask you
you give a persona to an LLM and it it
it is like a poor man's version of like
a human doing the judging and so we see
a lot of people using that as a crutch
right now but you need to toward the
toward the last question that was asked
you do need to make sure you're
validating this and and making sure that
you're not going off in some weird bias
direction but we do see them in
practice that's time. Oh uh happy to
chat offline. Yep. Thank you so much,
John.
Um, he is available as well. Feel free
to reach out to him. Up next, um, we
have, let's see. Let's use claw for
sonnet. So, that should be good, right?
All right. So, let's see. Yeah, I
actually asked the AI to go ahead and
create the introduction for
everybody. So, make sure to tell us if
anything is hallucinated.
Please welcome Jeff Huber, CEO and
co-founder of Chroma, the widely adopted
open source vector database that's been
making waves across the tech media from
Techrunch to Forbes exclamation point.
Joining him is Jason Leu, principal at
567 Studio and a machine learning
engineer who wears many hats as a
consultant educator ready to show you
how to turn your messy conversation data
into actionable insights and killer
evals. There you go. Pretty good. Pretty
good. All right.
Can you hear me? Okay. All right.
Welcome everybody. Um, I'm Jeff Huber,
the co-founder and CEO of Chroma, and
I'm joined by Jason. We're going to do a
two-parter here. We're really going to
pack in the content. It's the last
session of the day, and so we thought
I'd give you a lot. Um everything in
this presentation today is open source
and code available. So we're also not
selling you any tools. Um and so
there'll be QR codes and stuff
throughout to grab the code. So let's
talk about how to look at your data. Um
all of you are AI practitioners. Um
you're all building stuff and uh this
probably these questions probably
resonate quite deeply with you. Uh what
chunking strategy should I use? Is my
embedding model the best betting
embedding model for my data? Um and
more. And our contention is that you can
really only manage what you measure.
Again, I think Peter Ducker is the
original who coined that. So I can't
take too much credit. Um but it
certainly is still true today. So we
have a very simple hypothesis here which
is you should look at your data. Um the
goal is to say look at your data I think
at least 15 times this presentation. So
that's two. Um and great measurement
ultimately is what makes systematic
improvement easy. And it really can be
easy. It doesn't have to be super
complicated. So, I'm going to talk about
part one, how to look at your inputs.
And then Jason is going to talk about
part two, how to look at your
outputs. So, let's get into it. All
right. Looking at your inputs. How do
you know whether or not your retrieval
system is good? And how do you know how
to make it better? Um, there are a few
options. Um, there is guess and cross
your fingers. That's certainly one
option. Um, another option is to use an
LLM as a judge. you're using, you know,
some of these frameworks where you're
checking factuality and other metrics
like this and they cost $600 and take
three hours to run. If that is your
preference, you certainly can do that.
Um, you can use public benchmarks. So,
you can look at things like MTev to
figure out, oh, which embedding model is
the best on English. Uh, that's another
option. Um, but our contention is you
should use fast evals. And I will tell
you exactly what fast evals are. All
right. So, what is a fast eval? Um a
fast eval is simply a set of query and
document pairs. So the first step is if
this query is put in this document
should come out. Uh a set of those is
called a golden data set and then the
way that you measure your system is you
put all the queries in and then you see
do those documents come out. Um and
obviously you can retrieve five or
retrieve 10 or retrieve 20. It kind of
depends on your application. Um it's
very fast and very inexpensive to run.
And this is very important because it
enables you to run a lot of experiments
quickly and cheaply. Um you know I'm
sure all of you know that like
experimentation time and your energy to
do experimentation goes down
significantly when you have to click go
and then come back six hours later. Um
all of these metrics should run
extremely quickly for pennies.
So maybe you don't have yet, you have
your documents, you have your chunks,
you know, you have your stuff in your
retrieval system, but you don't have
queries yet. That's okay. Um, we found
that you can actually use an LLM to
write questions and write good
questions. Um, you know, I think just
doing naive like, "Hey LM, write me a
question for this document." Not a great
strategy. However, we found that you can
actually teach LLMs how to write
queries. Um, these slides are getting a
little bit cropped. I'm not sure why,
but we'll make the most of it. Um, to
give you an example. So, this is
actually a um example from one of the
MTe kind of the golden data sets around
embedding models or the benchmark data
sets. Um, this also points to the fact
that like many of these benchmark data
sets are overly clean, right? What is a
pergola used for in a garden? And then
the beginning of that sentence is a
pergola in a garden dot dot dot dot dot.
Um, real world data is never this clean.
Um so what we did in this report um the
link is in a few slides. Um we did a
huge deep dive into how can we actually
align queries that are representative of
real world queries. It's too easy to
trick yourself into thinking that your
system's working really well with
synthetic queries that are overly
specific to your data. Um and so what
these graphs show is that we're actually
able to semantically align the
specificity of queries synthetically
generated to real queries that users
might ask of your system.
So what this enables is, you know, if a
new cool sexy embedding model comes out
and it's doing really well in the MTB
score and everybody on Twitter is
talking about it, instead of just, you
know, going into your code and changing
it and guessing and checking and hoping
that it's going to work, um, you now can
empirically say whether it's good or
better or not for your data. Um, and you
know, the kind of example here is quite
contrived and simple. Um, you know, but
you can actually look at the actual
success rate. Okay, great. These are the
queries that I care about. Do I get back
more documents than I did before? If so,
maybe you should consider changing. Now,
of course, you need to re-mbed your
data. That service could be more
expensive. It could be slower. The API
for that service could be flaky. There's
a lot of considerations obviously when
making very good engineering decisions.
Um, but clearly the north star of like
success rate of how many documents that
I get for my queries, super fast and
super useful and makes your improvement
of your system much more systematic and
deterministic. All right. So we actually
uh worked with weights and biases um
looking at their chatbot um to kind of
ground a lot of this work. So what you
see here is for the weights and biases
chatbot um you can see four different
embedding models and you can see the
recall at 10 across those four different
embedding models. And then I'll point
out that uh blue is ground truth. So
these are actual queries that were
logged um in weave and then sent over.
And then there's generated. These are
the ones that are synthetically
generated. And we want to see is a few
things. We want to see that those are
pretty close and we want to see that
they are always the same kind of in
order of accuracy, right? We don't want
to see any like big flips between um
ground truth and generated and uh we're
really happy to see that we found uh
that answer. Now there are a few fun
findings here which is and of course
they're going to get cropped out but
that's okay. Um number one uh the
original embedding model used for this
application was actually text embedding
three small. um this actually performed
the worst out of all the embedding
models that we evaluated just for in
this case. Um and so probably wasn't the
best choice. Um the second one was that
actually if you look at MTeb Gina
embeddings v3 does very well in English.
It's like you know way better than
anything else but for this application
uh it didn't actually perform that well.
Um it was actually the voyage 3 large
model which performed the best and that
was empirically determined by actually
running this fast eval and looking at
your data. That's number
three. All right. All right. So, if
you'd like access to the full report,
um, you can scan this QR code. It's at
research.tra.com. There's also an
adjoining video, which is kind of
screenshotted here, which goes into much
more detail. There are full notebooks
with all the code. It's all open source.
You can run it on your own data. And um,
hopefully this is helpful for you all
thinking about how again you can
systematically and deterministically
improve your retrieval systems. And with
that, I'll hand it over to Jason. Thank
you. So, you know, if you're working
with some kind of system, there's always
going to be the inputs that we look at.
And so we talked about maybe thinking
about things like retrieval, how does
the embeddings work? But ultimately we
also have to look at the outputs, right?
And the outputs of many systems might be
the outputs of a conversation that has
happened, a you know agent execution
that has happened. And the idea is that
if you can look at these outputs, maybe
we can do some kind of analysis that
figures out, you know, what kind of
product should we build, what kind of
portfolio of tools should we develop for
our agents and so forth. And so the idea
is, you know, if you have a bunch of
queries that users are putting in or
even a couple of hundred of
conversations, it's pretty good to just
look at everything manually, right?
Think very carefully about each
interaction and then only use these
models when they make sense. And then
oftent times, but if I say this, they
can say, you know what, if we just put
everything in 03 and then here,
generally only use the language models
if you think you're not smarter than the
language model.
Then when you have a lot of users and
actual good product, you might get
thousands of queries or tens of
thousands of conversations and now you
run into an issue where there's too much
volume to manually review. There's too
much detail in the conversations and
you're not really going to be the expert
that can actually figure out what is
useful and what is good. And ultimately
with these long conversations with tool
calls and chains and reasoning steps,
these outputs are now really hard to
scan and really hard to understand. But
there's still a lot of value in these
conversations, right? If you've used a
chatbot, whether it's in cursor or any
kind of like cloud code system,
oftentimes you do say things like, "Try
again. This is not really what I meant,
you know, be less lazy next time." It
turns out a lot of the feedback you give
is in those conversations, right? We
could build things like feedback widgets
or thumbs up or thumbs down, but a lot
of the information exists in those
conversations. and the frustration and
the retry patterns that exist can be
extracted from those
conversations. The idea is that the data
really already exists in this
conversation. We think of a simple
example outside in a different industry.
You know, we can imagine the analogy of
marketing, right? Maybe we run our evals
and the number is 0.5. I don't really
know what that means. Factuality is 6. I
don't know if that's good or bad. Is 0.5
the average. Who knows? But imagine we
run a marketing campaign and our you
know ad metric or our KPI is 0.5.
There's not much we can do. But if we
realize that 80% of our users are under
35 and 20% are over and we realize that
the younger audience performs well and
the older audience performs poorly. What
we've done is we've just drawn a line in
the sand on who our users are. And now
we can make a decision. Do we want to
double down on marketing to a younger
audience or do we want to figure out why
we aren't uh successfully marketing to
to the older population, right? Do I
find more park podcast to market to? You
know, should I run a Super Bowl ad? Now,
just by drawing a line in the sand and
deciding which segment to target, we can
now make decisions on what to improve.
Whereas just making them ads better is a
sort of very generic sentiment that
people can have.
And so one of the best ways of doing
that is effectively just extracting some
kind of data out of these conversations
in some structured way and just doing
very traditional data analysis. And so
here we have some kind of object that
says I want to extract a summary of what
has happened maybe some tools that it's
used maybe the errors that we've noticed
the conversations that that happen maybe
some metric for satisfaction maybe some
metric for frustration. The idea is that
we can build this portfolio of metadata
that we can extract and then what we can
do is we can embed this find clusters
identify segments and then start testing
our
hypothesis. And so what we what we might
want to do is sort of build this
extraction put into an LLM, get this
data back out and just start doing very
traditional data analysis, no different
than any kind of uh product engineer or
any kind of data scientist. And this
tends to work quite well. You know, if
you look at some of the things that
Anthropic Cleo did, they basically found
that, you know, uh, code use was 40x
more represented by cloud claude users
than by, you know, uh, GDP value
creation. They go, okay, maybe code is
like a good avenue. And and obviously
that's not the really case, but the idea
is that by understanding how your users
develop a product, you can now figure
out where to invest your time.
And so this is why we built a library
called Cura that allows us to summarize
conversations, cluster them, build
hierarchies of these clusters, and
ultimately allow us to compare our evals
across different KPIs again. So now, you
know, if we have factuality is 6, that's
really hard. But if it turns out that
factuality is really low for queries
that require time filters, right? Or
factuality is really high when queries
revolve on, you know, contract search.
Now we know something's happening in one
area, something's happening in another.
And then we can make a decision on what
to do and how to invest our time. And
the pipeline is pretty simple. We have
models to do summarization, models to do
clustering, and models that do this
aggregation
step. And so what you might want to do
is just load in some conversations. And
here we've made some a fake data set,
maybe conversations, fake conversations
from Gemini. And the idea is that first
we can extract some kind of summary
model where there's topics that we
discuss, frustrations, errors,
etc. We can then cluster them to find
cohesive groups. And here we can find
maybe you know some of the conversations
are around data visualization, SEO
content requests and authentication
errors. And now we get some idea of how
people are using the software. And then
as we group them together, we realize,
okay, really there are some themes
around technical support. Does the agent
have tools that can do this? as well. Do
we have tools to debug these database
issues? Do we have tools to debug
authentication? Do we have tools to do
data visualization? Um, that's something
that's going to be very useful. And at
the end of this pipeline, we're sort of
presented with these printouts of
clusters, right? We know what the tools
are, how the chatbot is being used at a
higher level, you know, SEO, content,
data analysis, and at a lower level, you
know, maybe it's blog post and
marketing. And just by looking at this,
we might have some hypothesis as to what
kind of tools we should build, how we
should, you know, develop, you know,
even our marketing or how we can think
about changing our prompts. We can do a
ton of these kinds of things. And this
is because the ultimate goal is to
understand what to do next, right? You
do the segmentation to figure out what
kind of new hypotheses that you can
have. And then you can make these
targeted investments within these
certain segments. If it turns out that,
you know, 80% of the conversations that
I'm having with the chatbot is around
SEO optimization, maybe I should have
some integrations that do that. Maybe I
should reevaluate the prompts or have
other workflows to make that use case
more powerful for them. And again, the
goal really is to just make a portfolio
of tools of metadata filters of data
sources that allows the agent to do its
job. And oftent times the solution isn't
really making the AI better. It's really
just providing the right infrastructure.
Right? A lot of times if you find that a
lot of queries use time filters and you
just didn't add a time filter that can
probably improve your eval by quite a
bit right we have situations where we
wanted to figure out if contracts were
signed and if we just extracted one more
step in the OCR process now we can do
this large scale filters and figure out
you know what data
exists and generally the practice of
improving your applications is pretty
straightforward right we all know to
define eval but not everyone that I work
with has really been thinking about
something like finding clusters and
comparing KPIs across cost clusters. But
once you do, then you can start making
decisions on what to build, what to fix,
and what to ignore. Maybe you have a
two-sided uh quadrants, right? Maybe you
have low usage and high usage, and you
have high performing evals and low
performing evals, right? If a large
portion of your population are using
tools that you are bad at, that is
clearly the thing you have to fix. But
if a large proportion of people are
using tools that you're good at, that's
totally fine. If a small proportion of
people use something that do something
that you're good at, maybe there's some
product changes you need to make. Maybe
it's about educating the user. Maybe
it's adding some, you know, prefiller or
automated questions to show them that we
can do these kind of capabilities. And
if there are things that nobody does,
but when we do them, they're bad, maybe
that's a oneline change in the prompt
that says, "Sorry, I can't help you. Go
talk to your manager." Right? These are
now decisions that we can make just by
looking at, you know, what proportion of
our conversations are of a certain
category and whether or not we can do
well in that category. And as you
understand this, then you can go out,
you can build these classifiers to
identify these specific intents. Maybe
you build routers, maybe you build more
tools. And then you start doing things
like monitoring and having the ability
to do these group buys, right? So now
you have different categories of query
types over time and you can just see
what the performance looks like, right?
where 0.5 doesn't really mean anything,
but whether or not a metric changes over
time across a certain category can
determine a lot about how your product
is being used. By doing this, we figured
out that, you know, some customers when
we onboard them, they they use our
applications very differently than our
historical customers. And we can now
then make other investments and how to
improve these systems. And ultimately,
the goal is to create a datadriven way
of defining the product roadmap.
oftentimes it is research that leads to
better products now rather than products
justifying some research that we don't
know is
possible and again the real marker of
progress is your ability to have a high
quality hypothesis and your ability to
test a lot of these hypotheses and if
you segment you can make clearer
hypotheses if you use faster evals you
can run more experiments and by having
this continuous feedback through
monitoring this is how you actually
build a product right this is regardless
of being an AI product. This is just how
you build a
product. And so if you look at the
takeaways really when you think about
measuring the inputs, you really want to
think about not using public benchmarks,
building evals on your data and focusing
first on retrieval because that is the
only thing a LLM improvement won't fix,
right? If the retrieval is bad, the LLM
will still get better over time, but you
need to earn the right to sort of twink
tinker with the LLM by having good
retrieval. And then lastly, if you don't
have any customers or any users, you can
start thinking about synthetic data as a
way of augmenting that. And once you
have users, look at your data as well.
Look at the outputs, right? Extract
structure from these conversations.
Understand, you know, how many
conversations are happening? How often
are tools being misused? What are the
errors? And how are people frustrated?
And by doing that, you can do this
population level data analysis, find
these similar clusters, and have some
kind of impact weighted understanding of
what the tools are. Right? It's one
thing to say, you know, maybe we should
build more tools for data visualization.
It's another thing to say, hey boss, 40%
of our conversations are around data
visualization and the, you know, the
code engine or the code execution can't
really do that well. Maybe we should
build two more tools for plotting and
then see if that's worth it. And you can
justify that because we know there's a
40% of the population is using data
visualization and we do that, you know,
maybe only 10% of the time, right? This
is impact weighted. And ultimately as
you compare these KPIs across these
clusters, you can just make better
decisions across your entire product
development process. So again, start
small, look for structure, understand
that structure, and start comparing your
KPIs. And once you can do that, you can
make decisions on what to fix, what to
build, and what to
ignore. If you want to find more
resources, feel free to check out these
QR codes. uh the first one is to chroma
cloud to understand a little bit more
about their research and the second one
is actually a set of notebooks that
we've built out that go through this
process. So we load the weights and
biases conversations we do this cluster
analysis and we show you how we can use
that to make better product decisions.
So there's three Jupyter notebooks in
that repo. Check them out on your own
time and uh thank you for listening. We
do have time
for we do have time for like one quick
question and of course as well outside
as well. So thank you If anybody wants
to grab the mic there and over there.
Yep.
What's the spicy take today?
It's not KPI, by the way. That's not the
spicy. I think I think more agent
businesses should try to price and like
price their services on the work done
than the tokens used. Yeah. Price on
success, price on value. very unrelated
to this talk but
all right well if there's no other
questions definitely meet up with them
for any other questions you may have but
yeah thank you so much for coming to the
to the eval track hopefully you all are
sold and you're going to look into evals
and into your data as well so thank you
so much nailed it
you
Hi.
Yeah. Is there a reason that the
technique that I talked about using fast
evals, you know, query and document out?
Could it be for images as well, right?
Um, so yeah, I think it would work sort
of just the same. You just sort of need
a labeled pair, right, of like this
input should yield this output and it
could be multimodal or anything.
relation
for Chroma or for the
talk too many. That's okay.
[Music]
Yeah, Chroma is a retrieval system. So,
Chroma does like vector search, full
text search, metadata search, but the
Choco is in some ways separate from what
Chroma does commercially. Like this is
just a open- source set of Jupyter
notebooks that show you how to like set
up and run benchmarks.
I guess
once you figure out
certain angle on the world.
Yep.
I'm not touching anything in case they
come over in five minutes and we have a
pop. I don't trust it. Right. So So
don't I'm just saying to you
don't start taking batteries. Yeah.
Yeah. I think that's also why they said
don't do anything.
They just wanted us to hang on and see
what happens. Well, because we need to
do this real quick.
I think
Can you go grab the Q&A
mics? Bring them back here.
So, check check. I don't even need for
this to be open to send it to record.
Okay, I can I wonder if I just cut it
mute in the N.