Why building eval platforms is hard — Phil Hetzel, Braintrust

Channel: aiDotEngineer

Published at: 2026-04-28

YouTube video id: _fQ7Z_Wfouk

Source: https://www.youtube.com/watch?v=_fQ7Z_Wfouk

All right, it's 11:15.
We're going to go ahead and get started.
Before we do, everyone say evals. Evals.
I was telling my colleague Rose, who is
at at the door that I was a adjunct
professor for a number of years.
And
um
the first
year that I did it, I thought I was
going to have this full class of 130
people every single week eager to learn.
And then as the weeks went on, 130
became 60, became 30, became 10. So I
always tell myself that whenever I give
a talk that only about four or five
people are going to show up, but I'm
going to be really excited to teach
those four or five.
Um today is a
real blessing because
we have we have a a packed house here
today. Everyone's excited to learn about
evals, and I I'm excited to to teach it.
Um here's what we're going to be talking
about today. I'll give you a little bit
of intro about myself and the company
that I work for.
An overview of of the problem statement.
We'll go into the different stages of
when people are building eval platforms.
And then after that we'll we'll talk
about, at least in in my opinion, where
I think
eval platforms are going to go.
Um but yeah, this is this is me. My name
is Phil Hetzel.
I lead solutions engineering at
Braintrust. I'll go into what Braintrust
is in a second. Solution engineering
that basically means I'm the person and
my team are the people that make sure
that people are getting the most value
out of our platform.
And as as quickly as possible.
So I'm fortunate because
throughout all of our customers I I see
what the state of the art is in in both
evals and and agent observability.
Prior to BrainTrust, I spent 12 years in
consulting and systems implementation. I
worked for KPMG for 4 years. I worked
for a company called Slalom Consulting
for 8 years where I led the global
Databricks business unit. And I noticed
that as I was
helping my my clients with those
implementations, they were great they
were so good at generating these
generative AI proofs of concepts and
none of them were getting to production.
And I wanted to be
I want to be helpful in making sure that
those POCs could get to production. So I
actually started using BrainTrust
because I knew it helped out in the
space.
I started using it as a user.
And I like the platform so much that I
applied for a job and I've been
been here for about a year.
Outside of work,
I I like to play chess but I'm I'm I'm
very bad at it.
And I like to spend time with my wife
and and my dachshund dachshund is his
name Pistol Pete.
And he's pictured he's the person in
brown. He's not the person in black.
Person in black is me.
Has anyone heard of BrainTrust before?
Anyone? Couple hands. How many people
have heard about BrainTrust for the
first time this week?
Really? Okay, great. Wonderful.
BrainTrust for
for just a reminder, I think of
ourselves as an agent quality platform.
And there's a lot of things that can go
into quality.
The way that we can get to agent quality
two main pillars through evals and
through observability.
Which we think of as really similar
problems to solve.
Evals, that's what you're doing with
your agent before it gets to production,
as you're experimenting, so that you can
become confident in your agent. And then
observability is really similar, but
you're already in production. Your agent
is uh in front of real usage from real
users, and you want to be confident, you
want to remain confident, I should say,
that your agent is performing the way
that you thought that it it would when
you were building it
um
So, that's BrainTrust. I was
specifically told to not make this a
sales pitch, so that's like really the
last BrainTrust slide that that you'll
get today. Uh although, of course, I'm
very happy to answer questions about our
company this week.
But mainly, I wanted to talk more
conceptually about um how people start
to mature and and build uh build these
platforms, spoken from uh a place where
we have a lot of experience in the
space.
Uh first, why evals are important.
Evals are important because
um this sounds obvious, but LLMs have
extreme variability.
Uh we love LLMs because they're highly
variable. There are so many different
types of problems that LLMs can reason
to solve. That's why we're that why we
you know, we're so attracted to to them
as a technology.
Um agents are also uh of course, agents
use LLMs as as the brain of the agent.
Agents are becoming the norm in how
customers are interacting with
companies. People expect an agentic
experience now.
So, if you combine both of those things
together, you really need to be
confident in how your agent is going to
perform once it is in production.
Without doing so, you're going to incur
or you can potentially incur a great
deal of risk um from both a a brand
perspective, a compliance perspective,
uh and even more of a a cost and and and
maintenance and systems perspective. So,
we want to avoid all of those things
happening and make sure that you know,
our customers are having a great
experience and that our agents are are
um
acting the way that we thought that they
would act.
Um how many people are like
they're doing evals right now, but it's
just on a Google Sheet or or some
spreadsheet is probably There's no shame
in that, my friend. Raise that hand
high. That's great. Um there's
I I and I I think I think that's great.
Like just making the step is is really
important. It's an acknowledgement of
the problem space. And a lot of folks
will, you know, they'll they'll come to
us and they'll say, "Well, I don't
really understand BrainTrust because,
you know, I all I need to know is how to
loop through my agent with a couple of
different inputs and be able to to
display some, you know, handwritten
notes and scores about that agent." So,
the things that I mentioned there, three
things, some way to execute your agent,
some UI, sometimes it's as simple as a
spreadsheet to show those outputs and
scores, and then also a way to to gather
input examples. What what I mean by
input example is the the thing that can
initiate a run of an agent, the thing
that can invoke an agent, whatever
information that's necessary for that.
Um it'll be a really short presentation
if if this is all evals was. I would I
would thank you for your time and I
would walk out the room, but that's not
that's not what you're here for. There
is a whole other part of the iceberg.
It's way more complicated than that.
There are a lot of things that you end
up having to build when you're really
serious about evals. We're not going to
talk about every single one of these
things today,
um but we will touch on on many of them.
And of course, if there's anything here
that I don't cover that you're
interested in, I'll leave some sometimes
time for questions for that. I also see
a lot of a lot of phones up. So, I'm
going to pause for iceberg pictures.
Um, a couple a couple of things while
that's happening. Why is this a
complicated problem?
We already talked about a little bit
about how the underlying technology is
quite complex. LLMs are are are not a
superficial
um, engine.
Uh, but it's also a multi-persona
problem building these agents. It's not
just something that engineers do in
isolation. It's something where
engineers, whether whether they're
product engineers or AI engineers or
both, systems engineers to get the thing
running.
Um,
uh,
SMEs that have the domain knowledge. All
of these people need to be involved.
And then lastly, it evals themselves
become a systems problem. That'll be the
last thing that we that we touch on
today.
So, what are the different stages of of
building an eval platform? Um, my my my
friend over over there that raised his
hand proudly
about starting out on a spreadsheet.
This is this is a great place to start.
The most important thing is that you
just get started. So, you've got a
spreadsheet and you've got a for loop.
You've got a bunch of input examples
that you can iterate through and you
have a way to execute your agent. So,
you can say you can see every time you
tweak your agent how the outputs are
different over time.
Um, while this is a great place to start
because there is no barrier to entry
here. Everyone has some way to access
some type of spreadsheet technology.
Um, the the returns can be diminishing
for a couple of reasons. Um, this is
more I would call this documenting. It
it's not really experimenting.
So, while you have this spreadsheet of,
you know, a a bunch of input examples.
Maybe you keep track across each time
you are tweaking your agent that the
different output that that emitted. Um,
that can become cumbersome to to manage
over time, of course. Um, it's really
challenging to be able to compare
directly experiments over time.
You're probably not doing a lot of
analytics across those experiments. And
the analytics that you are doing or
performing, they're likely coming from
some type of human scorer, which is
which is really valuable but uh
challenging to scale in practice.
Um, evals are are a team sport, kind of
what I was talking about before. Want to
make sure that we're bringing a ton of
people into the fold, not just technical
folks, but also non-technical folks.
They can add a lot of value to uh to
your agent because of their domain
unique domain expertise and proximity to
users. They're probably not coming into
the spreadsheet is is my point. Uh, and
it's slow. Um, it each time you you
eval, um, you have to go through
probably a little bit of a cumbersome
process to to recreate or append to the
spreadsheet.
Um, so uh probably the one of the uh
um most fun conversations that uh I have
in my job is
I'll have, you know, a a very proud
product engineer that gets on a call
with me
and you know, they they puff their chest
out and they smirk at me and they say,
"Well, I can just vibe code brain trust.
It's no problem."
Um, and I think for for like like if if
if you're just getting started in your
journey, it's a really nice step to to
go to. So, now instead of being in a
spreadsheet land, you're making
something a little bit more bespoke for
other other people would have bring them
into the fold. So, now you've probably
got a for loop.
You have a nicer UI now, so it's more
approachable.
And hopefully you've graduated into some
database that that isn't Excel or Google
Sheets. You probably
um
you know
use
roll roll a a new database in in
something like Neon or or something.
Um so now you have a a better story
around persistence of evals.
Uh and because of this, you're bringing
more people into the fold, you are um
making UIs that are a little bit more
bespoke for your specific users.
Um
the thing that's that's a problem here
is that you're still not really
iterating yet. You're still performing
work that is a little bit more uh just
reporting, just documentation, rather
than encouraging a lot of a lot of
iteration. So more of a reporting tool
here. How many people have vibe coded
their own
uh UI? Yeah.
Makes sense.
Uh next step here, so you want to
encourage a lot of um experimentation,
not just with technical users, but with
non-technical users. So um you know, I'm
showing this
uh image that is more aligned to
allowing experimentation
for non-technical users. But of course,
as you're building these platforms, you
want to allow for more SDK driven
experience as well. Uh that just doesn't
make for for a very nice image in a in a
presentation.
So experimentation to me means that you
can give a user access to
a an agent a a configuration of an agent
and a sandbox.
And you allow them to tweak certain
parameters within within that agent. In
my example here, I'm allowing a user in
the UI to change the system instructions
to an agent running outside of my eval
platform, and allowing them to compare
two different configurations of that of
that of that system prompt. And I'm
running evals across those two different
agent runs so that I can bubble up
scores. You can You can see that in the
image now. I can bubble up different
scores to understand both technically
and functionally how my agent is
behaving. Um so this is like you'll hear
about a lot of platforms have a
playground feature. You're going to want
some type of playground feature both for
technical and non-technical users.
This is where the rubber starts meeting
the road because the best way to perform
evals is to
um really think about the failure modes
that your agent can fall into um and
build scoring functions around those
failure modes. The best way to find
those failure modes in the first place
is to have access to production trace
data. I.e. your agent in front of real
usage users and in real usage. So the
next step here is a really important
one. We want to make sure that we can
connect what we at least internally we
call the flywheel. Observability and
evals to us is actually the same problem
from a from a systems perspective. Um
funny story, we used to be uh 3 years
ago when we started we were only an
evals platform. And then we noticed one
of our customers was running this
massive eval
like a every hour of every day. So we
reached out to this person and they
said, "Oh yeah, I'm just piping all of
my production traffic into this database
and I'm running an eval against it." So
we're like, "Okay, we should probably
just make make that ability to trace uh
and and observe actual traffic and be it
and account for that use case without
having to cram it into offline evals."
Uh so this is really important. Make
sure that we can observe things in
production, understand the actual
behavior of our agents, also understand
the the real lift that the changes that
we're making to our agents or or having.
Um so we're analyzing that data.
Uh we pull that back those actual
examples back into an offline
environment and then we improve upon
those using uh offline evals. This is a
loop so it's it's not just a a process.
Uh you're going to be performing this
loop hopefully for the lifetime of the
agent that that you're pushing to
production. You you you should be
iterating this loop as many times as
possible. That's how you that's how you
improve.
Um so as a result of that you've you've
changed your scope a little. You've
widened your scope a lot actually. You
are now a tracing plat- platform. You're
now a logging platform in addition to
being an offline evals platform.
Again, the benefit of that is that you
got you're starting to get far higher
signal from how users are interacting
with uh with your agents and you can use
those real interactions uh so you can um
almost think about evals almost like
you're rerunning production in a safe
environment. You're now getting to uh to
that point um with uh with with this
example.
Um you can also perform online evals so
you can point scoring functions to your
uh to your observability traffic uh and
perform things like alerting um all
things that you could build in when
you're at this phase of maturity for uh
for running evals.
Uh the bad here
uh if you build it, you have to manage
it. So just because you've you know uh
vibe coded a platform, guess what? You
might get a promotion for it, but also
like that's going to be your job now
uh is is to is to manage and and
continue to grow your eval platform at
the pace that the industry is moving um
which can be an exciting challenge.
That's kind of the bet that that our
company's making and we're excited to
solve that problem.
The more important challenge though is
that agent traces specifically, if you
kind of look on the on the screen, these
are really nasty. They're not like
normal application traces. Um they are
they are really semi-structured. A lot
of times they're unstructured. There's
just a a ton of text inherent to LLM
problems that were that we're solving.
Um they're um just like very large in
addition to being complicated. So, if
you're trying to cram
you know, I
a 1 GB trace into a Postgres row,
that can lead to a lot of performance
problems and they're numerous. It's high
velocity because there's so much usage
happening in production. Hopefully with
the with the agent that you've pushed.
Um so, this is how we used to solve this
problem. Um just as an example, if if
you're at this stage in maturity, you've
got traces coming in. You're going to
need to account for two query patterns.
One, if you're performing observability,
you need a way for for folks to
instantly be able to see their traces.
It's very important to people. So,
you'll need a a a very low latency way
to ingest data.
And then you also need a a second uh
a second layer of persistence for the
query pattern of I want to be able to
analyze in in aggregate these data. So,
we used to use an open-source data
warehouse for this.
We used to stitch these two sources
together through a
domain-specific language that we created
called BTQL that no one liked and and
including us we we we hated it. And then
we would perform a like a third level of
of aggregation with using DuckDB in the
in the browser.
Um this worked for us for for a bit
and then it it didn't work when I'll
just use one one our customer examples.
A a customer like Notion for
as as an example, just a ton of a lot of
of unstructured data that they were
sending us. They want to be able to
perform things like full text search
across a trace. None of these
technologies are really equipped to
perform text style
analytics, which is a challenge with
with the LLM domain cuz there's just so
much text.
So, that leads us to this, measuring
agent quality, performing evals,
performing observability. It's actually
a systems problem. It's not just a UI UX
problem. We recognize that it's quite
easy to vibe code the UI of evals,
but it's way way way more challenging to
create that data layer of of running a
successful evals and and observability
platform. And not just from a scale
perspective, although that matters,
mostly from a functional perspective of
allowing people to do the things that
they would expect to do, like performing
full text search across millions of
traces
in in their platform of choice.
I talked about this a little bit. The
reason why this is such a novel problem
to solve is across a lot of these
dimensions, which I you know, I I won't
drain this slide, but
the data comes in really fast.
The data are are are like just really
large when they come in. So, even though
traditional
spans in a trace, span is just like one
part of a trace, traditional span would
be like a couple of kilobytes. Here,
we've seen spans that are 10, 20
megabytes in size. Just
so much context within those spans.
Highly highly unstructured. And then
also there are a lot of different types
of read patterns. So,
you might be performing aggregate types
of read patterns, but also you want very
low latency types of of read read
patterns. So, none of these problems are
are individually unique, but together
they make a make for a very unique
problem from a systems perspective.
Uh so what we've done is um and you
know, what
you all would have to endeavor to do if
you were building this yourselves is you
really have to think about making the
right data platform for traces so that
you can perform some of the more
functional
requirements that that eventually come
down the line.
The example that I have here is that you
know, let's say that you want to let a
coding agent loose on your evals
platform so that you can
be a little bit more self-healing with
grabbing data in aggregate from your
evals platform using a coding agent to
grab that into context and change your
agent within
you know, with with within
a coding agent session. That's something
that's going to be really challenging to
do if you can run a lot of just pure SQL
on the data back end of of your evals
platform. We've actually noticed a lot
of these headless style use cases come
up where people aren't in interested in
the UI at all. The only thing that
they're interested in is how can I
perform evals in a way where I can use a
codex or I can use a cloud code to to
help
help increase the quality of my agent
for me.
So the the last problem here that that
I'll talk about is the so what problem.
Um and we'll we'll we'll skip this for
now for the sake of time. This is how
Braintrust does this. We have a blog
about this if if you're interested that
that just got released.
Um
Um but what what kind of comes next here
for
um
what you can expect to build into your
evals platform is you want to be able to
tell folks the unknown unknowns of your
agent. I.E. Don't make me look across a
whole bunch of traces. Just tell me how
people are are using
our our agent. So, you want to be able
to uncover those unknown unknowns
unknown unknowns through topic modeling
techniques so that you know where to
spend your engineering time.
Um you want to make sure that you are
building your platform not just for
humans, but also for agents cuz that's
one of the main media for how people are
are creating technology now.
Um we didn't even talk about the
non-functional requirements that go into
building these platforms like role-based
access control, data masking. It's also
something that that's super important
that comes up when you want to operate
at scale.
Uh and lastly,
uh a consideration for adding a
automatic tracing through some type of
AI proxy or gateway so that people don't
even have a choice but to trace their
their LLMs. Uh you can govern very
centrally by adding tracing
automatically uh to
uh to your eval platform.
Um so, I appreciate the time. Um I've
got like a minute and 20 seconds left
for for questions. I can probably take
two of them.
If anyone has any questions.
Yes.
Okay. Um
I'm not sure about BrainTrust, but with
BrainTrust and these kind of tools,
the problem is often when you create
like dynamic prompts
and not only only string interpolation,
but also like files
videos
creating LLMs,
like solutions struggle
and then you often build like custom
versions.
How does BrainTrust
uh yeah, get around that? So, like how
does BrainTrust BrainTrust specifically
handle multimodal outputs and inputs and
traces?
Yeah, we uh like just very technically,
we
uh put them in some object storage,
reference them, and then um
display them directly into the trace.
So, if you have like an audio file or a
video file, you can play it in the trace
when someone's reviewing the the trace
itself. We don't want people to have to
exit the platform for that. And the
prompt management is in Brainstorm? It
could be, yeah. The question was, is
prompt management in Brainstorm? It it
it could be or it doesn't have to be.
Yeah.
Okay, perfect. Thank you so much for
your attention today.