The Future of Evals - Ankur Goyal, Braintrust

Channel: aiDotEngineer

Published at: 2025-08-09

YouTube video id: MC55hdWLq4o

Source: https://www.youtube.com/watch?v=MC55hdWLq4o

[Music]
[Applause]
Awesome. Uh so today we're going to talk
a little bit about evals to date and
where we think eval are going to be
going in the future.
Also for those of you who saw my brother
earlier um I'm going to do my best to
live up to his energy and uh and
charisma.
But um yeah, you know, it's been an
amazing almost two-year journey for us
at Brain Trust. We have had the
opportunity to work with some of the
most amazing companies building um I
think the best AI products in the world.
Uh I'm blown away by how many EVLs
people actually run on the product. The
average org that signs up for Brain
Trust runs almost 13 EVELs a day. Some
of our customers run more than 3,000
EVELs a day. uh and some of the most
advanced companies that are running
EVELs are spending more than two hours
in the product every day working through
their evals. And I think one of the
things that stands out to me is while we
have customers building some of the
coolest most automated
um AI based products and agents in the
world eval
the best thing you can do is look at a
dashboard and I think we have a pretty
cool dashboard in Brain Trust but still
it's just a dashboard that you look at
and you walk away and think okay what
changes can I make to my code or to my
prompts so that this eval does better.
Um, and I actually think that is all
going to change.
Uh, so today I'm excited to talk about
something called loop. Loop is an agent
that we've been working on for some time
now that's built into brain trust. Um,
and it's actually only possible because
of evals. Every quarter for the last two
years, we've run evals on the frontier
models to see how good they are at
actually improving prompts, improving
data sets, and improving scorers. And
until very, very recently, they actually
weren't very good. In fact, we think
that Claude 4 in particular was a real
breakthrough moment. Um, and it performs
almost six times better than the the
previous leading model before it.
So, Loop runs inside of Brain Trust and
it can automatically optimize uh your
prompts all the way to very complex uh
agents. Um, but just as importantly, it
also helps you build better data sets
and better scorers because it's really
the combination of these three things
that make for really great evals.
This is a little preview of of the UI.
Um, you can actually start using it
today if you are an existing Brain Trust
user or you sign up for the product.
There's a feature flag that you can just
flip on called Loop and start using it
right away. Um, by default it uses Cloud
4, but you can actually pick any model
that you have access to and start using
it. Whether it's an OpenAI model, a
Gemini model, or maybe some of you are
building your own LLMs, you can use
those as well. Um, and as you can see,
it runs directly inside of Brain Trust.
One of the things that we uh learned
from working with a lot of users is how
important it is to actually look at data
and look at prompts while you're working
with them. And we didn't want that to go
away uh when we introduced loop. So
every time it suggests an edit to your
data or it suggests a new idea for
scoring or it suggests an edit to one of
your prompts, you can actually see that
side by side directly in the UI. Um, of
course, for the more adventurous among
you, there's also a toggle that you can
turn on that says like just go for it
and it will go and optimize away. Um,
which actually works really well.
So, just to recap, uh, to date, EVELs
have been a critical part of building
some of the best AI products in the
world, but the task of actually doing
evaluation has been incredibly manual.
And I'm excited about how over the next
year uh eval themselves are going to be
completely revolutionized by the latest
and greatest that's coming out um from
you know the frontier models themselves
and we're very excited to incorporate
that into brain trust. Please if you're
not already using the product try it
out. Uh try out Loop give us your
feedback. Uh we have a lot of work to
do. Um and we'd love to talk to you.
We're also hiring. Uh so if you're
interested in working on this kind of
problem, whether it's the UI part of it,
the AI part of it, or the infrastructure
uh side of it, we'd love to talk to you.
Um you can scan this QR code. Uh it
should be over there. Yeah, you can scan
the QR code and and get in touch with
us. Uh we'd love to chat. Thank you.
[Music]