2025 is the Year of Evals! Just like 2024, and 2023, and … — John Dickerson, CEO Mozilla AI

Channel: aiDotEngineer

Published at: 2025-08-06

YouTube video id: CQGuvf6gSrM

Source: https://www.youtube.com/watch?v=CQGuvf6gSrM

[Music]
Thanks everyone for being here. Um I'm
going to give this talk mostly from the
uh point of view of being a co-founder
and chief scientist at Arthur AAI where
for six years prior to joining Mosilla
AI as CEO. Um, I do want to say Mozilla
AI operates in the open source world
where we're providing open source AI
tooling and we're supporting the open
source AI stack. Our end goal is to uh
enable the open source community to be
at the same table as a Sam Alman when
talking about AI moving forward. So, if
you're interested in that, that's not
what this talk is going to be about, but
we can talk about that offline. This
talk is going to be about 2025 finally
being the year of the evals. And as was
written uh written as was spoken uh by
my introduction, I've been in the space
for a very long time. Arthur AI for
example was in and still is in obser
observability evaluation and security in
both traditional ML and AI and then into
the deep learning revolution and then
into the genai revolution and then into
the agentic revolution. And I think
we're finally at the point where uh all
of these companies are going to start
seeing hockey stick growth which is
exciting.
So the thesis of this talk, one thing is
that uh I I see a IML monitoring and
evaluation as two sides of the same
sword or ruler, right? You can't do
monitoring or observability without
being able to measure and measurement is
uh the core functionality for
evaluation.
This was not top of mind really with the
seuite uh until two things happened
concurrently.
One is AI became a thing that people who
aren't a CIO or CTO could understand. So
the CEOs, the CFOs, the CISOs began to
understand it basically when ChatGBT
came out. And simultaneously
there was a perfectly timed budget
freeze across enterprise at least in the
US that happened due to a fear of an
impending recession. So this is right
before chat GBT launched. This is like
October, November when most enterprises
would set up budget for the next year.
At that time there was a freeze except
for money that could be opened up for a
specific pet project and that PET
project because CEOs and CFOs then knew
about it uh was Gen AI.
So that happened and then now we have
the final sort of uh vertex on this
triangle which is going to force
evaluation to be top of mind this year
which is that we have systems that are
now acting for
uh humans acting for teams as opposed to
just providing inputs into larger
systems. So these three things together
are as you saw with Brain Trust, as
we're seeing at Arthur, as you're seeing
with like Arise AI or Galileo, other big
players in this space, we're starting to
see big takeoff because of this.
Cool. So right now, this is what's
happening. Year of the agent, we all
hear about it. We're hearing about it
here at this conference. Agents are
starting to make decisions and take
actions, complex steps that lead toward
an action, either autonomously or
semi-autonomously. uh as a question on
the last talk um um uh brought up you
know bringing humans into the loop is
still obviously a very good idea on many
systems but we're getting closer and
closer to full automation
and agentic systems are going into
deployment now okay and that's in
enterprises that's in SMBs that's in pet
projects and so on and what that means
is by that last slide this is also the
year that we need
flash back uh to then up to like a year
ago where every year we were asking hey
is this the year of ML? Is this the year
of evaluations? And prior to sort of
these agentic systems coming out, we
would have machine learning models
basically spitting out numbers that
would then be ingested into a more
complex system. And that complexity uh
would sort of erase uh the top of mind
need to think about what's coming out of
the model itself. Except for people in
this audience, we know that's very
important. But when it comes to decision
makers, it would often get wrapped up
into this sort of opaque box.
And that meant that the ML part didn't
really bubble up beyond like the CIO or
whatever or the system was going into
which meant that typically that year was
not the year for EVEL because the need
for eval was not obvious to the entire
seuite.
Cool. So let's take a quick step back in
time. Before November 30th of 2022, the
chatbt launch, uh, ML monitoring was
certainly a thing, right? Like data
science teams have long used statistical
methods as part of larger systems to
understand what's going on, right? This
is this is core. Um, like I mentioned
though, there's a tenuous connection to
sort of downstream business KPIs. And at
the end of the day, that's what gets
your product bought in the enterprise is
being able to make a sale about dollars
saved or about dollars earned. So being
able to connect machine learning the
components specifically to a downstream
business KPI
there was a lot of lip service around a
IML around the ROI from the seauite
including CIO CEOs um but that was just
lip service in our experience at least
uh it was still basically selling into
the CIO so basically it made it hard to
sell outside of that now obviously this
is a large space this has been happening
since you know about 2012 I would say is
when a IML monitoring really started up
with like H2O and algorithmia and Seldon
sort of the first generation of these
companies coming around Y labs Aporeia
Arise Arthur Galileo Fiddler protect AAI
and so on and so on and so on. I've put
the cuto off here at uh like mid 2022
sort of like before the genai revolution
happened. There have obviously been
companies founded after that. You know,
we just saw Brain Trust talking as well.
And then, you know, the big players here
as well, right? Snowflake, data bricks,
data dog, SageMaker, Vert.ex, you know,
Microsoft's products and so on. So,
people have been thinking about it,
but it was never the thing. Again,
rarely top of mind for the CEO, the CFO,
and the CISO. It's never the issue. So
when we would talk to people, it's
always yes, we understand that we need
this, but security is going to be a
bigger issue or latency is going to be a
bigger issue or some of these more
traditional technology sort of problems
are going to be the issue. It wasn't the
machine learning model itself. So
basically, you know, I do a lot of due
diligence for venture capitalists as
well now as a you know um a multi-time
founder and so on. And basically every
pitch deck in this space from like the
mid2010s onward had a slide that said
this is the year that a CEO is going to
get fired because of an ML related
screw-up. Uh and to my knowledge it just
still hasn't happened. There have been
some forwardinking leaders. So I have
here a um the annual report from Jamie
Diamond head of JPMC. Uh this came out
in April of 2022. So it covered
basically JPMC up through their fiscal
year in 2021. Uh and he's talking about
the spend that they have going into AI.
But if you squint and you look at these
numbers, they're still like comically
small. So basically, one of these is in
the consumer world, he makes the
statement that from 2017 up through the
end of 2021, they had put $100 million
into a IMO, right? That's not a huge
amount of money uh for JPMC.
So keep sitting back in time pre-hat GPT
and so on, but let's now flip to the
macroeconomic side of things.
So the economy started getting pretty
dicey um right up until about chatbt um
uh um launched right so that was the end
of November for chatbt uh a lot of
enterprise budgets are set in October
November for the following year and
toward mid to the end of 2022 there were
very deep fears about an impending
recession that didn't end up happening
but those fears basically made it so
that most enterprises either froze or
shrunk their IT budgets for 20 uh 23
three. Okay. So, what that meant is were
it not for particular tailwinds called
JBT, we probably wouldn't have seen a
lot of new technology being developed in
the IT departments at these large
enterprises in 2023.
Right? So, this is sort of a bittersweet
uh mix. I guess I just talked about a
lot of this. Um, this was sort of a
bittersweet mix in the sense that it did
set us up to put the eye of Sauron on a
particular specific small pet project
where uh a small amount of budget could
be applied and that small pet project uh
came from our friends at OpenAI
launching chatbt right before the
holiday break and I'm convinced just
from talking to some seuite folks across
the enterprises here in the US that
basically what happened is now CEOs and
CFOs and you know less technical and the
computer science sense of the word were
able to interact swiftly and easily with
a single UI on the internet and get
wowed by AI. Right? So I was also wowed
by AI. In fact, we were hosting um
Arthur was hosting with our um series A
leaders index ventures an event at
Nurups, the major machine learning
conference the night that chatbt came
out and it basically took over a bunch
of nerds in a room being like wow this
is very very impressive and that
happened to everybody else right flip
back to November 30th you know you can
make Eminem rap like Taylor Swift oh hey
isn't this funny oh hey my mom did this
sort of like joke about you know I want
to have poetry but in the uh uh written
in the you know the words of a rapper or
whatever this started to happen over and
over again And what that meant is that
that discretionary budget uh which
exists uh was unlocked specifically for
now the CEO's pet projects which were
called Genai.
So 2023 uh we still had austerity forced
on us because of those frozen budgets
um or even reduced budgets. But the
thing that was happening here is that
now the only money going around that
could be allocated was going to
specifically genai. And so everybody
focused on this. A it's a cool
technology. Everybody focused on this
and the science projects started to sort
of float around within enterprise.
2024 we started to see genai based
applications going into production right
chat applications are the obvious one
internal chat applications internal
hiring tools things like that
and that's because basically the only
budget going into new projects in 2023
was going to genai and now as things go
into production primarily internally in
2024 uh we have the folks who tend to
dress in business suits uh asking
questions around ROI governance risk
compliance brand optics and that kind of
thing so now we're starting to get a
little bit closer to people outside of
the machine learning, the data science,
the computer science world, the CIO's
office caring about evaluation, right?
If I need to quantit quant uh if I need
to have a quantitative estimate of risk,
then I need to do evaluation.
2025, we've seen, you know, scaleups,
right? Look look at the revenue numbers
for any frontier model provider. Look at
the revenue numbers for a lot of us in
this room. Just everything is really
going up right now. And that's because
of usage, which is great.
That also means that's a function of the
seauite basically becoming comfortable
basically talking about and putting
large real budget into AI right so
2023's IT budgets 2024's IT budgets uh
set for the following year those weren't
frozen right those are earmarked
specifically for AI applications and
things along those lines so we had
science projects in 2023 go into
production in 2024 and they're now
shipping and scaling uh in 2025 and also
like frankly the the technologies just
gotten really amazing Right? Like all of
us in this room are are technical, but
even I'm just amazed every time a new
model is dropped.
The community has also really gotten
behind this. You know, open source has
gotten behind this. Venture capital, big
tech is writing huge checks into uh into
frontier model providers and so on. So
everything is sort of coming together in
2025.
And also remember that third vertex, we
have machine learning systems now moving
toward autonomy. Okay, so 2025 we all
hear it. It's year of the agent. I no
longer is a question mark needed here.
It's clearly the the year of the agent.
Now, quick 30 secondond definition of an
agent. As defined, you know, in the late
50s onward, um agents need to perceive
the environment. They need to learn.
They need to abstract and generalize.
And unlike traditional machine learning,
they're going to reason and act, right?
We have reasoning models out there. We
have systems that are acting in virtual
environments or cyber physical
environments. And what that means is you
have a lot of complexity introduced into
the system and you have a lot of risk
introduced into the system and that's
great for those of us in uh ebo.
So, at the end of the day, the thing
that really matters, like I mentioned,
is connecting when you're selling any
product, not just our products in this
room. When you're selling any product
into an enterprise or SMB, is being able
to attach uh your product into some sort
of downstream business KPI, risk
mitigation, revenue gains, uh you know,
losing less money, whatever. So, now
evaluations, you need to be able to do
this because you're quantifying things.
And they're finally a first class
discussion point, which is fantastic. So
we have the CEO like I mentioned
November 30th 2022 and onward now at
least knows what the tech is and I'm not
saying they know you know what attention
means uh right but I am saying that they
know some of the capabilities around
these generative models they know some
of the capabilities around agentic and
multi-agent systems uh and they're
comfortable um you know talking to
experts about it uh allocating budget
for it uh and talking to their board of
directors and shareholders about it as
well.
Uh, we have the CFO who because the CEO
cares about this stuff, uh, also
obviously needs to care about it, but
she's going to care about the impact to
the bottom line, right? That's what a
CFO does. They're doing allocation,
they're doing budget planning, uh, and
they're going to need to basically write
some numbers into an Excel spreadsheet,
and those numbers have to come from, in
part quantitative evaluation.
Uh, CISOs now see this as, you know, a
huge security risk and opportunity. And
for those who haven't sold into
enterprises before, slicers are
typically willing to write checks uh
smaller checks especially for startups
uh more quickly and with less overhead
than like a CIO would. CIOS tend to have
a bigger org for one and also tend to
have a lot more process. CISOs tend to
be a little bit more scrappy and willing
to try out tools and so on. And so this
actually happened before the agentic uh
revolution. This actually happened when
Genai started coming up where CISOs were
like, hey, hallucination detection,
prompt injection, things like that.
that's firmly in the security space,
which is why you've seen a lot of
guardrail products, including Arthur,
including the ones that come from our
competitors going into the CISO's office
and and basically being able to sign a
lot of deals.
Uh the CIO, corporate,
they're still on board. Uh they want to
keep their job. And the CTO's now, uh
they always want standard, right? They
need to make these decisions based on
numbers. Those numbers are coming in
part from like hotel standards standards
like that
and that's great right so I've listed a
lot of the seauite here I haven't talked
about chief strategy officers or
otherwise but like the CEO the CFO the
CTO the CIO the CISO they control a lot
of budget and now they are all willing
to talk about and they're all aligned
about basically the need to understand
evaluation from AI
great so uh quick remind me you know you
should hold me truthful
All the evaluation companies,
observability companies, monitoring
companies, security companies, whatever
you want to call them, have shifted into
a multi-agent systems monitoring, right?
The point around you should monitor the
whole system. You shouldn't just monitor
the one model that is being used by one
particular agent. That's well taken and
well understood in industry and in
government. And I think that's that's
great to have that at that topline
discussion point.
But, you know, keep me honest here. Uh
there was an article that came out in
midappril uh in the information uh
showing some leaked revenue numbers for
a variety of startups in the evaluation
space and biases Galileo Brain Trust and
so on but they were lagged by about 6
months or eight months um and just from
talking to friends in the space uh those
numbers are no longer representative of
what folks in this area are making. And
so let's see what the information leaks
in early 2026 about this and maybe we'll
see something like revenue no longer
lags at AI evaluation startups because
this is the year for AI evaluation.
Great. So I'll leave it with some time
for questions. Um I did mention Mozilla
is not firmly in the evaluation space.
We do have a very nice open source not
monetized at all what we're calling a
light LLM for multi-agent systems. So if
you're playing around with different
multi-agent system frameworks check out
any agent. We implement a lot of them
for you under a unified interface. So
for people in this room that might be a
fun project to play around with. So
thank you. I'll uh have three three
minutes for questions.
>> So okay thanks uh thanks great
presentation. I just have a question
really about the enterprise value about
the most of the evaluations in Genead
require domain expertise. So for
example, if you're building a multi-
agent system to do financial investment
analysis to do something called a
discounted cash flow spreadsheet, is the
agent doing it correctly or not? Uh I'm
just trying to understand is how is that
problem getting solved? Because most of
them are, you know, coming from an ML
background where it was structured data,
but this is a lot of unstructured data
and you have to measure the quality like
is it in acting like a human, right?
>> Yeah. Yeah. It's it's great. Actually, I
have a paper in nature machine
intelligence talking about some of the
problems that can come around when you
do persona based agents where I say act
like a farmer in Ohio in your mid-40s
and so on. There's value in that, but
you can't do it perfectly. And my gut
reaction to this is um there was a
leaked spreadsheet from Merkor, which is
a company that uh can hire in experts
showing, you know, $50 an hour, $100 an
hour, $200 an hour for experts to be
hired by, for example, Google or for
example, Meta or for example, large
banks to do kind of what you're saying,
which is you're going to have an expert
sitting alongside the multi- aent
system. Uh basically sitting next to,
you know, the intern who's going to come
in and take your job in like a year or
two or change your job. Maybe take is
not the right word, but they're
basically doing that expensive human
validation and lock step with the multi-
aent system, which you know, if you're
going to be doing discounted cash flow
analysis, right, the kind of thing where
a you can either make or lose a lot of
money and b lose your job if you get it
wrong. Uh it's worth spending that large
amount of money doing the human
validation. It's a question for everyone
though is like what does that look like
in five years once that data is
incorporated into the systems themselves
and that can be a mode as well, right?
when you talk to anyone in the eval
space like it's the data set creation
and the environment creation that
matters more than anything which is the
point you're getting at as well. So if I
if I spend a bunch of money to have a
very good competitive like uh DCF
environment or whatever that can help me
versus my competitors. So there is like
um there is capex going into that.
>> Uh yeah
>> thanks for the presentation. Um do you
have a rough uh maybe timeline on when
you think um eval will primarily be
driven by uh maybe gen AI or LMS?
>> Yeah, the LLM as a judge paradigm is and
I think this was talked about in in the
previous talk as well. Uh we see it
getting used in practice because there
are issues with it, right? We have a
paper in iClar uh from last month
talking about some of the biases that
LLMs as judges have versus humans and
things like conciseness or helpfulness
and some of those anthropic words. But
the long and short of it is that like it
it solves the data set creation problem
in some sense and that like you can ask
you you give a persona to an LLM and it
it it is like a poor man's version of
like a human doing the judging and so we
see a lot of people using that as a
product right now but you need to toward
the toward the last question that was
asked you do need to make sure you're
validating this and and making sure that
you're not going off in some weird bias
direction
that's time. Oh. Uh, happy to chat
offline.
[Music]