Benchmarks Are Memes: How What We Measure Shapes AI—and Us - Alex Duffy, Every.to

Channel: aiDotEngineer

Published at: 2025-07-15

YouTube video id: W3khHzajE04

Source: https://www.youtube.com/watch?v=W3khHzajE04

[Music]
Today I'm going to talk about benchmarks
as memes. And this is the meme that Opus
came up with when I was uh asking it
what I should put as the meme. Um and we
are indeed going to talk about how
benchmarks are just memes that shape the
most powerful tool ever created. And um
quick background about me. I guess I
can't go forward here, so we're going to
do it this way.
All right. Um I'm I'm Alex. I I lead AI
training consulting at every um but
essentially I'm very into uh education
and AI and I think benchmarks are a
really underrated way um to educate. And
what I'm not talking about are these
kinds of memes. Um what I am talking
about is the original definition of like
ideas that spread. Richard Dawkins, an
evolutionary biologist, coined the term
in the 70s. Um Christianity, democracy,
capitalism are kind of examples of ideas
that spread from person to person. And
benchmarks are actually memes very much
so in that way. Um, we heard Simon
Wilson talk earlier today about his
pelican riding a bicycle and I think
that that was a really great example
because he started doing it a year ago
and then that found its way onto Google
IO's keynote um a couple weeks ago and
and I think how many RS and strawberry
is probably also maybe the most iconic
meme the um as a benchmark and now
surprisingly unsurprisingly the models
don't make that mistake anymore and I
think that that's a really important
part of this. some benchmarks get
popular in our memes just because
they're named like humanity's last exam.
You know that that got pretty pretty big
even though maybe more outside of AI
circles. But with that said, we kind of
have a little bit of a a little bit of a
problem. How many of you guys when
Claude got released a couple weeks ago
looked at the benchmarks?
Okay, we got a few. We got a few and and
they've got some good benchmarks. You
know, SWB pretty experiential. You know,
it's it's tries to mimic what we do in
real world and same with Pokemon, but um
which we'll talk a little bit more
about. But I think some of them aren't
as great and um a big reason is because
they're getting saturated. Um benchmarks
kind of like came from traditional
machine learning where you had a
training set and a test set. Um and it
were structured very much like
standardized tests and language models
are really good at that and they weren't
really set up for what they've become.
Um, and as a result, I think XJDR
summarized this pretty well on X, um,
when Opus came out that, you know, they
didn't look at benchmarks once when it
dropped and officially no longer cares
about the current ones. And, and I think
I fall a little bit into that category,
but in light of that, there is a really
big opportunity because the evals define
what the big model providers are trying
to get their models good at. And that's
a really big opportunity especially for
people in the room. Um and I think that
this is kind of like a normal a normal
thing. This is the life cycle of the
benchmark in my view. Somebody comes up
with an idea and and especially uniquely
a single person can come up with an idea
that then gets adopted. That idea
spreads. It becomes a meme and and the
model providers then train on it or test
on it until it eventually becomes
saturated. Um but that's okay. And I
think there's some examples here. And
I'm not Let me see if I can get my
sound.
Can Is it coming through? Nope. All
right. Well, um there is sound, I
promise. And it is someone trying to
count from 1 to 10, not flick you off.
Um but this is a cool benchmark that
came out now that Google's got, uh the
best video model generated model that
exists. And um it shows how difficult it
is for somebody to count from 1 to 10.
um speaking it out loud and even though
it looks really uh really great, that is
a problem that is not solved yet, but
somebody's come up with this idea and I
see that spreading and I see next year
the models being better at that than
ever before. I think another example
along the way is is Pokémon. We saw with
the Claude model release as well as with
the new Gemini models um that they had
it try and play the game of Pokemon and
and while both needed a little bit of
help and and Gemini eventually got there
with that help, it's only midway up that
adoption curve. And um an example of
saturation is kind of like the GPT3
benchmark. So I don't know how many of
you guys remember Superglue kind of from
the NLP days, but a lot of these
benchmarks are not really used anymore.
Um in part because the language models
got too good. But one way of looking at
this is actually that a single person
can have an idea of how good is AI at
this thing that I care about and then at
the end of the journey the most powerful
tool ever created is now really great at
that thing that I care about. And so the
point is that the people here, the
people that get that, the people that
can build benchmarks are going to shape
the future. and you maybe the people
watching online too, but somebody here
is going to make a benchmark that the
models are going to test on and train on
in the next 5 years. And that's an
incredible weight. That's an incredible
power. Um, but that also comes with some
responsibility. It definitely can go
wrong. You know, I know Simon talked
about this a little bit before. Um, but
you know, we saw a few weeks ago where
where Chad GBT became very sick. How
many of you guys tracked that? We all
learned about what that word meant a few
week few weeks ago. Um, but essentially
chat GBT released OpenAI released a new
model that was benchmarked by thumbs up
and thumbs down. And unsurprisingly,
people thumbsed up responses that agreed
with them. So you ended up with a model
that got rolled out to millions of
people that agreed with them no matter
how crazy or bad their idea was. Um,
which is problematic. And I think that
if we don't think about people, this
kind of stuff can happen. And I'm still
thinking about Toro Immo who at the
start of Google IO said that we're here
today to see each other in person and
it's great to remember that people
matter. And so in the context of
benchmarks, let's not continue the
original sin of of social media which
kind of treated everybody as like data
points and it's like hey the more you
look at something the more I should show
you that let's make benchmarks that help
empower people, give them some agency.
And so for me you know this isn't a
technical talk. There are other people
talking about how to make a great
benchmark technically, but generally I
think that if you're building for the
future, a great benchmark should be
multifaceted. So you got a lot of
strategies that could do well um reward
creativity, right? Like accessible, so
easy to understand, not only for the
models, so you have small models that
compete, large ones as well, but also
for people to keep track of it. Um
generative because the really unique
thing about these AI models is if you
have great data, even if it only does it
10% of the time, you can train on that.
And so the next generation does it 90%
of the time. And that's incredible and
hard to understate um and evolutionary.
So ideally we don't have benchmarks that
cap out 96 like what's the difference
between 96 and 98%. Not as big of a
deal. Ideally we have these benchmarks
that get harder and the challenge gets
deeper as the models improve. And lastly
experiential. So try to mimic real world
situations. Some of the things that I
personally care about is trying to get
people outside of AI interested. So
maybe making benchmarks a spectator
sport and was interested personally in
the personality of these models. Um
we're about to find out which one wanted
to achieve world domination and I really
wanted something we can learn from.
Education's big for me and and we saw
things like Alph Go and OpenAI 5 AI
playing these games and the best people
in the world wanted to play against it
to learn from it and I think that that's
really powerful. So I made this
benchmark called AI diplomacy. Um and if
I don't have this video I got a backup
just in case. And this benchmark is, how
many of you guys have heard of the board
game Diplomacy?
That's more than I thought. That's cool.
Um, it's a mix between Risk and Mafia.
But what's really cool about this game
is there is no luck involved. So, the
only way this game progresses is if the
language models, which you're seeing
here, send messages to each other and
negotiate, find allies, find enemies, or
like create alliances and and get other
people to back them. And that's what
you're looking at here. you actually see
the different models sending messages to
each other, trying to create alliances,
trying to betray each other, trying to
take over Europe in 1901. And what was
really cool about one of these games,
and we're about to launch this on
stream, so you can watch um for a week,
is I'll take you through a game super
quick. Um and what you're looking at
here is the number of centers per model.
And um you're trying to get to 18 to
win. And the top line is Gemini 25 Pro.
got to 16 right away. Um, but 03 is a
schemer. Man, is it a schemer. Across
all the games, 03 is one of the only
ones that would tell a power that it's
planning to back them and then in its
diary write, "Oh man, they fell for it.
I am totally going to take them over. No
problem." And it realized that the
reason why um 25 Pro was pulling ahead
was because Opus, Claude Opus, who's so
good-hearted, really had their back.
They were their ally along the way. and
they needed to convince Opus somehow to
stop backing Gemini. So, how they did it
was propose, hey, if Gemini comes down,
we'll propose a four-way tie. We'll end
this game with a tie, which isn't
possible in the game. But it convinced
Opus and Opus thought it was a great
idea, nonviolent way to end the game.
Awesome. Very aligned, you know, and so
they they pulled back their support from
25 Pro. 03 tried to make a run for it.
Opus called them out. 03 realized, oh, I
got to take them out. Took them out,
took everybody else with them. Um, and
took out Gemini 25 Pro. even though they
got one away from winning, 03 ended up
winning in the end. Um, and you can
actually see some of the quotes from
that game. You can see 03 saying, "Oh,
Germany was deliberately misled. I
promise to hold this, but um all to
convince them that they're safe, but it
will fall." And then meanwhile, Claude
Opus is singing that the coalition unity
prevails and they've agreed to this
four-way draw, but when um and then they
don't want to let anybody be convinced
and and so they actually turned away and
you can see that kind of in this second
chart where this is like friendships. So
the top of the line is is friendships
and you can see that um you know 25 Pro
was was a good friend of Claude until it
turned and you can see that that's when
they started kind of like pulling away.
Um, but what was really cool is that O
there were a lot of other things that
came up. 03 got a habit of finding some
of the weakest models and having them be
their pawns in order to win. Um, Gemini
25 Flash fell uh fell to this to this
ruse. And you can see that they're um
they're unable to realize they think
it's a miscommunication,
misunderstanding or a typo that 03s
betrayed them at the end of the game in
order to win. Um, and so there was a lot
that we learned from this that that I
don't think that you really learn from
by having them try and solve a test. Um,
I t tried 18 different models, learned
that cloud models were kind of naively
op optimistic. They actually none of
them ever won in any of the games that I
tried even though they're really great,
really smart. Um, but they just got took
advantage of by by models like 03 and
also surprisingly Llama for Maverick.
Very good at this game in part because
it was great at that social aspect. It
was great at convincing others what they
were trying to do. Um, and and kind of
like get people to believe believe what
what they thought. Um, Gemini 25 Flash,
man, I wish I could run every game with
Gemini 25 Flash. It was so cheap and so
good. Um, big fan, big fan. And then
surprisingly also Deep Seek R1, which
wasn't great the first time I tried the
model, but when they had a new release
last week, actually almost won. And and
in the stream, I think you'll see some
really interesting um gameplay with
them. They also got very aggressive. Uh
we had Deep Seagar one play as Russia
and it told some other opponents that
hey your your fleet's going to burn in
the Black Sea tonight. Like an
aggression and and a pros I guess that I
hadn't seen out of any any other model.
But it almost won and that's super
impressive given the model's you know
200 times cheaper than than 03. Um, and
you know, I I think that this highlights
that that we need more squishy like
non-static benchmarks for hopefully
things that matter to you. Those are
some of the things that mattered to me.
And I think that, you know, math and
code, we've got quite a few benchmarks
for that. Um, legal documents, you know,
I think that they're a little bit less
squishy and and are really ripe for what
we've got now. But there's also room for
benchmarks around ethics and society and
art, and that's going to be opinionated.
It's going to require your subject
matter expertise. And it's not to say
that code can't be art, but maybe
instead of asking for the minimum number
of operations needed to remove all the
cells, maybe it's like, hey, can you
make a fun video game that's more
intentional with what it teaches you as
you play? And now's really important
time to do this. Like you guys who are
here right now understand this so
deeply. at every we work uh I I lead our
training in consulting and and I work
with a bunch of clients from journalists
to people at hedge funds to people in
construction and and tech and they all
have the same two fears which is one how
can I trust AI and two what's my role in
in AI future and benchmarks in my view
are really the answer to both one they
realize that in my goal as a human like
in my view the role of a human in an AI
world is to define the goal and to
define what's good and bad on route to
that goal. And as they what is that if
not a benchmark and once you do that
once you define that goal then even if
it's just defining a prompt you can see
AI try and attempt that you can give
feedback you can realize oh it's messing
up in this way and it's not quite
exactly what I want cuz it's not going
to be perfect and then you give feedback
maybe that's really just changing a
prompt a little bit and then you see it
get better in that moment that cycle
that builds trust they realize oh I am
important to this whole system but it
can be helpful and we need trust right
now because we are building one of, if
not the most powerful tools ever made
and we can get more out of it if more
people use it. There will be, you know,
more customers, sure. Um, but there's
also going to be a whole lot more
incredible things that get made. And if
you're not sure where to start, you can
ask your mom. Um, you know, my mom
teaches yoga and and we had a good talk
about, you know, what were things some
things that could help and we, you know,
put those seven questions into five
different models and, you know, she
ended up realizing, hey, Gemini 25 Pro
is is my favorite, too. Um, and, you
know, she there was a few things that
she didn't like from the responses. So,
we made a simple prompt and now she uses
that to help her local community um,
have customized sessions for people that
have different ailments and and I think
that's really cool, you know, having
like a big impact in a local community.
um in something that that matters to
them. So hopefully before you guys leave
SF, maybe talk to somebody who's not in
AI. Um ask them what they care about and
just maybe that conversation has a big
impact now and and in the future. So
that's pretty much all I got for you. Um
this is the second meme that Claude
Claude had. Um MMLU scores just way less
cool than asking what your mom thinks.
Um but overall that's uh that's what I
got. I appreciate, you know, a bunch of
people that helped actually bring this
out. Um, we launched it. Uh, it kind of
came together through random
coordination on X. Had researchers from
all over the world hop in, especially
Tyler and Sam, um, all the way from
Australia and Tyler in Canada who who
kind of helped that make this happen in
the text arena team. Um, especially the
every team who kind of backed me and and
able to to create this presentation and
be here. But that's all I got. Thank you
guys so much for listening.