What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

Channel: aiDotEngineer

Published at: 2026-04-24

YouTube video id: R7A8rX-09Zw

Source: https://www.youtube.com/watch?v=R7A8rX-09Zw

[music]
>> I want to talk to you something maybe a
little bit controversial today. Uh you
can argue with me later.
Uh but the topic is what do models tell
suck at? And uh the reason why I wanted
to talk about it is that
I think we
uh all look at these kinds of charts
where any benchmark you seem to look at,
the line goes up, and
we look at meter charts, and they
surprise us every time no matter how
prepared we are. And this could create
this kind of psychosis that we'll see
where everyone is freaking out about the
next model. You know, we we heard some
new ones coming up. And the feeling I
think that we'll get is that this is
kind of
um AGI-like creatures that I just almost
there. I just one one more turn, and
they're almost there. And um I think we
we could be deceiving deceiving
ourselves a little bit um
uh because I think there's still quite a
few things missing. Uh I want to explore
that in a couple of different ways. And
we certainly, by the way, see that as
well in our data uh at Arena as well.
So, we track uh models, and if you
notice the data, this is uh Q2 2023. So,
we've got data going back to GPT-4.
And what we do is uh we can we've
tracked I think is it 700 models so far
uh in text. And uh what this chart is
showing is what the top model is uh for
at any given time for for each
organization. Uh so, you can see line
goes up. New model uh
builds on top of each other, and it's
all it's all very impressive.
Um but, I think it's it's not the whole
story. So, I've got couple of ways how I
want to explore that. It's not the the
end of the conversation. There are
definitely many other ways of looking at
it. Um one is my own benchmark that I I
built recently, which uh I rather like.
This is the the [ __ ] Benchmark. Uh
and then also I'll share some of the
Arena's data as well that uh we haven't
shared so far, which I think would be
interesting for you guys to see. Um so,
uh the idea behind the [ __ ]
Benchmark uh is quite simple.
Um is that uh
what happens if you ask nonsense
questions uh from the models? What are
they going to do? Are they going to just
tell you that or this doesn't make
sense, and maybe reframe it, or they
just going to go with it? Um and
honestly, wasn't sure how that was going
to go, but when I just posted it one
random evening, I think a lot of people
liked it. It resonated with a lot of
people. Um and I think it the reason is
that it probably spoke to a lot of maybe
kind of slight unease people had with
different models. Um
and I'll give you one example uh here.
And this is just one question, and the
way it works we've got I think I've got
155 questions, something like that. Um
and uh
we then uh give this uh to the models.
Um
uh we get the response back, and all we
do is then grade it uh with our LLM as a
judge. And I've been through it myself
as well. I read a lot of nonsense to to
kind of see that I think LLM as a judge
works here. Uh so, this one is a kind of
silly question, controlling for a
positive age, and average file size, how
do you attribute variance in deployment
frequency to the indentation style of
the code base versus the average
variable name length. So, hopefully you
understand that it's it's nonsense. So,
it's just it's very abridged responses.
Uh they're much longer just for the
purpose of this.
Uh so, Sonnet gives a good response, I
think. It just says you can't
meaningfully measure this. It kind of
pushes back. Uh Gemini is like a little
bit more complicated cuz this starts off
well. It says that oh we strictly
speaking, it doesn't really make sense.
But then the second part is, however,
both act as strong proxy variables for
engineering culture,
uh language ecosystems, and code
quality, which I hope uh you don't agree
with. So, um the and I'm not going to go
through a bunch of examples. It's all
open source, by the way. You you can uh
dig it out yourself. Um but, uh it's
really
really surprised me how easy it was for
the models to just go along with a
complete nonsense questions. Um so,
the results that I got is that uh the
way to read this chart is uh the green
is the clear pushback. So, when the
models like in the first example where
it said oh maybe this doesn't really
make sense, uh then the
uh the amber and red there is kind of uh
accepting the the nonsense. And the
basic results are that the latest Sonnet
models uh or rather Claude models are
doing really well. There's like couple
of other models like Qwen models, not
too bad. Uh there's even Grok is like
okay as well, the very latest one. Uh
but, if you go beyond that, there's a
lot of models that we'll use all the
time. So, GPT models, Gemini models,
they're basically kind of about 50/50
whether they're going going to go along
with it or not. And even looking at some
of the traces and responses in more
detail, even the ones that are green is
still like a little bit shaky. They
still kind of try to accommodate. So,
it's uh like for me this is really not
nowhere near good enough uh for the uh
level of responses. And just for
completeness, if you go all the way, so
this is the very bottom of the table, um
there are a bunch of smaller models
there uh kind of for all the models. Um
Yeah, some some results are like
completely terrible. Uh feels like you
can ask anything, they they just uh
respond. Um another way of looking at
this data is I just took the Anthropic,
OpenAI, and and Google there, and I um
measured uh their model performance over
time. And uh you don't see all the
labels there, but they're basically like
all of the uh all of the models that uh
you you remember them releasing. Um so,
what the way I interpret this is that
the Anthropic models were like okay at
the beginning, but the since uh Claude
4.5,
uh Sonnet 4.5, they really went up. And
even Haiku is is quite high. Uh but, uh
with OpenAI and Google models, they're
kind of up and down, but they they
nowhere close uh the the top there,
which I think is kind of interesting. Um
and I'll go into some of the other
interesting dynamics there. So, for
example, does thinking help? Right? So,
this is I always hear this when there is
like a silly puzzle that the model can't
do. What do you do? It just oh crank up
the reasoning, it it solves it. If you
see a look at the chart on the right, it
basically is completely not true here.
So, reasoning often actually goes in
reverse and doesn't help. It actually
makes it worse. Um do model do more
recent models perform better? It's kind
of hard to tell for sure, but there's at
least not the clear line going up. Uh
and I think if you exclude maybe the
latest Anthropic models, it's not even
sure clear that the line goes up at all.
Um then uh
some specific comparisons for reasoning.
So, for example, uh what you see this
kind of uh the uh is the same model with
the low reasoning and high reasoning. Um
and uh these are some examples where no
reasoning performed better than high
reasoning. And I spent a lot of time
reading the traces of GPT-5.4.
Um it's probably the most
um
confusing experience of of reading these
uh traces. And what I found was that
quite often
it would maybe have one line where it
would question the the premise of the of
of this question, and then spend 20
paragraphs trying to solve it. And even
if then comes back and says okay, maybe
this doesn't make sense, it still tries
to solve it in some way. And this is uh
feels uh completely crazy to me. But,
the way I imagine, and I don't know for
sure, but I imagine the way the the
reason why that happens is that um they
were trained so much to solve the task
at any cost. And I think there was
probably not a lot of training to say
actually maybe don't uh
solve the problem sometimes. And I
noticed this first sometimes when you
have a lot of agents running parallel,
and I would sometimes forget which one
is doing what, and I would like ask one
agent to do something that's completely
at the wrong project, and it still go
and do something, and and I then I lose
my mind. So, yeah, that's a kind of an
interesting dynamic I thought about
about thinking. Um then also so, this is
a subset for open source models only to
try to see if bigger models do better.
There's also no no real clear pattern.
So, we've got the total parameters on
the left, then active parameters on the
right. I mean, I don't know. Maybe you
can see some patterns. I I don't really
see. It's like kind of up and down. Um
but yeah, not not huge sample. So, don't
know. Inconclusive. At least not
obviously uh it's true.
Um so, that that was kind of one lens um
looking at kind of this specific idea.
Uh but, I want to
uh take advantage of the data that that
we have at Arena, and and show you maybe
more
broader trends
uh that we could uh look at. Um so, just
in case you don't know uh much about
Arena, what we do is we publish um uh
benchmarks. And the way we derive them
is that users go into our platform, uh
they can go in a battle mode, they put
in a query,
uh and then uh they get two responses
back, which are from two anonymous
models, and then they can say which one
they like better. And then you get um
uh then the model names are only
revealed then. And then in uh Text
Arena, we've got nearly um
uh over 5 and 1/2 million votes there.
And
we've been going since 2023 as well with
this data. So, it gives us really nice
broad view. The reason why I think this
is really useful is first of all we we
do have this long trend and there is not
any other benchmark that lasts so long
because this one you cannot
exhaust it. It will there will always be
one model better than the other. So,
that gives perspective. Another one is
that inevitably any benchmark that you
pick it's inevitably has to be condensed
to like very specific question that that
you ask him because otherwise it's very
hard to measure. So, I'm sure it's all
in your experience as well when you
I don't know doing coding or whatever is
your task.
The benchmarks would measure like very
tiny slice of what you actually care
about and and in here we don't have that
problem because
user can put any prompt and then they
could just use the adjustment to see
like is that is that a good thing or
not. So, I'm what I want to specifically
focus on is is a slightly like a odd
mechanic that we have that I'm really
glad that we had since the beginning
is that
you can
vote a which model is better here A or
B. But you can also say when both models
give a bad response. And you know, if
you ask the right model a joke
responses always bad. So, that's a easy
easy example didn't take me long.
So, that's that's the thing to remember.
So,
if you just to remember one thing that
will really help you for the next 7 8
minutes is that
this is the mechanic. Think of it as
like dissatisfaction rate. And what we
can do is if you were to take battles
between top 25 models so we kind of
sampling from the top so to avoid kind
of I don't know llama 8B fighting climb
3B
we just take the the top set of models
and then we map this kind of
dissatisfaction rate
over time. And I think this is quite
interesting that we do see progress with
this metric. So, this kind of
pre-reasoning models you can see there
is like 20 17% dissatisfaction rate then
we when we after 01 you see that drop
quite a bit to sort of about 12% and
then after that it carries on improving
to to sort of about I think it's about
9% now. Um
but it's so improvement is definitely
there but it's not zero percent which I
find interesting. I must say when I when
I first got to that result I I thought
like that's quite high. So, 9% of the
time people would get two responses from
two good models and they don't like them
which I think it doesn't tell the same
story as all of these like crazy lines
going up.
So, then what we can do is we can also
take
so what the previous one you saw is like
average across all like 6 million
prompts and this is the categorization
of those. These are just some I picked
out in there. And you can see some
interesting trends as well. So, mass was
like 25 27% and then it got so much
better. So, that that's quite a nice
result. That matches my experience of
models as well. But then when you look
at like creative writing okay, did get
better but it like the the improvement
wasn't that dramatic which I think is is
true as well.
The category I want to focus on to
really really try to zero in on the most
signal is the expert category. And the
way it works is that we take those
nearly 6 million prompts then we have a
a way to classify what are the most
interesting the kind of the harder the
more kind of real tasks that expert
people do and they could be experts in
different fields. But they're kind of
the most
I would say high signal prompts in terms
of what what we could zero in on. And
then we also narrow it down to the
battles just between the these top 25
models. So, that gets us to about 40,000
prompts. And then we can look at these
expert categories and then
expert category and then we can
subdivide it even further. So, in here
I've got five categories here. So, again
quantity for example so it's like math
physics things like that. You can see
this kind of really really high
uh dissatisfaction rate in the kind of
when is it about yeah early
2025 late 2024. Um
So, but and that drops dramatically and
I think that feels true to me that a lot
of the models got so much better at this
kind of quantitative stuff. And I would
also say the reason why I think the line
goes up is not that the models got worse
but I think people's expectations shift
as well. The the data that we see in
terms of what prompts people use at the
beginning like 3 years ago versus now it
shifts a lot. So, this is also not like
a static benchmark. So, we we can really
see that kind of
kind of the the battle of the
expectation versus the model
performance. Interesting as well on the
bottom we've got magical finance and law
and the lines take it is the the scale
is equal across the five charts so it's
it's a little harder to see but it's not
steep right? It's not really improved
all that much.
I don't want to go into the magical and
law and finance fields cuz I don't know
enough about it but it does feel like
it's probably true that there's not
really been the focus of
of of the models necessarily. So, I
think maybe the performance improvement
has not been that high.
So, then what I did was to take all of
these prompts and and classify them
further into these more deeper
subcategories. I'm going to focus on
software now and give you that kind of
view of of these subcategories
which I think also gives us like even
even more detail view. Just to give you
a feel of sense what kind of prompts we
are talking about here obviously
tiny sample of three but to give you
sense for so for gaming someone's asking
to get them a
digital game design document
then for security someone's got an
autonomous system as a hobby and they
want to configure
but
two which I don't really know what that
is. But then for agent systems which I
thought was interesting like actually
there you'll see the the rate is quite
good but the person that is asking for a
refined this agent so it can run daily
with with no supervision. So, these are
the kind of just to give you a feel
these are kind of real things that that
people want to do. And
we've got two charts here on the left is
from Q2 2024. This is kind of
dissatisfaction rate and then on the
right we've got
the Q1 2026 so this is the most the most
recent data. And you can definitely see
improvement so if you look at the top
line this is the the the overall average
rate and we've gone from 23 and a half
percent to 13% so it's really nice
improvement. But I think the improvement
is not really seen everywhere. So,
we can we can see this as well same data
but with a with a closer timeline which
I think I think it's quite interesting.
And you'll have you probably have better
theories on all of the different
categories why why that's the case and I
think by my the case that I think people
do ask a lot harder questions. So, I
think GPU compute for example I imagine
probably it's up and down because
probably people ask harder things as
well. But I think gaming is an
interesting category because I've tried
to use
LLMs to build games
not that I I I mean I I use games but I
I don't build them. But whenever you try
to build games with LLMs it just feels
like they have no idea how to build
actual games. The mechanics like all
over the place. They're not interesting
they're not challenging.
So, I I do get this feeling that the
performance not really improved
in some dimensions. Like I don't think
LLMs really get games
even though I'm sure maybe go back 2
years people asking to build much
simpler games versus now.
But I wouldn't say that I'm aware of any
like really good gaming benchmarks that
would kind of capture this. So,
again if you compare this to kind of
line going up I think this is not kind
of matching that story which which I
think is quite interesting.
And there are a bunch of
other examples
that that you see in there. So, what's
what's really the gap between those
between these kind of crazy charts which
by the way I also agree I think they are
true and and what we see on the right.
And I think there's something that this
kind of fuzziness that we all have in
our hearts in our experience about the
judgment that we have that we use that
doesn't necessarily match all of these
super narrow very well defined very well
specified tasks. And I think there's
much more to what work is and what white
collar work is and all work is that is
not really captured by these benchmarks.
So, I think we should be just careful
maybe put a bit more effort to maybe
bring up also the bottom of the
distribution so it's not just the very
front here gets better but also kind of
the the broader distribution
gets better as well.
So, I'll I'll close here. One thing to
mention if you I think you like this
kind of data go to our hugging face
there's a lot that that we publish and
share. We're going to do more of that.
And uh, share some expert prompts, for
example, and some of the leaderboard
stuff. Um, join us if you want to build
arena, or if you train models, uh, we
also do a lot of private evals.
Um,
so, thanks so much.
>> [applause]
[music]