Measuring AGI: Interactive Reasoning Benchmarks for ARC-AGI-3 — Greg Kamradt, ARC Prize Foundation

Channel: aiDotEngineer
Published at: 2025-07-16
YouTube video id: 3XmFPwjG8pg
Source: https://www.youtube.com/watch?v=3XmFPwjG8pg
[Music]
Today we are going to talk about why AI
benchmarking is about to get a lot more
fun. But before we do, we need to go
over some cool demos here. So I love the
Claude plays Pokemon demo. There's
something about really special about
seeing this little embodied agent make
its own decisions, go and play Pokemon
uh for us from our childhood here. Now,
OpenAI got in on the same game. I
thought this was awesome. And then just
the other day, Gemini beat Pokemon. I
like seeing the little robots with AGI
on top of their head. That must mean
that we're already there, right? Well,
if we have these agents playing Pokemon
and it's already doing it in Beating
Head, that means the game's over, right?
We're all done. Well, not quite. Because
with Claude plays Pokemon, we saw that
it would get stuck in the same place for
3 days. It would need interventions. It
would hallucinate different actions. And
not only that, there was a ton of
Pokemon training data that was within
the model itself. So although this is a
really cool example about an agent
exploring a world, there are a lot of
things that we can go and improve on. So
as Kyle was saying, my name is Greg
Camera, president of Arcress. We are a
nonprofit with the mission to be a
northstar guide towards open AGI. We
were founded last year by M Franis
Chalet and Mike Canoope and just last
December we were invited by OpenAI to
join them on their live stream to
co-announce the 03 preview results on
ARC AGI. Now there's a lot of AI
benchmarking companies out there but we
take a very opinionated approach as to
how we should do this. And our opinion
is that the best target we should be
aiming for is actually humans. And the
reason why we think that is because we
see that humans are the one proof point
of general intelligence that we know
about. And if we use humans as the
target, that does two things for us.
Because what we do is we come up with
problems that are feasible for humans
but hard for AI. Now, while we do that,
that does two things. Number one is it
creates a gap. And when you have that
gap, you can start to measure. Well, how
many problems can we come up with that
humans can still do but AI can't? And
then number two is it guides research.
So you can quantify that class of
problems and then go tell researchers,
hey, there's something really
interesting going on on this side of the
problem. Uh there something that we need
to go uh check out from there. All
right. So if we're going to measure
artificial general intelligence based
off of humans, we need to actually
define well what is general
intelligence? And there's two
definitions that I love to uh quote. The
first one was by John McCarthy and he
says that AI is the science and
engineering of uh making machines do
tasks and this is the important part
that they have never seen beforehand and
they have not prepared for beforehand.
This is very important because if you've
seen a class of problem beforehand if
it's already in your training data then
you're simply just repeating
memorization. You're not actually
learning anything new on the fly. Right?
The second uh person I like to quote on
this is actually Francois himself and he
put it very eloquently within just three
words and he calls intelligence skill
acquisition efficiency and this is this
is really beautiful here because skill
acquisition can you learn new things and
not only that but how efficiently can
you learn those new things and humans
are extremely of spoiler humans are
extremely efficient at learning these
new things. So Francois proposed this
definition in his 2019 paper on the
measure of intelligence. But he went
further than that. He didn't just define
it. He actually proposed the benchmark
to see can a human or or an AI can it
learn something new and then go repeat
what it learned. And this is where the
RKGI version one benchmark came out. So
over here on the left hand side, this is
the learn the skill portion. This is
what we call the training portion. And
what we show you is a transformation
from an input to an output grid. And
then the goal for the human or the AI is
to look at it and say, "hm, what's going
on here?" And then on the left, we
actually ask you to demonstrate that
skill. So it's a little mini skill you
learn on the left and we ask you to
demonstrate it on the right. And if you
can successfully do it, and this is what
it looks like, it's just the grid editor
here. Then yes, you've learned what the
transformation is and you've actually
applied this. And so you're showing a
nonzero level of generalization as you
go through this. So our benchmarks, RKGI
2, this is the most recent one. It has
over a,000 tasks in it. And the
important part here is each one of these
tasks is novel and unique. And what I
mean by that is the skills required for
one of them. We will never ask you to
apply that same skill to another task.
Um, this is very important because we're
not testing whether or not you can just
repeat the skill you've already learned,
but we want to test all the little mini
skills that you can do over time and see
you can see if you can actually
demonstrate those. And if we're going to
back up that humans can actually do
this, well, we need to go get
first-party data. So our group as a
nonprofit, we went down to uh San Diego
and we tested over 400 people. So rented
a bunch of computers and we did this in
person to pre uh have data privacy and
we made sure that every single task that
was included in Arc AGI was solvable by
people. So this isn't just an aim here.
We're actually doing the work to to to
go and do that. But if we think about
it, there's actually quite a bit of
human-like intelligence that's out of
scope from what we call a single turn
type of benchmark. With RKGI, you have
all the information presented that you
need right at test time. You don't need
to do any exploring or anything, and
it's all through single turn. So, if
we're going to be measuring any
humanlike intelligence, and I would
argue that if you are going to measure
humanlike intelligence, it needs to be
interactive by design.
And what you need to have is you need to
be able to test the ability of an agent,
whether that be biological or
artificial, to explore an open world,
understand what goals it needs to do,
and ultimately look at the rewards and
go from there. So this is actually very
in line with what Rich Sutton had just
published within his paper, Welcome to
the era of experience. And he argues
that if we want agents that will um be
readily adaptable to the human world,
they need to engage with the open world.
They need to collect observational data
and they need to be able to take that
data to build a world model and make
their own rules and really understand
what it is. Or else you're just going to
have the human ceiling uh the human data
ceiling going forward from here. If
we're going to be able to build this,
we're going to need a new type of
benchmark that gets out of the single
turn uh realm. And this is where
interactive reasoning benchmarks are
going to come in. Now, an interactive
reasoning benchmark is going to be a
benchmark where you have a controlled
environment. you have defined rules and
you may have sparse rewards where an
agent needs to navigate to understand
what is going on in order to explore and
uh complete the objective from here. Now
there's an open question as to all right
if our aim is interactive reasoning
benchmarks what is the medium in which
we're going to actually execute these
benchmarks in and it turns out that
actually games are quite a quite
suitable for interactive reasoning
benchmarks. The reason for this is
because games, they're a very unique set
of um intersection of complex rules,
defined scope, and you have large
flexibility into creating these types of
environments that you can then go put
different artificial systems in or
biological systems with it. Now, I know
what you may be asking here. Wait, Greg,
didn't we already do games? Didn't we do
this 10 years ago? We already went
through the Atari phase. Well, yes, we
did. But there's actually a huge amount
of issues with what was going on during
that realm there. Not um uh even just
starting with all the dense rewards that
come with the Atari games. There was a
ton of irregular reporting. So everybody
would report their own performance on
these different scales and was tough to
compare these models with it. There was
no hidden test set that came. And then
one of my um biggest gripes with the
Atari phase was that all the developers,
they already knew what the Atari games
were. So they were able to inject their
own developer intelligence into their
models themselves. And then all of a
sudden the intelligence of the
performance that well that's getting
borrowed from the developer that's not
actually getting done by the model
itself from there. So if we were able to
create a benchmark that overcame these
shortcomings well then we'd be able to
make a capabilities uh assertion about
the model that beat it that we've never
been able to make beforehand.
And so to put it another way that's a
bit more visual. We know that AI can
beat one game. This is proved. We AI can
beat chess. AI can beat Go. We've seen
this many, many, many times here. And we
know that AI can beat 50 games with 50
known games with unlimited compute and
unlimited training data. We've seen this
happen with Agent 57 and Muse 0ero. But
the assertion that we want to make is,
well, what if AI beat a 100 games that
the system has never seen beforehand and
the developer has never actually seen
beforehand either? If we were able to
successfully put a test or put AI to
this test, then we could make the
capabilities assertion about that AI
that we don't we don't currently have in
the market right now. And I'm excited to
say that that's exactly what ARC is
going to go build. So this is going to
be our version 3 benchmark. Today is a
sneak pre preview about what that's
going to look like. And this is going to
be our first interactive reasoning
benchmark that is going to come from
ARC. And I want to jump into three
reasons why it's very unique here. So
the first one is much like our current
benchmark, we're going to have a public
training and a public evaluation set. So
the reason why this is important with
our public training, call it on the
order of about 40 different novel games.
This will be where the developer and the
AI can understand the interface and
understand kind of what's going on here.
But all performance reporting will
happen on the private evaluation set.
And this is very important because on
this private evaluation set, there's no
internet access allowed. So no data is
getting out about this. The scores that
come out of private evaluation set will
have been done by an AI that has never
seen these games before and neither has
the developer seen them. So we can
authoritatively say that this AI has
generalized to these open domains here.
Now the second important point about
what RKGI 3 is going to have is it's
going to force understanding through
exploration.
One of my other gripes with uh current
game benchmarks out there is you give a
lot of instruction to the actual AI
itself. Hey, you're in a racing game or
hey, you're in an FPS. Go control the
mouse and do all these things. We're
going to drop AI and humans into this
world and they won't know what's going
on until they start exploring. So, even
as I look at this screenshot, this is
actually one of our first games. We call
it locksmith. We give all of our games a
cool little name like that. As I look at
this, it was I don't know what's going
on, right? But I start to explore and I
start to understand, oh, there's certain
things I need to pick up. There may be
walls. There may be goals and
objectives. I'm not sure what those
goals and objectives are right when I
first start, but that's the point. So,
not only are we going to ask humans to
explore and make up their own rules as
to understand how to do the game, but
we're going to require the same thing
for AI as well. And that's something
that we're not currently seeing from the
from the reasoning models that we have
from there. Now, the third key point is
that we're only going to require core
knowledge priors only. This is something
that we carry from ARGI 1 and 2 as well.
But what this means is you'll notice the
ARC tasks there's no language. There's
no text that's being involved here.
There's no symbols and we're not asking
you any trivia. So, I call these um
other benchmarks that rely on these.
Sometimes we try to make the hardest
problems possible. We go hire the best
people in the world and I call them PhD
plus++ problems, right? And that's
great, but AI is already super human.
It's way smarter than me in a lot of
different domains. We take the
alternative approach, which is let's
look at more of the floor and look at
the reliability side. Let's take all the
core um anything outside of core
knowledge and strip those away. So core
knowledge prior the four of them that
there are
are basic math and these are things that
are humans that were either born with or
hardwired to gather right immediately
after birth. So basic math meaning
counting up to 10 basic geometry so
understanding different shapes and
topology and then agentness which is
understanding theory of mind that
there's other types of agents out there
in the world that I know that they're
interacting and then the fourth one is
objectness. So as we create our
benchmark, these are the four principles
that we like to um go after when we try
to test the abstract and reasoning
piece. Now I was reading the recent uh
Darkeesh essay and he actually put it
really well in one of his paragraphs
here. He was talking about one of the
reasons why humans are are great and he
says it's their ability to build up
context, interrogate interrogate their
own failures and pick up small
improvements and efficiencies as they
practice a task.
We don't yet have this type of
environment that can go and test this
from a benchmark perspective for AI. And
this is exactly what RKGI is going to go
build. So before we wrap it up here, I
want to talk about how we're going to
evaluate AI because it's like, okay,
cool. You go they go play the game.
Well, what does it mean? How do you know
if it's doing well or it's not? And
we're going to bring it back to
Francois's definition. So we're going to
bring it back to uh skill acquisition
efficiency and we're going to use
humans, which again is our only proof
point of general intelligence. We're
going to use humans as the baseline. So,
we're going to go and test um hundreds
of humans on these exact arc tasks and
we're going to measure how long does it
take them? How many actions does it take
them to complete the game? And then
we're going to get a human baseline and
we're going to be able to measure AI in
the same exact way. So, can the AI
explore the environment, intuitit about
it, create its own goals, and complete
the objectives faster than humans? Well,
if it cannot, I would go as far as to
assert that we do not yet have AGI. And
as long as we can come up with problems
that humans can still do but machines
cannot, I would again assert that we do
not have AGI with it. So we're going to
be looking at skill acquisition
efficiency as our main uh our main
output metric here.
Today we're giving a sneak peek about
what this looks like. This is World's
Fair. Actually, next month in San
Francisco, we're going to give a sandbox
preview. So we're going to release five
games. Um we know better than to try to
wait till the end. We're going to make
contact with reality. We're going to put
out these fives. We're actually going to
host a mini agent competition, too. So,
we want to see what is the best possible
agent that people can do. We'll put up a
little prize money. Um, and then we're
going to look forward to launching about
120 games. That's the goal by Q1 of
2026.
Now, that sounds like it's not that many
games, and you think it's not that many
data points, but the richness of each
one of these games goes really, really
deep. There's multiple levels. It goes
deep with each one of them. And it's
quite the operational challenge to make
all of these. And that's a whole another
side of the benchmarking process, which
I'm happy to talk about later.
If this mission resonates with you,
again, Arc Prize, we are a nonprofit.
One of the best ways to get involved is
through making a direct taxdeductible
donation from that. If anybody in the
room knows any philanthropic donors,
whether it be LPs or individuals, I'd
love to absolutely talk to them. But
then also, we're looking for adversarial
testers. We want to pressure test ARGI 3
as best as we can. So, if there's
anybody who's interested in
participating in the agent competition,
whether it's offline or offline uh
online or offline, let me know. Happy to
chat. And then also kind of cool story,
we originally started with Unity to try
to make these games. And we quickly
found out that Unity was way overkill
for what we needed to do if you're just
doing 2x2 64x 64 uh games here. So we're
actually making a very lightweight
Python engine ourselves. So if there's
any game developers out there, anybody
wants to get involved with this and
knows Python well, we're looking for
game developers and game designers as
well.
That is all we have today. Thank you
very much.
Kyle, do we have Yeah, I think we have
time in this case for for a couple of
questions if anyone wants to come up.
There's microphones. One, two, three of
them. Um maybe a couple of questions. Um
I'm gonna kick that off. Sure. Yeah.
Yeah. Yes. Yes. Um question for you. So
I don't know where I can repeat. Okay.
All right. Um I uh I I it's it's very
hard to make estimates about timelines
famously, but um if you had to guess,
how long do you think this new version
of the benchmark you're making will take
before it gets saturated?
Um well, the the way I think about that
is uh well, I would say I'm counting in
years. I'm not counting decades. We'll
put it that way. Okay. Interesting. All
right. Uh yeah, we'll we'll take uh one
at each mic. Looks like you've it's well
distributed. So starting over here.
Sure. Hi. Um, you mentioned efficiency
as part of the requirements and so I'm
wondering for the uh the benchmarks if
you're considering things like wattage
or time or other ways of of using that
as one of the criteria. Yeah, I I love
that question and I would have put it in
if I had more time, but I'm very
opinionated about efficiency for
measuring AI systems. If I could have
two denominators for intelligence on the
output, number one would be energy
because you know how much energy the
human uh the human brain takes and
that's our proof point of general
intelligence. So you can take how much
calories the human brain takes. So I'd
love to do energy. But then number two
denominator is the amount of training
data that you need for it. Neither of
which are very accessible for closed
models in the current day. So we use
proxies and the proxy is the cost. But
then with interactive evals like this,
you get another proxy which is um action
count and how long does it take you to
actually do it. We're not going to have
a a wall clock within these games. It's
going to be turn-based. Um so we won't
have wall clock to do it. Awesome. All
right. Question two and then we'll do
three. Please keep them both very short.
Yeah. Um yeah, very quick question. Um
could you define more what do you mean
by objectness? Yes, that one's actually
quite simple. It's just understanding
that when you look out into the world
that there's um a mass of things that
may act together. So the crude way would
be you have one pixel but then it's
surrounded by a whole bunch of other
pixels and they all move together. You
understand all those pixels as one and
kind of acting as a one body rather than
individual. And really evolutionary wise
that's a same that's a tree over there.
All this is part of the same tree that
that kind of thing.
Um final question I'll keep this one
super short. U how do you distinguish
between tasks that um in the games that
you guys you guys are developing. How do
you distinguish between tasks that
humans cannot do and an AGI also cannot
do? Like what is the north star there?
It's a good question. It tasks that
humans cannot do are a bit out of scope
for our thesis on how we want to drive
towards AGI. So I would say that's not
that's not really the aim that we're
looking looking for on that. Um that's a
whole different conversation around
super intelligence and maybe that's for
another time. Thank you.
[Music]