Are We at the End of Ai Progress? — With Gary Marcus

Channel: Alex Kantrowitz
Published at: 2025-05-07
YouTube video id: 3MygnjdqNWc
Source: https://www.youtube.com/watch?v=3MygnjdqNWc
What's going to happen now that AI
research houses are coming up against a
scaling wall? We'll find out with a
leading AI skeptic, author, and founder,
Gary Marcus. And I am thrilled to be
joined by Gary here in studio today.
Gary, great to see you. Great to see
you. Welcome to the show. Thanks for
having me. So, the genesis of this
episode is that I did an episode with
Mark Chen from OpenAI about GPT 4.5. And
you come into my DMs and you say,
"Listen, I want to give a rebuttal.
scaling is basically over and it's not
exactly what OpenAI has said. Now, for
those who don't know about the scaling
laws, uh basically the idea is that the
more compute and data you put into these
large language models, the better
they're going to get basically
predictably, linearly. Um well,
exponentially was the idea, right? And
so the context here is now we've seen
almost every research house all but
admit that that has hit the point of
diminishing returns. I think Mustafa
Sulleman was here. He pretty much
admitted it. Uh Thomas Kuran, CEO of
Google Cloud, said that they're
diminishing returns are happening. Uh
Yan Lakun has also talked about the fact
that you're just not going to see as
many returns from AI scaling as you
would beforehand. So just describe the
context of what we're seeing right now.
How big of a deal is it? And then what
are the implications for the AI
industry? Because this is the big
question. I mean, how much better can
these things get? Right? That is the big
question with AI today. Well, I mean, I
have to laugh because I wrote a paper in
2022 called deep learning is hitting a
wall. And the whole point of that paper
is that scaling was going to run out,
that we were going to hit diminishing
returns. And everybody in the field went
after me. A lot of the people you
mentioned, I mean, Lun did, Elon Musk
went after me by name, Alman did, and
they all like Alman said, "Give me the
strength of the of a mediocre deep
learning skeptic." So, people were
really pissed when I said that deep
learning was going to run out. So it's
amazing to me that a bunch of people
have uh conceded that these scaling laws
uh are not working the way they used to
be and they're also doing a bit of
backpedaling. I think that that Mark
Chen interview I can't quite remember
the details but I think it was a version
of backpedaling and redefining things.
So if you go back to
200 22 there were these papers by Jared
Kaplan and others at OpenAI and they
said look we can just mathematically
predict how good a model is going to be
from how much data there is. And then
there were the so-called chinchilla
scaling laws. And everybody was super
excited. And basically people invested
half a trillion dollars assuming that
these things were true. You know, they
they made arguments to their investors
or whatever. They said if we put in this
much data, we're going to get here. And
they all thought that here in particular
was going to mean AGI eventually. And
what happened last year is everybody was
disappointed by the results. So we got
one more iteration of scaling after
2002. that after 2022 that worked really
well and we call that GPT4 and all of
these models that are sort of like that.
So I wrote that paper around
GPT3. We got another iteration of
scaling. So right three was scaling
compared to two it was much better. Two
was scaling compared to one it was much
better. So much better meant um sorry
much more data meant much better. But
what is what is much better? Well, I
mean, one way to think about it is you
didn't need a magnifying glass to see
the difference between GPT2 and it was
we didn't call it GPT1, but the original
GPT. And you didn't need a magnifying
glass for GPT4 as opposed to GPT3. It
was just obviously better. A lot of
people thought is that we would pretty
quickly see GPT5 and a lot of people
raced to build it. So, OpenAI tried to
build GPT5 and they had a thing called
Project Orion and it actually failed and
eventually got released as GPT 4 and a
half. So, what they thought was going to
be GPT5 just didn't meet expectations.
Now, they could slap any name on any
model they want. And in fact, lately,
nobody understands how they're naming
their models. But they haven't felt like
any of the models that they've worked on
since GPT4 actually deserve the name
GPT5. and it didn't meet the performance
that these so-called mathematical laws
required. What I said in that paper is
they're not really mathematical laws.
They're not physical laws of the
universe like gravity. They're just
generalizations that held for a little
while. Like a baby may double in weight
every couple of months early in its
life. That doesn't mean that by the time
you're 18 years old that you're going to
be 30,000 lbs. And so we had this
doubling for a while and then it
stopped. And we can talk about why, but
the reality is that's not really
operative anymore. So there's been
efforts to kind of misdirect and shift
direction. So I think everybody in the
industry quietly or otherwise
acknowledged that, hey, we're not
getting the returns that we thought
anymore. And nobody's been able to build
a so-called GPT5 level model. That's a
big deal, right? I'm a scientist and as
a science or was originally a scientist.
As a scientist, we we have to pay
attention to negative results as well as
positive results. So when 30 people try
the same experiment and doesn't work,
nature is telling you something. And
everybody tried the experiment of
building models that would 10x the size
of GPT4 hoping to get to something they
could call GPT5 or that was like a
quantum leap better than GPT4. They
didn't get there. So now they're talking
about scaling inference time compute.
That's a different thing. Wait, but
before we get there, I just want to talk
to you. I want to test your your um
theory here. So, it's not that scaling
is over, right? I don't think anyone in
that we're talking about say scaling is
over. Basically, what they're saying is
if you want to make the model better and
I think that makes means more
intelligent, more conversational, even
um more personable, you can still do it
by scaling. I think what they admit the
what the thing that they admit though is
that it takes much much more compute and
much more data to get the same results
that you would in the previous. So let's
clarify two things. One is that what
people talked about about scaling
originally was a mathematically
predictable relationship between
performance and amount of data. You can
go back back and look at the Chinchilla
paper, the Jared Kaplan paper and lots
of things that were posted on the
internet. There were papers that saying
uh or t-shirts saying scale is all you
need. You looked at that t-shirt and it
had equations from the Jared Kaplan
paper and it said, you know, here's the
exponent. You can fit the equation. If
you have this much data, this is the
performance you're going to get. And
there were a bunch of papers, a bunch of
models that actually seemed to fit that
curve, but it was an exponential curve.
And what's happening now is, yeah, you
add more data, you get a little bit
better, but you're not fitting that
curve anymore. We've fallen off the
curve. That's what it really means to
say that scaling isn't working anymore
is you don't you know if I drew a curve
for you it was going up and up and up
really fast and it's not going up um as
a function of how much data you had. So
or how much compute you had. So you
added a bunch of compute and you got
this much better performance. And this
is how people justified running these
experiments that cost a billion dollars
is they're like I know what I'm going to
get for the billion dollars. And then
they ran the billion dollar experiments
and they didn't get what they thought
they would. Yeah. you get a little bit
better, but that's what diminishing
returns means. Diminishing returns means
you're not getting the same bang for
your buck as you used to. That's where
we are now. So, anytime you add a little
piece of data, the model is going to do
better, excuse me, on that piece of
data. But the question is, does it
generalize and give you significant
gains across the board? And we were
seeing that and we just aren't anymore.
So, is there still a path for these
models to become much more performant? I
mean, let's say you do superers size
these clusters to the point that is um
insanely they are insanely bigger than
they were previously. Let's talk about
like Elon Musk's one uh million GPU
cluster. Well, let's look at what Elon
got for his money, right? So, he built
Gro 3 and by his own testimony, it was
10 times the size of Grock 2. It's a
little better, but it's not night and
day, right? Grock 2 was night and day
better than the original Grock. GPT4 was
night and day better than GPT3. GPT3 was
night and day better than GPT2. Gro 3 is
like, yeah, you can measure it. You can
see that there's some performance, but
for 10x the investment of data, compute,
and not not to mention cost of energy to
the environment, it's not 10 times
smarter by any reasonable measure. It
just isn't. Okay? And so this would be
the point where I say, "Well, then this
entire AI moment is done." However,
well, it's this moment. There will be
other AI moments, but this one I know
I'm I'm setting it up to say that it's
not because um because like you
mentioned, you're talking about test
time compute. That's another way to say
reasoning, I think, which is these
models. Well, I'm going to give you a
hard time about people do do that. But
with with reasoning or test time
compute, you'll help me figure out the
finer details. What these models are
doing is they're coming to try to find
an answer and they're checking their
progress and deciding whether it's a
good step or not and then taking another
step and another step. Yeah. And we've
seen that they have been able to perform
much better when you put that reasoning
capabilities on top of these large
models which has enabled these research
houses to continue the progress and give
you but it's not really you it's it's
these companies some push back on that.
So it is true that you can build a model
that will do better if you put more
compute on it, but it's only true to
some degree. So um then I'll get to
whether it's actually reasoning or not,
but it turns out that on some problems
you can generate a lot of data in
advance and for those problems adding
more test time compute seems helpful.
There was a paper this weekend
that's cause calling some of this into
question. By the way, just to explain to
folks, test time is when the model is
giving an answer. That's what test time
is. That's right. So, you have these
models now like 03 and 04. That will
sometimes take like 30 seconds or 5
minutes or whatever to to answer a
question. And sometimes it's absurd
because you ask it like what's 37* 11
and it takes, you know, 30 seconds.
You're like, my calculator could have
done it faster. But we'll put aside that
absurdity. Um, in some cases it seems
like time well spent, sometimes not.
But if you look carefully, the best
results for these models are always
almost are almost always on the same
things which are math and programming.
And so when you look at math and
programming, you're looking at domains
where it's possible to generate what we
call synthetic data. And to generate
synthetic data that you know are
correct. So for example on
multiplication, you can train the model
on a bunch of multiplication problems
and you can figure out the answer in
advance. You can train the model what it
is that it should predict. And so on
these problems in what I would call
closed domains where we can do
verification as we create the synthetic
data, we verify that the answer we're
teaching the model is correct. The
models do better. But if you go back and
you look at the 03 uh sorry the 01
paper, even then you could already see
that the gains were there and not across
the board. They reported that on some
problems 01 was not better than GPT4.
It's only on other problems, these cut
and dry problems with the synthetic
data, uh, that you actually got better
performance. And I've now seen like 10
models, and always seems to be that way.
Um, we're still waiting for all the
empirical data to come in, but it looks
to me like it's a narrow trick that
works in some cases. The amazing thing
about GPT4 is that it was just better
than GPT3 on almost anything you could
imagine. and GPT3. The amazing thing is
it was better than GPT2 on almost
anything you can imagine. Models like 01
are not systematically better than GPT4.
There's they're better in certain use
cases, especially ones where you can
create data in advance. Now, the reason
I wouldn't call them reasoning models,
though you're right that many people do,
is what I think they're doing is
basically copying patterns of human
reasoning. They're getting data about
how humans reason certain things, but
the depth of reasoning there is not that
great. They still make lots of stupid
mistakes all the time. I don't think
they have the abstractions that we
think, for example, a legician has when
they're reasoning. So, it look has the
appearance of reasoning, but it's really
just mimicry and there's limits to how
far that mimicry goes. I I give you just
one more example is 03 apparently
hallucinates more than the models that
came before it, which is stunning. Like,
how does that happen?
I mean that's a a good broader question
which is our understanding of these
models is still remarkably limited. So
the technical term or one technical term
interpability well I was going to give
you a different one which is which is
blackbox. Okay. Um but they're closely
related those two terms. You need
interpability to get figure out what's
going on in the black box if you can at
all. I mean I'd almost put it another
way which is that but isn't the black
box the thing in the plane that tells
you what actually happened? Well that's
a different thing right. So a black box
in a plane is actually a flight recorder
that records a lot of data. But what we
mean in machine learning by blackbox is
you have a model where you have the
inputs and you have the outputs. You
know how you calculate them, but you
don't really understand how the system
gets there. So in this case, you're
doing all this matrix multiplication.
Nobody really understands it. And so
nobody can actually give you a
straightforward answer for why 03
hallucinates more than GPT4. We can just
observe it. That's what happens with
black boxes is you you empirically
observe things and you say, "Well, it
does that, but you don't really know why
and you don't really know how to fix it
either." Another example just in the
last couple days is apparently Sam Alman
reported I forget. The new model is is
stubborn or what was it? I forget. No,
it's not stubborn. It's a bro. It's a
bro, but that's GPT40. It's just like it
became very fratty. Became very fratty.
And like you, right? You would be like,
"What's going on? Like, help me with
this." And it's like, "Yo, that's a
hella good question, bro." And they're
like, "We don't know why this happened."
And they rolled it back completely.
Yeah. Exactly. Or I thought they were
partly rolled it. Whatever. No, no. Sam
said it's now the latest um iteration.
It's been completely rolled back.
Completely rolled back. So, right, that
was what I would call again empirical,
like they tried it out and it didn't
work or it worked in the way that it
irritated people. Right. And so, we
don't know in advance. like there's a
lot of just like try it because that's
how black boxes work and we have some
things but those things are not very
strong. So the scaling quote laws were
empirical guesses about how these models
worked and they were true for a little
while which was amazing and they're not
true anymore which is also amazing in a
way. Um, so we don't know what's going
to happen from the black boxes, right?
Okay. So, but let me now sort of and
sorry, let me come back to one other
thing quick, which is interpretability.
So, that's a very closely related
notion. So, um, let's say you look at a
GPS navigation system. That's a piece of
AI that's very interpretable. So, you
can say it is plotting this route. It
says, you know, you can go this way, you
can go that way. This is the function
that it's maximizing. This is the
database it's using. This is how it
looks up the data. We don't have any of
that in these so-called blackbox models.
We don't really know what the database
is that it's consulting. It isn't
exactly consulting a database at all and
we don't know how to fix it. And so, you
know, um Dario Mod uh who's a CEO, we
just talked about this on the show. You
actually praised his interpretability
post for interpretability. I'll be
honest, I haven't read the paper yet. I
just read the title. So, so bad on me.
But the title of his paper was something
like on the desperate need for
interpretability or urgences it and I
think he's right. I've said this too
myself like in my last book I I talked
about interpretability being really
important. The only difference between
Dario and me on this point is we both
think that we're screwed as a society if
we stick with uninterpretable models. He
just thinks that LLMs will eventually be
interp interpretable. and his company,
to be fair, has done the best work on
interpretability of LLMs that I'm aware
of. Um, Chris Ola, I think, is
brilliant, but they haven't got that
far. They've gotten further than anybody
else, but I don't think we're ever going
to get very far into the black box. And
so, I think we need to start over and
find different approaches to AI
altogether. Right. So, Gary, if I'm
listening to what you're saying on this
show so far, it is basically after GPT4,
we haven't made a lot of progress.
However, a little bit but but let me
just do the push back here which is I
mean if you think about what it's like
using these these models after GPT4 they
are significantly better. I'll give you
one example. Uh I was using 03 this new
reasoning model or test time model
whatever you want to call it and I just
I'm I'm in it and I'm doing crazy things
and it's exceptionally helpful. So, I
put a photo of myself on a rock climbing
wall and said, "What's going on?" And it
like was able to look at the form, where
my body was, where my what my posture
was, and like analyze all these things
and give actually helpful coaching tips,
which you never would have had with um
with GPT4. Then you think about what
Claude is doing, the anthropic bot. Uh I
was with some friends last night, and
this is what we do for fun. I vibe coded
a retirement calculator uh directly in
Claude. It took like 10 minutes. We went
from we took a bank statement. We got a
line graph of the person's balances, a
bar graph of their expenses, uh
financial plan, and then we coded a
retirement calculator based off of the
data that we had there. Um and then you
also you also have PhDs that are now
adding their um unique unique insights
into these models uh for training. They
just basically are sitting and writing
down what they know and the model is
absorbing it. So we are seeing I would
call it vast improvement over the GPT4
models. So I mean there's a couple
different ways to think about that. So
one is on a lot of benchmarks there's
improvements but there's also issues of
data contamination. Alex Risner wrote a
excellent piece in the Atlantic about
the issues of data contamination and
we've seen a lot of studies where people
are like well we tried it in my company
it's not really that much better. So
they're better on the benchmarks. Are
they better in general? Not so clear.
There was a new
benchmark released by a company called
Val AI or something like that. Valai
that the Washington Post talked about
yesterday where they looked at things
like can you pull out a chart based on a
a series of financial statements SEC
statements from a bunch of companies and
these systems all claimed to do it but
accuracy was under 10%. And overall on
this new benchmark accuracy was at 50%.
Would these be new models be better than
GPT4? Maybe. they weren't that good. So
I think people tend to notice when they
do well, they don't notice as much when
they do poorly. And although I think
there's been some improvement, there has
not been the quantum leap that people
are expecting. We have not moved past
hallucinations. We have not moved past
stupid reasoning errors. If you go back
to my 22 uh 2022 paper, deep learning is
hitting a wall, I didn't say there'd be
no progress at all. What I said is we're
going to have problems with
hallucinations. we're gonna have
problems with reasoning, with planning
until we have a different architecture
in some sense. And I think that that's
still true. We're still stuck on the
same kinds of things. So if you have,
you know, deep research write you a
paper, it's going to make up
preferences. Okay? It's probably going
to make up numbers like, you know, did
you actually go back and check? So for
example, with I think it's called, they
all have similar names now. Whatever
Gro's version is, deep search or deep
research. Yes. How some I I don't know.
deep research. I'll be convinced that we
have AGI until these companies learn how
to call deep research something other
than deep research. They all use the
same exact name. It's really bizarre. So
whichever version Gro has, I asked it,
for example, to list all of the major
cities that were west of Denver. And to
somebody who wasn't paying attention,
it'd be super impressive. But because I
really wanted to know how well it was
working, I checked and it it left out um
Billings, Montana. All right. So, you
got a list that looks really good and
then um there are errors. This often
happens. And then I had a crazy
conversation with it after that. I said,
"What happened to Billings?" And he
said, "Well, there was an earthquake
there on February 10th or whatever." And
I looked up in the, you know, the
seismological data. I use Google because
I want to um have a real source or duck.
Go and there was no earthquake then. And
I pushed it on and said, "Well, I'm
sorry for the error and whatever." So,
we're still seeing those kinds of
things. We may see them less, but they
are still there. That we still have
those kinds of problems. So, I don't
doubt that there's been some
improvement, but the quantum across the
board that people were hoping for is not
there. The reliability is still not
there. There's still lots of subtle
errors that people don't notice. And
then, you know, if you want to talk to
me about retirement calculators, there
are a lot of those on the web. So, the
easy cases for these
systems are the ones where the source
code is actually already there on the
web. Like Kevin Roose talked about this
example of having um he he quote vibe
coded a system to look in a refrigerator
and tell tell them what recipe to make.
But it turns out that app is already
there on the web. And there are demos of
that with source code. And so like if
you ask a system to do something that's
already been done, that's always been
true with all of these systems. That's
their sweet spot is regurgitation. And
so, yeah, they can build the stuff
that's out there, but if you want to
code things in the real world, you
usually want to code something that's
new. And these systems have a lot of
problems with that. And another recent
study, excuse me, showed that they're
good at coding, but they're not good at
debugging. And like coding is just the
tiniest part of the battle, right? The
the real battle is debugging things and
maintaining the code over time. And
these systems don't really do that yet.
But you know, search has made them more
reliable when these bots are able to
search the web and they are now starting
to give you lots of uh links in the
actual answers. I still like get daily
people sending me examples of, you know,
it hallucinated these references. I'm
not saying hallucinations have been
solved, but for me like I will use it.
It's an incredible research assistant.
And then when it links out to things and
I'm not sure of those uh figures, I'll
then go to the primary sources and start
reading. I mean, good on you that you go
to the primary sources. I worry the most
about people who don't. And we've seen
countless lawyers, for example, get in
trouble using these system. Has it been
countless? I just heard of one. Oh, no,
no, no. There's there's many more than
that. There there's some in the US.
There's some in Canada. I think there
was just one in Europe. Um I mean, it's
not really countless. one could sit
there and do it, but it's got to be at
least a dozen by now. And whether this
is going to be all right, I think we can
both agree on this that whether this is
the end of progress or towards the end
of progress or whether there's a lot
more progress, there's a real problem of
people outsourcing their thinking to
these bots. Well, Microsoft did a study
in fact suggesting that critical
thinking was was getting worse as a
function of them. And that wouldn't be
too surprising. We have a whole
generation of kids who basically rely on
these bots and who don't really know how
to look at them critically. You know, in
previous years, we were starting to get
too many kids relying on whatever
garbage they found on the web basically.
And I mean, chat bots are basically
synthesizing um the the garbage that
they find on the web. And so, we're not
really teaching kids critical thinking
skills. And nowadays, like the idea for
many kids of writing a term paper is I
typed in a prompt in chat GPT and then
maybe I made a couple edits and I I
turned it in. You're obviously not
learning how to actually think or write
um in that fashion. A lot of these tools
I think are best used in the hands of
sophisticated people who understand
their limits. So you know coding has
actually been I think one of the biggest
applications and that's because coders
understand how to debug code and so they
can take the system basically it's just
typing for them and looking stuff up and
if it doesn't work then they can fix it
right the really dangerous applications
are like when somebody asks for medical
advice and they can't debug it
themselves and you know something goes
wrong. Okay. So, I'm going to take into
consideration all the things that you've
said so far and see if I can get a sense
as to where you think we're heading. It
seems like there was a push to just make
these models better based off of scale.
That could be things like the 300,000 uh
GPU cluster I think Meta used for Llama
4 or it could be the million cluster uh
GPU center that Elon's built for Grock.
Um, and what you're saying is that's
been maxed out pretty much like no one's
hold. It's not maxed out, but there's
diminishing returns. There's diminishing
returns. Exact. So, the point that a
point that I'm trying to make here is
you don't believe that there's going to
be anyone that's going to build a bigger
GPU data center than that because if
you're seeing diminishing returns from
something that costs billions of
dollars, doesn't make sense to invest.
Well, wait a second. I'm not saying
people are rational. I think that people
will probably try at least one more
time. They'll build things you probably
Elon will build something that's 10
times the size of Gro 3 which will be
huge and it will you know it will have a
serious impact on the environment and so
forth. I just don't think it's not just
GPUs also it's data right like how much
more data let's come to the data
separately in a second so I think people
will actually try right I think MASA has
just bankrolled Sam to try I just don't
think they're going to get that much for
it I don't think they'll get zero I mean
there will be tangibly better
performance on certain benchmarks and so
forth but I don't think that it's going
to be wildly impressive and I don't
think it's going to knock down the
problems of hallucinations boneheaded
error so here's what I'm getting at
that's not going to feel much better
than what we have today. It doesn't seem
like you believe that reasoning is going
to make the bot feel much better than we
have today. Um, not not the kind of
reasoning they're doing. No emergent
there's no emergent coding. So, are you
basically saying that what we have in AI
today, this is it like this for
generative AI for a while, I guess. I
mean, look, I put out some predictions
last year um in March that people can
look up their head on Twitter and those
predictions include I said there'd be no
GPT5 this year or if it came out, it
would be disappointing. supposed to come
in summer. Well, this was last year. So,
so I said in 2024 we won't see this. And
that was a very contrarian prediction at
that point, right? This was a few weeks
after people had said, "Oh, I bet GPT4
is going to drop off to the Super Bowl,
like right after the Super Bowl. Won't
that be amazing?" So, people really
thought it was going to come last year
if you go back and look at, you know,
what they said on Twitter, etc. And it
didn't. And I correctly anticipated that
it wouldn't. And I said we're gonna have
a kind of pileup where we're gonna have
a lot of similar models from a lot of
companies. I think I said seven to 10,
which was sort of roughly right. Um, and
I said we were gonna have no moat
because everybody's doing the same thing
and the prices were going to go down.
We'd have a price war. All of that stuff
happened. Now, maybe we get to so-called
GPT5 level this year. Keeps getting
pushed back. Um, I don't know if we'll
get much further than that without some
kind of genuine innovation. And I think
genuine innovation will come. But what I
think is we're going down the wrong
path. Yan Lun used this notion of, you
know, we're on the exit ramp. How do you
say it? Large language models are are
the off-ramp to AGI. You know, they're
not really the right path to AGI. And I
agree with him. Or you could argue he
agrees with me because I said it, you
know, for years before he did, but we
won't go there. The broader notion is
sometimes we make mistakes in science. I
think one of the most interesting ones
was people thought that genes were made
of protein for a long time. So the early
20th century lots of people tried to
figure out what protein is a gene made
of. It turns out it's not made of a
protein. It's made of a sticky acid that
everybody now knows called DNA. So
people spent 15 years or 20 years like
really looking at the wrong hypothesis.
I think that giant blackbox LLMs are the
wrong hypothesis. But science is
self-correcting. In the end, people put
another $300 billion into this and it
doesn't get the results they want,
they'll eventually do something
different. Right. But what you're
forecasting is a basically an enormous
financial collapse because That's right.
I don't think LLMs will disappear. I
think they're useful, but but the
valuations don't make sense. I mean, I
don't see open AI being worth $300
billion. And you have to remember that
venture capitalists have to like 10x to
be happy or whatever. Like I don't see
them, you know, IPOing at $3 trillion. I
just don't. No, it's interesting because
I almost see the Open AI valuation as
the one that makes the most sense
because they have a consumer app. The
where the place that I start to get if
if what you're saying is correct that
we're not going to see any more if we're
seeing real diminishing results from
scaling and this is basically where we
are, then there's real worry for
companies like Nvidia, which has
basically risen on the idea of scaling.
They're down a third a third this year.
2 something 2.5 trillion last year.
They're a genuinely good company. They
have a wonderful ecosystem. They're
worth a lot of money. I mean, I don't I
don't want to put an exact figure, but
I'm not surprised that they fell and I'm
not surprised that they're still worth a
lot. No, but this is the
thing. If we end up seeing the fact that
this next iteration, the 10 billion
dollars that Sam is going to spend
seemingly on the next set of GPUs, uh,
if that doesn't produce serious results,
that's going to hurt, that will cause a
crash in Nvidia because so much of the
company's demand is coming based off of
this idea that scaling is going to work.
So, they have multiple problems, both
open AI and Nvidia. So, one is it does
look to me like we're hitting
diminishing returns. It does not look to
me like this inference time compute
trick is really a general solution. It
doesn't look like hallucinations are
going away. And it does look like
everybody has the same magic formula. So
everybody's basically doing the same
thing. They're building bigger and
bigger LLMs. And what happens when
everybody's doing the same thing? You
get a price war. So Deep Seek came out
and OpenAI dropped its prices quite a
bit, right? And so every because
everybody I mean not literally everybody
but you know 10 20 different companies
all basically have the same idea and are
trying the same thing. You have to have
a price war. Nobody has a technical
mode. Open AAI has a user mode. They
have more users and that's that's the
most valuable thing they have like for
them API. I would say the API is close
to worth. I don't know worth worthless
is the right word but it's worth it's
not worth very much. It is the thing.
Chad GPT is the thing that that really
has that it's the brand name that is
most valuable. I also think it's the
best bot right now. I It might be I mean
I think people go back and forth. Some
people someday say it's Claude. I've
been on the claw train for a long time.
Um now you're on the and I'm on Chad
GPT. I think what I think is going to
happen is you have leap frog but the
leaps aren't going to be as big as they
were. So four was a huge leap. I mean
this is a different way of saying what I
said before was huge leap over three.
You know, let's say I can't even keep up
with the naming scheme. GPT 4.1, right?
Let's say is better than Grock 3 or or
Claude 3.7, let's just say
hypothetically. And so people run to
this side of the room and then, you
know, Claude whatever
3.8.1 or whatever will be a little
better and then some people will run to
that side of the room. Um, but nobody's
going to be able to charge that much
money because the the advances are going
to be smaller and people start to say,
well, you know, I use this one for
coding and this one for brainstorming
and whatever. But nobody anymore says
this is just like dominant. Like GPT4
was just dominant when it came out.
There was nothing as good as it for
anything. If you wanted this kind of
system, you used it, right? I mean,
that's my memory of it. I don't hear any
of the the chat GPT or whatever. I can't
even keep up with the names anymore. Any
of those products, any of the OpenAI
products being referred to in the same
kind of hush tones like they're just
better. And like, you know, Google's
still in this race and they may undercut
on price. Meta is giving stuff away.
People are building on it. Deepseek, I
hear, has something new that's going to,
you know, be better than ChatgPT. Um,
and you know, may maybe it's true, maybe
it's not, but we we were
We're in this era where the the
differences between the models are just
getting really small. I I was I want to
ask you when you're going to admit that
you were wrong about things or if you
ever will. Which things? Which things? I
think that I so so but I I also realize
that the question doesn't really hit
because I I just want to say we we spoke
the last time you were I think you've
been on the show two times. Once with
Blake Le Moy, once one-on-one. Yeah. And
we because it's interesting. I think
you're the one of the the most outspoken
AI critics and you say a lot of the
things that we say here on the show,
which is that AGI is marketing. And even
if we don't hit AGI, there's still a lot
to be concerned about, whether that's
the BS that people are talking about or
being able to use these models for um
you know, for nefarious purposes by
turnurning out like content. Like I
don't know if you saw there was this
study of uh this University of Zurich
tried to fool people on Reddit or tried
to convince people on Reddit based off
of answers by GPT and it still convinced
more people than this is the new
persuasion the persuasion study. I'm
aware of it but I read it. So I I guess
like to me it's it it does seem like
it's kind of tough to be a critic of
LLMs right now because they have been
getting so much better. But I don't know
just sort of like I mean people say Gary
you're wrong and I say well here are the
predictions I actually made. Like I've
actually reviewed them in print and I've
asked people who say that I'm wrong to
like point what did I say that was
wrong. I think that sometimes people
confuse my skepticism with other
people's skepticism. Um, but I think if
you look at the things that I have said
in print, they're mostly right. And it,
you know, like Tyler Cowan said, you're
wrong about everything. You're always
wrong. And I said, Tyler, can you point
to something? And he said, well, you've
written too much. I can't do it. Well, I
look through some of your stuff and I I
do think that sometimes it seems like
you might have put like this enormous
burden of proof for the AI industry.
Like you do pick out sometimes like
everyone that says like AGI is coming
this year and you're like these people
are liars. But that being said, like I
think your core argument, most people
are wrong. I've offered to put up money
and I offered Elon Musk a million,
right? And I offered criteria and I'll
tell you about that. In 2022 in May, I
offered him $100,000 bet. Later, I upped
it to a million dollars. And I put out
criteria on Twitter. I said, I'm going
to offer these. Do these make sense to
you? And everybody on Twitter, not
everybody, nearly everybody on Twitter
at the time said those were fine. Like
people accuse me of goalpost shifting,
but my goal posts are the same, right? I
have a 2014 paper in the New Yorker,
article in the New Yorker where I talk
about a comprehension challenge. I've
stuck by that. That is part of my AGI
criteria. I made a bet with Miles
Brundage on the same criteria, which he
actually took the bet to his credit. Um,
but when I put them out in 2022, this is
the important part, everybody was more
or less in agreement that those were
reasonable criteria. And I said, "If you
could beat my comprehension challenge,
which is say, you know, watch movies,
know when to laugh, understand what's
going on, if you could do the same thing
for novel, if you could translate math
uh from English in into stuff you could
formally verify, if you could go into a
random kitchen, you know, teleoperating
a robot and, you know, make a dinner. If
you could um what was the other
criterion? Um, oh, and you write, I
think it was 10,000 lines of bug-free
code and you could do dug debugging to
get there, whatever, you know, okay, if
you could do like three out of five,
we'll call that AGI. And at the time,
everybody said that's fine. Now people
are backtracking. Like Tyler Cowan said
03 is AGI. Like by what measure? I felt
that that was kind of a stretch. It was
cheesy. And he said he said the measure
was him. It looked like AGI to him.
That's he invoked the, you know, classic
line about pornography. I know. I see
it. But people have pointed out lots of
problems with 03. I think it's absurd to
call 03 AGI. I wouldn't call it AGI. So,
you know, you you a minute ago said,
"Gary, you're wrong." But then you
ticked off a bunch of things I'm
actually right about. I didn't say,
"Gary, you're wrong." I said, "What is
there a point you'll admit you're
wrong?" Like But what I'm trying is it's
the point at which I'm wrong. So, let me
clarify one other thing. But, let me
just say I didn't say that you're you're
wrong. I just said like it when was the
point of advance that you would say okay
yeah I've been wrong about this stuff
because I have listened to some of your
let me clarify something but I also
right after I said that I was like you
know it's kind of like a tough question
and then I explained where I agreed with
you yeah that's what happened um so
so some people take me as saying that AI
is impossible and that's not me right I
actually love AI want it to work I just
want us to take a different approach
approach, right? I want us to take a
neurosy symbolic approach where we have
some classical elements of classical AI
like explicit knowledge, formal
reasoning and so forth that people like
Kinton have kind of thumbmed their nose
at, but the say Demis has used very
effectively in Alpha Fold. So, we get
into that if you want. If we get to AI,
the question about whether I'm right or
not depends on how we get there. So,
I've made some pretty particular um
guesses about it and I have guessed that
pure LLM will not get us there. Pure
large language model. So, will I concede
them wrong when we get to AI that
actually works? Depends on how it works.
Okay. Yeah. And I think it's clear that.
I mean, I don't know. We could watch
this back in a couple years. If we get
to pure LLM, if another round of
scaling, you know, gets us to AGI by the
criteria that I laid out, then I will
have to concede that I was wrong. Okay.
All right. All right, I'm going to take
a quick break and then let's come back
and talk a little bit more about the
current risks and maybe read some of
your tweets and have you expand upon
them. We'll be back right after this.
And we're back here on Big Technology
Podcast with AI skeptic Gary Marcus.
Gary, let me ask you this. So, you know,
one of the things we talked about last
time you were here was that AI doesn't
have to reach a the AGI threshold to be
something that we should be concerned
about. Absolutely not. And a lot of the
focus was on hallucinations. You and I
both I think we have a little bit of a
diverging opinion on hallucinations. I
think they've gotten much better. You
think it's still a big problem? Those
could both be true. By the way, that
could both be true. All right. So, let's
let's put a pin in that for now. I think
where I'm seeing the most concern uh is
verology. Um or we just had a study that
came out that showed that uh AI is now
in PhD on PhD level in terms of
verology. Uh we had Dan Hendricks from
the center for AI safety was here. We
talked about the fact that like AI can
now walk veriologists through uh how to
create or enhance the function of
viruses. And we're starting to see some
of these AI programs like you mentioned
DeepS be available to everybody be
pretty smart and uh be released without
guardrails or not enough guardrails
especially if they're open source. So
what are you worried about here? Is that
the core concern or is there other
stuff? I think there's actually multiple
worries and the different worries from
different architectures and
architectures used in different ways and
so forth. So dumb AI can be dangerous.
So if dumb AI is empowered to control
things like the electrical grid and it
makes a bad decision, that's a risk,
right? If you put a bad driverless car
system in, you know, a million cars, a
lot of people would die, right? The main
thing that has saved a lot of people
from dying in driverless cars is there
aren't that many of them. And so, you
know, even though they're not actually
super safe at the moment. Um, you know,
restrict where we use them and so forth,
we don't put them in situations where
they wouldn't be very bright. Um, so
dumb AI can cause cause problems. Super
smart AI could, you know, maybe lock us
all in cages if it wanted to. I mean we
have to talk about the likelihood of it
wanting to but there definitely worries
there and we need to take them seriously
and then you have things are in between.
So for example the ver verology stuff is
AI that's not generally all that smart
but it it can do certain things and in
the hands of bad actors it can do those
things and I think it is true either now
or will be soon enough that these tools
can be used to help bad actors create
viruses that cause problems. And so I
think that's a legitimate worry even if
we don't get to AGI. So we have dumb
dumb AI right now is a problem. Smarter
AI even if it's not AGI can cause a
different set of problems and you know
if we ever got to super intelligent that
that might open a different can of
worms. I mean you can think like you
know human beings of different degrees
of brightness and with different skills
if they choose to do bad things can you
know cause different kinds of harm. And
so what's your view on open source then?
I worry about it. I do worry about it
because bad actors are using these
things already. They're mostly using
them for misinformation. Not sure how
much biology they're doing. Um, but they
will and they're going to be interested
in that. You know, state actors that
want to do terrorist kinds of things
will do that. Um, I am worried about
open sourcing at all. And I think the
fact that Meta could make basic that
Meta could basically make that decision
for the whole world is not good. Like I
think there should have been much more
government oversight. scientists should
have um contributed more of the
discussion. But now those kinds of
models are open source. They've been
released. We can't put that genie back
in the bottle. And over time, just like
people, I should have said this earlier,
even if the models don't get any better,
we will still find new uses for them.
And some of those new uses will be
positive and some of them will be
negative, right? We're still exploring
what these technologies can do. And
people are finding you know ways to make
money in dubious ways and to cause harm
for various reasons and so forth. And so
you know giving those tools very broadly
has problems. On the other hand I think
what we've learned in the last three
years is that the closed companies are
not the ethical actors that they once
were. So you know Google famously said
don't do evil and they took that out of
their platform. Um, you know, Microsoft
was all about AI ethics and then, you
know, when Sydney came out, they're
like, "We're not taking this away. We're
gonna stick with it." Well, they did
kill Sydney, right? Sydney was this
Well, they
very, I don't know, runchy AI that tried
to steal Kevin Rooster's wife. So, yeah.
I mean, they they reduced what it could
do. But, um, but they stuck with it in
some sense. But, you know, and like
OpenAI said that we're, you know,
nonprofit for public benefit. Now
they're desperately trying to become a
for-profit that is really not
particularly interested in public
benefit. It's interested in in money and
they may become a surveillance company
which I don't think is because what
you're talking about with the
advertising side. So basically they have
a lot of private data because they have
a lot of users and people type in all
kinds of stuff and they may have no
choice but to monetize that and you know
they've been showing signs of that. They
hired Nakason who used to be at the NSA.
They bought a share in a webcam company
and they recently announced they're
trying to build a social media company.
They want, you know, they look like
they're on a a path to sell your data,
your very private data to, you know,
whoever they care about. It's concerning
because whatever data I gave to
Facebook. Like I always used to think
that this conversation around Facebook
data was a little ridiculous because uh
I didn't think I was giving that much
information to Facebook. But I am giving
Open AI a lot of information. I mean
there a lot of people that treat it as a
the treat well that's the number one use
as therapist companion I don't use it as
a therapist but I'm like putting a lot
of my work in I read a great book called
uh privacy and power I'm blanking
slightly on the title by Carissa Viz um
and she had examples in there like
people were taking data from Grinder and
extorting people right Grinder is an app
uh for gay people if you don't know and
um you know that's still in our society
and like in some places it's acceptable
and other places um you know people
don't necessarily want to come out if
they're gay whatever and so people have
been extorting people with data from
Grinder imagine what they're going to do
you know people type into chat GPT like
their very specific sexual desires maybe
crimes they've committed like people are
typing a lot of stuff crimes they want
to commit crimes they want to commit you
know we have a a political climate where
you know conspiracy crime or conspiracy
might be treated in a different way than
it once was. And so just typing it into
chatbt might, you know, get somebody
deported. Who knows? Um, now I'm freaked
out. It's I wouldn't personally use the
system because the writing is on the
wall and I think that they they make
some promises to their business
customers, but not to their um, you
know, consumer customers. And that stuff
is available for them to do what they
want with it. And they probably will
because that's how they're going to make
money. Here, here's another way to put
it is suppose I'm right about the things
I've been arguing and they can't really
get to, you know, the GPT7 level model
that everybody dreamed of. They can't
really build AGI, but they're sitting on
this
incredible treasure chest of data. What
are they going to do? Well, if they
can't make AGI, they're going to sell
that data.
This is why I always thought like when
you take in a lot of money, it's always
you always have to pay that money back
in some way and that changes the way you
operate. That's right. I mean, look at
23 and me. They're out of business and
now that data is for sale and who knows
what's going to happen with the 23 and
me data. I hope you're wrong about this
one, but with the
history exactly this I'm not saying you
are. I'm just saying I hope you are cuz
that would be I I hope I'm wrong too.
But there is a level of a lot of things
I hope I'm wrong, Gary. If the pro if
people got that freaked out about what
Facebook was doing with your data, if
they overststepped, there's going to be
a major societal backlash.
I think maybe I mean sometimes people
just accommodate to these things. I've
been amazed at how willing people are
to, you know, give away all that
information to Facebook. I don't use it
anymore. But let me ask you this. You uh
quote tweeted one of these. So we'll get
into a tweet here. You quote tweeted one
of these tweets. Uh, is the push to
optimize AI for user engagement just
metricchasing Silicon Valley brain or an
actual pivot in business model from
create a post scarcity society god to
create a worse TikTok? This is what
basically we're talking about is that
that might be the pivot. Yeah, that's
right. I think that was someone else's
tweet that I Yeah. Daniel lit and you
said, "I've been basically telling you
about this." Yeah, exactly. So, that's
what it is.
Um, you also wrote this saying the quiet
part out loud. The business model of
Genai will be surveillance and
hyperargeted ads just like it has been
for social media. That's right. And we
were just talking about that and what I
was quote tweeting there was something
from Aravven Sinvas if I pronounce his
um name correctly who's the CEO of
Perplexity and he basically I said he's
saying quiet part out loud. He basically
said we're going to use this stuff to
hyper target. Hyper target. You also
said that companies like Johnson and
Johnson uh will finally realize that
Genai was not going to deliver on its
promises. Have there been companies that
have pulled back? Like is are you just
using Johnson and Johnson as an example?
That was based on a Wall Street Journal
thing and I may have failed to include
the link because of Elon Mus crazy
notions around Elon. You got to put the
links in the Elon. You got to put the
links whatever else. That's right. So
anyway, that was I was alluding to a
Wall Street Journal report that had just
come out. um which showed that J&J had
basically said in so many words I'll
paraphrase it they tried Gen AI in a lot
of different things generative AI and a
few of them worked and a lot of them
didn't and they were going to like stick
to the ones that did like customer
service and maybe not do some of the
others I mean you have to go back you
know a year and a half in history to
when people thought Gen AI was going to
do everything that an employee was able
to do basically and I think what J&J and
a bunch of companies have found out is
that's not really true you know they can
do a bunch of things that employees do,
but they can't typically do everything
that a single employee does and you know
they're reasonably good at triaging
customer service and they're not
necessarily good at creating say a
careful financial protection. Okay. So
Gary, you have like 5 minutes left. I
want you said something in the I think
in the first half about the path that
you think needs to be taken to AGI. Can
you explain what that is in like as
basic of a way as you can to like you
know make it as simple to understand for
anyone who's not caught up with the
systems that you spoke about. Sure. So a
lot of people will have read Danny
Conorman's book thinking fast and slow
and there he talked about system one and
system two cognition. So system one was
fast and automatic reflexive. System two
was more deliberate more like reasoning.
I would argue that the neural networks
that power generative AI are basically
like system one cognition. They're fast.
They're automatic. They're statistically
driven, but they're also errorprone.
They're not really deliberative. They
can't they can't sanity check their own
work. And I would say we we've done that
pretty well. But system 2 is more like
classical AI where you can explicitly
represent knowledge, reason over it.
Looks more like computer programming.
And these two schools have both been
around since the
1940s, but they've been very separate
for what I think is sociological and
economic reasons. Either you work on one
or you work on the other. People argue
or fight for graduate students and fight
for grants and stuff like that. So
there's been a great deal of hostility
between the two, but the reality is they
kind of complement each other. Neither
of them has worked on its own. So the
classical AI failed, right? people build
all these expert systems, but there were
always these exceptions and they weren't
really robust. You'd pay graduate
students to patch up the exceptions. Now
we have these new systems. They're not
really robust either, which is
why OpenAI is paying Kenyons and PhD
students and so forth to kind of fix the
errors. The advantage of system one is
it learns very well from data. The
disadvantage is it's not very accurate.
Sorry, very abstract. So the no I should
have said that slightly differently. The
large language models and that kind of
approach
transformers are very good at learning
but they're not very good at
abstraction. You can give them billions
of examples and they still never really
understand what multiplication is and
they certainly never get any other
abstract concept. Well the classical
approach is great at things like
multiplication. You write a calculator
and it never makes a mistake but it
doesn't have the same broad coverage and
it can't learn new things. So you can
wire multiplication in, but how do you
learn something new? The classical
approaches have had trouble with that.
And so I think we need to bring them
together. And this is what I call
neurosymbolic AI. And it's really what
I've been lobbying for for decades. And
I think it was hard to raise money to do
that in the last few years because
everybody was obsessed with generative
AI. But now that they're seeing the
diminishing returns, I think investors
are more open to trying alternatives.
And also, Alphafold is actually a
neurosyolic model. And it's probably the
best thing that AI ever did. And so,
decoding proteins, protein folding.
Yeah. Figuring out the three-dimensional
structure of a protein from from a list
of its nucleotides. Um, and so, are you
going to raise money to try to do this?
I'm very interested in that. Let's put
it that way. Masa son, if you want to
make use of your
money. No, I'm kidding. You talking to
him? Uh, not at this particular moment.
Okay, Masa, if you're
watching, I don't know. Trying to help.
Okay, great. Well, Gary, can you shout
out where to find your Substack? So, if
anybody wants to read your uh longer
work on the state of AI, where should
they go? Sure. So people might want to
read uh my last two books by the way.
Taming Silicon Valley which is really
about how to regulate AI and rebooting
AI which was 2019 is a little bit old
but still I think anticipates a lot of
the problems around common sense and
world models that we're still facing
today. And then for kind of almost daily
updates uh I write a Substack which is
free although you can pay if you like um
to support me. Um and that's at gary
marcus.substack.com.
Okay. Well I'm a subscriber Gary. Great
uh to have you on the program. Thanks so
much for coming. Thanks a lot for having
me again. Yet again. Yet again. Yet
again. Well, we'll keep doing it. It's
always nice to hear your perspective on
the world of AI. So, I always enjoy our
conversations. Thanks for having me.
Yes. Same here. All right, everybody.
Thank you for listening. We'll be back
on Friday breaking down the week's news.
Until then, we'll see you next time on
Big Technology Podcast.