How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR

Channel: aiDotEngineer

Published at: 2026-01-19

YouTube video id: k1t2xyWMUdY

Source: https://www.youtube.com/watch?v=k1t2xyWMUdY

[music]
>> Here's the very simple argument.
If you look at the sub notion of compute
over time,
you know, this could be like R&D
spending on computes. This could be the
experimental computes, could be training
compute, you know, whatever
that some particular [clears throat] lab
is is using.
Looks like this, no surprise.
If you have another chart of like,
you know, log time horizon, let's say
this this measure from the
this figure that many of you would have
seen on Twitter.
Over time
it looks like that.
You know, let's say that this was like
not merely a coincidence, but these
things were causally proportional.
In the sense that if
if compute growth were to half,
then time horizon growth would half.
So, you know, for the for the sake of
argument, let's say that, you know,
starting from 2008 or so,
the compute curve begins to bend like
that, where this would be no growth and
this would be the original growth,
something something like half.
Then if, you know, if they were causally
related and in particular they were
causally proportional to one another,
then you'd expect this to go like that.
And then for some milestone that you
care about, let's say here we've got
one month one work
raising up there.
One month.
Then the
delay implied in AI capabilities is
potentially enormous.
Now, like why, you know, lots of people
have circulated that there might be some
slowdown in compute growth. I'm not an
expert in in those forecasts, but I
think I think the prior reasons do seem
like somewhat strong to me. One is like
physical constraints that we might we
might hit, power constraints as
mentioned, or
there are various other ones that that
Epoch have reported on that they that
they consider, all of which seem to not
bite through 2030, but, you know,
potentially potentially could bite
sometime after 2030.
I think the more likely one is just like
dollars is a constraint. Like you can't,
you know, large tech companies can only
spend so much. At a certain point, it's
like large nation states can only spend
so much. Like you can't
I guess there are some scenarios in
which you you can't you can't continue
going, but that seems to to kind of
naturally imply this slowing down. And
then the you know, additional point that
this this paper is trying to make is
that under a
very contestable but standard assumption
from economics,
you should in fact expect these these
two to be causally proportional.
I think in particular you should expect
them to be causally proportional
to the extent that or for the period
that software any singularity is not
possible. And that's another discussion
and we can talk about that.
But at least in this kind of somewhat
business as usual
um
um
scenario or sort of until that scenario
no longer applies,
I think this is this is maybe a
reasonable model and does imply some
slowing of AI capabilities in the in the
near future.
I have no plan for this session
whatsoever.
That also assumes
that we don't have a technological
advance that dramatically improves
capabilities relative to compute. Like
like an like an unpredictable
technological advance, right?
>> Yeah, yeah. I mean all all predictions,
you know, assume no unpredictable
>> [laughter]
>> um yeah, I'm like um
uh you know, time horizon or or like in
general in in AI kind of straight lines
on on log linear plots
have been a I think, you know, a very
highly underrated forecasting tool.
They've done extremely well over now
many orders of magnitude. You know, I I
I think it's reasonable to have the
default expectation that the
log linear lines continue through like
approximately the same number of orders
of magnitude except maybe if there's,
you know, some significant break in the
inputs. Yeah, of course on the upside
there could be um there could be
something quite dramatic. Software any
singularity is the first thing that
comes to my mind, but
but, you know, another transformer style
moment seems like another another
candidate, actually. Of course, also one
of the problems with with testing this
would be that like I think most of the
tasks that you have
it will eventually eclipse
the maximum possible amount of time that
those tasks can take at some point in
the evaluation set.
Yeah, so I think
you know, there are some ways around
this that we're working on. I'd be I'd
be excited to talk about that. They they
all feel pretty early.
But yeah, you know, I think it's I think
it's right that
if
if time horizons are doubling, you know,
eventually you you know, the
the doubling time is such that you can't
possibly make long enough tasks
in the in the relevant It's also that
like we actually hit a place where time
horizon is no longer a useful measure
because actually you now want time now
you want total time to decrease.
Like you like like like what you want is
you want the same results at a lower
time.
>> Oh, um
uh
one thing So, what you want higher
reliability at a lower time horizon time
horizon. One thing to say about time
horizon
um is there's like two notions of time
here. Like a like a human time axis
thing as like calendar time axis. The
the time that the model working for, I
think you should like kind of
approximate it to zero.
It's it's not actually zero. They are
they are taking actions, but they they
largely
do their successful work pretty pretty
early on to the extent they're going to
be successful on tasks.
So, so my my guess would be that it will
continue to be the case that there's not
sort of so much extra use on that margin
of of making the
models complete tasks more quickly,
although reliability very much so,
obviously.
So, most of it's like the human like the
like the iteration loop, most of the
time is spent in like the human machine
iteration loop.
Um the humans are working without AIs
and AIs working without humans. So, the
for the humans I guess it's all human
Yeah, yeah, yeah, yeah. Yeah.
>> [snorts]
>> Cool.
Any questions on me to ask? I I can go
through
some like upcoming things that we're
that we're excited about if people are
excited about those things. Yeah. I I I
did have one personal
perceived one like that time perception
one of those kind of things.
>> Yeah, yeah.
One one thing I thought you brought up a
little bit in the paper, which is
uh
you know, whether or not familiarity is
a confounding factor.
Um although one of the things with tools
you're thinking Oh, yeah, tool
familiarity is a confounding factor. And
and of course also like you also brought
up that like tool capability has
dramatically changed. But uh there was
an interesting presentation from Meta at
the developer community engineering
summit this year.
And they had done a they have probably
the best infrastructure for quantitative
measurement of like developer experience
in the world of any company.
And they're able to tell you basically
how long it actually takes to make
make a PR, basically. They call them
diffs at Meta, but like how much actual
effort a human time effort it took to
make a PR.
And
what they saw was they saw a J curve
when they gave people agents. And that J
curve was I don't know how long it was,
like 3 months or 6 months. And so, one
of the things that I also wonder is like
if it would be interesting if if if
there's a cutoff of how much familiarity
the person has. Like have they been
using this as their full-time daily
driver for a period of months.
And if there's like a interesting cutoff
that occurs once they're at a certain
level of familiarity occurs.
Yeah, I'm totally I'm totally on board
with like
not just in this case, but in many
economically relevant outside of
software engineering cases, you know, J
curve like explanations being being a
real thing. I'm like, yeah.
You know,
developers not just developers
experiment with tools. You tend to be
slower the first time that you're
experimenting with tools.
But, you know, if you're doing this so
that you you have some investment
benefits. You know, later on you might
be might be more proficient with the
tools or in the case of AI,
maybe you just sort of expect the models
will get better. And so, even if you
don't become more proficient, it'll be
like the kind of thing that you want to
do. You know, those explanations broadly
um
make sense to me.
Um I can give you some reasons why I'm
interested in this.
>> [laughter]
>> Um
I think the [clears throat] one thing to
say is, you know, we're we're um
um
what are some things to say?
Um
Uh as background, you know, we're
continuing with this with this work and
we'll we'll see.
Uh you know, another thing to say is
just like quantitatively, you know,
difference between this and this very
large.
>> [laughter]
>> Um
I'm like, how much how much is J curve
explaining? I think it's not explaining
that much.
>> Let me explain that because we see this
over and over actually in software
engineering studies that the one
question you can't ask people on a
survey is how long did a task take.
Like you can ask people how much more
productive did you feel and they will
give you an accurate response that
correlates with quantitative data. Yeah,
yeah, yeah. Ask anybody the amount of
time that something takes, they are
almost always wrong. So, that I was like
like like when I shared this with my
colleagues, I was like, okay, I'm not
surprised about that at all. What is
interesting is how much is the slowdown
aspect. That was what was interesting.
Yeah, yeah, yeah.
Yeah.
Point well taken. That that that makes a
lot of sense. I do
I do um
uh so, I think we
despite this were interested in time
estimates because um
you know, we're we're interested in
providing Yeah, I mean I mean
>> [laughter]
>> I mean it's the perceptual like yeah, I
do think that's relevant to also because
like the perceptual aspect is also the
hype aspect.
Right? Like so, developers will tell you
that they were faster when they weren't.
And I think that is worth knowing.
And you know, to to the extent that
we're interested in um
uh measuring the, you know, possibility
timing nature of
um
of capability explosions or sort of R&D
being automated. One commonly proposed
measure to do this is just like ask
developers or researchers how much
they've been sped up and for exactly the
reasons I'm pointing out, I don't put a
lot of faith in those
um in those in those estimates. So nice
to nice nice to see it like this. Yeah,
some some more some more Jacob things.
So I
So the
So the forecasters who who are not
predicting time to complete, right? They
they are they are just predicting this
this effect size. The non-developers,
the expert forecasters.
They are told the degree of experience
these developers have and some of the
forecasters are
um in thinking about how this population
might be different to other populations,
pointing out various facts about the
study like they're more experienced. I
expect experienced people to get less um
uh to get less speed up or, you know,
the repository is a larger. I think AI
is a less capable of working on large
repositories. I expect less speed up.
They never never mention
um familiarity with tools. My my sense
is that um
yeah, they they share the the sense that
I had at time which was like most of the
action is in understanding what's AI's
the kind of things that AI's are good at
or bad at in the first place.
And all of these developers have
experience with LLMs in their core
development workflow. It's just Cursor
that they're they're quite that three
quarters of them are
are totally unfamiliar with at the start
of the study.
Um so I I I just
I I wasn't seeing much much margin.
Um
yeah,
I I I I think it I think it is I think
it is an open open question. I I also,
you know, we watched so many hours of
screen recordings of these developers
working and um I just do not see
um I think they're like working very
reasonably.
You know, in some cases worse than me
and my colleagues, in some cases better.
Um I'm not seeing these like advanced
workflows that they're not accessing.
>> Yeah, and my experience is is not that
far off from this is that there are
times when like I am dramatically slowed
down and there are times when I am
accelerated.
>> Yep. Uh and although as my familiarity
with the tool increases
>> Yep. I definitely don't see a speed up.
I improve a lot because I learn over
time
>> Yeah. what I can tell it to do and what
I can't tell it to do.
>> Yeah. Uh in addition to like things just
getting better with it, like
understanding like, okay, now I need to
plan now to blah blah blah.
>> Yeah. But I but that's why so the thing
I think I'm doing before you make a like
high-level architectural decision that,
you know, 10 conversations uh 10
conversations uh turns down is going to
blow up in your face. You like really
try to think about it.
>> Yeah, yeah, yeah, exactly. And and and
also like scope it down to like a
smaller problem. Like I at first I would
try problems that were too large and
like it can't handle that. Yeah, yeah.
But just I mean, just for the future, if
you ever do I mean, I think it's
obviously really hard with the with the
sample with the 16-person sample size,
but
It'll be interesting.
That's it. Great. Great. Cuz cuz in the
future what I I think having a cutoff,
like trying to figure out if there is a
cutoff of familiarity where the number
changes would be interesting to see if
that meta result generalizes outside of
that.
Um we are we are on it. I think um
the AI's have been getting better during
this period which is going to compound a
lot of a lot of what's going on,
obviously. But yeah, yeah. It'll be
interesting. The thing is the projects
themselves are very optimized for people
coming on onto new projects and figuring
out how to, you know, they're already
the ones that struggle to be organized
well for humans to come on board and
build and navigate them quickly don't
survive very long in the open source
ecosystem. And
these are fairly mature open source
projects. They're a little bit different
from like in an enterprise settings
where things survive cuz they make money
even if they're a pain to develop on.
Right? So the the context is a bit
different.
These are the the repos running. Yeah.
It's interesting.
>> Yeah, that that is a really interesting
point cuz like actually some of the
repos that I was helped the most with
were ones that I was completely
unfamiliar with and which had no decent
documentation of any kind and where like
I I had to come in on this legacy code
base that had existed for years and like
make a change and and like the developer
who owned it was like only partially
available to answer questions to me. And
so in that case like Quad Coach was a
huge help. Yeah, legacy code bases don't
exist cuz they work well, it's because
they make money.
>> [laughter]
>> Interesting point.
Yeah, so The question I had was um
So like did all the developers have the
same level of AI
familiarity with with Cursor or was was
there some variance and was there uh
like
>> [clears throat]
>> is there a plot of like each of the each
of their familiarity There's always a
plot.
>> [laughter]
>> There's always a plot. You just think
that you accounted for the the insight
the question of was there is there a
Jacob Yeah, so so here's some here's
some evidence.
Um so okay, the you know, I can show you
some plots. I think the the the sample
size is just small enough that like you
shouldn't really believe any of them.
I mean, though I I think the plots
aren't going to show much, but then I I
don't want to say that's like strong
evidence this is not something that's
going on. I just think the evidence is
kind of weak. The thing that really
convinced me is like watch the videos.
I watched the videos [laughter] of them
working and you know, often they're
better using Cursor than they are in the
mail. And I'm like, wow, you know, I'm
I'm working on this project using
Cursor.
>> [laughter]
>> Um but but here here are some graphs. So
um so this is by whether they have
various types of um
uh AI experience coming into the study
and, you know, basically you see no
movement in in point estimates. People
for whom Cursor was primary IDE before,
um yeah, not not a huge amount of
difference versus people for whom it was
not.
Um
Then the next one is
you know, you might think may maybe you
have a view that's, you know, some Jacob
cutoff comes after this point, but
still, you know, within the within the
study there's some variation in how
experienced people are with AI because
they have multiple issues, you know,
they're after the first AI issue they're
slightly more uh exposed than after the
second AI issue. So you might try sort
of excluding those data points over time
and and and seeing and seeing what pops
up and, you know, they don't they don't
seem to get better at using AI over
time. Although I think there's probably
a statistic issue with uh I think
there's probably what, sorry? There's
probably a statistic issue with that
that plot right there. Like those bars
are very very wide. Oh, yeah. I mean, I
think yeah, none of I I think like all
of the um plots outside of the main
plots, all of these subset things, you
should like not put a lot of stock in.
Um
yeah, I I I I totally I totally agree.
Um [laughter] okay, and then lots lots
has been made So so this graph is the
reason we put it in unclear evidence cuz
we're like, yeah, things point in
different directions. Um a lot has been
made of of this plot suggesting, you
know, something something J-shaped. In
particular that, you know, at the end
once people have more experience um
uh they do experience some some speed
up. Um here are some issues. You know,
first, like the other plots don't. I
think I think that's important to to
include. And second, these hours are
coded very conservatively. So for
instance, someone in the 30 to 50 hours
bucket is um
had Cursor as their primary IDE in 2024.
They had recorded themselves on their
time tracking software as having spent
140 hours using Cursor. They
conservatively estimated that they'd
spent 50 hours using Cursor. And so they
end up in our 30 to 50 hours bin. This
is someone who's who's primary IDE was
was was Cursor last year. Um
and and, you know, people have been
commenting about this. They've been
using Cursor for less than a week. I
think that's not a
not a very fair assessment. If you if
you were to move that developer over
from the uh penultimate bar into the
again, you shouldn't believe this
because of statistics and but um if you
were to move the uh that that developer
from the um penultimate um
effect size estimate to the to the last
one, then you'd see some balancing out
where you get back to essentially zero
in in that bucket.
Uh yeah, again. So so like don't rule
anything out. I think Jacob explanations
yeah, still like they're on the table.
Is is it not likely that the 50-hour
group also is similarly underestimating
their their time they've spent using
Cursor and that actually if you just had
a longer scale that you would still see
a trend? Oh, that that is an interesting
point. Um
Um
That seems plausible to me. Um and then
and then I guess I want to I'm not sure
it's an underestimate because we're
using this like very conservative Yeah,
totally.
Totally. Um yeah, yeah, I think that
seems plausible to me and then
um for this not to be a strong evidence
I'd retreat back to I think you
shouldn't really believe in any of these
Yeah, yeah, yeah.
I think the basic issue is it's a small
sample size and there's also a lot of
bias in the data set effectively, right?
Like it's a certain kind of data set.
You mean like the kinds of the kinds of
developers Cuz yeah, open source
developers and also working on open
source projects that are pretty mature.
Yep.
You know, those those two things are
You know, working with open source
developers on projects that are pretty
mature. This is probably reasonably
indicative, maybe, but the sample size
is pretty small. But outside of that it
gets a little harder. Yeah, and talking
about this, I'm like um
uh I think yeah, this group is really
weird. It's really interesting. It's
like interesting for the same reason
it's weird, right? Um
uh
yeah, we we were interested in, you
know, again studying um
uh possible effects on of AI for R&D
speed up or or or automation. Um
there, if any types of developers are
not being greatly sped up, it implies
the whole thing isn't isn't being sped
up. So so it is kind of curious to see
even even like particular weird
populations. You might imagine in like
large, you know,
sort of production in production code
bases maybe have a bit more of this
shape than scrappy experiment scripts.
Yeah, yeah, yeah. Um but yeah, it's
totally I think I think it's very
interesting. It's just it's hard to
generalize. We just don't know. Yeah.
Yeah, we're doing this large study and I
think uh you know, I I think
unfortunately after the large study
which includes more green field
projects,
I think it's still going to be hard to
generalize.
Um for for not totally similar reasons,
yeah. Although I don't feel like your
results are particularly contradictory
with any actual independent research
that's been conducted. The only research
that I've seen that that would say is
contradictory to yours is research that
has been funded by model shops or agent
shops.
>> [laughter]
>> What can I say about that? I I do I do
think that
most of the research that's that's put
out um
uh is associated with
uh large tech companies um
and I and I I think there are other
methodological concerns that I I think
is reasonable. I I have methodological
concern with that as well. I know people
who work at some of those places who are
methodological concerns with the work
that was output, so.
I mean I you know I think I think there
are there are concerns about ours as
well. Sure. Sure, but I I I actually
feel like I I remember somebody sent me
your paper and when I saw the headline I
was like no way.
Well, [laughter] me too.
I was like that sounded like BS. Yeah,
yeah. I read the paper and I was like
oh, this doesn't suck at all.
>> [laughter]
>> Uh yeah. Yeah. A little bit. Well, no. I
feel like at least your high-level
conclusion
both is intuitive like from a person
who's read a lot of software engineering
research and also is well justified. I
like I think people I have had people
argue with me about the 16 developer
thing, but I don't think that actually
matters in that particular case cuz I
think they're actually a fairly good
control set more or less, right? For an
experiment because they they remove a
lot of validity concerns by being
experts. So, yeah, they it's true that
they don't represent certain like a like
the broad aspects of developers, but
they also remove a lot of variance in
what you would expect from the
population and they and they allow you
to have like a uh sort of an
epistemological
function of like, hey, let's isolate
that factor away and then let's let's
see what happens with that.
And that's what I like that and then I
thought the way that the study was
conducted was completely sufficient to
draw that conclusion, the high-level
conclusion that it drew.
Thank you very much. Um
here's a here's a curiosity. So so we
did We haven't published this because
of
organizational reasons that we won't go
into, but um
>> [laughter]
>> um we did conduct this um uh you know,
peop- people would throw sort of their
various explanations for for for what's
going on here, you know, many of which
have lots of merit, some of which I'm
more more skeptical of. Um you know, a
natural one is brownfield versus
greenfield projects. Um so we ran this
um kind of enormous hackathon where we
randomized half of teams to um use AI
versus not, kind of you know, maximally
greenfield or something. Um and uh and
then we'd have a bunch of judges score
them, um you know, many judge scores per
project or something to try and even out
that noise more see, you know, is it the
case that like the bottom
uh 50% were all the AI-disallowed group
and the top uh um the top were all the
um AI-allowed groups or something like
that. Now, unfortunately, it was sort of
even even smaller. That's like part of
the reason we're not publishing this. I
think the evidence is is is really quite
weak. The degree of overlap is enormous.
Like the the point estimate that we um
I'm a bit nervous about saying this
because, you know, it hasn't gone
through the kind of review processes
that something like this goes through.
So so um maybe I've messed something up,
but um uh I think the point estimate is
something like four percentage points
higher on a on a I'm sorry, four
percentile points higher um if AI is
allowed versus if it's not after the
after controlling for everything else.
That is like
you know
extremely noisy and you shouldn't draw
any conclusions, but um but seemingly
maybe kind of um
small effects. I think the phones
allowing AI.
Um
yeah.
Yeah. So the question I have I I guess
this is this is similar also related to
other research that you guys have done.
Um so have you found a similar pattern
or I guess first have you um explored
like the effect of AI in other domains
than specifically software engineering?
Um and if so, have you also found this
kind of surprising result that maybe
as much of a speed up
Um
uh no no no no. I mean no new
directions. Those are stuff that we have
not done. Um
uh
yeah, I
yeah, you know, we're we're interested
in understanding um uh
possibility of accelerating R&D. Um you
know, coding is not the only kind of
thing that happens at major AI
companies. Much more conceptual work
happens. Um
uh you know, I'd be I'd be very excited
about um
uh [clears throat]
you know, working with math PhD students
or very different types of software
developers or um
or you know, running running these kind
of studies inside of um major AI
companies or or large tech companies or
or something like that.
I think um we are very interested in
you know, not necessarily directly, but
some somewhat close analogy to um
to the large AI company case.
So to the extent that something really
deviates from that, um
probably less interested.
Good. Interesting. So yeah, so it sounds
like uh you're interested in measuring
capabilities for like
for like math research uh
and uh some
like other research.
Yeah, I'd say I'm interested in like
what the hell is going on in AI.
Um
you know, how how am I going to learn
the most about what the hell is going on
in AI? Um you know, I I I think
something something a bit more
conceptual, some- something where, you
know, fewer humans are currently working
on it, so it's less appearing in
training data um will help me better
sort of triangulate the truth about
what's going on in AI um even if I don't
care about
math research in particular, um it'll
still
still sort of draw helpful qualitative
lessons is is kind of the sense I have.
Yeah. I mean if I was going to pick the
areas that I think it's most successful
or like areas where I would expect to be
more successful, but where I think it is
being less successful, I would pick
probably data science Hm.
as an interesting one. Like how does
data science How do How much of data
scientists help by AI today? Say say
more about why you expect it to be less
successful. Um so
in a in a real So let me give you an
example. Yeah.
And at LinkedIn there are 5,000 tables
with the name impressions in the in the
table, right? So if an analyst wants to
understand how many impressions happened
on a page, where the hell did they go?
Yeah. AI can't figure that out. Yeah.
Like today there is no existing AI
system that we have that could be hooked
into like corporate environment like
that and process through I mean there's
trillions of rows in those tables.
So
like how like like So what a data
scientist needs to do is they need to be
like, I need to like, you know, analyze
a bunch of data and come to a
conclusion.
Right?
Uh and I I hear lots of
like thoughts about building systems,
you know, people talk talk about ML and
SQL. The models are much better writing
SQL than they used to be, but I believe
that the state of underlying data is so
bad
that the the the actual data scientist
is going to get way less value out of
the
of the [clears throat] AI than software
engineers thought they were going to.
Hm. That is that is Interesting. That's
very curious. I
um So what one one view that some some
more bearish people have looking looking
at the future of AI is is um you know,
so much there's so much tacit knowledge
around, there's so much knowledge that's
sort of
um embedded inside of companies that
you're you're not going to pick up from
you know, these like RL training
environments that up or or something
something something. Yeah. Maybe it it's
not sort of the state of nature that
there needs to be many specialized AIs.
Indeed, like much of the lesson of the
past few years that one big general AI
seems to seems to be more performance,
but you know, at some point in the
future when data is like locked up
inside of companies um uh you know, we
will have more of this um
proliferation of many more specialist
models as I have, you know, GPTN
fine-tuned on on LinkedIn data in
particular something something something
something. I want my reaction that's
kind of like that. Yeah, I don't know. I
I it was I I do have a disbelief-like
reaction. I'm like, ah. What is science,
you know?
>> [laughter]
>> But also like also but also like so
contradictory facts. So the problem with
these problems is the all these data
sets contain contradictory facts. Like
the name of the field will be uh like uh
you know, date started or like time
it'll be it'll be time started, right?
And then it will contain only a date
except for it will only contain the date
up until like November of last year and
then after that it will contain only the
month, but then after that it will
contain maybe the the seconds that the
thing finished. And in order to actually
successfully query the data set you the
data you the data analyst or the data
scientist have to know what those
cut-off dates were, which is not written
anywhere.
>> [clears throat]
>> Although what you could do theoretically
is import a bunch of the SQL that other
analysts have written to try to figure
out like the how they triangulated these
things or work backwards from those
reports. But today So I think today, for
example The people
Sorry, I've just like I haven't worked
at a large company. People
>> [laughter]
>> People People don't fix this at the
source.
No. No. So I feel like the lesson I
learn over and over again at this uh
data specs really matter. Really really
matter. No, I I I've also been working
in data analysis and and research
developer research and so Yeah. Yeah.
And so
yeah, so the like the problem is like
their job is like produce this report
for this executive, right? Not go make
infrastructure to produce this report.
Yeah.
Oh, but I'm like if I
Okay.
>> [laughter]
>> I'm with you. I live that dream every
day. Yeah. Well, you just have you end
up having to, right? Is is you have to
build out infrastructure for it. That
has to be part of the job description
and and the other part is you have to
fix the problem at the source. Like you
really
I I
I still remember having a conversation
where where someone said, it's too
difficult to fix it at the source
because there's too much complexity of
all the systems that and all the
sources. I said, Okay, wait a minute.
You're saying it's too complicated to
solve at the source downstream somehow a
problem that is too big for the entire
organization to solve.
>> Yeah. It's easier to solve there. Come
on. Yeah. Like that doesn't make any
sense. I just think there's so much
potential here and I have not seen a lot
of studies done on like how people who
are at working in that data space
experiencing AI. And what's fascinating
about that is real ML is mostly data
work. Like like ML especially outside of
LLMs and outside of LLMs, the majority
of ML engineers spend most of their time
doing like feature curation. Mhm. rather
than they spend actual direct model
training. And like trying to clean up
bad data for feature creation. So like
theoretically the potential even for the
improvement of ML by enabling ML to be a
better data scientist is huge. And I I
suspect that if you My hypothesis is
if you went into this space, you would
discover
it is great at telling me how to write
SQL
or how to like write Pandas.
And or Polars or whatever you're using.
It is okay at doing very trivial things
and it fails at all complex tasks. Mhm.
Like fails completely on complex tasks.
I don't even know I don't even see a
benchmark on it. Mhm.
Mhm. Can you give me an example of a of
a complex task? Sure.
Uh
let's say a complex task is
determine the time between
Give me the P90 of time between
deployments for all deployments that
happened to Capital One.
It struggled at that? That Yeah, that
that it doesn't seem surprising to me.
>> surprising, right? Yeah. Uh so uh I'm
like you know, if it has sort of
reasonable context about where it would
find this.
>> So if I had that data, right? Sure Sure
makes sense. And uh and and then so
okay. So fine. So so give me that number
and then also I'll make sure that you
can break that down, you know, by team
hierarchy. So like can you give me that
in a table so I can break it down by
team hierarchy? Uh where is the team
hierarchy data?
Like uh how Oh,
here's a funny thing. Uh what PRs were
in those? So how do I know how how would
I how do I actually determine
what the time deployment started and
ended was? Cuz it turns out that's not
clear in the base telemetry.
And you have to like know magic to
figure out when the when the deployment
started and ended.
Um
Uh oh and also tell me, you know, for my
ability to analyze it, tell me how many
PRs were in each of those deployments
and which PRs went to each of the
deployments. Well, guess what? The
deployment system only This is being
recorded, right?
I think it is being recorded. Okay. Yes,
but before you
>> [laughter]
>> Um
So then, you know, imagine the
deployment system doesn't contain
sufficient information about that data,
right?
Uh
then like
like where do I get that data? Well,
that data it doesn't exist in any other
system. So what I Well, maybe I have to
go like
I have to go to GitHub and I have to
call the GitHub API and like the chance
that the LLM
or any agent figuring that out today is
pretty minimal.
Mhm.
Yeah, I do still, you know, relative to
my colleagues, I'm I'm I'm pretty
embarrassed on AI progress. I I I do
still have some reaction that's like ah
like
can't you spend the day getting this
into a Cursorless file?
>> [laughter]
>> You know, like where where where the um
where the hierarchy exists. I I would I
would go
I think that's why I think it's
interesting. I think what you were
saying I don't I have not seen any
real comprehensive study on the
experience of data scientists have. Uh
Um if you if you have any ins to um
uh to to ask running studies at large
tech companies, then I I'm all ears. You
all ears.
There is a fellow at OpenAI that I was
talking to who was one of the speakers
who does evals uh internal evals and he
has mentioned that he's done some work
with data scientists. So he might know
some people who have that data.
But it's it's all been internal between
him and like Cursor between him and
like, you know, the product team or
whatever, right?
Um
Uh yeah, that and I also think uh
I I one of the ones I'm curious about
too is lawyers.
Curious about like more traditional like
older like lawyers, doctors, and I think
mathematicians are all very interesting
to me.
Just because
you know, both lawyers and doctors are
so constrained by a legacy history of
like the constraints around them and how
they work.
Um yeah, legal legal issues. Imagining
they need to be a significant power.
Yeah.
And there's stodginess.
Like I I I'm also interested in like
what's the how are The stodginess The
stodginess I feel like is is a
I I I think I'm less bought into as a
long-term explanation for economic
I like the the legal restrictions they
sort of continue to be the case through
time. The stodginess I can like set up a
new law firm that's less stodgy and then
change the previous law firm or
it seems seems to I agree. I agree. I I
mean it's um
I don't think it's persistent. I just
think it's it's interesting to see One
thing that would be interesting to see
is like if that affects the mental model
that they have today. Like like if if
they're like how they've been talked to
about it or how their trust in it
affects how they use it. It'd be
interesting to know
to me.
I don't know if it's it's a worthwhile
study. It's more of one of those things
that I wonder about idly.
You take a lawyer who just got out of
college and sort of, you know, has spent
a lot more time using ChatGPT and you
take a lawyer who's been in the business
for 50 years and, you know, has has a a
giant file folder full of Word docs that
contain like all the briefs that all
their, you know, junior associates have
written for decades and decades and he
just opens up those briefs and like
changes a few words in them and then
sends them out to the judge and he's
like, you know, has known those judges
for like 30 years, 40 years. He knows
exactly what they want and like
you know, is he getting any Is he
getting any value? But is there value he
should get?
Is there something that like Is there
some way that like he would be helped?
I AI I certainly know discovery.
Discovery and AI is like in in law is
like a huge huge problem.
And I I know that like there's Harvey. I
don't know anything about success
they've had.
I know a lot of people working in that
space specifically. Like that's
It's an ongoing thing, right? There
there's always technology for it, but
it's kind of
the adoption of it is a very different
thing from That's That's that's the
thing, right? Cuz I One of the first
things that I thought of cuz I I have a
little bit of a legal background and one
of the first things that I thought of
the first time like when ChatGPT-3 came
out, I was like oh, this could totally
change discovery.
Like this could be because discovery is
like the most painful and most difficult
and most expensive. Like you could have
serious social consequences by making
discovery less expensive. Like That is
the expensive part of having a loss.
And so like you could actually have
significant impact on a society if you
could make discovery cheaper and
instantaneous
and reliable.
Yeah. I have a question
on your graph. Yeah.
Cursor. Mhm. I'm not sure Mhm. Mhm.
Mhm.
You missed it on office.
Keep on going.
Like two more.
Yeah. Oh, sorry. All right. It was a
scatter plot, right? Um
It was what?
Cursor in 50 hours.
According Sorry, what say? Yeah.
Yep.
Yep.
Uh I say it's this one.
Yes. That one.
So you're saying that people the
developer there was no difference.
Cursor, is we talking about the idea
that five coding
and
>> [clears throat]
>> they use it for 50 hour Oh.
I was very intrigued by that because
everyone talks about five coding and how
Cursor is
instrumental.
I
Why did you get to How did you get to 50
hours? I was curious. Um so so this is
including time
five and 50 hours is
This is including uh time in the
experiments um that developers have
spent in experiments plus their plus
their past experience. So for um
for for some developers working on some
issues, it's it's past the experiment.
Some of them have gotten to more than 50
hours of um Cursor experience. Um
Uh and that's that's who's coded up in
that in that bucket at the end. And was
Was it the same task for each? Uh no,
these are kind of they're they're actual
tasks that pop up on the GitHub
repositories.
Which which as I mentioned that are kind
of um
I don't want to
I'm a little bit nervous about saying
they're weird cuz it implies they're um
uh I want to say it's very interesting
and it's very weird. And it's
interesting for the same reasons it's
weird. The These are um
These are repositories in which they
have These These are projects in which
they have an enormous amount of mental
context built up um that the the AIs
might not have um that they've worked on
for for many many years that they can um
I'm not sure this is always the case,
but you know, I imagine in my head that
that they basically know how to execute
on the particular task they have before
um
uh before they even, you know, go about
to attempt to get.
Because they're so experts in in the in
the project.
When you mean when power is speed up? Is
it like
like 5%? Like what do you mean by power
up? What's How do you quantify the power
speed up?
Um
So uh you might to think about uh
Let's see. Let's go to this one instead.
So
um on the
here left-hand side, we have the um
averages for what the developers say um
will happen in terms of their time to
complete if their issue or their task
gets assigned to the AI disallowed or
the AI allowed group. Um you know, they
they think that if AI is disallowed,
it'll take them a bit more time, closer
to two hours and I guess more like an
hour and a half or a little bit less if
AI is allowed. Um but then, you know, we
we randomize this particular task to
allow AI or not allow AI. And it turns
out, you know, if we randomize to AI
allowed, then the times are more like a
bit above two hours rather than a bit
below two hours. Um and then you can
think of the uh change in time estimate
as sort of being one divided by the
other here. It's not quite that for
reasons reasons I can go into, but it's,
you know, it's effectively um
What is actually the transformation? You
know, whatever. It's something like AI
disallowed over AI allowed minus one.
So uh to to draw that out, I'm like um
you know, I might be like
what's what's the speed up? You know, is
it like uh 1.1x?
But you know, these these developers are
going 1.1 times faster when we're
actually on a time to complete scale,
not a not a speed scale. But ignoring
ignoring that
ignoring that detail, you know, is it
1.5x?
Is it 0.5x? Are they actually going sort
of twice as slow?
How would we get that information? Well,
we'd do something like take the AI
disallowed times
divided by the allowed AI times. You
know, if this was
1.1, let's say, times as long as the
allowed times, then we'd get to 1.1 x
speed up.
It's something something like that
that's going on.
And in fact, you know, we find that we
find a slowdown. Obviously.
I I just read a fascinating article
last company I can't remember, but
basically
journalist
was allowed to
uh
using five coding, right?
Uh do a pull request, meaning
there was some
feature
AI was used to assist with
building out the requirements and
he practically according to the article
just kind of did a little couple of
tweaks and then the sign off on it.
And it was this fairly fast And it
happened with the whole five coding
thing.
Yeah, I He didn't code. Like that was
the whole thing. He was like he didn't
have any s-
software development background. That
was the whole thing. I was just curious
you've
tried to do a study on that. So I So I
definitely do I definitely do the share
this out. But you know, if you've got
like no idea what's going on, then
probably probably these are going to be
some some significance
some significant speed up. You know, I I
I will say I guess number one, it's not
you know, it's not a priori obvious.
You know, in fact, we went out and did
this hackathon with you know, very
experienced people and much less
experienced people and and tried to see
what happened. And [clears throat] what
we found is you know, the scores the
judge scores extremely noisy and I think
you shouldn't believe it. But
you know, the the judge scores were not
that much higher when AI was allowed
versus versus when it was not. The
people aren't actually making that much
more progress. And then And then another
thing to say is I I think there's going
to be more expertise in this in this
room than than I have. My understanding
from either sitting with these open
source developers for a while and not
not being a very capable developer
myself
is
is that the the quality bar on the
repositories in in the study is just
very high. Typically.
And so I would be very surprised if
journalist you know, even frankly if
like a good software engineer without
lots of experience on the repository,
but but certainly you know, someone who
wasn't a software engineer was able to
get up a clean PR on these repositories
first time. In fact, I think that's a
lot of the story for what's going on
here is that the AIs, you know, they
actually kind of do make progress in the
right direction some some good fraction
of the time. But
for you know, for various reasons
sometimes for reasons of correctness,
but sometimes for reasons of like you
know, how they've tried to solve the
problem and you know, whether that's the
typical way of solving the problem or
like how various parts of the project
speak to one another. These these kind
of considerations, you know, they they
haven't properly accounted for that. And
so you know, the humans not only need to
spend expensive time verifying, but also
like clean up clean up all the stuff.
And my sense is that someone who didn't
have all that experience like basically
wouldn't know how to do that step. And
so wouldn't be able to submit a clean PR
to these repositories. You know, that's
that's it. Like I relative to these
people at least, I suck at software
development. [laughter]
And I I'm getting up you know, PRs
internally all the time. And
I think they're I think they're worse
quality and
you know, and they're and they're
getting over time they're getting better
over time. You know, I do believe that
people are coding when they when they
wouldn't be able to code. They are
submitting you know, PRs that are lower
quality standard when they wouldn't be
able to do that at all.
But but getting up getting up these
expert level PRs, I I do feel kind of
skeptical.
And And that's actually part of what I
was getting at is
they often get PRs often get rejected by
more novice folks on these big on these
bigger quality projects for no other
reason other than the developer
ergonomics impact of the PR, right? So
the fact that it makes it harder for me
to future maintain cuz cuz for open
source project, almost all the incentive
is biased towards making it easier for
me to maintain the project. Right? So
every time a PR comes in, if it doesn't
make it easier for me to maintain the
project, I have a tendency to reject it.
Yeah.
Uh if it does make it easier to maintain
the project, then yay, I'm into it. As a
That is unlike what you have in a
typical business context, right? Where
the most important thing actually is to
get something done. Yeah. Right? Uh
because you're you know, the fact that
that someone's going to spend a lot of
time maintaining is almost job security,
right? But for open source, it's the
opposite. It's actually what causes
people to leave projects is when it's
difficult to maintain. Right? So it is a
different bias on what you accept for
pull requests.
Can you remind me the name of the name
of the English gentleman who maintains
the Haskell compiler?
Uh
Simon something? Yeah, I I know. I I
can't remember. No. Okay. I can't
remember the name at all. So here's
here's his one story that that might be
relevant.
You know, bunch of repositories
in the study that they all have you
know, broadly these characteristics. One
of them is the Haskell compiler.
Famously on the Haskell compiler,
there's like some chance I don't know if
it's 50% or 30% or whatever. But there's
some chance that if you submit a PR, the
I'm being recorded.
the Simon Simon Simon Marlow maybe? I'm
not sure.
The creator of the Haskell compiler will
come into the comments and argue with
you for many many hours, much longer
than you spent working on the pull
request until
until the PR hits exactly your
specifications.
Combine that fact with the remarkable
fact I think that the median PR in the
study, the time they spend working on
the code post review is zero minutes.
That is the the median PR is like
perfect first time around because the
professional incentives of these
developers are are like that. Now
there's a very long tail on one of them
on one of them I think literally Simon
this gentleman pops up and argues in the
comments for many hours and that that
that one's a lot longer.
>> [laughter]
>> But
yeah, they are they are maintaining this
extremely high bar.
I'm interested in the other upcoming
stuff that you have in your doc.
Yeah, there is.
Um
>> [snorts]
>> So
um
Yeah, so what so you know, so so one
thing I
what to say? Um
I guess let's let's go in order.
As I As I think you mentioned,
you know, if if
capabilities are measured by time
horizon keep keep doubling, it does seem
very very challenging to keep up with
that. In the short term, we have a
number of directions for
for getting on top of that. But
and I think that will last like through
the year. But through two years, you
know, that seems challenging.
I think still possible. Through three
years, I think
still seems possible. You know, it
starts starts to get harder and harder.
Anyway, in the short term, building
these building these much longer tasks
and ways in which we might get around
the problem entirely. For instance,
um here's one thing that might be
somewhat You could also raise the
accuracy bar.
Uh you could raise the accuracy bar.
Although
you know, we're
the reason we're interested in this in
the first place is we like
you know, is GPT-5 existentially
dangerous?
Okay, and the answer is no, I think.
Yeah. But like what But like why why do
we think the answer's no? Okay, at least
I think there are multiple reasons. But
at least we can say you know, GPT-5 is
just like not that good at stuff. Like
you're trying to get it to do you know,
data science on these on these
columns that have very similar names and
it's like not clear exactly the logic
that's led to the columns. It's not It's
not doing that sort of thing. I'm like,
wouldn't you need to do that sort of
thing to be Not exactly that, but you
know, be capable of stuff to be
existentially dangerous and it's not
capable of stuff. Um so you know, I see
that the the AIs are failing at these at
these hard tasks. I'm like, you know,
great, whatever. But if you know,
[laughter] if it's succeeding 90% of the
time and not 99% of the time on like
very challenging long tasks, then you
know, I think there are
there are you know, other things that we
can retreat to and I'm prepared to
retreat to those things. But at least
the naive the reason it's not
existentially dangerous is that it's not
capable of stuff. I I think at least
starts to get pushed on as as as an
explanation. So here here's another
possibility for something we could do.
Uh
Whoop.
So
um what if instead of
time horizon here, we had
uh time horizon
uh whilst
uh being
watched
very
closely.
And then
Okay. And then you know, maybe the the
time horizon whilst not being watched
very closely is like this, but the time
horizon whilst being watched very
closely is like this. That is like if um
uh uh you know, I I I hope that major AI
companies have
um monitoring setups going where you
know, for example, AIs are looking over
transcripts of of AIs performing work to
see if anything suspicious is going on.
And you know, it's not it won't be
impossible perhaps I conjecture to to
fool those monitors, but you know, be
challenging to to to fool those
monitors. So you know, maybe you can do
you know, 40-hour tasks without 40-hour
you know, dangerous capability relevant
tasks
without any monitoring, but only like
2-hour long tasks, if you're if you're
being watched very very closely. How do
you mean watched? Like if there's
another model evaluating the output?
>> That's the kind of thing I'm talking
about. And maybe you escalate to humans
and With and without safety controls you
Yeah, yeah, yeah. Roughly, yeah. Um and
then, you know, now now we've bought a
load more doublings and we can
[laughter] we can sort of keep keep
doing the capability extrapolation
thing. That that's one of the kinds of
things I'm thinking about in addition to
creating ever longer tasks. Yeah, I mean
I imagine some of the model shops do
have like evaluations of capability with
and without safety because I'm sure that
there like there's an argument between
their researchers and their safety
teams.
Um yeah, yeah, yeah. Um yep. Um
Um Feel like I have seen something about
this, but not a lot.
Yeah, yeah. Um yep. Um
Um yeah, I I guess I think that
um this might be sort of like an
especially quantitatively important
consideration.
Or um I I I expect that it will reduce
the effective time horizon by uh
by like maybe an order of magnitude or
two. Um
Yeah, I I I agree that there's a there
are some important senses in which it's
not really a difference difference in
kind. Yeah, but of course then I would
also worry that like publishing that
encourages people to like focus less on
safety or to like try to argue against
safety because somehow it impacts
capability.
Yeah, I I think there are lots of
landmines
in um in in all sorts of safety work,
not just not just in AI.
Um
Okay, next thing.
Um
You know, we have this we have this
trend. I I spoke about this at the
beginning, but you know, we have this
trend. Is it going to continue forever?
Is this is this a fact of the universe
or does it, you know, somehow depend on
inputs or what you think about um
intelligence explosions or or something
like that? Um trying trying to think
about that. Where's this line uh
actually going? Is um is a is a pretty
active area of work. Also, you know, the
ways in which um
this line or or the the particular
points don't correspond to the thing I
care about. So, one obvious way is that
um
you know, these these models are being
judged according to
um uh
you know, I [laughter] I think I think
the um
algorithmic scoring that we use on on
meter tasks is is um is importantly sort
of more robust or more covering the
relevant concerns than might be the case
in just sort of sweep benches and unit
tests, but but it still sort of it still
has a lot of the same character. Um
there are um you know, considerations
like being able to
build on this work in future outside of
the immediate problem um uh facing you
that that aren't being captured by by
meter scoring. And maybe if you did
capture that, you know, you'd get
something a little bit like going from
50% success to 80% success. You know,
you can do hour long tasks if it doesn't
matter whether you can build on the
work, but you know, only 30 minute tasks
if it does matter whether you can build
on the work. But bringing bringing these
numbers again to to something I care
about a little bit more and then yeah,
projecting out both of their uh computer
slowdowns um if if we are going to enter
some regime where
um uh AIs are building AIs and that
leads to some sort of steepening of the
curve these these kind of
considerations.
That's another thing I'm thinking about.
Um
Da da da da da da da.
Oh, and then capability is measurement
from new angles. So, here's
um
you know, here's here's one history of
meter that I think is not the accepted
history and um also probably um not a
very accurate history, certainly not the
most accurate history, but but here's
one possible telling.
Um you know, near the beginning
meter has early access to where when I
wasn't there and I have sort of no
internal knowledge of this. When meter
has early access to GPT-4
um and they were just sort of Q&A
datasets going on everywhere or like
else update sets or something.
You're like
you know, can GPT-4 like seem so smart
relative to stuff that that went before.
Can it do stuff?
You know, so you like you try it out
some task. Can it can it do stuff? And
the answer is, you know, can do some
stuff and can't do other stuff.
Um and um and people are like, "Oh,
that's cool, you know, you try this, you
try this um neat new kind of thing,
getting models to do stuff instead of
instead of answering questions." And
then and then later you're like, "Well,
different models, you know, they come
out over time. You know, this model
comes out in January, this model comes
out in February. Can they do different
kinds of stuff? If we test them on the
same if we test them on the same stuff,
then we'll try and think of kind of the
most obvious in some ways summary
statistic of whether they can do stuff,
this like single single um data point or
number that reflects whether they can do
stuff that's time horizon glossed over
time and see what happens. You're like,
"Oh, that's that's kind of interesting."
And then you're like, "Well, what's the
next sort of in some sense kind of
dumbest or like most obvious thing you
can do? Well, we'll run kind of the most
obvious RCT design or like allow AI or
not allow AI and then we'll see we'll
see what happens and we'll try and you
know, it'll be it'll be messy. There's
lots of um
there are a lot of methodological
problems that that people point out as
there are with this work, but they're
different kinds of problems. You know,
they have different pros and different
cons and maybe with these sort of two
different things give two different
answers and have two different sets of
pros and cons we can kind of triangulate
the truth from that.
And then now I'm like, "Well, can can we
pull that rabbit out of the hat one more
one more time?" Are there or multiple
more times? Are there other sources of
evidence that have, you know, different
pros and cons that I that I won't
believe in fully, but they're different
pros and cons and they might give
different answers and so on and so
forth. Um here are two suggestions of
things I'm curious about at the moment.
The first is
um in the wild transcripts.
So, you know, agents in cursor in code
code and in in whatever other other um
products or services, um they
[clears throat] leave behind traces
um traces of the deaths that they've um
[snorts]
contributed to codes or or deaths of
their actions and their recent chains
and and so on and so forth. Um the
traces that they leave in the wild are,
you know, importantly different from
this where it's more kind of contained
and
you know, the task is sort of neatly
packaged and stuff. This is going to be,
you know, like the like the example with
the many different columns that are very
confusing. This is going to be like
whatever real crap shows up in the wild.
How how do models learn from that? Um
There are important reasons why you
shouldn't believe that kind of
information. It's it's like not very
experimental. It's like hard to know
exactly what to make of it, but it does
have these important pros that it's like
it's more real. It's, you know, the data
is enormous perhaps the data on
transcripts is enormous. You know,
perhaps there's a lot you can learn
there. That's that's one thing.
And then and then here's another one.
There's this um
there's this group which you guys should
check out called um
agent village.
AI village, sorry. Um where they um
they have um a lot of different models
or or agents kind of living in this
village occasionally talking to humans
trying to accomplish um fuzzy goals that
are that are set to them basically using
computer use.
They try and do stuff like, you know,
organize this event at the park or um
run a human subjects experiment or run
this merch store, you know, stuff stuff
like that that's not so clearly
specified.
And
basically all the time they find that
the models fall on their faces and suck.
Um and there are lots of reasons not to
believe this evidence.
You know, here are some of the reasons.
Number one, um it is using computer use
and I think computer use is just way
worse than CLI based
computer use capabilities are
considerably worse than CLI based stuff
at the moment or text based things in
general at the moment and maybe care
more about text based things cuz that's
more relevant to
various other things you care about and
also lots of GUI
based things um can be converted into
text things. Um it's um you know,
there's all these different models
hanging around in the village. I'm like,
"Why why are there so many models? Like
why is there a village instead of just
like some big agent orchestration set
up?" I don't I don't really understand
what's going on there. Um
Anyway, lots of reasons not to believe
it, but on the other hand, it is models
doing stuff in the world. It's not
benchmark style tasks. It's like trying
to accomplish some goal and they can't
accomplish even sort of, you know, very
basic subsets of the goal. And I feel
like that's extremely interesting and I
I wonder if you could get rid of some of
the most obvious cons, you know, make
this only text based, give them some um
uh relevant text based tools, work a
bunch on the elicitation to make to make
these models sort of more performant,
get rid of the less performant models in
in the village so on and so forth, but
then try and get them to do these fuzzy
goals
um and you know, just observe like where
do they mess up?
Like
you know, they they they they went about
step one, it went great, but then they
sort of they became incoherent or they,
you know, went into a strange
psychological basin with one of the
other models or, you know, they they
weren't able to interact with external
services in an appropriate way or or
figure out their resource use. You know,
I'd be very interested just kind of
qualitatively in what's in what goes on
when you do that. Again, keeping in mind
that we're interested in um the ability
of
um at least at the moment I'm most
interested in the ability of AIs to um
automate R&D and, you know,
speaking to why that's not the case at
the moment and why that might not be the
case in the near future. Some something
shaped like this seems like it might be
might be kind of might curiously point
to to why that's not the case.
Not sure exactly what's there, but yeah.
And my observation is that that they
they are effectively neurodivergent
individuals, right? And none of our
world was not built for
that.
>> Yeah. And there's everything that we
have that they're defined for a human to
do, they're shaped and sized to humans.
Just like you know, the military, like,
you know, how big are packs? Well, it's
based on how much they think a person
can reasonably carry, right? And how
much we expect someone to handle for
their taxes, that's based on what we
think a human can do. Well.
And and they're and if you think about
neurodivergent individuals, they
struggle with challenges with the way
the world's expectations don't align
with them. And compared to a
neurodivergent individual, these you
know, these intelligences are really
really different, right? And so all of
their rough edges where they don't align
with our world,
that's why they needed assistance to
actually human assistant in order to
accomplish anything real in our world.
It's just too hard
for them currently.
Currently? Yeah, yeah, yeah.
>> [laughter]
>> Someday they change. Okay.
They're just hopeless. Yeah.
They have to get really, really good or
our world will have to change. One of
those two things. You know, I I agree. I
like so strongly said a sense, but you
know, but if you ask me to really pin
down like why why is that case that the
case again when they're like, you know,
beating all
GPQA
GPQA experts on these extremely hard
science questions and they're, you know,
blah, blah, blah. Like that's actually
what the why are they not able to
accomplish things in the world? You ever
met a neuro divergent individual who
wasn't terribly good at something?
>> [laughter]
>> Completely useless at getting through
life?
>> Yeah, yeah. They're all very good at
reading books.
>> [laughter]
>> There's a lot of those people in the
world.
It's not that surprising.
My my only feeling about AI abilities is
like, well, today is the 200th day my
car didn't rocket off the Earth and
escape velocity and fly to the moon.
Like
that's because you didn't build a rocket
yet.
Yeah, I mean, I think there was a lot of
talk a year ago about, you know,
maybe I'm mischaracterizing, but I
thought there was a lot of talk a year
ago about computer use capabilities
being impressive today.
There was. There was a lot of talk about
it and yet I have talked to almost
nobody who has used them for any
practical
>> [laughter]
>> Um yeah, but if we if we move this to
text only and it seems reasonable to
complete text only. Um
Um you know, would you still have the
rocket concern? No, I wouldn't have it.
I wouldn't really.
I don't want it depends on what the task
was. Sure. Yeah.
Means yeah, the kind of thing that you
could that a human could do over CLI
only.
So I I think this um this relates to the
interrupt talk that
>> [clears throat]
>> earlier sphere they talked about how um
you know, one way to uh
use uh
effectively is to give them if you have
a task, like figure out a way to present
the task or transform the task something
that is industry fit,
>> [clears throat]
>> you know, for the model. And I feel like
this conversation kind of yeah, ties in
on that. Like um you know, interacting
with with Chrome is less in distribution
than a CLI. So I I think that could be
an interesting area of research is like,
you know, uh okay, so
if you're interested in exploring like
how well can a Chrome is really tasks,
like first I I guess creating harnesses
and creating an interface that is much
more in distribution for them. So that
way that's you know, less of a
concern.
Yeah, I mean, I I think also it speaks
to the points about quote unquote near
to bench models. Um
you know, that's um
it's not so different from management
scale or something giving, you know,
giving appropriately scoped tasks to to
your to your very talented interns or
very talented neuro divergent interns or
something something like that. I do I do
think that's right. From the sorry to be
a, you know,
uh sorry sorry to be so passive. From
the perspective of
capability explosions
um and automating R&D, you know, I think
maybe the models will get extremely good
at
scoping tasks for themselves such that
its benchmark style or or something like
that. But you know, if they can't do
that I'm like, well, there's a lot of
things that aren't that don't look like
benchmarks that grow up in the real
world and you do need to be able to kind
of flexibly work with that if you're to
do something as complicated as automate
a major AI company. Um
um
and you know, so so I do I do think it's
um
Yeah, I think I think it can both be the
case that the AIs are incredibly
performance um
on some particular type of problem or if
you make other types of problems more
similar in scope or shape to to the type
of problem that they're best at and and
also that they, you know, can't flexibly
substitute for human workers because
that requires,
you know, um yourself setting up the
problem in in a way that's appropriate
or or not having those constraints
yourself.
Yeah, it is interesting though. Just
just your point about new capabilities
is thinking about this like another axis
on the graph that you have.
Because
I think if there's not just I wonder if
there's not just a time horizon issue,
but there's a
a task category or type of work
category. Like like as your example of
computer like computer use is one of
those examples, right? Like if we think
about the capability of computer use
versus a capability that would require
computer use.
versus a capability that could become
can be accomplished entirely in text.
Yeah, so that Yeah, sure.
Well, but but but like a lot of these
are like like almost all these
benchmarks are basically text.
Um yes, yes, yes. And indeed, you know,
the ones the ones that aren't the ones
that require sort of um
vision capabilities are are notably
lacking in benchmarks. Yeah, I I I um
I'm not sure exactly what to what to
make of this graph. I think one thing I
make is that's you one thing I make of
this is that's um
uh you know, there probably is
maybe not so much variation in in sort
of slope or doubling time across across
across distributions. I think it's only
weak evidence for that, but you know, in
in intercepts or you know, the base of
where we are now, um yeah, there's
there's possibly a great deal of variety
especially on this sort of um
uh image like capabilities versus not to
mention, but but
physical abilities even more, you know.
Yeah, right. So there's exactly like so
I mean, you could even go through
sensors or like
you could go through like a tactile like
like today like they would all score
zero.
Nothing has tactile. So like it can't
tell you anything about anything
tactile.
Um well, you know, in producing this
graph we you know, we're trying and make
the models as performance as possible on
some held out set. No,
so we
you know, we try and give them some
tactile stuff.
>> [laughter]
>> I'm not sure they perform zero. Sure,
sure, sure. But
space we do have some examples of
products. Yeah, yeah. And space
judgments, spatial judgments.
Yeah.
Um
You know, we we've obviously seen
computer fine control and stuff like
that in other robotics.
It's just I I haven't even I don't even
know if anybody maybe somebody has
listed out what all of the capabilities
that we would expect in the future. Like
if we actually wanted AGI, what is the
entire list of key
That's a way to start a debate that
doesn't end.
I think [clears throat and laughter]
it's Hazel Hopper and Arjun Ramani
hopefully have a paper on this often
small number of problems. Yeah.
And then maybe if we think about where
are we at and do all of the capabilities
follow the same
All the capabilities that we currently
measure, do they follow the same
uh log? Yeah,
it does seem like a reasonable null
hypothesis to to view as well as me, I
think. Not not not not certainty. I
mean, who knows?
Yeah, yeah.
Um
>> [snorts]
>> Um oh, there was something there was
something I wanted to add there.
Um Oh, oh, pit. Yeah, so here's another
thing I'm thinking about not super in a
research capacity, although kind of. Um
Um
So you know, some people like me are
sort of skeptical of of software only
singularity. That is the the idea that
you could automate AI research without
also automating
chip design and maybe also chip
production as well. Um that you'd
quickly get bottlenecks by by computes.
There are only for fixed hardware there
are only so so so many experiments that
you can run that that will be that will
be um
sufficiently productive to to
uh to fuel progress upwards.
But you know, even for people like me
who are skeptical of that,
um
uh you know, you might think that in
fact like chip production is going to
get automated. You know, the robots like
>> [laughter]
>> they're they're coming. They can they
can do they can do the stuff that humans
do and then and then you maybe you
really do have a fully self-sustaining
um
robots plus AI economy. And so you know,
and so you you you have some slow trend
from from computer slowing down, but
then you have sort of a human back up
once once the whole thing is
is is in a tight loop. Um what one
interesting debate that I heard about
recently and would like to think more is
um
uh you know, I think there's
in in the public discussion there's some
sense that, you know, what why are
robotics capabilities lagging? Um
uh lagging LLM like capabilities so
much. Well, it's to do with training
data or something something something
like that or maybe it's to do with
hardware constraints.
I'm I'm curious if it's not to do with
hardware constraints. All right, what
what
exactly are these hardware constraints?
If we put super intelligence inside
hopefully this will super intelligence
in inside of
you know, um hardware parts that existed
today, could it build
chip production facilities? All right,
I have no idea because I'm stuck you
know, I'm I'm beyond beyond beyond
novice, but it's not obvious to me what
the what the answer is. I think it's I
think it's kind of plausible. I'm not
sure you need this like um
I yeah, I'm not sure you need this like
very
flexible fine motor control in order to
do it. Also I think maybe the fine motor
control is there subject to having super
intelligence controlling it.
>> to be fair like the key aspects of chip
production are done by robots.
Um oh, but but but I'm also thinking
like building the robots and Yeah, the
whole, you know, And and as far as I
know
I have a
friend who spent most of his career
doing software development, but during
COVID started working on manufacturing
things like peppers and things like that
to help people and he found out how hard
the manufacturing world is and how slow
the iteration process is. And it is
really like he put it like he he knew it
was going to be worse.
He didn't understand that it was like
next level like an order of magnitude
worse. I think that probably like, you
know, we we from our perspective people
that don't do it, it seems like, oh, how
bad can it be, right? It's the the
feedback I've had from everybody who
actually works in that space is it's
way, way different.
>> That's what I've heard as well. I I've
only talked a little bit with like
people who work in fabs and stuff, but I
I was surprised when I did talk to them
of the level of human expertise
required. Yeah. In order to work at the
fabs, like a lot of those jobs are like
fairly high paying actually. Oh yeah,
very jobs. And also like also the rate
of improvement is actually glacial,
right? Compared compared to software,
right?
>> I think also because it's cost a billion
dollars to build a fab.
>> And so like each generation is a huge
cost to find money. It it's brutal.
Right.
So it's I think that's why it's been
hard to get it all the way there is just
like
give give them a [clears throat] couple
more centuries. Maybe they can get it
done.
>> [laughter]
>> Is that really your view? Centuries?
Centuries? I I do. I do I do think I I'm
skeptical like you about like how easy
some of these tasks are. Yeah.
We think they're easy, but in my
experience like
I I
I remember when the self-driving thing
came out and people were like pushing it
out and it was I actually worked in that
space for a while and it was like
I get that we can get really close to
it, but getting all the way to something
that is acceptable is
extremely difficult, right? And we
underestimate how much work is involved
in getting that last little bit done.
The first time I ever said it
I I knew we could do it with computers
like
you know, 10 years ago pretty much, but
getting the last bit that everyone's
happy with it
Yeah. needs a lot of work. I feel this
myself, you know, I didn't get a
driver's license but I when I got
something
because because I expected self-driving
cars to come.
Yeah, I think I think it's tasty, but it
hasn't been that long, you know? And
they're they're expanding to to
to to the entire Bay Area.
They're going to get I don't think it's
going to take that long. Is is the is
the robot economy building the chip
productions going to take centuries?
I don't know about Yeah. I I could see
that it might take it's it's so part of
the trick with self-driving is the
economic incentive is moving it along
faster, right? And probably the robot
building robots kind of thing would
also, but like
>> Yeah. you know, where we're at right now
is like riprap is kind of as far along
as we got of robots building robots,
right? Which is Oh oh, but I I feel like
you know,
is that is that paying sufficient
attention to the charts? GPT-2, 2019.
>> [laughter]
>> It's so it's so recent. You know, I I I
have some This is so this is so Yeah,
yeah, it's um
uh
nonsensical, but I'm like maybe we're in
a sort of GPT-2 moment. Yeah, no, it's a
fair point. I I could be wrong. It's
just my guess is it's going to take a
lot longer than we think.
I think At least to be able to do like
real mass production. Yeah. Uh
at a scale that that causes the kind of
global impact you're talking about.
>> Yeah. That that's I I think they can
already do a great job building
one-offs, right? Robots are very good at
built doing one-off builds. Yep.
At small scale, but
it's totally impractical for doing it on
a large scale.
There is um
um
um
One one fact I think is kind of
remarkable
is this Maybe it's this. Is that the
rate of
Is it this? Yeah, yeah.
The rate of compute puts to
robotics models
lags behind
um sorry, is is is about the same, but
the the levels uh two orders of
magnitude difference.
Um I I am kind of
um curious about if that gap closed. Um
um
uh what we'd what we'd see.
It does seem like at least sort of more
capable robots are in some sense
um very on the table as something that
could be the case very soon if this if
this
>> [laughter]
>> No, I'm I'm I'm not saying all the way.
I'm certainly not saying chip
production. It just does seem like
there's some sort of data hang.
Yeah, yeah. Just something.
It's interesting. Um
Also also thinking some sort of um
um some like you don't just need to be
scaling data, you can also scale
parameters use the same amount of data,
you know.
That's the way to use compute to to
close some gap.
Interesting.
Yeah. So one of you just gave me a very
interesting overview of
where AI is going into fabrication and
where it's not.
And and what does it say? Um
So it says So it says there's a lot of
worries where right now it's going to
help probably pretty dramatically in the
near future and a lot of it's the
computational aspects. There's a lot of
computational aspects that are extremely
expensive when you're designing like a
mask display the
the hole that you're using for the laser
to get the transistors. Mhm. Um
and like calculating that, how to build
it, and
ensuring that it conforms to the
spec that you've written basically is
extremely computationally expensive.
Um and there's a lot of opportunity for
AI to help there.
Um and there's also theoretically the
possibility for so like
chip obviously uh chip manufacturing is
extremely um precise, but also fragile.
And the opportunity for an AI to detect
parameters that are basically out of
whack and leading to failure potential
failure in like uh imaging a wafer
uh is could theoretically dramatically
improve yield and yield is a big problem
in fab and in chip manufacturing. Like
the reason that you get different speeds
out of your CPUs is because they
actually just have the one line that
produces all those CPUs and some of the
clock currents will come out worse. And
that's why the higher giga that's why
the higher gigahertz models are more
expensive than the lower gigahertz. Like
like if you have like your Nvidia like
your home GPUs, your your 50 40 or 50 50
or 50 60 50 70 50 80 90 are all the same
chip. Right. They just have different
quality Different levels of fault
tolerance essentially. Yeah.
Um but the problem is
that
uh they're they're Cut the recording.
They're going to kick us out soon, but
feel free to continue discussion. Yeah,
cool. You also hang out. Yeah, sure.
>> [music]