How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR
Channel: aiDotEngineer
Published at: 2026-01-19
YouTube video id: k1t2xyWMUdY
Source: https://www.youtube.com/watch?v=k1t2xyWMUdY
[music] >> Here's the very simple argument. If you look at the sub notion of compute over time, you know, this could be like R&D spending on computes. This could be the experimental computes, could be training compute, you know, whatever that some particular [clears throat] lab is is using. Looks like this, no surprise. If you have another chart of like, you know, log time horizon, let's say this this measure from the this figure that many of you would have seen on Twitter. Over time it looks like that. You know, let's say that this was like not merely a coincidence, but these things were causally proportional. In the sense that if if compute growth were to half, then time horizon growth would half. So, you know, for the for the sake of argument, let's say that, you know, starting from 2008 or so, the compute curve begins to bend like that, where this would be no growth and this would be the original growth, something something like half. Then if, you know, if they were causally related and in particular they were causally proportional to one another, then you'd expect this to go like that. And then for some milestone that you care about, let's say here we've got one month one work raising up there. One month. Then the delay implied in AI capabilities is potentially enormous. Now, like why, you know, lots of people have circulated that there might be some slowdown in compute growth. I'm not an expert in in those forecasts, but I think I think the prior reasons do seem like somewhat strong to me. One is like physical constraints that we might we might hit, power constraints as mentioned, or there are various other ones that that Epoch have reported on that they that they consider, all of which seem to not bite through 2030, but, you know, potentially potentially could bite sometime after 2030. I think the more likely one is just like dollars is a constraint. Like you can't, you know, large tech companies can only spend so much. At a certain point, it's like large nation states can only spend so much. Like you can't I guess there are some scenarios in which you you can't you can't continue going, but that seems to to kind of naturally imply this slowing down. And then the you know, additional point that this this paper is trying to make is that under a very contestable but standard assumption from economics, you should in fact expect these these two to be causally proportional. I think in particular you should expect them to be causally proportional to the extent that or for the period that software any singularity is not possible. And that's another discussion and we can talk about that. But at least in this kind of somewhat business as usual um um scenario or sort of until that scenario no longer applies, I think this is this is maybe a reasonable model and does imply some slowing of AI capabilities in the in the near future. I have no plan for this session whatsoever. That also assumes that we don't have a technological advance that dramatically improves capabilities relative to compute. Like like an like an unpredictable technological advance, right? >> Yeah, yeah. I mean all all predictions, you know, assume no unpredictable >> [laughter] >> um yeah, I'm like um uh you know, time horizon or or like in general in in AI kind of straight lines on on log linear plots have been a I think, you know, a very highly underrated forecasting tool. They've done extremely well over now many orders of magnitude. You know, I I I think it's reasonable to have the default expectation that the log linear lines continue through like approximately the same number of orders of magnitude except maybe if there's, you know, some significant break in the inputs. Yeah, of course on the upside there could be um there could be something quite dramatic. Software any singularity is the first thing that comes to my mind, but but, you know, another transformer style moment seems like another another candidate, actually. Of course, also one of the problems with with testing this would be that like I think most of the tasks that you have it will eventually eclipse the maximum possible amount of time that those tasks can take at some point in the evaluation set. Yeah, so I think you know, there are some ways around this that we're working on. I'd be I'd be excited to talk about that. They they all feel pretty early. But yeah, you know, I think it's I think it's right that if if time horizons are doubling, you know, eventually you you know, the the doubling time is such that you can't possibly make long enough tasks in the in the relevant It's also that like we actually hit a place where time horizon is no longer a useful measure because actually you now want time now you want total time to decrease. Like you like like like what you want is you want the same results at a lower time. >> Oh, um uh one thing So, what you want higher reliability at a lower time horizon time horizon. One thing to say about time horizon um is there's like two notions of time here. Like a like a human time axis thing as like calendar time axis. The the time that the model working for, I think you should like kind of approximate it to zero. It's it's not actually zero. They are they are taking actions, but they they largely do their successful work pretty pretty early on to the extent they're going to be successful on tasks. So, so my my guess would be that it will continue to be the case that there's not sort of so much extra use on that margin of of making the models complete tasks more quickly, although reliability very much so, obviously. So, most of it's like the human like the like the iteration loop, most of the time is spent in like the human machine iteration loop. Um the humans are working without AIs and AIs working without humans. So, the for the humans I guess it's all human Yeah, yeah, yeah, yeah. Yeah. >> [snorts] >> Cool. Any questions on me to ask? I I can go through some like upcoming things that we're that we're excited about if people are excited about those things. Yeah. I I I did have one personal perceived one like that time perception one of those kind of things. >> Yeah, yeah. One one thing I thought you brought up a little bit in the paper, which is uh you know, whether or not familiarity is a confounding factor. Um although one of the things with tools you're thinking Oh, yeah, tool familiarity is a confounding factor. And and of course also like you also brought up that like tool capability has dramatically changed. But uh there was an interesting presentation from Meta at the developer community engineering summit this year. And they had done a they have probably the best infrastructure for quantitative measurement of like developer experience in the world of any company. And they're able to tell you basically how long it actually takes to make make a PR, basically. They call them diffs at Meta, but like how much actual effort a human time effort it took to make a PR. And what they saw was they saw a J curve when they gave people agents. And that J curve was I don't know how long it was, like 3 months or 6 months. And so, one of the things that I also wonder is like if it would be interesting if if if there's a cutoff of how much familiarity the person has. Like have they been using this as their full-time daily driver for a period of months. And if there's like a interesting cutoff that occurs once they're at a certain level of familiarity occurs. Yeah, I'm totally I'm totally on board with like not just in this case, but in many economically relevant outside of software engineering cases, you know, J curve like explanations being being a real thing. I'm like, yeah. You know, developers not just developers experiment with tools. You tend to be slower the first time that you're experimenting with tools. But, you know, if you're doing this so that you you have some investment benefits. You know, later on you might be might be more proficient with the tools or in the case of AI, maybe you just sort of expect the models will get better. And so, even if you don't become more proficient, it'll be like the kind of thing that you want to do. You know, those explanations broadly um make sense to me. Um I can give you some reasons why I'm interested in this. >> [laughter] >> Um I think the [clears throat] one thing to say is, you know, we're we're um um what are some things to say? Um Uh as background, you know, we're continuing with this with this work and we'll we'll see. Uh you know, another thing to say is just like quantitatively, you know, difference between this and this very large. >> [laughter] >> Um I'm like, how much how much is J curve explaining? I think it's not explaining that much. >> Let me explain that because we see this over and over actually in software engineering studies that the one question you can't ask people on a survey is how long did a task take. Like you can ask people how much more productive did you feel and they will give you an accurate response that correlates with quantitative data. Yeah, yeah, yeah. Ask anybody the amount of time that something takes, they are almost always wrong. So, that I was like like like when I shared this with my colleagues, I was like, okay, I'm not surprised about that at all. What is interesting is how much is the slowdown aspect. That was what was interesting. Yeah, yeah, yeah. Yeah. Point well taken. That that that makes a lot of sense. I do I do um uh so, I think we despite this were interested in time estimates because um you know, we're we're interested in providing Yeah, I mean I mean >> [laughter] >> I mean it's the perceptual like yeah, I do think that's relevant to also because like the perceptual aspect is also the hype aspect. Right? Like so, developers will tell you that they were faster when they weren't. And I think that is worth knowing. And you know, to to the extent that we're interested in um uh measuring the, you know, possibility timing nature of um of capability explosions or sort of R&D being automated. One commonly proposed measure to do this is just like ask developers or researchers how much they've been sped up and for exactly the reasons I'm pointing out, I don't put a lot of faith in those um in those in those estimates. So nice to nice nice to see it like this. Yeah, some some more some more Jacob things. So I So the So the forecasters who who are not predicting time to complete, right? They they are they are just predicting this this effect size. The non-developers, the expert forecasters. They are told the degree of experience these developers have and some of the forecasters are um in thinking about how this population might be different to other populations, pointing out various facts about the study like they're more experienced. I expect experienced people to get less um uh to get less speed up or, you know, the repository is a larger. I think AI is a less capable of working on large repositories. I expect less speed up. They never never mention um familiarity with tools. My my sense is that um yeah, they they share the the sense that I had at time which was like most of the action is in understanding what's AI's the kind of things that AI's are good at or bad at in the first place. And all of these developers have experience with LLMs in their core development workflow. It's just Cursor that they're they're quite that three quarters of them are are totally unfamiliar with at the start of the study. Um so I I I just I I wasn't seeing much much margin. Um yeah, I I I I think it I think it is I think it is an open open question. I I also, you know, we watched so many hours of screen recordings of these developers working and um I just do not see um I think they're like working very reasonably. You know, in some cases worse than me and my colleagues, in some cases better. Um I'm not seeing these like advanced workflows that they're not accessing. >> Yeah, and my experience is is not that far off from this is that there are times when like I am dramatically slowed down and there are times when I am accelerated. >> Yep. Uh and although as my familiarity with the tool increases >> Yep. I definitely don't see a speed up. I improve a lot because I learn over time >> Yeah. what I can tell it to do and what I can't tell it to do. >> Yeah. Uh in addition to like things just getting better with it, like understanding like, okay, now I need to plan now to blah blah blah. >> Yeah. But I but that's why so the thing I think I'm doing before you make a like high-level architectural decision that, you know, 10 conversations uh 10 conversations uh turns down is going to blow up in your face. You like really try to think about it. >> Yeah, yeah, yeah, exactly. And and and also like scope it down to like a smaller problem. Like I at first I would try problems that were too large and like it can't handle that. Yeah, yeah. But just I mean, just for the future, if you ever do I mean, I think it's obviously really hard with the with the sample with the 16-person sample size, but It'll be interesting. That's it. Great. Great. Cuz cuz in the future what I I think having a cutoff, like trying to figure out if there is a cutoff of familiarity where the number changes would be interesting to see if that meta result generalizes outside of that. Um we are we are on it. I think um the AI's have been getting better during this period which is going to compound a lot of a lot of what's going on, obviously. But yeah, yeah. It'll be interesting. The thing is the projects themselves are very optimized for people coming on onto new projects and figuring out how to, you know, they're already the ones that struggle to be organized well for humans to come on board and build and navigate them quickly don't survive very long in the open source ecosystem. And these are fairly mature open source projects. They're a little bit different from like in an enterprise settings where things survive cuz they make money even if they're a pain to develop on. Right? So the the context is a bit different. These are the the repos running. Yeah. It's interesting. >> Yeah, that that is a really interesting point cuz like actually some of the repos that I was helped the most with were ones that I was completely unfamiliar with and which had no decent documentation of any kind and where like I I had to come in on this legacy code base that had existed for years and like make a change and and like the developer who owned it was like only partially available to answer questions to me. And so in that case like Quad Coach was a huge help. Yeah, legacy code bases don't exist cuz they work well, it's because they make money. >> [laughter] >> Interesting point. Yeah, so The question I had was um So like did all the developers have the same level of AI familiarity with with Cursor or was was there some variance and was there uh like >> [clears throat] >> is there a plot of like each of the each of their familiarity There's always a plot. >> [laughter] >> There's always a plot. You just think that you accounted for the the insight the question of was there is there a Jacob Yeah, so so here's some here's some evidence. Um so okay, the you know, I can show you some plots. I think the the the sample size is just small enough that like you shouldn't really believe any of them. I mean, though I I think the plots aren't going to show much, but then I I don't want to say that's like strong evidence this is not something that's going on. I just think the evidence is kind of weak. The thing that really convinced me is like watch the videos. I watched the videos [laughter] of them working and you know, often they're better using Cursor than they are in the mail. And I'm like, wow, you know, I'm I'm working on this project using Cursor. >> [laughter] >> Um but but here here are some graphs. So um so this is by whether they have various types of um uh AI experience coming into the study and, you know, basically you see no movement in in point estimates. People for whom Cursor was primary IDE before, um yeah, not not a huge amount of difference versus people for whom it was not. Um Then the next one is you know, you might think may maybe you have a view that's, you know, some Jacob cutoff comes after this point, but still, you know, within the within the study there's some variation in how experienced people are with AI because they have multiple issues, you know, they're after the first AI issue they're slightly more uh exposed than after the second AI issue. So you might try sort of excluding those data points over time and and and seeing and seeing what pops up and, you know, they don't they don't seem to get better at using AI over time. Although I think there's probably a statistic issue with uh I think there's probably what, sorry? There's probably a statistic issue with that that plot right there. Like those bars are very very wide. Oh, yeah. I mean, I think yeah, none of I I think like all of the um plots outside of the main plots, all of these subset things, you should like not put a lot of stock in. Um yeah, I I I I totally I totally agree. Um [laughter] okay, and then lots lots has been made So so this graph is the reason we put it in unclear evidence cuz we're like, yeah, things point in different directions. Um a lot has been made of of this plot suggesting, you know, something something J-shaped. In particular that, you know, at the end once people have more experience um uh they do experience some some speed up. Um here are some issues. You know, first, like the other plots don't. I think I think that's important to to include. And second, these hours are coded very conservatively. So for instance, someone in the 30 to 50 hours bucket is um had Cursor as their primary IDE in 2024. They had recorded themselves on their time tracking software as having spent 140 hours using Cursor. They conservatively estimated that they'd spent 50 hours using Cursor. And so they end up in our 30 to 50 hours bin. This is someone who's who's primary IDE was was was Cursor last year. Um and and, you know, people have been commenting about this. They've been using Cursor for less than a week. I think that's not a not a very fair assessment. If you if you were to move that developer over from the uh penultimate bar into the again, you shouldn't believe this because of statistics and but um if you were to move the uh that that developer from the um penultimate um effect size estimate to the to the last one, then you'd see some balancing out where you get back to essentially zero in in that bucket. Uh yeah, again. So so like don't rule anything out. I think Jacob explanations yeah, still like they're on the table. Is is it not likely that the 50-hour group also is similarly underestimating their their time they've spent using Cursor and that actually if you just had a longer scale that you would still see a trend? Oh, that that is an interesting point. Um Um That seems plausible to me. Um and then and then I guess I want to I'm not sure it's an underestimate because we're using this like very conservative Yeah, totally. Totally. Um yeah, yeah, I think that seems plausible to me and then um for this not to be a strong evidence I'd retreat back to I think you shouldn't really believe in any of these Yeah, yeah, yeah. I think the basic issue is it's a small sample size and there's also a lot of bias in the data set effectively, right? Like it's a certain kind of data set. You mean like the kinds of the kinds of developers Cuz yeah, open source developers and also working on open source projects that are pretty mature. Yep. You know, those those two things are You know, working with open source developers on projects that are pretty mature. This is probably reasonably indicative, maybe, but the sample size is pretty small. But outside of that it gets a little harder. Yeah, and talking about this, I'm like um uh I think yeah, this group is really weird. It's really interesting. It's like interesting for the same reason it's weird, right? Um uh yeah, we we were interested in, you know, again studying um uh possible effects on of AI for R&D speed up or or or automation. Um there, if any types of developers are not being greatly sped up, it implies the whole thing isn't isn't being sped up. So so it is kind of curious to see even even like particular weird populations. You might imagine in like large, you know, sort of production in production code bases maybe have a bit more of this shape than scrappy experiment scripts. Yeah, yeah, yeah. Um but yeah, it's totally I think I think it's very interesting. It's just it's hard to generalize. We just don't know. Yeah. Yeah, we're doing this large study and I think uh you know, I I think unfortunately after the large study which includes more green field projects, I think it's still going to be hard to generalize. Um for for not totally similar reasons, yeah. Although I don't feel like your results are particularly contradictory with any actual independent research that's been conducted. The only research that I've seen that that would say is contradictory to yours is research that has been funded by model shops or agent shops. >> [laughter] >> What can I say about that? I I do I do think that most of the research that's that's put out um uh is associated with uh large tech companies um and I and I I think there are other methodological concerns that I I think is reasonable. I I have methodological concern with that as well. I know people who work at some of those places who are methodological concerns with the work that was output, so. I mean I you know I think I think there are there are concerns about ours as well. Sure. Sure, but I I I actually feel like I I remember somebody sent me your paper and when I saw the headline I was like no way. Well, [laughter] me too. I was like that sounded like BS. Yeah, yeah. I read the paper and I was like oh, this doesn't suck at all. >> [laughter] >> Uh yeah. Yeah. A little bit. Well, no. I feel like at least your high-level conclusion both is intuitive like from a person who's read a lot of software engineering research and also is well justified. I like I think people I have had people argue with me about the 16 developer thing, but I don't think that actually matters in that particular case cuz I think they're actually a fairly good control set more or less, right? For an experiment because they they remove a lot of validity concerns by being experts. So, yeah, they it's true that they don't represent certain like a like the broad aspects of developers, but they also remove a lot of variance in what you would expect from the population and they and they allow you to have like a uh sort of an epistemological function of like, hey, let's isolate that factor away and then let's let's see what happens with that. And that's what I like that and then I thought the way that the study was conducted was completely sufficient to draw that conclusion, the high-level conclusion that it drew. Thank you very much. Um here's a here's a curiosity. So so we did We haven't published this because of organizational reasons that we won't go into, but um >> [laughter] >> um we did conduct this um uh you know, peop- people would throw sort of their various explanations for for for what's going on here, you know, many of which have lots of merit, some of which I'm more more skeptical of. Um you know, a natural one is brownfield versus greenfield projects. Um so we ran this um kind of enormous hackathon where we randomized half of teams to um use AI versus not, kind of you know, maximally greenfield or something. Um and uh and then we'd have a bunch of judges score them, um you know, many judge scores per project or something to try and even out that noise more see, you know, is it the case that like the bottom uh 50% were all the AI-disallowed group and the top uh um the top were all the um AI-allowed groups or something like that. Now, unfortunately, it was sort of even even smaller. That's like part of the reason we're not publishing this. I think the evidence is is is really quite weak. The degree of overlap is enormous. Like the the point estimate that we um I'm a bit nervous about saying this because, you know, it hasn't gone through the kind of review processes that something like this goes through. So so um maybe I've messed something up, but um uh I think the point estimate is something like four percentage points higher on a on a I'm sorry, four percentile points higher um if AI is allowed versus if it's not after the after controlling for everything else. That is like you know extremely noisy and you shouldn't draw any conclusions, but um but seemingly maybe kind of um small effects. I think the phones allowing AI. Um yeah. Yeah. So the question I have I I guess this is this is similar also related to other research that you guys have done. Um so have you found a similar pattern or I guess first have you um explored like the effect of AI in other domains than specifically software engineering? Um and if so, have you also found this kind of surprising result that maybe as much of a speed up Um uh no no no no. I mean no new directions. Those are stuff that we have not done. Um uh yeah, I yeah, you know, we're we're interested in understanding um uh possibility of accelerating R&D. Um you know, coding is not the only kind of thing that happens at major AI companies. Much more conceptual work happens. Um uh you know, I'd be I'd be very excited about um uh [clears throat] you know, working with math PhD students or very different types of software developers or um or you know, running running these kind of studies inside of um major AI companies or or large tech companies or or something like that. I think um we are very interested in you know, not necessarily directly, but some somewhat close analogy to um to the large AI company case. So to the extent that something really deviates from that, um probably less interested. Good. Interesting. So yeah, so it sounds like uh you're interested in measuring capabilities for like for like math research uh and uh some like other research. Yeah, I'd say I'm interested in like what the hell is going on in AI. Um you know, how how am I going to learn the most about what the hell is going on in AI? Um you know, I I I think something something a bit more conceptual, some- something where, you know, fewer humans are currently working on it, so it's less appearing in training data um will help me better sort of triangulate the truth about what's going on in AI um even if I don't care about math research in particular, um it'll still still sort of draw helpful qualitative lessons is is kind of the sense I have. Yeah. I mean if I was going to pick the areas that I think it's most successful or like areas where I would expect to be more successful, but where I think it is being less successful, I would pick probably data science Hm. as an interesting one. Like how does data science How do How much of data scientists help by AI today? Say say more about why you expect it to be less successful. Um so in a in a real So let me give you an example. Yeah. And at LinkedIn there are 5,000 tables with the name impressions in the in the table, right? So if an analyst wants to understand how many impressions happened on a page, where the hell did they go? Yeah. AI can't figure that out. Yeah. Like today there is no existing AI system that we have that could be hooked into like corporate environment like that and process through I mean there's trillions of rows in those tables. So like how like like So what a data scientist needs to do is they need to be like, I need to like, you know, analyze a bunch of data and come to a conclusion. Right? Uh and I I hear lots of like thoughts about building systems, you know, people talk talk about ML and SQL. The models are much better writing SQL than they used to be, but I believe that the state of underlying data is so bad that the the the actual data scientist is going to get way less value out of the of the [clears throat] AI than software engineers thought they were going to. Hm. That is that is Interesting. That's very curious. I um So what one one view that some some more bearish people have looking looking at the future of AI is is um you know, so much there's so much tacit knowledge around, there's so much knowledge that's sort of um embedded inside of companies that you're you're not going to pick up from you know, these like RL training environments that up or or something something something. Yeah. Maybe it it's not sort of the state of nature that there needs to be many specialized AIs. Indeed, like much of the lesson of the past few years that one big general AI seems to seems to be more performance, but you know, at some point in the future when data is like locked up inside of companies um uh you know, we will have more of this um proliferation of many more specialist models as I have, you know, GPTN fine-tuned on on LinkedIn data in particular something something something something. I want my reaction that's kind of like that. Yeah, I don't know. I I it was I I do have a disbelief-like reaction. I'm like, ah. What is science, you know? >> [laughter] >> But also like also but also like so contradictory facts. So the problem with these problems is the all these data sets contain contradictory facts. Like the name of the field will be uh like uh you know, date started or like time it'll be it'll be time started, right? And then it will contain only a date except for it will only contain the date up until like November of last year and then after that it will contain only the month, but then after that it will contain maybe the the seconds that the thing finished. And in order to actually successfully query the data set you the data you the data analyst or the data scientist have to know what those cut-off dates were, which is not written anywhere. >> [clears throat] >> Although what you could do theoretically is import a bunch of the SQL that other analysts have written to try to figure out like the how they triangulated these things or work backwards from those reports. But today So I think today, for example The people Sorry, I've just like I haven't worked at a large company. People >> [laughter] >> People People don't fix this at the source. No. No. So I feel like the lesson I learn over and over again at this uh data specs really matter. Really really matter. No, I I I've also been working in data analysis and and research developer research and so Yeah. Yeah. And so yeah, so the like the problem is like their job is like produce this report for this executive, right? Not go make infrastructure to produce this report. Yeah. Oh, but I'm like if I Okay. >> [laughter] >> I'm with you. I live that dream every day. Yeah. Well, you just have you end up having to, right? Is is you have to build out infrastructure for it. That has to be part of the job description and and the other part is you have to fix the problem at the source. Like you really I I I still remember having a conversation where where someone said, it's too difficult to fix it at the source because there's too much complexity of all the systems that and all the sources. I said, Okay, wait a minute. You're saying it's too complicated to solve at the source downstream somehow a problem that is too big for the entire organization to solve. >> Yeah. It's easier to solve there. Come on. Yeah. Like that doesn't make any sense. I just think there's so much potential here and I have not seen a lot of studies done on like how people who are at working in that data space experiencing AI. And what's fascinating about that is real ML is mostly data work. Like like ML especially outside of LLMs and outside of LLMs, the majority of ML engineers spend most of their time doing like feature curation. Mhm. rather than they spend actual direct model training. And like trying to clean up bad data for feature creation. So like theoretically the potential even for the improvement of ML by enabling ML to be a better data scientist is huge. And I I suspect that if you My hypothesis is if you went into this space, you would discover it is great at telling me how to write SQL or how to like write Pandas. And or Polars or whatever you're using. It is okay at doing very trivial things and it fails at all complex tasks. Mhm. Like fails completely on complex tasks. I don't even know I don't even see a benchmark on it. Mhm. Mhm. Can you give me an example of a of a complex task? Sure. Uh let's say a complex task is determine the time between Give me the P90 of time between deployments for all deployments that happened to Capital One. It struggled at that? That Yeah, that that it doesn't seem surprising to me. >> surprising, right? Yeah. Uh so uh I'm like you know, if it has sort of reasonable context about where it would find this. >> So if I had that data, right? Sure Sure makes sense. And uh and and then so okay. So fine. So so give me that number and then also I'll make sure that you can break that down, you know, by team hierarchy. So like can you give me that in a table so I can break it down by team hierarchy? Uh where is the team hierarchy data? Like uh how Oh, here's a funny thing. Uh what PRs were in those? So how do I know how how would I how do I actually determine what the time deployment started and ended was? Cuz it turns out that's not clear in the base telemetry. And you have to like know magic to figure out when the when the deployment started and ended. Um Uh oh and also tell me, you know, for my ability to analyze it, tell me how many PRs were in each of those deployments and which PRs went to each of the deployments. Well, guess what? The deployment system only This is being recorded, right? I think it is being recorded. Okay. Yes, but before you >> [laughter] >> Um So then, you know, imagine the deployment system doesn't contain sufficient information about that data, right? Uh then like like where do I get that data? Well, that data it doesn't exist in any other system. So what I Well, maybe I have to go like I have to go to GitHub and I have to call the GitHub API and like the chance that the LLM or any agent figuring that out today is pretty minimal. Mhm. Yeah, I do still, you know, relative to my colleagues, I'm I'm I'm pretty embarrassed on AI progress. I I I do still have some reaction that's like ah like can't you spend the day getting this into a Cursorless file? >> [laughter] >> You know, like where where where the um where the hierarchy exists. I I would I would go I think that's why I think it's interesting. I think what you were saying I don't I have not seen any real comprehensive study on the experience of data scientists have. Uh Um if you if you have any ins to um uh to to ask running studies at large tech companies, then I I'm all ears. You all ears. There is a fellow at OpenAI that I was talking to who was one of the speakers who does evals uh internal evals and he has mentioned that he's done some work with data scientists. So he might know some people who have that data. But it's it's all been internal between him and like Cursor between him and like, you know, the product team or whatever, right? Um Uh yeah, that and I also think uh I I one of the ones I'm curious about too is lawyers. Curious about like more traditional like older like lawyers, doctors, and I think mathematicians are all very interesting to me. Just because you know, both lawyers and doctors are so constrained by a legacy history of like the constraints around them and how they work. Um yeah, legal legal issues. Imagining they need to be a significant power. Yeah. And there's stodginess. Like I I I'm also interested in like what's the how are The stodginess The stodginess I feel like is is a I I I think I'm less bought into as a long-term explanation for economic I like the the legal restrictions they sort of continue to be the case through time. The stodginess I can like set up a new law firm that's less stodgy and then change the previous law firm or it seems seems to I agree. I agree. I I mean it's um I don't think it's persistent. I just think it's it's interesting to see One thing that would be interesting to see is like if that affects the mental model that they have today. Like like if if they're like how they've been talked to about it or how their trust in it affects how they use it. It'd be interesting to know to me. I don't know if it's it's a worthwhile study. It's more of one of those things that I wonder about idly. You take a lawyer who just got out of college and sort of, you know, has spent a lot more time using ChatGPT and you take a lawyer who's been in the business for 50 years and, you know, has has a a giant file folder full of Word docs that contain like all the briefs that all their, you know, junior associates have written for decades and decades and he just opens up those briefs and like changes a few words in them and then sends them out to the judge and he's like, you know, has known those judges for like 30 years, 40 years. He knows exactly what they want and like you know, is he getting any Is he getting any value? But is there value he should get? Is there something that like Is there some way that like he would be helped? I AI I certainly know discovery. Discovery and AI is like in in law is like a huge huge problem. And I I know that like there's Harvey. I don't know anything about success they've had. I know a lot of people working in that space specifically. Like that's It's an ongoing thing, right? There there's always technology for it, but it's kind of the adoption of it is a very different thing from That's That's that's the thing, right? Cuz I One of the first things that I thought of cuz I I have a little bit of a legal background and one of the first things that I thought of the first time like when ChatGPT-3 came out, I was like oh, this could totally change discovery. Like this could be because discovery is like the most painful and most difficult and most expensive. Like you could have serious social consequences by making discovery less expensive. Like That is the expensive part of having a loss. And so like you could actually have significant impact on a society if you could make discovery cheaper and instantaneous and reliable. Yeah. I have a question on your graph. Yeah. Cursor. Mhm. I'm not sure Mhm. Mhm. Mhm. You missed it on office. Keep on going. Like two more. Yeah. Oh, sorry. All right. It was a scatter plot, right? Um It was what? Cursor in 50 hours. According Sorry, what say? Yeah. Yep. Yep. Uh I say it's this one. Yes. That one. So you're saying that people the developer there was no difference. Cursor, is we talking about the idea that five coding and >> [clears throat] >> they use it for 50 hour Oh. I was very intrigued by that because everyone talks about five coding and how Cursor is instrumental. I Why did you get to How did you get to 50 hours? I was curious. Um so so this is including time five and 50 hours is This is including uh time in the experiments um that developers have spent in experiments plus their plus their past experience. So for um for for some developers working on some issues, it's it's past the experiment. Some of them have gotten to more than 50 hours of um Cursor experience. Um Uh and that's that's who's coded up in that in that bucket at the end. And was Was it the same task for each? Uh no, these are kind of they're they're actual tasks that pop up on the GitHub repositories. Which which as I mentioned that are kind of um I don't want to I'm a little bit nervous about saying they're weird cuz it implies they're um uh I want to say it's very interesting and it's very weird. And it's interesting for the same reasons it's weird. The These are um These are repositories in which they have These These are projects in which they have an enormous amount of mental context built up um that the the AIs might not have um that they've worked on for for many many years that they can um I'm not sure this is always the case, but you know, I imagine in my head that that they basically know how to execute on the particular task they have before um uh before they even, you know, go about to attempt to get. Because they're so experts in in the in the project. When you mean when power is speed up? Is it like like 5%? Like what do you mean by power up? What's How do you quantify the power speed up? Um So uh you might to think about uh Let's see. Let's go to this one instead. So um on the here left-hand side, we have the um averages for what the developers say um will happen in terms of their time to complete if their issue or their task gets assigned to the AI disallowed or the AI allowed group. Um you know, they they think that if AI is disallowed, it'll take them a bit more time, closer to two hours and I guess more like an hour and a half or a little bit less if AI is allowed. Um but then, you know, we we randomize this particular task to allow AI or not allow AI. And it turns out, you know, if we randomize to AI allowed, then the times are more like a bit above two hours rather than a bit below two hours. Um and then you can think of the uh change in time estimate as sort of being one divided by the other here. It's not quite that for reasons reasons I can go into, but it's, you know, it's effectively um What is actually the transformation? You know, whatever. It's something like AI disallowed over AI allowed minus one. So uh to to draw that out, I'm like um you know, I might be like what's what's the speed up? You know, is it like uh 1.1x? But you know, these these developers are going 1.1 times faster when we're actually on a time to complete scale, not a not a speed scale. But ignoring ignoring that ignoring that detail, you know, is it 1.5x? Is it 0.5x? Are they actually going sort of twice as slow? How would we get that information? Well, we'd do something like take the AI disallowed times divided by the allowed AI times. You know, if this was 1.1, let's say, times as long as the allowed times, then we'd get to 1.1 x speed up. It's something something like that that's going on. And in fact, you know, we find that we find a slowdown. Obviously. I I just read a fascinating article last company I can't remember, but basically journalist was allowed to uh using five coding, right? Uh do a pull request, meaning there was some feature AI was used to assist with building out the requirements and he practically according to the article just kind of did a little couple of tweaks and then the sign off on it. And it was this fairly fast And it happened with the whole five coding thing. Yeah, I He didn't code. Like that was the whole thing. He was like he didn't have any s- software development background. That was the whole thing. I was just curious you've tried to do a study on that. So I So I definitely do I definitely do the share this out. But you know, if you've got like no idea what's going on, then probably probably these are going to be some some significance some significant speed up. You know, I I I will say I guess number one, it's not you know, it's not a priori obvious. You know, in fact, we went out and did this hackathon with you know, very experienced people and much less experienced people and and tried to see what happened. And [clears throat] what we found is you know, the scores the judge scores extremely noisy and I think you shouldn't believe it. But you know, the the judge scores were not that much higher when AI was allowed versus versus when it was not. The people aren't actually making that much more progress. And then And then another thing to say is I I think there's going to be more expertise in this in this room than than I have. My understanding from either sitting with these open source developers for a while and not not being a very capable developer myself is is that the the quality bar on the repositories in in the study is just very high. Typically. And so I would be very surprised if journalist you know, even frankly if like a good software engineer without lots of experience on the repository, but but certainly you know, someone who wasn't a software engineer was able to get up a clean PR on these repositories first time. In fact, I think that's a lot of the story for what's going on here is that the AIs, you know, they actually kind of do make progress in the right direction some some good fraction of the time. But for you know, for various reasons sometimes for reasons of correctness, but sometimes for reasons of like you know, how they've tried to solve the problem and you know, whether that's the typical way of solving the problem or like how various parts of the project speak to one another. These these kind of considerations, you know, they they haven't properly accounted for that. And so you know, the humans not only need to spend expensive time verifying, but also like clean up clean up all the stuff. And my sense is that someone who didn't have all that experience like basically wouldn't know how to do that step. And so wouldn't be able to submit a clean PR to these repositories. You know, that's that's it. Like I relative to these people at least, I suck at software development. [laughter] And I I'm getting up you know, PRs internally all the time. And I think they're I think they're worse quality and you know, and they're and they're getting over time they're getting better over time. You know, I do believe that people are coding when they when they wouldn't be able to code. They are submitting you know, PRs that are lower quality standard when they wouldn't be able to do that at all. But but getting up getting up these expert level PRs, I I do feel kind of skeptical. And And that's actually part of what I was getting at is they often get PRs often get rejected by more novice folks on these big on these bigger quality projects for no other reason other than the developer ergonomics impact of the PR, right? So the fact that it makes it harder for me to future maintain cuz cuz for open source project, almost all the incentive is biased towards making it easier for me to maintain the project. Right? So every time a PR comes in, if it doesn't make it easier for me to maintain the project, I have a tendency to reject it. Yeah. Uh if it does make it easier to maintain the project, then yay, I'm into it. As a That is unlike what you have in a typical business context, right? Where the most important thing actually is to get something done. Yeah. Right? Uh because you're you know, the fact that that someone's going to spend a lot of time maintaining is almost job security, right? But for open source, it's the opposite. It's actually what causes people to leave projects is when it's difficult to maintain. Right? So it is a different bias on what you accept for pull requests. Can you remind me the name of the name of the English gentleman who maintains the Haskell compiler? Uh Simon something? Yeah, I I know. I I can't remember. No. Okay. I can't remember the name at all. So here's here's his one story that that might be relevant. You know, bunch of repositories in the study that they all have you know, broadly these characteristics. One of them is the Haskell compiler. Famously on the Haskell compiler, there's like some chance I don't know if it's 50% or 30% or whatever. But there's some chance that if you submit a PR, the I'm being recorded. the Simon Simon Simon Marlow maybe? I'm not sure. The creator of the Haskell compiler will come into the comments and argue with you for many many hours, much longer than you spent working on the pull request until until the PR hits exactly your specifications. Combine that fact with the remarkable fact I think that the median PR in the study, the time they spend working on the code post review is zero minutes. That is the the median PR is like perfect first time around because the professional incentives of these developers are are like that. Now there's a very long tail on one of them on one of them I think literally Simon this gentleman pops up and argues in the comments for many hours and that that that one's a lot longer. >> [laughter] >> But yeah, they are they are maintaining this extremely high bar. I'm interested in the other upcoming stuff that you have in your doc. Yeah, there is. Um >> [snorts] >> So um Yeah, so what so you know, so so one thing I what to say? Um I guess let's let's go in order. As I As I think you mentioned, you know, if if capabilities are measured by time horizon keep keep doubling, it does seem very very challenging to keep up with that. In the short term, we have a number of directions for for getting on top of that. But and I think that will last like through the year. But through two years, you know, that seems challenging. I think still possible. Through three years, I think still seems possible. You know, it starts starts to get harder and harder. Anyway, in the short term, building these building these much longer tasks and ways in which we might get around the problem entirely. For instance, um here's one thing that might be somewhat You could also raise the accuracy bar. Uh you could raise the accuracy bar. Although you know, we're the reason we're interested in this in the first place is we like you know, is GPT-5 existentially dangerous? Okay, and the answer is no, I think. Yeah. But like what But like why why do we think the answer's no? Okay, at least I think there are multiple reasons. But at least we can say you know, GPT-5 is just like not that good at stuff. Like you're trying to get it to do you know, data science on these on these columns that have very similar names and it's like not clear exactly the logic that's led to the columns. It's not It's not doing that sort of thing. I'm like, wouldn't you need to do that sort of thing to be Not exactly that, but you know, be capable of stuff to be existentially dangerous and it's not capable of stuff. Um so you know, I see that the the AIs are failing at these at these hard tasks. I'm like, you know, great, whatever. But if you know, [laughter] if it's succeeding 90% of the time and not 99% of the time on like very challenging long tasks, then you know, I think there are there are you know, other things that we can retreat to and I'm prepared to retreat to those things. But at least the naive the reason it's not existentially dangerous is that it's not capable of stuff. I I think at least starts to get pushed on as as as an explanation. So here here's another possibility for something we could do. Uh Whoop. So um what if instead of time horizon here, we had uh time horizon uh whilst uh being watched very closely. And then Okay. And then you know, maybe the the time horizon whilst not being watched very closely is like this, but the time horizon whilst being watched very closely is like this. That is like if um uh uh you know, I I I hope that major AI companies have um monitoring setups going where you know, for example, AIs are looking over transcripts of of AIs performing work to see if anything suspicious is going on. And you know, it's not it won't be impossible perhaps I conjecture to to fool those monitors, but you know, be challenging to to to fool those monitors. So you know, maybe you can do you know, 40-hour tasks without 40-hour you know, dangerous capability relevant tasks without any monitoring, but only like 2-hour long tasks, if you're if you're being watched very very closely. How do you mean watched? Like if there's another model evaluating the output? >> That's the kind of thing I'm talking about. And maybe you escalate to humans and With and without safety controls you Yeah, yeah, yeah. Roughly, yeah. Um and then, you know, now now we've bought a load more doublings and we can [laughter] we can sort of keep keep doing the capability extrapolation thing. That that's one of the kinds of things I'm thinking about in addition to creating ever longer tasks. Yeah, I mean I imagine some of the model shops do have like evaluations of capability with and without safety because I'm sure that there like there's an argument between their researchers and their safety teams. Um yeah, yeah, yeah. Um yep. Um Um Feel like I have seen something about this, but not a lot. Yeah, yeah. Um yep. Um Um yeah, I I guess I think that um this might be sort of like an especially quantitatively important consideration. Or um I I I expect that it will reduce the effective time horizon by uh by like maybe an order of magnitude or two. Um Yeah, I I I agree that there's a there are some important senses in which it's not really a difference difference in kind. Yeah, but of course then I would also worry that like publishing that encourages people to like focus less on safety or to like try to argue against safety because somehow it impacts capability. Yeah, I I think there are lots of landmines in um in in all sorts of safety work, not just not just in AI. Um Okay, next thing. Um You know, we have this we have this trend. I I spoke about this at the beginning, but you know, we have this trend. Is it going to continue forever? Is this is this a fact of the universe or does it, you know, somehow depend on inputs or what you think about um intelligence explosions or or something like that? Um trying trying to think about that. Where's this line uh actually going? Is um is a is a pretty active area of work. Also, you know, the ways in which um this line or or the the particular points don't correspond to the thing I care about. So, one obvious way is that um you know, these these models are being judged according to um uh you know, I [laughter] I think I think the um algorithmic scoring that we use on on meter tasks is is um is importantly sort of more robust or more covering the relevant concerns than might be the case in just sort of sweep benches and unit tests, but but it still sort of it still has a lot of the same character. Um there are um you know, considerations like being able to build on this work in future outside of the immediate problem um uh facing you that that aren't being captured by by meter scoring. And maybe if you did capture that, you know, you'd get something a little bit like going from 50% success to 80% success. You know, you can do hour long tasks if it doesn't matter whether you can build on the work, but you know, only 30 minute tasks if it does matter whether you can build on the work. But bringing bringing these numbers again to to something I care about a little bit more and then yeah, projecting out both of their uh computer slowdowns um if if we are going to enter some regime where um uh AIs are building AIs and that leads to some sort of steepening of the curve these these kind of considerations. That's another thing I'm thinking about. Um Da da da da da da da. Oh, and then capability is measurement from new angles. So, here's um you know, here's here's one history of meter that I think is not the accepted history and um also probably um not a very accurate history, certainly not the most accurate history, but but here's one possible telling. Um you know, near the beginning meter has early access to where when I wasn't there and I have sort of no internal knowledge of this. When meter has early access to GPT-4 um and they were just sort of Q&A datasets going on everywhere or like else update sets or something. You're like you know, can GPT-4 like seem so smart relative to stuff that that went before. Can it do stuff? You know, so you like you try it out some task. Can it can it do stuff? And the answer is, you know, can do some stuff and can't do other stuff. Um and um and people are like, "Oh, that's cool, you know, you try this, you try this um neat new kind of thing, getting models to do stuff instead of instead of answering questions." And then and then later you're like, "Well, different models, you know, they come out over time. You know, this model comes out in January, this model comes out in February. Can they do different kinds of stuff? If we test them on the same if we test them on the same stuff, then we'll try and think of kind of the most obvious in some ways summary statistic of whether they can do stuff, this like single single um data point or number that reflects whether they can do stuff that's time horizon glossed over time and see what happens. You're like, "Oh, that's that's kind of interesting." And then you're like, "Well, what's the next sort of in some sense kind of dumbest or like most obvious thing you can do? Well, we'll run kind of the most obvious RCT design or like allow AI or not allow AI and then we'll see we'll see what happens and we'll try and you know, it'll be it'll be messy. There's lots of um there are a lot of methodological problems that that people point out as there are with this work, but they're different kinds of problems. You know, they have different pros and different cons and maybe with these sort of two different things give two different answers and have two different sets of pros and cons we can kind of triangulate the truth from that. And then now I'm like, "Well, can can we pull that rabbit out of the hat one more one more time?" Are there or multiple more times? Are there other sources of evidence that have, you know, different pros and cons that I that I won't believe in fully, but they're different pros and cons and they might give different answers and so on and so forth. Um here are two suggestions of things I'm curious about at the moment. The first is um in the wild transcripts. So, you know, agents in cursor in code code and in in whatever other other um products or services, um they [clears throat] leave behind traces um traces of the deaths that they've um [snorts] contributed to codes or or deaths of their actions and their recent chains and and so on and so forth. Um the traces that they leave in the wild are, you know, importantly different from this where it's more kind of contained and you know, the task is sort of neatly packaged and stuff. This is going to be, you know, like the like the example with the many different columns that are very confusing. This is going to be like whatever real crap shows up in the wild. How how do models learn from that? Um There are important reasons why you shouldn't believe that kind of information. It's it's like not very experimental. It's like hard to know exactly what to make of it, but it does have these important pros that it's like it's more real. It's, you know, the data is enormous perhaps the data on transcripts is enormous. You know, perhaps there's a lot you can learn there. That's that's one thing. And then and then here's another one. There's this um there's this group which you guys should check out called um agent village. AI village, sorry. Um where they um they have um a lot of different models or or agents kind of living in this village occasionally talking to humans trying to accomplish um fuzzy goals that are that are set to them basically using computer use. They try and do stuff like, you know, organize this event at the park or um run a human subjects experiment or run this merch store, you know, stuff stuff like that that's not so clearly specified. And basically all the time they find that the models fall on their faces and suck. Um and there are lots of reasons not to believe this evidence. You know, here are some of the reasons. Number one, um it is using computer use and I think computer use is just way worse than CLI based computer use capabilities are considerably worse than CLI based stuff at the moment or text based things in general at the moment and maybe care more about text based things cuz that's more relevant to various other things you care about and also lots of GUI based things um can be converted into text things. Um it's um you know, there's all these different models hanging around in the village. I'm like, "Why why are there so many models? Like why is there a village instead of just like some big agent orchestration set up?" I don't I don't really understand what's going on there. Um Anyway, lots of reasons not to believe it, but on the other hand, it is models doing stuff in the world. It's not benchmark style tasks. It's like trying to accomplish some goal and they can't accomplish even sort of, you know, very basic subsets of the goal. And I feel like that's extremely interesting and I I wonder if you could get rid of some of the most obvious cons, you know, make this only text based, give them some um uh relevant text based tools, work a bunch on the elicitation to make to make these models sort of more performant, get rid of the less performant models in in the village so on and so forth, but then try and get them to do these fuzzy goals um and you know, just observe like where do they mess up? Like you know, they they they they went about step one, it went great, but then they sort of they became incoherent or they, you know, went into a strange psychological basin with one of the other models or, you know, they they weren't able to interact with external services in an appropriate way or or figure out their resource use. You know, I'd be very interested just kind of qualitatively in what's in what goes on when you do that. Again, keeping in mind that we're interested in um the ability of um at least at the moment I'm most interested in the ability of AIs to um automate R&D and, you know, speaking to why that's not the case at the moment and why that might not be the case in the near future. Some something shaped like this seems like it might be might be kind of might curiously point to to why that's not the case. Not sure exactly what's there, but yeah. And my observation is that that they they are effectively neurodivergent individuals, right? And none of our world was not built for that. >> Yeah. And there's everything that we have that they're defined for a human to do, they're shaped and sized to humans. Just like you know, the military, like, you know, how big are packs? Well, it's based on how much they think a person can reasonably carry, right? And how much we expect someone to handle for their taxes, that's based on what we think a human can do. Well. And and they're and if you think about neurodivergent individuals, they struggle with challenges with the way the world's expectations don't align with them. And compared to a neurodivergent individual, these you know, these intelligences are really really different, right? And so all of their rough edges where they don't align with our world, that's why they needed assistance to actually human assistant in order to accomplish anything real in our world. It's just too hard for them currently. Currently? Yeah, yeah, yeah. >> [laughter] >> Someday they change. Okay. They're just hopeless. Yeah. They have to get really, really good or our world will have to change. One of those two things. You know, I I agree. I like so strongly said a sense, but you know, but if you ask me to really pin down like why why is that case that the case again when they're like, you know, beating all GPQA GPQA experts on these extremely hard science questions and they're, you know, blah, blah, blah. Like that's actually what the why are they not able to accomplish things in the world? You ever met a neuro divergent individual who wasn't terribly good at something? >> [laughter] >> Completely useless at getting through life? >> Yeah, yeah. They're all very good at reading books. >> [laughter] >> There's a lot of those people in the world. It's not that surprising. My my only feeling about AI abilities is like, well, today is the 200th day my car didn't rocket off the Earth and escape velocity and fly to the moon. Like that's because you didn't build a rocket yet. Yeah, I mean, I think there was a lot of talk a year ago about, you know, maybe I'm mischaracterizing, but I thought there was a lot of talk a year ago about computer use capabilities being impressive today. There was. There was a lot of talk about it and yet I have talked to almost nobody who has used them for any practical >> [laughter] >> Um yeah, but if we if we move this to text only and it seems reasonable to complete text only. Um Um you know, would you still have the rocket concern? No, I wouldn't have it. I wouldn't really. I don't want it depends on what the task was. Sure. Yeah. Means yeah, the kind of thing that you could that a human could do over CLI only. So I I think this um this relates to the interrupt talk that >> [clears throat] >> earlier sphere they talked about how um you know, one way to uh use uh effectively is to give them if you have a task, like figure out a way to present the task or transform the task something that is industry fit, >> [clears throat] >> you know, for the model. And I feel like this conversation kind of yeah, ties in on that. Like um you know, interacting with with Chrome is less in distribution than a CLI. So I I think that could be an interesting area of research is like, you know, uh okay, so if you're interested in exploring like how well can a Chrome is really tasks, like first I I guess creating harnesses and creating an interface that is much more in distribution for them. So that way that's you know, less of a concern. Yeah, I mean, I I think also it speaks to the points about quote unquote near to bench models. Um you know, that's um it's not so different from management scale or something giving, you know, giving appropriately scoped tasks to to your to your very talented interns or very talented neuro divergent interns or something something like that. I do I do think that's right. From the sorry to be a, you know, uh sorry sorry to be so passive. From the perspective of capability explosions um and automating R&D, you know, I think maybe the models will get extremely good at scoping tasks for themselves such that its benchmark style or or something like that. But you know, if they can't do that I'm like, well, there's a lot of things that aren't that don't look like benchmarks that grow up in the real world and you do need to be able to kind of flexibly work with that if you're to do something as complicated as automate a major AI company. Um um and you know, so so I do I do think it's um Yeah, I think I think it can both be the case that the AIs are incredibly performance um on some particular type of problem or if you make other types of problems more similar in scope or shape to to the type of problem that they're best at and and also that they, you know, can't flexibly substitute for human workers because that requires, you know, um yourself setting up the problem in in a way that's appropriate or or not having those constraints yourself. Yeah, it is interesting though. Just just your point about new capabilities is thinking about this like another axis on the graph that you have. Because I think if there's not just I wonder if there's not just a time horizon issue, but there's a a task category or type of work category. Like like as your example of computer like computer use is one of those examples, right? Like if we think about the capability of computer use versus a capability that would require computer use. versus a capability that could become can be accomplished entirely in text. Yeah, so that Yeah, sure. Well, but but but like a lot of these are like like almost all these benchmarks are basically text. Um yes, yes, yes. And indeed, you know, the ones the ones that aren't the ones that require sort of um vision capabilities are are notably lacking in benchmarks. Yeah, I I I um I'm not sure exactly what to what to make of this graph. I think one thing I make is that's you one thing I make of this is that's um uh you know, there probably is maybe not so much variation in in sort of slope or doubling time across across across distributions. I think it's only weak evidence for that, but you know, in in intercepts or you know, the base of where we are now, um yeah, there's there's possibly a great deal of variety especially on this sort of um uh image like capabilities versus not to mention, but but physical abilities even more, you know. Yeah, right. So there's exactly like so I mean, you could even go through sensors or like you could go through like a tactile like like today like they would all score zero. Nothing has tactile. So like it can't tell you anything about anything tactile. Um well, you know, in producing this graph we you know, we're trying and make the models as performance as possible on some held out set. No, so we you know, we try and give them some tactile stuff. >> [laughter] >> I'm not sure they perform zero. Sure, sure, sure. But space we do have some examples of products. Yeah, yeah. And space judgments, spatial judgments. Yeah. Um You know, we we've obviously seen computer fine control and stuff like that in other robotics. It's just I I haven't even I don't even know if anybody maybe somebody has listed out what all of the capabilities that we would expect in the future. Like if we actually wanted AGI, what is the entire list of key That's a way to start a debate that doesn't end. I think [clears throat and laughter] it's Hazel Hopper and Arjun Ramani hopefully have a paper on this often small number of problems. Yeah. And then maybe if we think about where are we at and do all of the capabilities follow the same All the capabilities that we currently measure, do they follow the same uh log? Yeah, it does seem like a reasonable null hypothesis to to view as well as me, I think. Not not not not certainty. I mean, who knows? Yeah, yeah. Um >> [snorts] >> Um oh, there was something there was something I wanted to add there. Um Oh, oh, pit. Yeah, so here's another thing I'm thinking about not super in a research capacity, although kind of. Um Um So you know, some people like me are sort of skeptical of of software only singularity. That is the the idea that you could automate AI research without also automating chip design and maybe also chip production as well. Um that you'd quickly get bottlenecks by by computes. There are only for fixed hardware there are only so so so many experiments that you can run that that will be that will be um sufficiently productive to to uh to fuel progress upwards. But you know, even for people like me who are skeptical of that, um uh you know, you might think that in fact like chip production is going to get automated. You know, the robots like >> [laughter] >> they're they're coming. They can they can do they can do the stuff that humans do and then and then you maybe you really do have a fully self-sustaining um robots plus AI economy. And so you know, and so you you you have some slow trend from from computer slowing down, but then you have sort of a human back up once once the whole thing is is is in a tight loop. Um what one interesting debate that I heard about recently and would like to think more is um uh you know, I think there's in in the public discussion there's some sense that, you know, what why are robotics capabilities lagging? Um uh lagging LLM like capabilities so much. Well, it's to do with training data or something something something like that or maybe it's to do with hardware constraints. I'm I'm curious if it's not to do with hardware constraints. All right, what what exactly are these hardware constraints? If we put super intelligence inside hopefully this will super intelligence in inside of you know, um hardware parts that existed today, could it build chip production facilities? All right, I have no idea because I'm stuck you know, I'm I'm beyond beyond beyond novice, but it's not obvious to me what the what the answer is. I think it's I think it's kind of plausible. I'm not sure you need this like um I yeah, I'm not sure you need this like very flexible fine motor control in order to do it. Also I think maybe the fine motor control is there subject to having super intelligence controlling it. >> to be fair like the key aspects of chip production are done by robots. Um oh, but but but I'm also thinking like building the robots and Yeah, the whole, you know, And and as far as I know I have a friend who spent most of his career doing software development, but during COVID started working on manufacturing things like peppers and things like that to help people and he found out how hard the manufacturing world is and how slow the iteration process is. And it is really like he put it like he he knew it was going to be worse. He didn't understand that it was like next level like an order of magnitude worse. I think that probably like, you know, we we from our perspective people that don't do it, it seems like, oh, how bad can it be, right? It's the the feedback I've had from everybody who actually works in that space is it's way, way different. >> That's what I've heard as well. I I've only talked a little bit with like people who work in fabs and stuff, but I I was surprised when I did talk to them of the level of human expertise required. Yeah. In order to work at the fabs, like a lot of those jobs are like fairly high paying actually. Oh yeah, very jobs. And also like also the rate of improvement is actually glacial, right? Compared compared to software, right? >> I think also because it's cost a billion dollars to build a fab. >> And so like each generation is a huge cost to find money. It it's brutal. Right. So it's I think that's why it's been hard to get it all the way there is just like give give them a [clears throat] couple more centuries. Maybe they can get it done. >> [laughter] >> Is that really your view? Centuries? Centuries? I I do. I do I do think I I'm skeptical like you about like how easy some of these tasks are. Yeah. We think they're easy, but in my experience like I I I remember when the self-driving thing came out and people were like pushing it out and it was I actually worked in that space for a while and it was like I get that we can get really close to it, but getting all the way to something that is acceptable is extremely difficult, right? And we underestimate how much work is involved in getting that last little bit done. The first time I ever said it I I knew we could do it with computers like you know, 10 years ago pretty much, but getting the last bit that everyone's happy with it Yeah. needs a lot of work. I feel this myself, you know, I didn't get a driver's license but I when I got something because because I expected self-driving cars to come. Yeah, I think I think it's tasty, but it hasn't been that long, you know? And they're they're expanding to to to to the entire Bay Area. They're going to get I don't think it's going to take that long. Is is the is the robot economy building the chip productions going to take centuries? I don't know about Yeah. I I could see that it might take it's it's so part of the trick with self-driving is the economic incentive is moving it along faster, right? And probably the robot building robots kind of thing would also, but like >> Yeah. you know, where we're at right now is like riprap is kind of as far along as we got of robots building robots, right? Which is Oh oh, but I I feel like you know, is that is that paying sufficient attention to the charts? GPT-2, 2019. >> [laughter] >> It's so it's so recent. You know, I I I have some This is so this is so Yeah, yeah, it's um uh nonsensical, but I'm like maybe we're in a sort of GPT-2 moment. Yeah, no, it's a fair point. I I could be wrong. It's just my guess is it's going to take a lot longer than we think. I think At least to be able to do like real mass production. Yeah. Uh at a scale that that causes the kind of global impact you're talking about. >> Yeah. That that's I I think they can already do a great job building one-offs, right? Robots are very good at built doing one-off builds. Yep. At small scale, but it's totally impractical for doing it on a large scale. There is um um um One one fact I think is kind of remarkable is this Maybe it's this. Is that the rate of Is it this? Yeah, yeah. The rate of compute puts to robotics models lags behind um sorry, is is is about the same, but the the levels uh two orders of magnitude difference. Um I I am kind of um curious about if that gap closed. Um um uh what we'd what we'd see. It does seem like at least sort of more capable robots are in some sense um very on the table as something that could be the case very soon if this if this >> [laughter] >> No, I'm I'm I'm not saying all the way. I'm certainly not saying chip production. It just does seem like there's some sort of data hang. Yeah, yeah. Just something. It's interesting. Um Also also thinking some sort of um um some like you don't just need to be scaling data, you can also scale parameters use the same amount of data, you know. That's the way to use compute to to close some gap. Interesting. Yeah. So one of you just gave me a very interesting overview of where AI is going into fabrication and where it's not. And and what does it say? Um So it says So it says there's a lot of worries where right now it's going to help probably pretty dramatically in the near future and a lot of it's the computational aspects. There's a lot of computational aspects that are extremely expensive when you're designing like a mask display the the hole that you're using for the laser to get the transistors. Mhm. Um and like calculating that, how to build it, and ensuring that it conforms to the spec that you've written basically is extremely computationally expensive. Um and there's a lot of opportunity for AI to help there. Um and there's also theoretically the possibility for so like chip obviously uh chip manufacturing is extremely um precise, but also fragile. And the opportunity for an AI to detect parameters that are basically out of whack and leading to failure potential failure in like uh imaging a wafer uh is could theoretically dramatically improve yield and yield is a big problem in fab and in chip manufacturing. Like the reason that you get different speeds out of your CPUs is because they actually just have the one line that produces all those CPUs and some of the clock currents will come out worse. And that's why the higher giga that's why the higher gigahertz models are more expensive than the lower gigahertz. Like like if you have like your Nvidia like your home GPUs, your your 50 40 or 50 50 or 50 60 50 70 50 80 90 are all the same chip. Right. They just have different quality Different levels of fault tolerance essentially. Yeah. Um but the problem is that uh they're they're Cut the recording. They're going to kick us out soon, but feel free to continue discussion. Yeah, cool. You also hang out. Yeah, sure. >> [music]