Why Agent Hype can fall short of reality – Joel Becker, METR
Channel: aiDotEngineer
Published at: 2025-12-24
YouTube video id: RhfqQKe22ZA
Source: https://www.youtube.com/watch?v=RhfqQKe22ZA
[music] Hey guys, thank you so much for having me. My name is Joel Becker. I work as a researcher or member of technical staff at MET, which stands for model evaluation and threat research. As we'll see in a second, I'm going to be talking about AI capabilities. How do we know how performant AIs are today? How how performant they might be in the near future from these two different sources of evidence that seem to give somewhat conflicting answers. You know, I I could have done this whole talk without reference to meter papers in particular, but we'll look at two papers I've been um involved with as as examples of benchmark style evidence and then more economic style evidence. On the benchmark side, measuring AI ability to complete long tasks. This is the paper um that comes with the the charts that many of you would have seen on on Twitter and so on that meter is well known for. And then the second this um RCT measuring how allowing AI affects developer productivity. And then we'll be talking about how to reconcile uh the the gap that's implied between these two different kinds of measurements. As I mentioned, META stands for model evaluation and threat research. We are a independent research nonprofit that seeks to inform the the public, policy makers, labs about the degree to which AIs might pose catastrophic risks to society. The model evaluation part uh means that we seek to understand AI capabilities and propensities. And the threat research part means we try to connect those capabilities and propensities to potential catastrophic risks. Okay. The first paper we're going to talk about associated with this chart that that many of you I think might have seen. Um take taking a step back first before we dive into the paper. You know how how usually do we think about measuring AI capabilities using benchmarks on a SWE bench or a GPQA so on and so forth. There's some notion of 0% performance um or or random performance. So for GPQA that's that's 25% which corresponds to this flaw that the worst you can possibly do. Perhaps there's a um human baseline that's below 100% for GPQA. I think this is something like 75% that represents maybe expert human performance. And then of course you can go all the way up to 100% potentially on on these kinds of benchmarks. But but what does it mean? you know, if I'm getting 50% on GPQA, if I'm like half the way from the um from the floor to the to the expert baseline, what you know, what does that really mean about how performant the AIS are? If I meet the human baseline, does that mean that the AIS are now as performant or even more performant than than expert humans in in a relevant sense that I that I care about? It's hard to interpret. You know, another thing that you see from this graph is that um benchmarks seem to have less and less time between coming online sort of giving any signal at all and being fully saturated. It's harder and harder to create benchmarks that have uh plenty of signal that you know might might be informative to us about how capable models are for for an extended period of time. So, we're we're going to go about this a different way. First, we're going to gather human baseline data for diverse tasks spanning a range of difficulties. You should think of these humans as, you know, experienced experts, but on their first day or or or first week on the job. These are not people with context on the tasks in particular. It's not exactly the kind of thing that's come up in their work before, but if it's a software engineering task, you know, there are relevantly skilled general software engineer. Same for the machine learning tasks and the cyber security tasks here that we'll talk about. the the [snorts] type of tasks come from these three um buckets or task distributions. Hcast which is a collection of um softwarebased tasks seemingly requiring autonomy you know interacting with tools um uh interacting with the environments thinking thinking through the problem not not just this kind of Q&A style um style data set um the SWAR suite which are these atomic problems these are problems that you know maybe GBT2 can do maybe maybe it can't problems like um here are four files one of them is called passwords.txt txt which file contains the passwords and then on the other end of difficulty we have rebench which are challenging novel open-ended um machine learning research engineering challenges which are are very difficult even for top human experts in addition to gathering the the human baseline data we'll also under as close to identical conditions as possible measure AI performance for the AIs that we're that we're interested in on the same set of tasks and then we're going to convert the time it takes for humans to complete these tasks into an estimate of AI autonomous capabilities as I'll I'll show you in a second. Here's an illustrative diagram in this case for claw 3.7 Sonet which was the the frontier model at the time that this paper came out. You can see that you know for the for the very short tasks something like 4 minutes or below Sonet is getting the answers correct you know essentially 100% of the time or or maybe even here literally 100% of the time. for the very hardest tasks it's struggling and then and then there's some range where we're kind of in the middle you know we're somewhere between 10 and 10 and 90%. I'll say that this empirical pattern where models are less performant at tasks that take humans longer is you know it's not a fact of nature but it's it's something that we see pretty pretty commonly pretty pretty robustly across models at least on this task distribution and I'd conjecture for for other task distributions as well. So we try and fit this dark purple line to to something like this data on on how long it took humans to complete the relevant tasks that the models are uh um are attempting. And then we call the point on the x-axis this horizontal axis this human time to complete axis at which we predict the models will succeed 50% of the time the time horizon of those models that there's much to debate in the 50% number. I can I can talk later about the reasons why we chose that and and then we'll do the same exercise for the other models. So here I have uh claw 3 opus has a time horizon of something like 4 minutes. That's where we're predicting that it has a success probability on this task distribution of 50%. For 01 preview I'm seeing something like 15 minutes so on and so forth. And then of course all these models you know they they come out over um calendar time. So if we plot the time horizon, the x-coordinate on uh on on this set of plots against um against calendar's time, we find something like this. It looks, you know, kind of like um kind of like an exponential trend that's that's going up at some constant rate. In fact, it doesn't just look like an exponential trend. If we had a perfectly straight line here, it would indicate um a perfectly exponential trend. um we we see something really remarkably steady actually much more steady than we were anticipating when we uh went about doing this research project and that's continued to be the case. So many of you will have seen updates that we've made of of this graph on on on Twitter. This is going all the way up to GPT 5.1 CEX max. So extremely recent um the predictions from this you know shockingly straight line have have held up very well I think. Taking a quick step back, what are benchmarks telling us or or here kind of benchmark like evidence? Well, one thing is that AIs can succeed at what for humans would be exceedingly difficult tasks. The tasks in our ebench are, you know, really far beyond my capabilities uh personally and and you know the AI is having a good crack at them some some decent percentage of the time. And the second's you know kind of obvious is that progress is rapid. >> [snorts] >> On the other hand, um you know, how much how much stock should we put in the um the evidence suggested by benchmarks? Um what limitations might they have? Lots, but here are here are three that I'll note. One is, as I mentioned, these are humans who are, you know, expert in some relevant sense, but they're low context. It's something like their their first week on the job. They haven't seen tasks exactly like this previously. They just have some relevant experience. presumably people who were more sort of you know not not just having the relevant experience but also highly familiar with um uh with the with the set of tasks would perform the tasks even sooner and then we think relative to those people the AIs were more performant. The second is that benchmarks can be low ceiling. Even you know GPQA or use that example again um where we're beginning to get to the point where where that benchmark is um is totally saturated not providing um additional information for marginal models whereas time horizon is providing this nice way to sort of chain benchmarks together in in in some sense over time. Um but you know nonetheless it's still very hard to um uh to create these ever harder tasks when the um when the time horizon of models is doubling every something like six to seven months. So even time horizon might be might be saturated in not too long or the benchmarks underlying time horizon. And the next one is you know not not a concern that's limited to the to the meter task to the task behind time horizon. It's also true for sweet bench. which is also true for for many of your um favorite agentic benchmarks that the problems aren't very messy in some sense. They don't require a ton of coordination with humans. They're often in relatively small contained environments where where not much can go wrong. You know, not these sort of massive open source code bases or or um other ways in which the the problems can involve more interaction with the real world or or or be messy in in some sense. Um so we did this we did this project and then um early this year we were you know we were trying to think about um uh how can we attack some of these limitations? What what's a different source of evidence that um might have its own own pros and cons but you know importantly be more externally valid in in the scientific jargon. Perhaps field experiments are the answer. [snorts] So more economic style evidence. So here we might be interested in very high context developers who are expert on the kind of tasks they're already doing speed up or some notion of productivity boost. You know it seems to have more signal through even some um superhuman according to benchmarks range. You know perhaps GPQA is fully saturated and you're getting a 1.5x 2x speed up something like that but you can still achieve a 3x 4x 5x speed up even even after that we we maintain more signal. And the last is that you know that the tasks are messier. They are tasks that are coming up in people's real work. They're not um synthetic. They're not small and contained. Um this is a real deployment scenario. Here's what we're going to do for this paper. We're going to gather 16 experienced developers on large mature open source projects that we'll go through in a second. Each of these developers will on average complete about 16 tasks from their real work. These are these are issues on the on the relevant GitHub repositories. The kind of thing that they might otherwise have completed with the with the caveat that the very longest issues we're not going to include. The tasks will be randomly assigned to AI disallowed or AI allowed. AI disallowed, you know, it means it means what you think it means. It means software development in 2019. It means no AI powered tab autocomplete. It means no cursor agentic coding tools. It means no LLMs via the web UI. or they can be randomly assigned to AI allowed in which case everything's on the table. You know, any of the AI tools I just mentioned or not using the AI tools. If you're in the AI allowed condition, you're not compelled to use AI. You just have the option. And we buy these developers Cursor Pro. So, um for the for the most part, that's the tool that they're using with typically 3.6 or 3.7s on it at the time, uh which was the Frontier model when we conducted this work. And then we're going to record the time it takes for the developers to complete each task and see the degree to which they might save time when AI is allowed versus when it's not. These are some of the repositories. Many of you will be familiar with them. We've got the Haskell compiler represented. We have scikitlearn. We have hugging face transformers. These are on average a million lines of code plus. They've been around for 10 plus years. The developers who are going to be working on these repositories as part of this study are on average the third top contributor out of hundreds or or even in some cases thousands of contributors to these repositories. They personally have been contributing to the repository for something like 5 years on average. These are top experts. Some of you might have seen this graph too. And and so the punch line's been spoiled for for the rest of you. Um we asked uh economics experts, machine learning experts, you know, these are people at major AI companies and labs, um uh top academics, um some graduate students, so on and so forth, you know, how much they expect developers to save time when they're using AI. They say something like 40% or a little bit less. We ask the developers themselves, the study participants, how much they expect to be sped up ahead of time, and they say something like 24 25%. Then we ask the developers after the study has been completed how much they think they were sped up in the past by AI being allowed on the issues they completed as part of this study and they say that it will have sped them up by something like 20%. And the punch line is that we find that developers are slowed down by 19%. They take 19% more time when AI is allowed relative to when AI is not allowed. You know, when I first saw the data coming in, saw sort of early versions of this plot, um, I thought presumably the same thing that many of you might be thinking right now, that we've messed something up. Um, that that, you know, something's gone wrong. There's some there's some issue in in how we've set up the experiments. How could it possibly be the case? You know, at least these um, uh, these developers have access to the zero points because they cannot use AI at at any time. Um, so we poured over, you know, many, many, many, many, many hours of screen recordings from these developers working on issues as part of the study. We look to dive into um, a bunch of hypotheses that might explain what's going on and try to categorize, you know, the things that that we think are going on versus not. Um, many of this is is listed in the paper. I I'll just quickly go through some of the things that we think are contributing. First, overoptimism about AI usefulness. that that seems like an obvious one. You know, the developers even after the study is completed, they think that um uh that AI is going to be helpful to their work. It's it makes sense they might overuse AI um on that basis. Um two more implicit repository context and high developer familiarity. You know, these developers are coming to these problems already knowing the solution to the problem. They don't they don't um they're so expert in this work. you know, I I I imagine them as as not trying to spend a bunch of time thinking through the solution that the the AI can can work through. Instead, they're just limited by how fast they can type. Um, which which means that, you know, using AI, instructing AIS to do it, um, comes with some significant time cost versus how they might otherwise have spent their time. I think many of us have the sense that AIS might be less performant on on large and complex repositories, which is a different from this difference from this benchmark style evidence or or from or from some previous work. And then low AI reliability. You know, um maybe the AIs are very performant on these kinds of tasks, but you know, they're only performant um 50% of the time or 80% of the time, 20% of the time. And so, at the very least, you need to check their work afterwards. And perhaps even you need to spend time correcting their work afterwards, which is which is something we see quite a lot on these issues. One thing from the factors with an unclear effect that I that I'll mention briefly I have to talk to people about later is below average use of AI tools which came up in the public discussion. This this is in the unclear column because it's sort of evidence evidence for and against. Um that that's true for for many of the things here. We don't have anything so conclusive to say we're still working on on this line of work. Here are some here are some caveats. All important. Um first you know obviously we do not provide evidence for all software developers or tasks. These are extremely experienced developers working on extremely complex longived open source repositories. I in my own work you know not um as expert in the relevant sense as as these people are. I'm working on much smaller repositories. Um I I feel more comfortable saying that even at this time I was sped up by AI tools even if even if the developers weren't. This setting is weird. It's weird for the same reasons that it's that it's interesting this this unusual developer population. Second, the experiment is concentrated in March 2025. As I mentioned, uh we know that AI progress is rapid. Um perhaps this this result will have already changed by the by the time I'm giving you this talk. So there's a kind of puzzle suggested right that the benchmark style evidence is giving um a very impressive sense of what benchmark of what AI capabilities look like today whereas the more economic style you know I include labor market impacts um uh uh working here too in addition to our in addition to our field experiments look somewhat more bearish or or unimpressive. You know why why is the former not not translating to the latter at least naively there seems to be a clash. How might we go about resolving this puzzle? So one possibility is that in fact we we messed something up. This is this is still live and on the table. Uh you know maybe the developers really are um uh not very capable at using AI and if we continue to run this experiment as as in fact we are they'll you know learn more familiarity with the tools and and so get productivity benefits that they they weren't getting at the time. I'm a little skeptical of that story but but but that's one possibility. Another that economists like to bring up is that we're not incentivizing these developers to finish quickly. we're paying them per the hour, um, which we do for external validity reasons. Um, you know, looking through their videos, I I really, uh, do not think that they're developing differently in accordance with these incentives, but but that certainly is one possibility that's on the table. You know, another um, more statistical in nature possibility is, you know, this is a small study. You shouldn't you shouldn't over update so much from small studies. We we are doing um, bigger things that I'm excited to release at some point. Okay, but let's let's assume we haven't messed something up and this is uh this this is a result um uh that that we think that we think does hold up. How could we resolve the puzzle? [snorts and sighs] So, one possibility, you know, as I as I alluded to briefly is that reliability needs to be very high to save time. That you need to be getting um the the answers these problems that developers are putting in correct. you know, something like 95 99% of the time in order for developers to tab tab tab through and you know, not not um not spend lots of time verifying the AI's work, which which of course um is pretty costly from a time perspective. Another possibility is bbenchlike or algorithmic costless scoring at the margin versus mergeability like scoring. Sweet scores are not trying to account for you know whether the code is spilled honable by by other people in future or whether it's matching quality considerations that aren't um considered by the unit tests. You know perhaps AIS really are performance according to SWEBenchl like scoring but not performance according to this kind of more holistic um uh holistic scoring that we might care about low versus high context baseliners. I I I mentioned I mentioned previously these are just much more skilled humans, you know, relative to those humans. Perhaps the AIs are less capable. Task distribution, maybe these are just different kinds of tasks, you know, in particular less less messy than the than the benchmark style task. Maybe that's explaining what's going on here. [snorts] Suboptimal capability elicitation. A huge amount of work has gone in at meter to making the agents as performant as possible given the underlying models on on our kinds of tasks. And um you know that involves churning through a load of AI tokens. Perhaps that's that's less the case for cursor in particular at the time when we completed the study. And then interdependence across tasks. Maybe it's the case that um you know if humans can complete task A and task B. AIS can only complete task A but not task B and of course can do task A faster. then it still makes sense to for humans to do task A and task B, not delegate task A because you know they they need to know the outputs. They need to know how how task A was completed in order to reliably complete task B. I think that's that's part of what's going on. You need to maintain context as you're working through these subtasks. Um lastly I will say that we are hiring not just for this kind of work that you've um that you've seen being extended you know ever longer tasks ever more ambitious um RCTs um even more sources of evidence from which we can triangulate the truth about AI capabilities but also for for much more besides you can you can find this at meter.org/careers org/careers. In particular, I'm excited about research engineers, research scientists who might be um hiding in the current audience. We're excited not just, you know, for for research types with academic experience, but very much for scrappy startup people as well. And we're also hiring for a director of operations. And with that, thank you very much for listening. Heat