Can you prove AI ROI in Software Eng? (Stanford 120k Devs Study) – Yegor Denisov-Blanch, Stanford
Channel: aiDotEngineer
Published at: 2025-12-11
YouTube video id: JvosMkuNxF8
Source: https://www.youtube.com/watch?v=JvosMkuNxF8
[music] So companies spend millions on AI tools for software engineering. But do we actually know how well these tools work in the enterprise or are these tools just all hype? To answer this and for the past two years, we've been researching the impact of AI on software engineering productivity. And our research is time series because we look at get historical data, meaning we can go back in time. And it's also cross-sectional because we cut across companies. And the way we use to measure most of the of the impact is by a machine learning model that replicates a panel of human experts. The way this works is that imagine you have a software engineer who writes a code commit and this code commit would be evaluated by multiple panels or of 10 and 15 independent experts who would evaluate that code commit across implementation time maintainability and complexity and then produce an output evaluation. So we took the labels of these panels across you know millions of of kind of evaluations and then trained a model to replicate this panel of experts meaning that we can deploy this at scale and if there's ever any doubts around the model's output you can always kind of assemble your own panel and see that it correlates pretty well with reality. Today we'll talk about four things. We'll start off with looking at some of the things that are driving AI productivity gains in software. Then we'll look at a AI practices benchmark that we developed. We'll then look at how we propose to measure AI return on investment in software engineering. And lastly, we'll finish things off with a case study. So here we took 46 teams that were using AI and we matched them with 46 similar teams that were not using AI and we measured their net productivity gains from AI quarterly. And the shaded area is the middle 50% of the data and the dark blue line is the median which as of July of this year stands at about 10% for this cohort. I'd like to direct your attention to the fact that the discrepancy between the top performers and the bottom ones is increasing. There's a widening gap. And so if we very unscientifically and very illustratively project this forward, we might get something like this, right? where uh you can have these top performers being part of this the rich gets richer effect where they these successful early AI adopters might compound their gains while these strugglers could fall further behind. At some point this is going to converge and this is very directional. But my point here is that if you're a leader in a company, you definitely need to know in which cohort you are right now so that you can course correct and without measuring the impact of AI on your engineers, you're not going to be able to do this. So we started investigating what are some of the factors that drive these top teams to perform better and the first thing we looked at is AI usage or basically token spent. In this graph you have the same kind of on the vertical axis the productivity increase and then on the horizontal one you have the token usage per engineer per month on a logarithmic scale. And what you can see is that the correlation is quite loose 020 or so linearly and there is a bit of a death valley effect around the 10 million uh token mark whereby teams that were using that amount of tokens seem to be doing worse than teams that were using a bit less tokens. It's very directional but interesting. Nevertheless, the conclusion here might be that AI usage quality matters more than AI usage value. We dug deeper and we said well does the environment in which the engineers work impact the productivity from AI and we came up with an environment cleaniness index index it's quite experimental it's a composite score that looks at tests looks at uh types and documentation and at modularity and at code quality and that index is on the bottom axis here from 0 to one and then on the vertical axis once again you have the kind of productivity lift relative to teams not using AI And so what you can see is that there's a point40 R squar meaning a pretty decent correlation around environment cleanliness and gains from uh AI or productivity gains from using AI. And so the takeaway here is to invest in codebase hygiene to unlock these AI productivity gains. We dug deeper to illustrate this concept. And here we have on this graph on the vertical axis the percentage of tasks that might uh be able to be completed by AI based on three colors. And so green means that AI can do most of the work for that task in that sprint. Yellow means that AI can help someone and red uh means that AI is not very useful. And this is quite illustrative but it conveys the point. And so then any code base at any point in time sits on a vertical line across this graphic. And what you can see is that clean code amplifies AI gains. Secondly is that you need to manage your codebase entropy, right? Your codebase tech debt because if you just use AI unchecked, this is going to accelerate this entropy which is going to push and degrade your cleanliness to the left kind of right and then you as as a human need to push on the other side to kind of improve or maintain that cleanliness to keep reaping the benefits from AI. Thirdly is that it's important that engineers need to know when to use AI and when not to use AI. And what happens when they don't is this kind of line on the left whereby you have AI AI outputs that are rejected or need heavy rewriting which then leads to engineers losing trust in AI saying okay this just doesn't work. I'm not going to use it. Which then further collapses your AI gains. Now we said can we find out whether we can look not only at usage but at how are these companies and these engineers using AI and we came up with an AI engineering practices benchmark. The way this works is that we can scan your codebase and detect these AI fingerprints or artifacts basically traces of how your team is using AI. It's quite directional at this point but evolving. And we can quantify this based on the percentage of your active engineering work that uses each AI pattern. And then we kind of repeat this monthly using git history. And the way this works is more or less you have kind of a few levels. And level zero might be how humans are just not using AI and write all of the code. Level one is kind of like personal use where engineers are not sharing prompts across the team or not versioning them. Level two is team use whereby teams are are sharing these kind of prompts and rules. And then level three is even more sophisticated. It's where AI autonomously does specific tasks maybe not the entire workflow. And level four is you know agentic orchestration which is where AI just runs the entire process. And so this is going to be an open- source tool which you can leverage if you sign up on the sweeper research portal. [snorts] We applied this benchmark to one of the companies in our research data set and we saw this. This company had two business units with equal access to AI tools, right? Same licenses, same spend, same tools, same everything. But the adoption rate and the usage rate was very different by business [clears throat] unit. On the left, the first business unit, you can as you can see in the area in the blue, seemed to be using AI a lot more for almost 40% of their work. whereas on the on the uh right the second business unit seem to struggle behind a bit more. And so the takeaway here is that access to AI and even AI usage doesn't mean or doesn't guarantee that that AI is going to be used in the same way across a company. As a leader, you really want to be understanding not just whether they're using but also how your engineers are using AI. Great. Now let's dive into how do we actually measure AI return on investment in software engineering. Oh uh there we go. Okay. So here ideally we would be measuring this based on business outcomes right? I give my AI engineer my engineers AI and then I make more money more revenue net revenue retention whatever business KPI you want to track. The problem is that there's too much noise between the treatment right giving AI and the result which is the business outcome and on top of this there's confounding variables such as your sales execution the macro environment your product strategy and therefore although that would be ideal unfortunately uh I think we need to find alternative paths and the most logical one is to simply look at the engineering outcomes because there is a clear signal right but here we need to go beyond measuring AI usage into measuring engineering outcomes. There's a few caveats and this topic is quite heavily discussed and so I want to mention some of them. The first one is that this is assuming that our product function can properly direct that increased capacity into something that generates value. And if they aren't directing that then it's a product problem which although sits quite close to engineering it's slightly different. Right? The second caveat is that this assumes that engineering is a meaningful bottleneck for value which frankly it typically is and that you can guard against good hards law by using a balanced set of metrics and also by having a good company culture that doesn't weaponize these metrics. And thirdly is that AI is still very new and measuring proxy metrics is still better than not measuring. There's going to be winners and losers in this AI race. And progress is better than perfection here. And so metrics don't need to be flawless to be useful is what I want to illustrate. So then um here we have uh two parts which you need to do to get the ROI from AI, right? You can need to measure usage and then you need to measure engineering outcomes. And so let's start with usage. There's really two buckets for enterprises. There's kind of more in a research environment, but to make it simple, there's access based and there's usage based. Accessbased is basically looking at when did people get access to the tool. And here we have you can kind of do a pilot group, give that group AI and then compare it to a similar group without AI or you can measure the same team across time. The problem is that access based is noisy and the gold standard is really usage based which uh uses telemetry from APIs from these coding assistants right to uh give you the right data to know who's using AI and and where and the caveat here is that the vendor API is different unfortunately tools like GitHub copilot aggregate the data and other tools like cursor give you more granular data the big takeaway is that you can measure impact of um retroactively by using git history and so you don't need to set up an experiment now and wait 6 months you can actually if you've already adopted AI you can go back in time and and and do this it's quite easy now we've seen usage let's look into how do we actually measure engineering outcomes what are some of the metrics we propose here we have um our framework which we propose which is using a prim primary metric and a guardrail metric and so here um the primary metric is engineering output it's not lines of code it's not PR counts and it's not DORA and it's basically based on this machine learning model that replicates the panel of experts right and the second set of metrics are the guardrail ones which you want to maintain at a healthy level but you don't want to maximize it doesn't make sense to maximize them truly and so then there's three categories within the guardrail ones rework and refactoring quality tech and risk and then people and devops The third bucket is important to highlight that these are not productivity metrics. They're useful, but you cannot just kind of use them like maximize them to maximize developer productivity. They kind of fall off at some point. And so the goal here might be to keep your guardrail metrics healthy while increasing the primary metric to whatever degree possible. Now let's dive into a case study. Here we worked with a company that uh large enterprise. We took a team of uh 350 people under a vice president and we measured pull requests. The reason we did this is to illustrate that you cannot measure pull requests to understand whether AI is helping you. And so here this team adopted um AI in May of this year and we measured the four months before 4 months after. We saw a 14% increase. Great. That's fantastic. But what about reviewer burden? What about code quality? So we measured code quality. And here what we saw is um I mean firstly actually code quality think of it as maintainability scale from 0 to 10. And uh there's kind of these bands. Uh it uses our our methodology. You can read it online. [snorts] But basically what you see is that in the preAI period their code quality was quite stable and consistent. And once they adopted AI two things happened. code quality decreased and then code quality became more erratic. Next, we took a look at our metric which is engineering output. It's not lines of code. And here for every month, you see the sigma, the sum of the output delivered for that month broken down into four buckets. Rework and refactoring. So rework is when you're changing or editing code that was it's still kind of fresh, so it's recent. Refactoring is when you're changing code that's a bit older. And uh what uh then like added and removed it's pretty self-explanatory. And then also you can see these kind of benchmarks. So we can benchmark this company against similar companies in their industry. And here AI usage had two effects. Firstly is that rework went up by 2.5 times which is really bad. And effective output which is kind of like a proxy for productivity or so didn't really change. And so then what's the conclusion here? Let's do a recap. So we saw that PRs went up by 14%. But this is inconclusive because more PRs doesn't mean better. We saw that code quality decreased by 9% which is problematic. We saw that effective output didn't increase meaningfully. And then we saw that rework increased by a lot. And so then the question here is what is the ROI of this AI adoption? Right? It might be negative. And what I want to point out here is that had this company not measured this more thoroughly and simply measured PR counts, they would have thought, hey, we're doing great. We increased our productivity by 14%. Let's run the numbers. That's how many million lots of millions of dollars. And does this offset the AI licenses? Sure thing it does, right? The other thing is that I don't think this company should abandon AI. They should simply use this data to understand what they're doing wrong. How can they improve? Because AI is here to stay. It's a tool that's going to transform how engineers are are working, right? and you can just um kind of like abandon it or so. Great. So, this concludes our insights for today. If you've enjoyed this uh talk and you would like similar insights for your company, I invite you to participate in our research. Everything you've seen today can uh be accessed through kind of participating in our research, some of them through live dashboards in our research portal. And especially I'd like to invite companies that have access to cursor enterprise to participate because we have a high need for this so we can publish papers around the granularity of using AI um in software engineering. You can sign up at software engineering productivity.stanford.edu. Thank you so much. >> [music]