Can you prove AI ROI in Software Eng? (Stanford 120k Devs Study) – Yegor Denisov-Blanch, Stanford

Channel: aiDotEngineer

Published at: 2025-12-11

YouTube video id: JvosMkuNxF8

Source: https://www.youtube.com/watch?v=JvosMkuNxF8

[music]
So companies spend millions on AI tools
for software engineering. But do we
actually know how well these tools work
in the enterprise or are these tools
just all hype? To answer this and for
the past two years, we've been
researching the impact of AI on software
engineering productivity. And our
research is time series because we look
at get historical data, meaning we can
go back in time. And it's also
cross-sectional because we cut across
companies. And the way we use to measure
most of the of the impact is by a
machine learning model that replicates a
panel of human experts. The way this
works is that imagine you have a
software engineer who writes a code
commit and this code commit would be
evaluated by multiple panels or of 10
and 15 independent experts who would
evaluate that code commit across
implementation time maintainability and
complexity and then produce an output
evaluation. So we took the labels of
these panels across you know millions of
of kind of evaluations and then trained
a model to replicate this panel of
experts meaning that we can deploy this
at scale and if there's ever any doubts
around the model's output you can always
kind of assemble your own panel and see
that it correlates pretty well with
reality.
Today we'll talk about four things.
We'll start off with looking at some of
the things that are driving AI
productivity gains in software. Then
we'll look at a AI practices benchmark
that we developed. We'll then look at
how we propose to measure AI return on
investment in software engineering. And
lastly, we'll finish things off with a
case study.
So here we took 46 teams that were using
AI and we matched them with 46 similar
teams that were not using AI and we
measured their net productivity gains
from AI quarterly. And the shaded area
is the middle 50% of the data and the
dark blue line is the median which as of
July of this year stands at about 10%
for this cohort.
I'd like to direct your attention to the
fact that the discrepancy between the
top performers and the bottom ones is
increasing. There's a widening gap. And
so if we very unscientifically and very
illustratively project this forward, we
might get something like this, right?
where uh you can have these top
performers being part of this the rich
gets richer effect where they these
successful early AI adopters might
compound their gains while these
strugglers could fall further behind. At
some point this is going to converge and
this is very directional. But my point
here is that if you're a leader in a
company, you definitely need to know in
which cohort you are right now so that
you can course correct and without
measuring the impact of AI on your
engineers, you're not going to be able
to do this.
So we started investigating what are
some of the factors that drive these top
teams to perform better and the first
thing we looked at is AI usage or
basically token spent. In this graph you
have the same kind of on the vertical
axis the productivity increase and then
on the horizontal one you have the token
usage per engineer per month on a
logarithmic scale. And what you can see
is that the correlation is quite loose
020 or so linearly and there is a bit of
a death valley effect around the 10
million uh token mark whereby teams that
were using that amount of tokens seem to
be doing worse than teams that were
using a bit less tokens. It's very
directional but interesting.
Nevertheless,
the conclusion here might be that AI
usage quality matters more than AI usage
value.
We dug deeper and we said well does the
environment in which the engineers work
impact the productivity from AI and we
came up with an environment cleaniness
index index it's quite experimental it's
a composite score that looks at tests
looks at uh types and documentation and
at modularity and at code quality and
that index is on the bottom axis here
from 0 to one and then on the vertical
axis once again you have the kind of
productivity lift relative to teams not
using AI
And so what you can see is that there's
a point40 R squar meaning a pretty
decent correlation around environment
cleanliness and gains from uh AI or
productivity gains from using AI. And so
the takeaway here is to invest in
codebase hygiene to unlock these AI
productivity gains.
We dug deeper to illustrate this
concept. And here we have on this graph
on the vertical axis the percentage of
tasks that might uh be able to be
completed by AI based on three colors.
And so green means that AI can do most
of the work for that task in that
sprint. Yellow means that AI can help
someone and red uh means that AI is not
very useful. And this is quite
illustrative but it conveys the point.
And so then any code base at any point
in time sits on a vertical line across
this graphic. And what you can see is
that clean code amplifies AI gains.
Secondly is that you need to manage your
codebase entropy, right? Your codebase
tech debt because if you just use AI
unchecked, this is going to accelerate
this entropy which is going to push and
degrade your cleanliness to the left
kind of right and then you as as a human
need to push on the other side to kind
of improve or maintain that cleanliness
to keep reaping the benefits from AI.
Thirdly is that it's important that
engineers need to know when to use AI
and when not to use AI. And what happens
when they don't is this kind of line on
the left whereby you have AI AI outputs
that are rejected or need heavy
rewriting which then leads to engineers
losing trust in AI saying okay this just
doesn't work. I'm not going to use it.
Which then further collapses your AI
gains.
Now we said can we find out whether we
can look not only at usage but at how
are these companies and these engineers
using AI and we came up with an AI
engineering practices benchmark. The way
this works is that we can scan your
codebase and detect these AI
fingerprints or artifacts basically
traces of how your team is using AI.
It's quite directional at this point but
evolving. And we can quantify this based
on the percentage of your active
engineering work that uses each AI
pattern. And then we kind of repeat this
monthly using git history. And the way
this works is more or less you have kind
of a few levels. And level zero might be
how humans are just not using AI and
write all of the code. Level one is kind
of like personal use where engineers are
not sharing prompts across the team or
not versioning them. Level two is team
use whereby teams are are sharing these
kind of prompts and rules. And then
level three is even more sophisticated.
It's where AI autonomously does specific
tasks maybe not the entire workflow. And
level four is you know agentic
orchestration which is where AI just
runs the entire process. And so this is
going to be an open- source tool which
you can leverage if you sign up on the
sweeper research portal. [snorts]
We applied this benchmark to one of the
companies in our research data set and
we saw this. This company had two
business units with equal access to AI
tools, right? Same licenses, same spend,
same tools, same everything. But the
adoption rate and the usage rate was
very different by business
[clears throat] unit. On the left, the
first business unit, you can as you can
see in the area in the blue, seemed to
be using AI a lot more for almost 40% of
their work. whereas on the on the uh
right the second business unit seem to
struggle behind a bit more. And so the
takeaway here is that access to AI and
even AI usage doesn't mean or doesn't
guarantee that that AI is going to be
used in the same way across a company.
As a leader, you really want to be
understanding not just whether they're
using but also how your engineers are
using AI.
Great. Now let's dive into how do we
actually measure AI return on investment
in software engineering.
Oh uh there we go. Okay. So here ideally
we would be measuring this based on
business outcomes right? I give my AI
engineer my engineers AI and then I make
more money more revenue net revenue
retention whatever business KPI you want
to track. The problem is that there's
too much noise between the treatment
right giving AI and the result which is
the business outcome and on top of this
there's confounding variables such as
your sales execution the macro
environment your product strategy and
therefore although that would be ideal
unfortunately uh I think we need to find
alternative paths and the most logical
one is to simply look at the engineering
outcomes because there is a clear signal
right but here we need to go beyond
measuring AI usage into measuring
engineering outcomes. There's a few
caveats and this topic is quite heavily
discussed and so I want to mention some
of them.
The first one is that this is assuming
that our product function can properly
direct that increased capacity into
something that generates value. And if
they aren't directing that then it's a
product problem which although sits
quite close to engineering it's slightly
different. Right?
The second caveat is that this assumes
that engineering is a meaningful
bottleneck for value which frankly it
typically is and that you can guard
against good hards law by using a
balanced set of metrics and also by
having a good company culture that
doesn't weaponize these metrics.
And thirdly is that AI is still very new
and measuring proxy metrics is still
better than not measuring. There's going
to be winners and losers in this AI
race. And progress is better than
perfection here. And so metrics don't
need to be flawless to be useful is what
I want to illustrate.
So then um here we have uh two parts
which you need to do to get the ROI from
AI, right? You can need to measure usage
and then you need to measure engineering
outcomes. And so let's start with usage.
There's really two buckets for
enterprises. There's kind of more in a
research environment, but to make it
simple, there's access based and there's
usage based. Accessbased is basically
looking at when did people get access to
the tool. And here we have you can kind
of do a pilot group, give that group AI
and then compare it to a similar group
without AI or you can measure the same
team across time. The problem is that
access based is noisy and the gold
standard is really usage based which uh
uses telemetry from APIs from these
coding assistants right to uh give you
the right data to know who's using AI
and and where and the caveat here is
that the vendor API is different
unfortunately tools like GitHub copilot
aggregate the data and other tools like
cursor give you more granular data
the big takeaway is that you can measure
impact of um retroactively by using git
history and so you don't need to set up
an experiment now and wait 6 months you
can actually if you've already adopted
AI you can go back in time and and and
do this it's quite easy
now we've seen usage let's look into how
do we actually measure engineering
outcomes what are some of the metrics we
propose
here we have um our framework which we
propose which is using a prim primary
metric and a guardrail metric and so
here um the primary metric is
engineering output it's not lines of
code it's not PR counts and it's not
DORA and it's basically based on this
machine learning model that replicates
the panel of experts right and the
second set of metrics are the guardrail
ones which you want to maintain at a
healthy level but you don't want to
maximize it doesn't make sense to
maximize them truly
and so then there's three categories
within the guardrail ones rework and
refactoring quality tech and risk and
then people and devops
The third bucket is important to
highlight that these are not
productivity metrics. They're useful,
but you cannot just kind of use them
like maximize them to maximize developer
productivity. They kind of fall off at
some point. And so the goal here might
be to keep your guardrail metrics
healthy while increasing the primary
metric to whatever degree possible.
Now let's dive into a case study.
Here we worked with
a company that uh large enterprise. We
took a team of uh 350 people under a
vice president and we measured pull
requests. The reason we did this is to
illustrate that you cannot measure pull
requests to understand whether AI is
helping you. And so here this team
adopted um AI in May of this year and we
measured the four months before 4 months
after. We saw a 14% increase. Great.
That's fantastic. But what about
reviewer burden? What about code
quality? So we measured code quality.
And here what we saw is um I mean
firstly actually code quality think of
it as maintainability scale from 0 to
10. And uh there's kind of these bands.
Uh it uses our our methodology. You can
read it online. [snorts] But basically
what you see is that in the preAI period
their code quality was quite stable and
consistent. And once they adopted AI two
things happened. code quality decreased
and then code quality became more
erratic.
Next, we took a look at our metric which
is engineering output. It's not lines of
code. And here for every month, you see
the sigma, the sum of the output
delivered for that month broken down
into four buckets. Rework and
refactoring. So rework is when you're
changing or editing code that was it's
still kind of fresh, so it's recent.
Refactoring is when you're changing code
that's a bit older. And uh what uh then
like added and removed it's pretty
self-explanatory. And then also you can
see these kind of benchmarks. So we can
benchmark this company against similar
companies in their industry. And here AI
usage had two effects. Firstly is that
rework went up by 2.5 times which is
really bad. And effective output which
is kind of like a proxy for productivity
or so didn't really change.
And so then what's the conclusion here?
Let's do a recap. So we saw that PRs
went up by 14%. But this is inconclusive
because more PRs doesn't mean better. We
saw that code quality decreased by 9%
which is problematic. We saw that
effective output didn't increase
meaningfully. And then we saw that
rework increased by a lot. And so then
the question here is what is the ROI of
this AI adoption? Right? It might be
negative. And what I want to point out
here is that had this company not
measured this more thoroughly and simply
measured PR counts, they would have
thought, hey, we're doing great. We
increased our productivity by 14%. Let's
run the numbers. That's how many million
lots of millions of dollars. And does
this offset the AI licenses? Sure thing
it does, right? The other thing is that
I don't think this company should
abandon AI. They should simply use this
data to understand what they're doing
wrong. How can they improve? Because AI
is here to stay. It's a tool that's
going to transform how engineers are are
working, right? and you can just um kind
of like abandon it or so.
Great. So, this concludes our insights
for today. If you've enjoyed this uh
talk and you would like similar insights
for your company, I invite you to
participate in our research. Everything
you've seen today can uh be accessed
through kind of participating in our
research, some of them through live
dashboards in our research portal. And
especially I'd like to invite companies
that have access to cursor enterprise to
participate because we have a high need
for this so we can publish papers around
the granularity of using AI um in
software engineering. You can sign up at
software engineering
productivity.stanford.edu.
Thank you so much.
>> [music]