Leadership in AI Assisted Engineering – Justin Reock, DX (acq. Atlassian)

Channel: aiDotEngineer
Published at: 2025-12-19
YouTube video id: PmZDupFP3UM
Source: https://www.youtube.com/watch?v=PmZDupFP3UM
[music]
Thanks for joining me in one of the
later day sessions. Looks like we we we
kept a lot of people here. This is a
nice full room. I'm great to see it.
We're going to go through a lot of
content in a short amount of time. So,
I'm going to get right into it. If you
want to get deeper into any of this
stuff, we have published this uh AI
strategy playbook for senior executives.
And uh a lot of the content that I'm
going to go through, I'm not going to
have time to get quite as deep, but this
is just a nice PDF copy that you can
come and refer to later. If you missed
this QR code, don't worry. I'll show it
again uh at the end. So, what is the
current impact of Genai?
Nobody knows, right? We've got Google on
the one hand telling us that everyone's
10% more productive. That's interesting.
Now, they're Google. they were already
pretty productive to begin with, but we
have this sort of now infamous meter MER
study which has some flaws in the way
that study was put together that showed
actually a 19% decrease in productivity
using codec assistance. So there's a lot
of volatility, a lot of variability. Uh
what was really interesting about this
study, even though I I mentioned there
were some flaws, um but every engineer
that took part in this study felt more
productive, but then the data actually
bore out that they were less productive.
kind of interesting, right? We've got
this induced flow uh that makes us feel
really good about what we're doing. So,
we need to address this. Dora has put
out some really good research on this,
too. But this is based on industry
averages. This is impact based on what
do we look at when we see a large sample
and an average of how certain factors
are being impacted by in this case 25%
increase in AI adoption. We see these
modest but positive leaning indicators.
7.5% increase in documentation quality
and uh increase in code quality by about
3.4%. At least that's not leaning in the
other direction, right? And when we
started digging through some of DX's
data, we have, you know, we're the
developer productivity measurement
company. We have lots of aggregate data
that we can look at with this. We found
the same thing. When we looked at
averages, we see about a 2.6% 6%
increase in overall uh change
confidence, which is a a percentage of
people who answered positively that they
feel confident in the changes that
they're putting into production. Uh
similar positive leaning average when we
looked at code maintainability, another
qualitative metric, a1% reduction in
change failure rate. uh which when you
think about the industry benchmark being
4% it's not insignificant
but this is not the full story because
this is what we saw when we broke the
same studies down per company. Every
company here is a every every bar
represents a company right we have some
that are seeing 20% increases in change
confidence while others are seeing 20%
decreases. We're seeing extreme
volatility which is why these averages
look so innocuous but they're belying
the greater story of variability. See
the same thing with code
maintainability.
The same thing with change failure rate.
So this is a 2% increase in change
failure rate up here at the top. Again
with an industry benchmark of 4%. That
means shipping as much as 50% more
defects than we were shipping before.
Right? We want to make sure we're on the
lower end of this. But how? Like what
should we be doing? Well, we found some
patterns here. We see that some
organizations are seeing positive
impacts to KPIs, but others are
struggling with adoption and even seeing
some of these negative impacts. Top down
mandates are not working, right? Driving
towards, oh, we must have 100% adoption
of AI. Great, I will update my read my
file every morning and I will be
compliant, right? We're not actually
moving the needle anywhere when we do
that. We also find that lack of
education and enablement uh has a big
impact on sort of negatively impacting
this. Some organizations just turn on
the tech and expect it to just start
working and everybody to know the best
ways to use it. Uh and a difficulty
measuring the impact or even knowing
what we should be measuring like what
metrics would should we be looking at
you know does utilization really tell us
much about the full story of Genai
impact. This is another graph from Dora.
uh this is a basian uh posterior
distribution which is an interesting way
of representing data. Basically you want
your mass to be on the yellow side of
this line uh the the uh the right side
of this line for the audience. Yeah. And
you want a sharp peak which is telling
you that we're pretty confident that
this initiative will have this impact.
And if we look at some of the topline
initiatives here, these are things like
clear AI policies. All right, we want to
make sure we have that. We want time to
learn. Not just giving people materials,
but actually giving them space to
experiment, right? Um, and so these
types of factors are the ones that seem
to be moving the needle the most. So,
we're going to go over some quick tips
on how we can do all of these things.
And again, the guide will go deeper into
this. We want to integrate across the
SDLC. All right. For most organizations,
writing code has never been the
bottleneck, right? We can in uh we can
increase productivity a bit by helping
with code completion, but our our
biggest bottlenecks are elsewhere within
the SDLC. There's a lot more to creating
software than just writing code. We want
to unblock usage. We can't just say,
well, we're worried about data
xfiltration, so we can't try this thing.
Like, no, get creative about it. We've
got really good infrastructure out there
now like bedrock and fireworks AI that
can let us run powerful models in safe
spaces. We have to have open discussions
about these metrics. We need to
evangelize the wins and we need to let
our engineers know why we're gathering
metrics and data. What is it that we're
trying to improve? We have to reduce the
fear of AI, right? We have to make sure
that people understand that this is not
a technology that is ready to replace
engineers. This is a a technology that's
really good at augmenting engineers and
increasing the throughput of our
business. We have to establish better
compliance and trust. And we need to tie
this stuff to employee success. These
are new skill sets. AI is not coming for
your job, but somebody really good at AI
might take your job. And so, as leaders,
we have the opportunity to help our
employees become more successful with
this technology. So, how do we reduce
the fear? Well, first of all, why do we
need to do this? Well, there's a lot of
good reasons, but I love to point to
Google's project Aristotle. This was a
2012 study where Google wanted to figure
out what are the characteristics of
highly performant teams. uh they thought
that the recipe was just going to be
what Google had this combination of high
performers, experienced managers and
basically unlimited resources and they
were dead wrong. Overwhelmingly the
biggest indicator of productivity was
psychological safety. Okay. And so that
very much applies now. We also have data
like this is SweetBench. I'm sure a lot
of you have seen this and there are some
impressive benchmarks that the agents
can do like a third of the things
they're asked to do without any human
intervention. That means that they're
not able to do twothirds of them. Right?
Again, we are augmenting. We're not
replacing. We're not ready. We may never
be ready. So, we need to be very
transparent with what we're doing. We
need to set very clear intents. Why, you
know, are we uh using this to to
augment, not to replace. We need to be
proactive in the way that we communicate
that and not just wait for people to get
upset and possibly scared. We need to
say, "No, we are here to help you to
give you a better developer experience
and to increase the throughput of the
business." And again we have to have
these discussions about metrics. Now
what metrics? What should we be looking
at? Well DX again developer experience
and productivity measurement company. Um
there are two sort of classes of metrics
that we can be looking at really two
levers that matter here and that's speed
and quality. Right? We want to increase
PR throughput. We want to increase our
velocity but not by just creating a
bunch of slop that's going to give us a
bunch of tech debt later that we're
going to have to deal with and we just
kick the bottleneck down the road if we
do that. Right? So we want to be looking
at things like change failure rate, our
overall perception of quality, change
confidence, maintainability.
And we have three types of metrics that
we can be looking at here. We have our
telemetry metrics. These are the things
coming out of the API. And they're good
for some stuff, but they're not always
accurate, right? We know like accept
versus suggest was kind of like all the
rage until we realize that engineers
need to click accept in the IDE in order
for the API to know about it. even if
they do click accept, who's to say they
didn't just go back and rewrite every
line that was suggested, right? So
that's providing us some context, but we
also need to do some experience
sampling. We need to like for instance
add a new field to a PR form that says I
used AI to generate this PR or I enjoyed
using AI to generate this PR and get
some data that way. And then
self-reported data or survey data. We
are big on surveys, but let me
underscore we're big on effective
surveys. 90% plus participation rates
engineered against questions that treat
developer experience as a systems
problem not a people problem because
that's what it is W. Edwards Deming 90
to 95% of the productivity output of an
organization is determined by the system
and not the worker. Okay, so
foundational developer experience and
developer productivity metrics still
matter the most. Right? Our AI metrics
like utilization and things are telling
us what's happening with the tech, but
these core metrics that we've been able
to trust are telling us whether these
initiatives are actually working, right?
Are we actually moving the needle and
having the outcomes that we want to see?
So top companies are looking at
different things, right? We are seeing
like adoption metrics coming out of
Microsoft. They've also got this great
metric called a bad developer day. I'm
not going to go into it, but there's a
really good white paper that shows like
all the different telemetry that they
can look at to determine what makes a
bad developer day. Dropbox is looking at
similar stuff. Adoption like weekly
active users, daily active users, that
sort of thing, but also looking at
quality metrics like change failure
rate. And booking is looking at similar
stuff as well. And so we built a
framework around this. We were first to
market with what we call our DXAI
measurement framework. And this is very
much inspired by things like Dora space
framework, DevX just like our core four
metric set which you can ask me about
later. Uh and we take these metrics and
we uh normalize them into these three
dimensions of utilization, impact and
cost. And you can kind of think about
this as a maturity curve too. A lot of
people start just figuring out okay
what's happening? who's using the tech,
what's the percentage of pull requests
that we're getting that are AI assisted
maybe through experience sampling? How
many tasks are being assigned to agents?
But then we can mature that perspective
a little bit and we can correlate that
utilization to impact. What is this
actually doing to velocity? What is this
actually doing to quality? And this is
when we start getting more mature in our
picture of our impact. And then finally,
cost. Although I like to joke that we're
15 years past the last hype cycle, which
was cloud, and we still have new
companies spinning up that are teaching
us how to understand and optimize our
cloud costs. So, we will see if we get
there. Although, I also hear horror
stories about people burning through
2,000 tokens at $2,000 worth of tokens a
day. So, we probably do need to hit that
as well. What about compliance and
trust? What can we do to ensure that the
output uh that that's being generated is
something that can be trusted by our
engineers? We have a lot of levers to
pull here, but one of the ones that I'd
like to talk about is setting up a
feedback loop for our system prompts. So
these could be called system prompts,
cursor rules, agent markdown. Pretty
much all of the mainstream solutions
have something like this where you can
go and provide a set of rules uh to
control how these models behave. Uh and
I won't get too much into the technical
details here. We have an example where
like the uh models have been providing
outdated Spring Boot uh stuff. We want
Spring Boot 3. It's It's been sending us
Spring Boot 2 stuff. The big takeaway
here is to have the feedback loop. Have
a gatekeeper, right? Have somebody or a
group in the organization that can
receive this feedback that understand
how to maintain and continuously improve
these system prompts, right? And that
way we're always maintaining the way
that these assistants or models or
agents affect the whole business. It
also pays to understand the way that uh
temperature works, especially when we're
building agents, right? we do have some
control over the determinism and
nondeterminism of these models. Uh again
like when a model is predicting a next
token, it doesn't just have like one
token. It has a matrix of tokens and
those are associated with a certain
probability of that being like the right
token. And so we have this setting
called temperature which is heat which
is entropy which is randomness that can
control the amount of randomness
involved in actually picking that token.
This is sometimes called increasing the
creativity of the model. And it's a
number between 0 and one. For those
reasons I just mentioned, don't use zero
or don't use one. Weird things will
happen. But you want some decimal in
between zero and one. When we have a
lower temperature, like we're seeing
here, 0.001,
we give it the same task twice, and it
gives us the exact same output character
for character. When we set that
temperature higher, this is an example
of 0.9. I'm asking the agent to create a
gradient for me. Uh, simple task. It's
giving me two relatively valid
solutions. I did ask it for a JavaScript
method and this is the only one that's
giving me a JavaScript method. But the
point is they are wildly different
approaches to the same problem when I've
increased the creativity of that model.
So we need to think about like use case
wise where should we have more
creativity and where should we have more
determinism and temperature is another
setting that we have that can help
control this. You can experiment with
all this using like docker model runner
lama lm studio that sort of thing. How
can we tie this to better employee
success? We had to provide both
education and adequate time to learn. So
we put together a study where we sampled
a bunch of uh developers that were
saving at least an hour a day uh uh
excuse me an hour a week and we asked
them to stack rank their top five most
valuable use cases. And we built a guide
around that. a guide that effectively
goes through code examples, prompting
examples uh of what we determined using
the sort of data approach where we
should get more reflexive about our best
practice and about uh the use cases that
we're becoming reflexive in in our use
of AI. And so that's what this guide was
about. And uh we've had this become
required reading in certain engineering
groups and uh proud of that. And this is
another way that we can help educate.
But we need to give time. Uh we don't
have time to go through all of this. I
do think it's interesting that the
number one use case for this was stack
trace analysis, right? So, not a
generative use case, actually more of an
interpretive use case. And we see some
other ones here that are not too
surprising. And there's examples of each
of these. What about unblocking usage?
How can we make sure that we can
creatively ensure that engineers can
take the most advantage of this? Well,
leverage self-hosted and private models.
That's getting easier and easier to do.
Partner with compliance on day one,
right? Make sure that what you're doing
is in line with your organization's
compliance. You may find that you're
making a lot of assumptions about things
that you don't think you can do that you
can actually do, right? And then think
creatively around various barriers.
Finally, how can we integrate across the
SDLC? What should we think about doing
there? You know, and I'm a big Ellie
Gold theory of constraints fan. Probably
have some others in the audience. An
hour saved on something that isn't the
bottleneck is worthless. And when we
look at data across in this case almost
140,000 engineers, we find that there
are definitely good like annualized time
savings with AI that are being eclipsed
by sources of context switching and
interruption, meeting heavy days, these
other things that it's like, yeah, we
can save time here, but we're losing so
much more time over there. So find the
bottleneck, fix the bottleneck, right?
Morgan Stanley's been very public about
the uh building this thing called Dev
Gen AI that looks at a bunch of legacy
code, Cobalt, mainframe natural. I hate
to admit Pearl because I'm an old school
Pearl developer. Uh but apparently
that's legacy now, too. And basically
creating specs uh for developers that
can just be handed to developers to
start modernizing the code without
having to do all that reverse
engineering, right? And they're saving
about 300,000 hours annually right now
doing this. There's a Wall Street
Journal journal article about this,
Business Insider article about it. Uh
they're very public about that. Zapier,
Zapier should be the example for
everyone. They have a whole series of
bots and agents that are doing things
like assisting with onboarding. They can
now make engineers effective in 2 weeks.
Industry benchmark on the good side is
like a month. On the medium side is like
90 days. And uh because they're able to
increase the effectiveness of the
engineers that they're h that they've
bringing into the organization, they
realized that they should be hiring
more, right? As opposed to trying to
maintain status quo by like cutting
headcount and trying to make individual
engineers more productive. They said,
"No, we could get more value out of a
single engineer. We should be hiring
faster than ever." And they are. And
it's really increasing their competitive
edge. I think that's the right attitude.
Spotify has been helping out their SRRES
by pulling together context when
incidents uh are detected and then
taking things like run but steps and and
other areas of context and documentation
and pushing them directly into S sur
channels so that those critical minutes
of trying to get to the bottom of what's
actually happening and what we should do
do to resolve the incident uh they just
eliminated that time right it's
significantly increased their MTTR so
let's get creative about areas in the
STLC that are our actual bottlenecks
All right, next steps. Uh, distribute
this guide as a reference for
integrating AI into the development
workflows that you have. Uh, determine a
method for measuring and evaluating
Genai impact. It's really important to
make sure that we're not on the bad
sides of those graphs that I showed you
earlier and then track and measure AI
adoption and and see how that correlates
to overall impact metrics and iterate on
best practices and use cases. And here's
a guide again. Thank you so much.
[applause]
[music]
>> [music]