Leadership in AI Assisted Engineering – Justin Reock, DX (acq. Atlassian)
Channel: aiDotEngineer
Published at: 2025-12-19
YouTube video id: PmZDupFP3UM
Source: https://www.youtube.com/watch?v=PmZDupFP3UM
[music] Thanks for joining me in one of the later day sessions. Looks like we we we kept a lot of people here. This is a nice full room. I'm great to see it. We're going to go through a lot of content in a short amount of time. So, I'm going to get right into it. If you want to get deeper into any of this stuff, we have published this uh AI strategy playbook for senior executives. And uh a lot of the content that I'm going to go through, I'm not going to have time to get quite as deep, but this is just a nice PDF copy that you can come and refer to later. If you missed this QR code, don't worry. I'll show it again uh at the end. So, what is the current impact of Genai? Nobody knows, right? We've got Google on the one hand telling us that everyone's 10% more productive. That's interesting. Now, they're Google. they were already pretty productive to begin with, but we have this sort of now infamous meter MER study which has some flaws in the way that study was put together that showed actually a 19% decrease in productivity using codec assistance. So there's a lot of volatility, a lot of variability. Uh what was really interesting about this study, even though I I mentioned there were some flaws, um but every engineer that took part in this study felt more productive, but then the data actually bore out that they were less productive. kind of interesting, right? We've got this induced flow uh that makes us feel really good about what we're doing. So, we need to address this. Dora has put out some really good research on this, too. But this is based on industry averages. This is impact based on what do we look at when we see a large sample and an average of how certain factors are being impacted by in this case 25% increase in AI adoption. We see these modest but positive leaning indicators. 7.5% increase in documentation quality and uh increase in code quality by about 3.4%. At least that's not leaning in the other direction, right? And when we started digging through some of DX's data, we have, you know, we're the developer productivity measurement company. We have lots of aggregate data that we can look at with this. We found the same thing. When we looked at averages, we see about a 2.6% 6% increase in overall uh change confidence, which is a a percentage of people who answered positively that they feel confident in the changes that they're putting into production. Uh similar positive leaning average when we looked at code maintainability, another qualitative metric, a1% reduction in change failure rate. uh which when you think about the industry benchmark being 4% it's not insignificant but this is not the full story because this is what we saw when we broke the same studies down per company. Every company here is a every every bar represents a company right we have some that are seeing 20% increases in change confidence while others are seeing 20% decreases. We're seeing extreme volatility which is why these averages look so innocuous but they're belying the greater story of variability. See the same thing with code maintainability. The same thing with change failure rate. So this is a 2% increase in change failure rate up here at the top. Again with an industry benchmark of 4%. That means shipping as much as 50% more defects than we were shipping before. Right? We want to make sure we're on the lower end of this. But how? Like what should we be doing? Well, we found some patterns here. We see that some organizations are seeing positive impacts to KPIs, but others are struggling with adoption and even seeing some of these negative impacts. Top down mandates are not working, right? Driving towards, oh, we must have 100% adoption of AI. Great, I will update my read my file every morning and I will be compliant, right? We're not actually moving the needle anywhere when we do that. We also find that lack of education and enablement uh has a big impact on sort of negatively impacting this. Some organizations just turn on the tech and expect it to just start working and everybody to know the best ways to use it. Uh and a difficulty measuring the impact or even knowing what we should be measuring like what metrics would should we be looking at you know does utilization really tell us much about the full story of Genai impact. This is another graph from Dora. uh this is a basian uh posterior distribution which is an interesting way of representing data. Basically you want your mass to be on the yellow side of this line uh the the uh the right side of this line for the audience. Yeah. And you want a sharp peak which is telling you that we're pretty confident that this initiative will have this impact. And if we look at some of the topline initiatives here, these are things like clear AI policies. All right, we want to make sure we have that. We want time to learn. Not just giving people materials, but actually giving them space to experiment, right? Um, and so these types of factors are the ones that seem to be moving the needle the most. So, we're going to go over some quick tips on how we can do all of these things. And again, the guide will go deeper into this. We want to integrate across the SDLC. All right. For most organizations, writing code has never been the bottleneck, right? We can in uh we can increase productivity a bit by helping with code completion, but our our biggest bottlenecks are elsewhere within the SDLC. There's a lot more to creating software than just writing code. We want to unblock usage. We can't just say, well, we're worried about data xfiltration, so we can't try this thing. Like, no, get creative about it. We've got really good infrastructure out there now like bedrock and fireworks AI that can let us run powerful models in safe spaces. We have to have open discussions about these metrics. We need to evangelize the wins and we need to let our engineers know why we're gathering metrics and data. What is it that we're trying to improve? We have to reduce the fear of AI, right? We have to make sure that people understand that this is not a technology that is ready to replace engineers. This is a a technology that's really good at augmenting engineers and increasing the throughput of our business. We have to establish better compliance and trust. And we need to tie this stuff to employee success. These are new skill sets. AI is not coming for your job, but somebody really good at AI might take your job. And so, as leaders, we have the opportunity to help our employees become more successful with this technology. So, how do we reduce the fear? Well, first of all, why do we need to do this? Well, there's a lot of good reasons, but I love to point to Google's project Aristotle. This was a 2012 study where Google wanted to figure out what are the characteristics of highly performant teams. uh they thought that the recipe was just going to be what Google had this combination of high performers, experienced managers and basically unlimited resources and they were dead wrong. Overwhelmingly the biggest indicator of productivity was psychological safety. Okay. And so that very much applies now. We also have data like this is SweetBench. I'm sure a lot of you have seen this and there are some impressive benchmarks that the agents can do like a third of the things they're asked to do without any human intervention. That means that they're not able to do twothirds of them. Right? Again, we are augmenting. We're not replacing. We're not ready. We may never be ready. So, we need to be very transparent with what we're doing. We need to set very clear intents. Why, you know, are we uh using this to to augment, not to replace. We need to be proactive in the way that we communicate that and not just wait for people to get upset and possibly scared. We need to say, "No, we are here to help you to give you a better developer experience and to increase the throughput of the business." And again we have to have these discussions about metrics. Now what metrics? What should we be looking at? Well DX again developer experience and productivity measurement company. Um there are two sort of classes of metrics that we can be looking at really two levers that matter here and that's speed and quality. Right? We want to increase PR throughput. We want to increase our velocity but not by just creating a bunch of slop that's going to give us a bunch of tech debt later that we're going to have to deal with and we just kick the bottleneck down the road if we do that. Right? So we want to be looking at things like change failure rate, our overall perception of quality, change confidence, maintainability. And we have three types of metrics that we can be looking at here. We have our telemetry metrics. These are the things coming out of the API. And they're good for some stuff, but they're not always accurate, right? We know like accept versus suggest was kind of like all the rage until we realize that engineers need to click accept in the IDE in order for the API to know about it. even if they do click accept, who's to say they didn't just go back and rewrite every line that was suggested, right? So that's providing us some context, but we also need to do some experience sampling. We need to like for instance add a new field to a PR form that says I used AI to generate this PR or I enjoyed using AI to generate this PR and get some data that way. And then self-reported data or survey data. We are big on surveys, but let me underscore we're big on effective surveys. 90% plus participation rates engineered against questions that treat developer experience as a systems problem not a people problem because that's what it is W. Edwards Deming 90 to 95% of the productivity output of an organization is determined by the system and not the worker. Okay, so foundational developer experience and developer productivity metrics still matter the most. Right? Our AI metrics like utilization and things are telling us what's happening with the tech, but these core metrics that we've been able to trust are telling us whether these initiatives are actually working, right? Are we actually moving the needle and having the outcomes that we want to see? So top companies are looking at different things, right? We are seeing like adoption metrics coming out of Microsoft. They've also got this great metric called a bad developer day. I'm not going to go into it, but there's a really good white paper that shows like all the different telemetry that they can look at to determine what makes a bad developer day. Dropbox is looking at similar stuff. Adoption like weekly active users, daily active users, that sort of thing, but also looking at quality metrics like change failure rate. And booking is looking at similar stuff as well. And so we built a framework around this. We were first to market with what we call our DXAI measurement framework. And this is very much inspired by things like Dora space framework, DevX just like our core four metric set which you can ask me about later. Uh and we take these metrics and we uh normalize them into these three dimensions of utilization, impact and cost. And you can kind of think about this as a maturity curve too. A lot of people start just figuring out okay what's happening? who's using the tech, what's the percentage of pull requests that we're getting that are AI assisted maybe through experience sampling? How many tasks are being assigned to agents? But then we can mature that perspective a little bit and we can correlate that utilization to impact. What is this actually doing to velocity? What is this actually doing to quality? And this is when we start getting more mature in our picture of our impact. And then finally, cost. Although I like to joke that we're 15 years past the last hype cycle, which was cloud, and we still have new companies spinning up that are teaching us how to understand and optimize our cloud costs. So, we will see if we get there. Although, I also hear horror stories about people burning through 2,000 tokens at $2,000 worth of tokens a day. So, we probably do need to hit that as well. What about compliance and trust? What can we do to ensure that the output uh that that's being generated is something that can be trusted by our engineers? We have a lot of levers to pull here, but one of the ones that I'd like to talk about is setting up a feedback loop for our system prompts. So these could be called system prompts, cursor rules, agent markdown. Pretty much all of the mainstream solutions have something like this where you can go and provide a set of rules uh to control how these models behave. Uh and I won't get too much into the technical details here. We have an example where like the uh models have been providing outdated Spring Boot uh stuff. We want Spring Boot 3. It's It's been sending us Spring Boot 2 stuff. The big takeaway here is to have the feedback loop. Have a gatekeeper, right? Have somebody or a group in the organization that can receive this feedback that understand how to maintain and continuously improve these system prompts, right? And that way we're always maintaining the way that these assistants or models or agents affect the whole business. It also pays to understand the way that uh temperature works, especially when we're building agents, right? we do have some control over the determinism and nondeterminism of these models. Uh again like when a model is predicting a next token, it doesn't just have like one token. It has a matrix of tokens and those are associated with a certain probability of that being like the right token. And so we have this setting called temperature which is heat which is entropy which is randomness that can control the amount of randomness involved in actually picking that token. This is sometimes called increasing the creativity of the model. And it's a number between 0 and one. For those reasons I just mentioned, don't use zero or don't use one. Weird things will happen. But you want some decimal in between zero and one. When we have a lower temperature, like we're seeing here, 0.001, we give it the same task twice, and it gives us the exact same output character for character. When we set that temperature higher, this is an example of 0.9. I'm asking the agent to create a gradient for me. Uh, simple task. It's giving me two relatively valid solutions. I did ask it for a JavaScript method and this is the only one that's giving me a JavaScript method. But the point is they are wildly different approaches to the same problem when I've increased the creativity of that model. So we need to think about like use case wise where should we have more creativity and where should we have more determinism and temperature is another setting that we have that can help control this. You can experiment with all this using like docker model runner lama lm studio that sort of thing. How can we tie this to better employee success? We had to provide both education and adequate time to learn. So we put together a study where we sampled a bunch of uh developers that were saving at least an hour a day uh uh excuse me an hour a week and we asked them to stack rank their top five most valuable use cases. And we built a guide around that. a guide that effectively goes through code examples, prompting examples uh of what we determined using the sort of data approach where we should get more reflexive about our best practice and about uh the use cases that we're becoming reflexive in in our use of AI. And so that's what this guide was about. And uh we've had this become required reading in certain engineering groups and uh proud of that. And this is another way that we can help educate. But we need to give time. Uh we don't have time to go through all of this. I do think it's interesting that the number one use case for this was stack trace analysis, right? So, not a generative use case, actually more of an interpretive use case. And we see some other ones here that are not too surprising. And there's examples of each of these. What about unblocking usage? How can we make sure that we can creatively ensure that engineers can take the most advantage of this? Well, leverage self-hosted and private models. That's getting easier and easier to do. Partner with compliance on day one, right? Make sure that what you're doing is in line with your organization's compliance. You may find that you're making a lot of assumptions about things that you don't think you can do that you can actually do, right? And then think creatively around various barriers. Finally, how can we integrate across the SDLC? What should we think about doing there? You know, and I'm a big Ellie Gold theory of constraints fan. Probably have some others in the audience. An hour saved on something that isn't the bottleneck is worthless. And when we look at data across in this case almost 140,000 engineers, we find that there are definitely good like annualized time savings with AI that are being eclipsed by sources of context switching and interruption, meeting heavy days, these other things that it's like, yeah, we can save time here, but we're losing so much more time over there. So find the bottleneck, fix the bottleneck, right? Morgan Stanley's been very public about the uh building this thing called Dev Gen AI that looks at a bunch of legacy code, Cobalt, mainframe natural. I hate to admit Pearl because I'm an old school Pearl developer. Uh but apparently that's legacy now, too. And basically creating specs uh for developers that can just be handed to developers to start modernizing the code without having to do all that reverse engineering, right? And they're saving about 300,000 hours annually right now doing this. There's a Wall Street Journal journal article about this, Business Insider article about it. Uh they're very public about that. Zapier, Zapier should be the example for everyone. They have a whole series of bots and agents that are doing things like assisting with onboarding. They can now make engineers effective in 2 weeks. Industry benchmark on the good side is like a month. On the medium side is like 90 days. And uh because they're able to increase the effectiveness of the engineers that they're h that they've bringing into the organization, they realized that they should be hiring more, right? As opposed to trying to maintain status quo by like cutting headcount and trying to make individual engineers more productive. They said, "No, we could get more value out of a single engineer. We should be hiring faster than ever." And they are. And it's really increasing their competitive edge. I think that's the right attitude. Spotify has been helping out their SRRES by pulling together context when incidents uh are detected and then taking things like run but steps and and other areas of context and documentation and pushing them directly into S sur channels so that those critical minutes of trying to get to the bottom of what's actually happening and what we should do do to resolve the incident uh they just eliminated that time right it's significantly increased their MTTR so let's get creative about areas in the STLC that are our actual bottlenecks All right, next steps. Uh, distribute this guide as a reference for integrating AI into the development workflows that you have. Uh, determine a method for measuring and evaluating Genai impact. It's really important to make sure that we're not on the bad sides of those graphs that I showed you earlier and then track and measure AI adoption and and see how that correlates to overall impact metrics and iterate on best practices and use cases. And here's a guide again. Thank you so much. [applause] [music] >> [music]