AI powered entomology: Lessons from millions of AI code reviews — Tomas Reimers, Graphite
Channel: aiDotEngineer
Published at: 2025-07-22
YouTube video id: TswQeKftnaw
Source: https://www.youtube.com/watch?v=TswQeKftnaw
[Music] Thank you all so much for coming to this talk. Um, thank you for being at this conference. Generally, my name is Tomas. I'm one of the co-founders of Graphite and I'm here to talk to you around AI power entomology. If you don't know, entomology is the study of bugs. It's something that we do. We is very near and dear to our heart and part of what our product does. So, Graphite, for those of you that don't know, builds a product called Diamond. Diamond is an AI powered code reviewer. You go ahead, you uh you connect it to your GitHub, and it goes ahead and finds bugs. The project started about a year ago. What we started to notice was that the amount of code being written by AI was going up and up and up, but so was the amount of bugs. And after really thinking about it, we thought that, you know, this might actually be part and parcel. And what we need to do is we need to find a way to better address these bugs in general. Uh given the technological advances, the first thing we turned to was AI itself. And we started to ask, well, maybe AI is creating the bugs, but can it also find the bugs? Can it help us? Um we started to go ahead and do things like ask Claude, hey, here's a PR. Can you find bugs on this PR? And we were pretty impressed with the early results. Here's an example actually pulled from this week from our codebase where it turns out that in certain instances we'd be returning one of our database OM classes uninstantiated which would go ahead and crash our server. Here's an example that came up on Twitter this week from our bot that found that um in certain instances there would be math being done around border radiuses that would lead to a division by a negative number and would then go ahead and crash the front end. So to answer the question, it turns out AI can find bugs. That's the end of the talk. I'm kidding. If you've tried this, um, you know, you've probably had a really, really frustrating experience. Um, we also went ahead and saw things like, you should update this code to do what ourity does. CSS doesn't work this way when it does, or my favorite, you should revert this code to what it used to do because it used to do it. Um, getting those lost us a lot of confidence, but we started to think, well, we're seeing some really good things and we're seeing some really bad things and maybe there's actually more than one type of bug. Maybe there's more than one type of thing an LLM can find. And so we started with the most basic division of well there's probably stuff that LM are good at catching and things that they're not good at catching. At the end of the day, LMS ultimately try and mimic the thing that you're asking them to do. And if you ask them, hey, what kind of code review comments would be left on this PR, it goes ahead and leaves everything, both those that are within its capability and things that are not within its capability. And so we started to categorize those. What we found though was even when we categorized those, the LLM would start to leave comments like this. You should add a comment describing what this class does. You should extract this logic out into a function or you should make sure this code has tests. While these are technically correct to developers, they're really frustrating. And I think this was actually one of the most like insightful moments for us in building this project was when we sat down with our design team and we started to actually go through past bugs. um both those left by uh our bot and by humans in our own codebase. The developers were all pretty much on the same page of like, "Yep, I'd be okay if an LLM left that. No, I would not be okay if LM left that. Yes, I'd be okay." And our designers were actually kind of baffled by it. They're like, "Well, but like that kind of looks like that other comment." And I think that what's happening here in the mind of the developer is if you go ahead and you read a type of comment like this, um maybe you find it pedantic, frustrating, annoying when it comes from LLM and you're much more welcoming to it when it comes from a human. And so as we started to think more around sort of that classification of bugs, we started to think around actually a second axis here, which was there's stuff LM can catch and LMS can't catch, but there's also stuff that humans want to receive from an LLM and humans don't want to receive from an LLM. And so what we went ahead and we uh what we went ahead and did was we went ahead and we actually took 10,000 comments from our own codebase from open source code bases uh open source code bases and we fed them to various LLMs and we asked them to uh categorize them and we did that not just once but we did that quite a few times and then we went ahead and we summarized those comments and what we ended up with was actually this chart uh where it says there's actually quite a few different types of bugs that you see left on code bases in the wild ignoring LMS for a second just talking around humans. You see things which are bugs. Those are logical inconsistencies that lead the code to behave in a way it doesn't want to behave. There's also accidentally committed code. This actually shows up more than you would expect. Um there performance and security concerns. There's documentation where the code says one thing and does another and it's not clear which one's right. Um there's stylistic changes, things like hey um you should update this this comment or in this codebase we follow this other pattern. And then there's a lot of stuff outside of sort of that top right quadrant. So in the bottom right where humans want to receive it but the LLMs don't seem to be able to get there yet are things like tribal knowledge. Uh one class of comment that you'll see a lot in PRs is hey we used to do it this way. We don't do it this way anymore because of blank. This documentation doesn't exist. It exists in the heads of your senior developers. And that's wonderful but it's really hard for an AI to be able to mind readad to that. on the left side where LM definitely can catch it but humans don't want to receive are those things I showed you earlier code cleanliness and best practice examples of these that we've found are uh comment this function add test extract this type out into a different type extract this logic out into a function while this is always correct to say I think it's really hard to know when to apply to an LLM I think as a human you're applying some kind of barometer of well in this codebase this logic is particularly tricky and I think someone's going to get tripped up so we should extract it out versus well in this codebase it's actually fine but what a bot can pretty much always leave this comment I'd actually make the argument a human can pretty much always leave this comment and it be technically correct the question is whether it's welcome in the codebase and one thing I'm going to say sort of like outside of all of this is as you add more this area seems to become larger of what people are comfortable with but for now given the context that we have given the code base the past history the uh your style guide and rules we are what we have we have what we have and so we end up with this idea of well it turns out that these are basically the classes of comments that we think that human that LLM can both create and humans want to receive. Now, if you've worked with LLMs, you know that these kinds of offline passes and first passes are great for initial categorizations. But the much harder question is how do you know that you're right continuously, right? We can. So, as the story goes, we went ahead, we basically started to characterize comments that LM leave. We updated our prompts to only prompt the LLM to do things that were in its capacity and that humans wanted to receive. And people anecdotally started to like it a lot more. But as we started to then think around, well, how can we get this LLM to how do we know that this is going right? As we think around new LLMs, as we got into Claude 4 or Opus instead of Sonnet, how do we know that we're actually staying in this top right quadrant? And as we increase the context, how do we know that this that this isn't growing on us? And actually, maybe there even more types of comments that we could be leaving that we're not leaving already. And so first and foremost, we started by just looking at what kinds of comments is the thing currently leaving. Your mileage may vary. For us, this is roughly the proportion we see of comments being left by the LLM right now based just on what we've seen. But the the deeper question for us was how do we how do we measure the success, right? Like given this quadrant, how do we know that in the we're in the top right? The first one was easy for us. So think around what they can catch and they can't catch. What we started to do was we started to actually add up votes and down votes to the product. So we let you go ahead and emoji react in these comments and they pretty much tell us when the LLM's hallucinating. When we start to see an up when we start to see it when we start to see a downvote spike, we know that okay, we might be trying to extend this thing beyond its capabilities. We we need to tone it down. But the second one was a lot harder that humans want to receive and humans don't want to receive was something that we weren't really sure how to get at. And so upvote down vvote we implemented it. We see about a less than a 4% downvote rate these days. We felt pretty good about that. The second one as we started to think around it well what we realized was well what's the point of a comment? Why do you leave a comment in code review? You leave a comment in code review ultimately so that someone actually updates the code to reflect that. And so our question was well can we measure that? Can we measure what percent of comments actually lead to the change that they describe? And so we started to do that. We started to ask that question of on open- source repos and on the variety of repos that graphite which is a code review tool has access to can we actually start to measure that number and I think one of the most fascinating things we found um was that only about 50% of human comments lead to changes and so we started to ask the question of well could we get the LM to to at least this right because if we get it to at least this it's at least leaving uh comments on the level of fidelity that humans are now you might be sitting in the audience and being like well why don't 100% of uh comments lead to lead to action. I want to I want to cave up this number. I'm saying lead to action within that PR itself. And so a lot of comments are sometimes fixed forward where people are like, "Hey, I I hear you and I'm going to fix this in a follow-up." A lot of comments are also like, "Hey, as a a heads up, in the future, if you do this, maybe you can do it this other way, but don't need to be acted on then." I think there's a there's a fair and some of them are just purely preferential of I would do it this way. Someone disagrees. In healthy code review cultures, that space for disagreement exists. And so we started to measure this and we started to say could we get the bot here? Um over time we actually have. So as of March we're at 52%. Which is to say that if you start to actually prompt it correctly you can get there. And I think our our sort of broader thesis is that um this measuring uh getting uh bugs via an LLM does actually work. Um if you want to try any of these findings in production um Diamond is our product that uh offers it. We have a booth over there. Um, thank you. [Applause] [Music]