AI powered entomology: Lessons from millions of AI code reviews — Tomas Reimers, Graphite

Channel: aiDotEngineer

Published at: 2025-07-22

YouTube video id: TswQeKftnaw

Source: https://www.youtube.com/watch?v=TswQeKftnaw

[Music]
Thank you all so much for coming to this
talk. Um, thank you for being at this
conference. Generally, my name is Tomas.
I'm one of the co-founders of Graphite
and I'm here to talk to you around AI
power entomology. If you don't know,
entomology is the study of bugs. It's
something that we do. We is very near
and dear to our heart and part of what
our product does. So, Graphite, for
those of you that don't know, builds a
product called Diamond. Diamond is an AI
powered code reviewer. You go ahead, you
uh you connect it to your GitHub, and it
goes ahead and finds bugs. The project
started about a year ago. What we
started to notice was that the amount of
code being written by AI was going up
and up and up, but so was the amount of
bugs. And after really thinking about
it, we thought that, you know, this
might actually be part and parcel. And
what we need to do is we need to find a
way to better address these bugs in
general. Uh given the technological
advances, the first thing we turned to
was AI itself. And we started to ask,
well, maybe AI is creating the bugs, but
can it also find the bugs? Can it help
us? Um we started to go ahead and do
things like ask Claude, hey, here's a
PR. Can you find bugs on this PR? And we
were pretty impressed with the early
results. Here's an example actually
pulled from this week from our codebase
where it turns out that in certain
instances we'd be returning one of our
database OM classes uninstantiated which
would go ahead and crash our server.
Here's an example that came up on
Twitter this week from our bot that
found that um in certain instances there
would be math being done around border
radiuses that would lead to a division
by a negative number and would then go
ahead and crash the front end. So to
answer the question, it turns out AI can
find bugs. That's the end of the talk.
I'm kidding. If you've tried this, um,
you know, you've probably had a really,
really frustrating experience. Um, we
also went ahead and saw things like, you
should update this code to do what
ourity does. CSS doesn't work this way
when it does, or my favorite, you should
revert this code to what it used to do
because it used to do it. Um, getting
those lost us a lot of confidence, but
we started to think, well, we're seeing
some really good things and we're seeing
some really bad things and maybe there's
actually more than one type of bug.
Maybe there's more than one type of
thing an LLM can find. And so we started
with the most basic division of well
there's probably stuff that LM are good
at catching and things that they're not
good at catching. At the end of the day,
LMS ultimately try and mimic the thing
that you're asking them to do. And if
you ask them, hey, what kind of code
review comments would be left on this
PR, it goes ahead and leaves everything,
both those that are within its
capability and things that are not
within its capability. And so we started
to categorize those. What we found
though was even when we categorized
those, the LLM would start to leave
comments like this. You should add a
comment describing what this class does.
You should extract this logic out into a
function or you should make sure this
code has tests. While these are
technically correct to developers,
they're really frustrating. And I think
this was actually one of the most like
insightful moments for us in building
this project was when we sat down with
our design team and we started to
actually go through past bugs. um both
those left by uh our bot and by humans
in our own codebase. The developers were
all pretty much on the same page of
like, "Yep, I'd be okay if an LLM left
that. No, I would not be okay if LM left
that. Yes, I'd be okay." And our
designers were actually kind of baffled
by it. They're like, "Well, but like
that kind of looks like that other
comment." And I think that what's
happening here in the mind of the
developer is if you go ahead and you
read a type of comment like this, um
maybe you find it pedantic, frustrating,
annoying when it comes from LLM and
you're much more welcoming to it when it
comes from a human. And so as we started
to think more around sort of that
classification of bugs, we started to
think around actually a second axis
here, which was there's stuff LM can
catch and LMS can't catch, but there's
also stuff that humans want to receive
from an LLM and humans don't want to
receive from an LLM.
And so what we went ahead and we uh what
we went ahead and did was we went ahead
and we actually took 10,000 comments
from our own codebase from open source
code bases uh open source code bases and
we fed them to various LLMs and we asked
them to uh categorize them and we did
that not just once but we did that quite
a few times and then we went ahead and
we summarized those comments and what we
ended up with was actually this chart uh
where it says there's actually quite a
few different types of bugs that you see
left on code bases in the wild ignoring
LMS for a second just talking around
humans. You see things which are bugs.
Those are logical inconsistencies that
lead the code to behave in a way it
doesn't want to behave. There's also
accidentally committed code. This
actually shows up more than you would
expect. Um there performance and
security concerns. There's documentation
where the code says one thing and does
another and it's not clear which one's
right. Um there's stylistic changes,
things like hey um you should update
this this comment or in this codebase we
follow this other pattern. And then
there's a lot of stuff outside of sort
of that top right quadrant. So in the
bottom right where humans want to
receive it but the LLMs don't seem to be
able to get there yet are things like
tribal knowledge. Uh one class of
comment that you'll see a lot in PRs is
hey we used to do it this way. We don't
do it this way anymore because of blank.
This documentation doesn't exist. It
exists in the heads of your senior
developers. And that's wonderful but
it's really hard for an AI to be able to
mind readad to that. on the left side
where LM definitely can catch it but
humans don't want to receive are those
things I showed you earlier code
cleanliness and best practice examples
of these that we've found are uh comment
this function add test extract this type
out into a different type extract this
logic out into a function while this is
always correct to say I think it's
really hard to know when to apply to an
LLM I think as a human you're applying
some kind of barometer of well in this
codebase this logic is particularly
tricky and I think someone's going to
get tripped up so we should extract it
out versus well in this codebase it's
actually fine but what a bot can pretty
much always leave this comment I'd
actually make the argument a human can
pretty much always leave this comment
and it be technically correct the
question is whether it's welcome in the
codebase and one thing I'm going to say
sort of like outside of all of this is
as you add more this area seems to
become larger of what people are
comfortable with but for now given the
context that we have given the code base
the past history the uh your style guide
and rules we are what we have we have
what we have and so we end up with this
idea of well it turns out that these are
basically the classes of comments that
we think that human that LLM can both
create and humans want to receive.
Now, if you've worked with LLMs, you
know that these kinds of offline passes
and first passes are great for initial
categorizations. But the much harder
question is how do you know that you're
right continuously, right? We can. So,
as the story goes, we went ahead, we
basically started to characterize
comments that LM leave. We updated our
prompts to only prompt the LLM to do
things that were in its capacity and
that humans wanted to receive. And
people anecdotally started to like it a
lot more. But as we started to then
think around, well, how can we get this
LLM to how do we know that this is going
right? As we think around new LLMs, as
we got into Claude 4 or Opus instead of
Sonnet, how do we know that we're
actually staying in this top right
quadrant? And as we increase the
context, how do we know that this that
this isn't growing on us? And actually,
maybe there even more types of comments
that we could be leaving that we're not
leaving already.
And so first and foremost, we started by
just looking at what kinds of comments
is the thing currently leaving. Your
mileage may vary. For us, this is
roughly the proportion we see of
comments being left by the LLM right now
based just on what we've seen. But the
the deeper question for us was how do we
how do we measure the success, right?
Like given this quadrant, how do we know
that in the we're in the top right? The
first one was easy for us. So think
around what they can catch and they
can't catch. What we started to do was
we started to actually add up votes and
down votes to the product. So we let you
go ahead and emoji react in these
comments and they pretty much tell us
when the LLM's hallucinating. When we
start to see an up when we start to see
it when we start to see a downvote
spike, we know that okay, we might be
trying to extend this thing beyond its
capabilities. We we need to tone it
down. But the second one was a lot
harder that humans want to receive and
humans don't want to receive was
something that we weren't really sure
how to get at. And so upvote down vvote
we implemented it. We see about a less
than a 4% downvote rate these days. We
felt pretty good about that. The second
one as we started to think around it
well what we realized was well what's
the point of a comment? Why do you leave
a comment in code review? You leave a
comment in code review ultimately so
that someone actually updates the code
to reflect that. And so our question was
well can we measure that? Can we measure
what percent of comments actually lead
to the change that they describe? And so
we started to do that. We started to ask
that question of on open- source repos
and on the variety of repos that
graphite which is a code review tool has
access to can we actually start to
measure that number and I think one of
the most fascinating things we found um
was that only about 50% of human
comments lead to changes and so we
started to ask the question of well
could we get the LM to to at least this
right because if we get it to at least
this it's at least leaving uh comments
on the level of fidelity that humans are
now you might be sitting in the audience
and being like well why don't 100% of uh
comments lead to lead to action. I want
to I want to cave up this number. I'm
saying lead to action within that PR
itself. And so a lot of comments are
sometimes fixed forward where people are
like, "Hey, I I hear you and I'm going
to fix this in a follow-up." A lot of
comments are also like, "Hey, as a a
heads up, in the future, if you do this,
maybe you can do it this other way, but
don't need to be acted on then." I think
there's a there's a fair and some of
them are just purely preferential of I
would do it this way. Someone disagrees.
In healthy code review cultures, that
space for disagreement exists. And so we
started to measure this and we started
to say could we get the bot here? Um
over time we actually have. So as of
March we're at 52%. Which is to say that
if you start to actually prompt it
correctly you can get there. And I think
our our sort of broader thesis is that
um this measuring uh getting uh bugs via
an LLM does actually work. Um if you
want to try any of these findings in
production um Diamond is our product
that uh offers it. We have a booth over
there. Um, thank you.
[Applause]
[Music]