How to Improve your Vibe Coding — Ian Butler

Channel: aiDotEngineer

Published at: 2025-08-03

YouTube video id: g03m-WFEu1U

Source: https://www.youtube.com/watch?v=g03m-WFEu1U

[Music]
My name is Ian. Um I'm the CEO of
Bismouth. Uh we're an endto-end agentic
coding solution kind of like Codeex. Um
we've been working on evals for how good
agents are at finding and fixing bugs
for the last several months. uh and we
dropped a benchmark yesterday uh
discussing our results. Um so one thing
to point out about agents currently is
that they have a pretty low overall find
rate for bugs. They actually generate a
significant amount of false positives.
Um you can see something like Devon and
Cursor have a less than 10% true
positive rate for finding bugs. Um, this
is an issue when you're vibe coding
because these agents can quickly overrun
your codebase with unintended bugs that
they're not able to actually find and
then later then fix. Um, overall too,
it's worth noting uh that in terms of
needle in a haystack when we plant bugs
in a codebase, these agents struggle to
navigate more broadly across those
larger code bases and actually find the
specific bugs.
So, here's the hard truth, right? uh
three out of six agents on our benchmark
had a 10% or less true positive rate out
of 900 plus reports. Um one agent
actually gave us 70 issues for a single
task and all of them were false and like
no developer is going to go through all
of those, right? You're not going to sit
there and try to like figure out what
bugs actually exist.
Um so bad vibes, right? Implications.
Most popular agents are terrible at
finding bugs. Cursor had a 97% false
positive rate over 100 plus repos and
1,200 plus issues. The real world impact
for this is that when developers are
actually building with this software,
there's alert fatigue and it reduces the
effectiveness of trusting these agents
which means bugs are going to go to
prod.
So how do you clean up some of the
vibes? Like we did this large benchmark.
We've been doing this for months. We
have like practical tips for you when
you're working kind of in your IDE with
these agents side by side. So the first
thing to note is bug focused rules. Um
every one of these agents has a rules
types of file. You want to basically
provide scoped instructions that provide
additional detail on security issues,
logical bugs, things like that. The
second issue here is context management.
So the biggest issue we saw with agents
when navigating code bases was after a
little bit of time they'd get confused.
they would lose logical links to stuff
they've already read and their ability
to reason and come up with connections
across a codebase stumbled
significantly.
Obviously, when it comes to finding
bugs, this is a problem because most
significant and real bugs are complex
multi-step processes that are nested
deeply in code bases.
And then finally, thinking models rock.
Um, thinking models were significantly
better um at finding bugs in a codebase.
So, whenever you're using something like
cloud code, cursor, whatever, try to
reach for thinking models. they are just
significantly better at this problem.
Um, okay. So, I mentioned rules earlier
and I think there's some like practical
tips you can take away for improving
your vibe coding. Um, OASP is like the
world's most popular kind of like, you
know, security authority for bugs. Um, I
would say give or take. Um, when you're
creating your rules files, try to feed
some specific security information like
the OAS top 10 to the model. What you're
doing here is biasing the model. So when
it's actually looking at your code, it's
considering these things in the first
place. Right now we find when you don't
actually supply models with security or
bug related information, their
performance is significantly lower than
otherwise.
Second, you're going to want to
prioritize naming like explicit classes
of bugs in those rules. Like don't be
like, "Hey cursor, just try to find me
some bugs in this repository." Be like,
"Hey cursor, I want you to examine my
repository for off bypasses or protocol
pollution, um SQL injection off
bypasses." Right? You want to be
explicit about this. That kind of primes
the models to be looking for these
issues. And then finally with rules, you
want to require kind of like fix
validation. So you always want to tell
the model, hey, you have to write and
get tests to pass before this is coming
into the codebase. You have to ensure
they've actually fixed the bugs.
We've seen more broadly across the
hundred repositories we benchmarked and
the thousands of issues we've seen from
many agents uh that structured rules
eliminate the vague check for bugs
requests uh and that produce alert
fatigue. um instead they prime agents
for much higher quality output.
So okay, context is key too, right? So I
mentioned agents struggle significantly
with like cross repo, you know,
navigation um and understanding. Um in
fact, a lot of the agents when they
reach their context limits kind of like
summarize or compact files down. When
that compaction happens, uh the ability
to detect and understand bugs reduces
significantly. So it's actually on you
as users in the IDE to kind of manage
your context more thoroughly for these
agents. You want to make sure you're
feeding uh you know either diffs of the
code uh that was changed to the agent.
They're able to actually understand
cause and effect better from that. You
want to make sure key files aren't being
summarized or being taken out of the
context window. Um
and you want to actually ask one one one
thing we found really effective in the
benchmarking was asking agents to come
up with a uh step-by-step component
inventory of your code. So have it index
like these are the classes, these are
the variables, um this is how the use is
happening across the codebase. When it
does that inventory, it becomes much
more able to find bugs.
So okay,
thinking models rock. Um we saw across
our benchmarking basically just
implicitly that thinking models were far
more able to find bugs. uh if you go
through their thought traces, you're
actually able to see them kind of expand
across a few different like
considerations in the codebase. And then
when they find those considerations,
they will actually dive deeper into the
chain of thought uh for finding those
bugs. That means in practice they do
find deeper bugs than just non-thinking
models are able to across the benchmark.
However, I still want to note here even
with thinking models, there's like a
pretty significant limitation in their
ability to actually like holistically
look at a file. We found again over
hundreds of repos and thousands of
issues that when agents were run the
topline number of bugs found would
remain the same but they would actually
the bugs themselves would change runto
run. So agents are never holistically
really looking at a file like you or I
would be looking at a file. There's high
variability across runs. Um we think
that's a very big limitation of current
agents by the way. We think for
consumers you shouldn't have to run your
agents a 100 times to get like the whole
holistic kind of like bug breakdown. But
that's kind of a still in progress
problem. Um, so
they're more thorough and they just
perform better across the benchmark uh
than other models were able to.
Um, I'm going to quickly plug us. So
we're bismouth.sh. Uh, we create PRs
automatically. We're linked into GitHub,
GitLab, Jira, and Linear. Uh, we scan
for vulnerabilities. We provide reviews.
Um, and we also have on-prem
deployments, which I know is a big
sticking point for people.
Um, may your vibes be immaculate. Um, if
you scan this QR code, it'll take you to
our site. There we have a link to the
full benchmark with breakdown of
methodology um, results. Uh, you can
dive into the actual uh, data itself.
And you can see the SM100 benchmark here
um, along with our full data set and
exploration so you can understand just
how well current agents are actually at
finding and fixing bugs. Um, yep. I'm
Ian Butler. Thank you so much.
[Music]