How to Improve your Vibe Coding — Ian Butler
Channel: aiDotEngineer
Published at: 2025-08-03
YouTube video id: g03m-WFEu1U
Source: https://www.youtube.com/watch?v=g03m-WFEu1U
[Music] My name is Ian. Um I'm the CEO of Bismouth. Uh we're an endto-end agentic coding solution kind of like Codeex. Um we've been working on evals for how good agents are at finding and fixing bugs for the last several months. uh and we dropped a benchmark yesterday uh discussing our results. Um so one thing to point out about agents currently is that they have a pretty low overall find rate for bugs. They actually generate a significant amount of false positives. Um you can see something like Devon and Cursor have a less than 10% true positive rate for finding bugs. Um, this is an issue when you're vibe coding because these agents can quickly overrun your codebase with unintended bugs that they're not able to actually find and then later then fix. Um, overall too, it's worth noting uh that in terms of needle in a haystack when we plant bugs in a codebase, these agents struggle to navigate more broadly across those larger code bases and actually find the specific bugs. So, here's the hard truth, right? uh three out of six agents on our benchmark had a 10% or less true positive rate out of 900 plus reports. Um one agent actually gave us 70 issues for a single task and all of them were false and like no developer is going to go through all of those, right? You're not going to sit there and try to like figure out what bugs actually exist. Um so bad vibes, right? Implications. Most popular agents are terrible at finding bugs. Cursor had a 97% false positive rate over 100 plus repos and 1,200 plus issues. The real world impact for this is that when developers are actually building with this software, there's alert fatigue and it reduces the effectiveness of trusting these agents which means bugs are going to go to prod. So how do you clean up some of the vibes? Like we did this large benchmark. We've been doing this for months. We have like practical tips for you when you're working kind of in your IDE with these agents side by side. So the first thing to note is bug focused rules. Um every one of these agents has a rules types of file. You want to basically provide scoped instructions that provide additional detail on security issues, logical bugs, things like that. The second issue here is context management. So the biggest issue we saw with agents when navigating code bases was after a little bit of time they'd get confused. they would lose logical links to stuff they've already read and their ability to reason and come up with connections across a codebase stumbled significantly. Obviously, when it comes to finding bugs, this is a problem because most significant and real bugs are complex multi-step processes that are nested deeply in code bases. And then finally, thinking models rock. Um, thinking models were significantly better um at finding bugs in a codebase. So, whenever you're using something like cloud code, cursor, whatever, try to reach for thinking models. they are just significantly better at this problem. Um, okay. So, I mentioned rules earlier and I think there's some like practical tips you can take away for improving your vibe coding. Um, OASP is like the world's most popular kind of like, you know, security authority for bugs. Um, I would say give or take. Um, when you're creating your rules files, try to feed some specific security information like the OAS top 10 to the model. What you're doing here is biasing the model. So when it's actually looking at your code, it's considering these things in the first place. Right now we find when you don't actually supply models with security or bug related information, their performance is significantly lower than otherwise. Second, you're going to want to prioritize naming like explicit classes of bugs in those rules. Like don't be like, "Hey cursor, just try to find me some bugs in this repository." Be like, "Hey cursor, I want you to examine my repository for off bypasses or protocol pollution, um SQL injection off bypasses." Right? You want to be explicit about this. That kind of primes the models to be looking for these issues. And then finally with rules, you want to require kind of like fix validation. So you always want to tell the model, hey, you have to write and get tests to pass before this is coming into the codebase. You have to ensure they've actually fixed the bugs. We've seen more broadly across the hundred repositories we benchmarked and the thousands of issues we've seen from many agents uh that structured rules eliminate the vague check for bugs requests uh and that produce alert fatigue. um instead they prime agents for much higher quality output. So okay, context is key too, right? So I mentioned agents struggle significantly with like cross repo, you know, navigation um and understanding. Um in fact, a lot of the agents when they reach their context limits kind of like summarize or compact files down. When that compaction happens, uh the ability to detect and understand bugs reduces significantly. So it's actually on you as users in the IDE to kind of manage your context more thoroughly for these agents. You want to make sure you're feeding uh you know either diffs of the code uh that was changed to the agent. They're able to actually understand cause and effect better from that. You want to make sure key files aren't being summarized or being taken out of the context window. Um and you want to actually ask one one one thing we found really effective in the benchmarking was asking agents to come up with a uh step-by-step component inventory of your code. So have it index like these are the classes, these are the variables, um this is how the use is happening across the codebase. When it does that inventory, it becomes much more able to find bugs. So okay, thinking models rock. Um we saw across our benchmarking basically just implicitly that thinking models were far more able to find bugs. uh if you go through their thought traces, you're actually able to see them kind of expand across a few different like considerations in the codebase. And then when they find those considerations, they will actually dive deeper into the chain of thought uh for finding those bugs. That means in practice they do find deeper bugs than just non-thinking models are able to across the benchmark. However, I still want to note here even with thinking models, there's like a pretty significant limitation in their ability to actually like holistically look at a file. We found again over hundreds of repos and thousands of issues that when agents were run the topline number of bugs found would remain the same but they would actually the bugs themselves would change runto run. So agents are never holistically really looking at a file like you or I would be looking at a file. There's high variability across runs. Um we think that's a very big limitation of current agents by the way. We think for consumers you shouldn't have to run your agents a 100 times to get like the whole holistic kind of like bug breakdown. But that's kind of a still in progress problem. Um, so they're more thorough and they just perform better across the benchmark uh than other models were able to. Um, I'm going to quickly plug us. So we're bismouth.sh. Uh, we create PRs automatically. We're linked into GitHub, GitLab, Jira, and Linear. Uh, we scan for vulnerabilities. We provide reviews. Um, and we also have on-prem deployments, which I know is a big sticking point for people. Um, may your vibes be immaculate. Um, if you scan this QR code, it'll take you to our site. There we have a link to the full benchmark with breakdown of methodology um, results. Uh, you can dive into the actual uh, data itself. And you can see the SM100 benchmark here um, along with our full data set and exploration so you can understand just how well current agents are actually at finding and fixing bugs. Um, yep. I'm Ian Butler. Thank you so much. [Music]