Beyond the Prototype: Using AI to Write High-Quality Code - Josh Albrecht, Imbue
Channel: aiDotEngineer
Published at: 2025-07-25
YouTube video id: x_1EumTaXeE
Source: https://www.youtube.com/watch?v=x_1EumTaXeE
[Music] It's great to be here. So, I'm Josh Albertch. I'm the CTO of Imbue. Uh, and our focus is on making more robust, useful AI agents. In particular, we're focusing on software agents right now. And the main product that we're working on today is called Sculptor. So, the purpose of Sculptor is to kind of help us with something that we've all experienced. You know, we've all tried these vibe coding tools and you, you know, tell it to go off and do something. It goes off and creates a bunch of code for you. Uh, and then, you know, voila, you're done, right? Well, not quite. like at least today there's a big gap between kind of the stuff that comes back uh and what you want to ship to production especially as you get away from the prototyping into a larger more established code bases. So today I'm going to go over some of the technical decisions that went into the design of sculpture our experimental coding agent environment uh and kind of go through some of the context and motivations for the various ideas that we've explored and the features that we've implemented. It's still a research preview, so these features may change before we actually release it. Uh, but I hope that you know whether you're an individual using these tools or you're someone who's developing the tools yourself, you'll find these uh kind of learnings from our experiments to be useful for yourselves. So today, if you're thinking about how you can make coding agents better, then there's a million different things that you could build. You could build something that helps improve the performance on really large context windows. You could make something to make it cheaper or faster. You could make something that does a better job of parsing the outputs. But I don't think that we really should be building any of these things. I think that what we really want to be building is things that are much more specific to the use case or to like the problem domain or the thing that you are like really specialized in. most of the things that I just mentioned are going to get solved over the next call it 3 to 12 to 24 months as models get better, coding agents get better etc. And so I think you know just like you wouldn't want to make your own database I don't think we want to be spending a lot of time working on the problems that are going to get solved uh instead we want to focus on the particular part of the problem that really matters for for us for our business and so at impe problem that we're focusing on is basically this like what is wrong with this diff you get a coding agent output and it tells you like okay I've added 59 new lines are those good like right now you have an awkward choice between either looking at each of the lines yourself or just hitting merge and kind of hoping for the best. Uh, and neither of those are a really great place to be. So, we try to give you a third option. Uh, the goal is to help build user trust by allowing another AI system to come and take a look at this and understand like, hey, are there any race conditions? Did you leave your API key in there, etc. So we want to think about how do we help leverage AI tools not just to generate the code but to help us build trust in that code and kind of the way that we think about it is about like identifying problems with the code because if there's no problems then it's probably high quality code and that's kind of the definition of high quality code. If you think about it from like an academic perspective, the way that people normally measure software quality is by looking at the number of defects and they look at like how long does it take to fix a particular defect or how many defects are caught by this particular technique. So this is sort of the definition that at least we're working on from when we're thinking about making high quality software. And then if we think about you know the software development process what you want to be doing is getting to a place where you have identified these problems as early as possible. So sculptor does not work as like a pull request review tool because that's much much later in the process. Rather we want something that's synchronous and immediate and giving you immediate feedback. As soon as you've generated that code, as soon as you've changed that line, you want to know like is there something wrong with it? That's easier both for you to fix and also for the agent to fix. So what are some ways that you can prevent problems in AI generated code? We're going to go through five different ways. Uh the first is learning planning or sorry only four different ways. Learning, planning, writing specs, and having a really strict style guide. And we'll see how those manifest in Sculptor. So the first thing you want to do when you're using coding agents if you're trying to prevent problems is learn what's out there. We try to make this as easy as possible in sculpture by letting you ask questions, have it do research, get answers about what are the technologies, etc. that exist, what are the ways that other people have solved similar problems so that you don't end up reproducing a bunch of work for what's already out there. Next, we want to think about how we can encourage people to start by planning. Here's a little example workflow where you can, you know, kick off the agent to go do something simple like, you know, implement this Scrabble solver and change the system prompt here to force the AI agent to first make a plan without writing any code at all. Then you can wait a little while. It'll generate the plan. Uh, and then you can go and change the system prompt again to say like, okay, now we can actually create some code. So we make it really easy to kind of change these types of meta parameters of the coding agent itself. Of course you can just tell the agent to do that. But by changing its system prompt you sort of force it in a much stronger way to uh change its behavior. And you can build up larger workflows by making sort of customized agents for always plan first then always do the code then always run the checks etc. Third, you want to think about writing specs and docs as a kind of first class part of the workflow. One of the main reasons why, at least I don't normally write lots of specs and docs in the past has been that it's kind of annoying to keep them all up to date to spend all this time kind of typing everything out if I already know what the code is supposed to be. But this is really important to do if you want the coding agents to actually have context on the project that you're trying to do because they don't have access to your email, your Slack, etc. necessarily. And even if they did, they might not know exactly how to turn that into code. So in Sculptor, uh, one of the ways that we try to make this easier is by helping detect if the code and the docs have become outdated. So it reduces the barrier to writing and maintaining documentation and dock strings because now you have a way of more automatically fixing the inconsistencies. It can also highlight inconsistencies or parts of the specifications that conflict with each other, making it easier to make sure that your system makes sense from the very beginning. And finally, you want to have a really strict style guide and try to enforce it. This is important even if you're just doing regular coding without AI agents, just with other human software engineers. But one of the things that is special in sculptor is that we make suggestions which you can see towards the bottom here uh that help keep the AI system on a reasonable path. So here it's highlighting that you could you know make this particular class immutable to prevent race conditions. Was this something that comes from our style guide where we try to encourage both the coding agents and our teammates to write things in a more functional immutable style to prevent certain classes of errors. We're also working on developing a style guide that's sort of customtailored to AI agents to make it even easier for them to avoid some of the most egregious mistakes that they normally make. But no matter how many uh things you do to prevent the AI system from making mistakes in the first place, it's going to make some mistakes. And there are many things that we can do to prevent or to detect those problems and prevent them from getting into production. So we'll go through three here. Uh first running llinters, second writing and running tests, third asking an LLM. Uh and we'll dig into each and see how that manifests in sculpture. So for the first one for running llinters, there are many automated tools that are out there like rough or my pylind py etc that you can use to automatically detect certain classes of errors. In normal development, this is sort of obnoxious because you have to go fix all these like really small errors that don't necessarily cause problems. It's a lot of like churn and extra work. But one of the great things about AI systems is that they're really good at fixing these. So, one of the things that we've built into Sculptor is the ability for the system to very easily detect these types of issues and automatically fix them for you without you having to get involved. Another thing that we've done is make it easy to use these tools in practice. A lot of tools end up like these. You know, how many people here, maybe a show of hands, how many people have a llinter set up at all? Okay. How many people have zero linting errors in their codebase? Two. Great. We'll hire you. Okay, cool. Uh but you know it's it's not it's not easy. But one of the things that we've done in sculpture is make it so that the AI system understands what issues were there before it started and then what issues were there after it ran. So at least you can prevent the AI system from creating more errors without you even if it doesn't work in a perfectly clean codebase. Okay. Third, testing. So why should you write tests at all? I think I was pretty lazy as a developer for a long time and did not want to write tests because it took a you know a lot of effort. You have to maintain them. I already wrote the code. It works. Okay. But one of the major objections to writing tests has kind of disappeared now that we have AI systems. The ability to generate tests is now so easy that you might as well write tests. Especially if you have correct code. You can tell the agent, hey, just write a bunch of tests, throw out the ones that don't pass, and just keep the rest. So there's no real reason to not write tests at all. Uh and B at as they say at Google, if you liked it, you should have put a test on it. This becomes much more important with coding agents. The reason is that you don't want your coding agent to go change the behavior of your system in a way that you don't understand and don't expect and don't want to see happen. So at Google, this matters a lot for their infrastructure because they don't want their site to crash when someone changes something. But if you really care about the behavior of your system, you want to make sure that it's fully tested. So how do we actually write good tests? I'll go through a bunch of different uh components to this. So first, one of the things that you can do is write code in a functional style. By this I mean code that has no side effects. This makes it much much easier to run LLM and understand if the code is actually successful. You really don't want to be running a test that has access to say your live Gmail environment where if you make a single mistake you can delete all of your email. You really want to isolate those types of side effects and be able to focus most of the code uh on the kind of functional transformations that matter for your program. Second, you can try and write two different types of unit tests. Happy path unit tests are those that are ones that show you that your code is working. It's happy. Hooray, it worked. uh you don't need that many of those. You just need a small number to show that things are working as you hope. The unhappy unit tests are the ones that help us find bugs. And here LLMs can be really, really helpful. So, especially if you've written your code in a functional style, you can have the LLM generate hundreds or even thousands of potential inputs, see what happens to those inputs, and then ask the LLM, does that look weird? And often when it says yes, that will be a bug. And so now you have a perfect test case replicating a bug. Third, after you've written your unit tests, it's maybe a good idea to throw them away in some cases. This is a little bit counterintuitive. In the past, it spent we took all this effort and spent all this time trying to write good unit tests and so we feel some aversion to throwing them away. But now that it's so easy to run LLM and generate the test suite again from scratch, there's a reason a good reason to not keep around too many unit tests of behavior you don't care about too much. You might also want to just refactor the ones that you generated into something that's slightly more maintainable. But when you do keep them around, it does kind of confuse the LLM when you come back and change this behavior. So it's something that's at least worth thinking about whether you want to keep the tests that were originally generated, clean them up, how many of them should you keep, etc. Fourth, you should probably focus on integration tests uh as opposed to testing only the kind of code level functional uh behavior of your program. Integration tests are those that show you that your program actually works. Like from the user's perspective, like when the user clicks on this thing, does this other thing happen? AI systems can be extremely good at writing these, especially if you create nice test plans where you can write, okay, when the user clicks on the button to add the item to the shopping cart, then the item is in the shopping cart. If you write that out and then you write the test, then you can write another test plan like if the user clicks to remove the button, the thing from the shopping cart, then it is gone. that systems can almost always get this right and so it allows you to work at the level of meaning for your testing which can be much more efficient. Uh fifth, you want to think about test coverage as a core part of your testing suite. So if you're having cloud code write things for you then you don't care just about the tests working on their own but you also care are there enough tests in the first place. If you think back to the original screenshot where we get back our PR of you know how many lines have changed. If I tell you how many lines have changed, it's not that helpful. If I tell you so many lines have changed and also there's 100% test coverage and also all the tests pass and also a thing looked at the tests and thought they were reasonable. Now you can probably click on that merge button without quite as much fear. Uh and sixth uh we try to make it easy to run tests in sandboxes and without secrets as much as possible. This uh makes it a lot easier to actually fix things and makes it a lot easier to make sure that you're not accidentally causing problems or making flaky tests. The third thing that we can do to detect errors is ask an LLM. There are many different things that we can check for, including if there are issues before you commit with your current change, if the thing that you're trying to do even makes sense, if there are issues in the current branch you're working on, if there are violations of rules in your style guide or in your architecture documents, if there are details that are missing from the specs, if the specs aren't implemented, if they're not well tested, or whatever other custom things that you want to check for. One of the things that we're trying to enable in Sculptor is for people to extend the checks that we have so that they can add their own types of best practices into the codebase and make sure that they are continually checked. After you've found issues, then you have to fix them. Very little of this talk is about fixing the issues because it ends up being a lot easier for the systems to fix issues than you would expect. I think this quote captures it relatively well. And a problem wellstated is halfsolved. What this means is that if you really understand what went wrong, then it's much easier to solve the problem. This is especially true for coding agents because the really simple strategies work really well. So even just try multiple times, try a hund times with a different agent, it actually ends up like working out quite well. And one of the things that enables this is having really good sandboxing. If you have agents that can run safely, then you can run an almost unlimited number subject to cost constraints uh in parallel. And then if any one of them succeeds, then you can use that solution. And this is really just the beginning. There are going to be so many more tools that are released over the next year or two and many of the people in this room are working on those tools. There will be things that are not just for writing code like we've been talking about, but for after deployment, for debugging, logging, tracing, profiling, etc. There are tools for doing automated quality assurance where you can have an AI system click around on your website and check if it can actually do the thing that you want the user to do. There are tools for generating code from visual designs. There are tons of de dev tools coming out every week. you will have much better contextual search systems that are useful for both you and for the agent. Uh and of course we'll get better AI based models as well. If anyone is working on these other sorts of tools that that are kind of adjacent to developer experience and helping you fix this like much smaller piece of the process, we would love to work together and find out a way to integrate that into Sculptor so that people can take advantage of that. I think what we'll see over the next year or two is that most of these things will be accessible. Uh, and it'll make the development experience just a lot easier once all these things are working together. So, that's pretty much all that I have for today. If you're interested, feel free to take a look at the QR code, go to our website at imbue.com and sign up to try out Sculptor. And of course, if you're interested in working on things like this, we're always hiring. We're always happy to chat, so feel free to reach out. Thank you. [Music]