Beyond the Prototype: Using AI to Write High-Quality Code - Josh Albrecht, Imbue

Channel: aiDotEngineer
Published at: 2025-07-25
YouTube video id: x_1EumTaXeE
Source: https://www.youtube.com/watch?v=x_1EumTaXeE
[Music]
It's great to be here. So, I'm Josh
Albertch. I'm the CTO of Imbue. Uh, and
our focus is on making more robust,
useful AI agents. In particular, we're
focusing on software agents right now.
And the main product that we're working
on today is called Sculptor. So, the
purpose of Sculptor is to kind of help
us with something that we've all
experienced. You know, we've all tried
these vibe coding tools and you, you
know, tell it to go off and do
something. It goes off and creates a
bunch of code for you. Uh, and then, you
know, voila, you're done, right? Well,
not quite. like at least today there's a
big gap between kind of the stuff that
comes back uh and what you want to ship
to production especially as you get away
from the prototyping into a larger more
established code bases. So today I'm
going to go over some of the technical
decisions that went into the design of
sculpture our experimental coding agent
environment uh and kind of go through
some of the context and motivations for
the various ideas that we've explored
and the features that we've implemented.
It's still a research preview, so these
features may change before we actually
release it. Uh, but I hope that you know
whether you're an individual using these
tools or you're someone who's developing
the tools yourself, you'll find these uh
kind of learnings from our experiments
to be useful for yourselves. So today,
if you're thinking about how you can
make coding agents better, then there's
a million different things that you
could build. You could build something
that helps improve the performance on
really large context windows. You could
make something to make it cheaper or
faster. You could make something that
does a better job of parsing the
outputs. But I don't think that we
really should be building any of these
things. I think that what we really want
to be building is things that are much
more specific to the use case or to like
the problem domain or the thing that you
are like really specialized in. most of
the things that I just mentioned are
going to get solved over the next call
it 3 to 12 to 24 months as models get
better, coding agents get better etc.
And so I think you know just like you
wouldn't want to make your own database
I don't think we want to be spending a
lot of time working on the problems that
are going to get solved uh instead we
want to focus on the particular part of
the problem that really matters for for
us for our business and so at impe
problem that we're focusing on is
basically this like what is wrong with
this diff you get a coding agent output
and it tells you like okay I've added 59
new lines are those good like right now
you have an awkward choice between
either looking at each of the lines
yourself or just hitting merge and kind
of hoping for the best. Uh, and neither
of those are a really great place to be.
So, we try to give you a third option.
Uh, the goal is to help build user trust
by allowing another AI system to come
and take a look at this and understand
like, hey, are there any race
conditions? Did you leave your API key
in there, etc. So we want to think about
how do we help leverage AI tools not
just to generate the code but to help us
build trust in that code
and kind of the way that we think about
it is about like identifying problems
with the code because if there's no
problems then it's probably high quality
code and that's kind of the definition
of high quality code. If you think about
it from like an academic perspective,
the way that people normally measure
software quality is by looking at the
number of defects and they look at like
how long does it take to fix a
particular defect or how many defects
are caught by this particular technique.
So this is sort of the definition that
at least we're working on from when
we're thinking about making high quality
software. And then if we think about you
know the software development process
what you want to be doing is getting to
a place where you have identified these
problems as early as possible. So
sculptor does not work as like a pull
request review tool because that's much
much later in the process. Rather we
want something that's synchronous and
immediate and giving you immediate
feedback. As soon as you've generated
that code, as soon as you've changed
that line, you want to know like is
there something wrong with it? That's
easier both for you to fix and also for
the agent to fix.
So what are some ways that you can
prevent problems in AI generated code?
We're going to go through five different
ways. Uh the first is learning planning
or sorry only four different ways.
Learning, planning, writing specs, and
having a really strict style guide. And
we'll see how those manifest in
Sculptor.
So the first thing you want to do when
you're using coding agents if you're
trying to prevent problems is learn
what's out there. We try to make this as
easy as possible in sculpture by letting
you ask questions,
have it do research, get answers about
what are the technologies, etc. that
exist, what are the ways that other
people have solved similar problems so
that you don't end up reproducing a
bunch of work for what's already out
there.
Next, we want to think about how we can
encourage people to start by planning.
Here's a little example workflow where
you can, you know, kick off the agent to
go do something simple like, you know,
implement this Scrabble solver and
change the system prompt here to force
the AI agent to first make a plan
without writing any code at all. Then
you can wait a little while. It'll
generate the plan. Uh, and then you can
go and change the system prompt again to
say like, okay, now we can actually
create some code. So we make it really
easy to kind of change these types of
meta parameters of the coding agent
itself. Of course you can just tell the
agent to do that. But by changing its
system prompt you sort of force it in a
much stronger way to uh change its
behavior. And you can build up larger
workflows by making sort of customized
agents for always plan first then always
do the code then always run the checks
etc.
Third, you want to think about writing
specs and docs as a kind of first class
part of the workflow. One of the main
reasons why, at least I don't normally
write lots of specs and docs in the past
has been that it's kind of annoying to
keep them all up to date to spend all
this time kind of typing everything out
if I already know what the code is
supposed to be. But this is really
important to do if you want the coding
agents to actually have context on the
project that you're trying to do because
they don't have access to your email,
your Slack, etc. necessarily. And even
if they did, they might not know exactly
how to turn that into code. So in
Sculptor, uh, one of the ways that we
try to make this easier is by helping
detect if the code and the docs have
become outdated. So it reduces the
barrier to writing and maintaining
documentation and dock strings because
now you have a way of more automatically
fixing the inconsistencies. It can also
highlight inconsistencies or parts of
the specifications that conflict with
each other, making it easier to make
sure that your system makes sense from
the very beginning.
And finally, you want to have a really
strict style guide and try to enforce
it. This is important even if you're
just doing regular coding without AI
agents, just with other human software
engineers. But one of the things that is
special in sculptor is that we make
suggestions which you can see towards
the bottom here uh that help keep the AI
system on a reasonable path. So here
it's highlighting that you could you
know make this particular class
immutable to prevent race conditions.
Was this something that comes from our
style guide where we try to encourage
both the coding agents and our teammates
to write things in a more functional
immutable style to prevent certain
classes of errors. We're also working on
developing a style guide that's sort of
customtailored to AI agents to make it
even easier for them to avoid some of
the most egregious mistakes that they
normally make.
But no matter how many uh things you do
to prevent the AI system from making
mistakes in the first place, it's going
to make some mistakes. And there are
many things that we can do to prevent or
to detect those problems and prevent
them from getting into production. So
we'll go through three here.
Uh first running llinters, second
writing and running tests, third asking
an LLM. Uh and we'll dig into each and
see how that manifests in sculpture. So
for the first one for running llinters,
there are many automated tools that are
out there like rough or my pylind py etc
that you can use to automatically detect
certain classes of errors.
In normal development, this is sort of
obnoxious because you have to go fix all
these like really small errors that
don't necessarily cause problems. It's a
lot of like churn and extra work. But
one of the great things about AI systems
is that they're really good at fixing
these. So, one of the things that we've
built into Sculptor is the ability for
the system to very easily detect these
types of issues and automatically fix
them for you without you having to get
involved. Another thing that we've done
is make it easy to use these tools in
practice. A lot of tools end up like
these. You know, how many people here,
maybe a show of hands, how many people
have a llinter set up at all?
Okay. How many people have zero linting
errors in their codebase? Two. Great.
We'll hire you. Okay, cool. Uh but you
know it's it's not it's not easy. But
one of the things that we've done in
sculpture is make it so that the AI
system understands what issues were
there before it started and then what
issues were there after it ran. So at
least you can prevent the AI system from
creating more errors without you even if
it doesn't work in a perfectly clean
codebase.
Okay. Third, testing. So why should you
write tests at all? I think I was pretty
lazy as a developer for a long time and
did not want to write tests because it
took a you know a lot of effort. You
have to maintain them. I already wrote
the code. It works. Okay. But one of the
major objections to writing tests has
kind of disappeared now that we have AI
systems. The ability to generate tests
is now so easy that you might as well
write tests. Especially if you have
correct code. You can tell the agent,
hey, just write a bunch of tests, throw
out the ones that don't pass, and just
keep the rest. So there's no real reason
to not write tests at all. Uh and B at
as they say at Google, if you liked it,
you should have put a test on it. This
becomes much more important with coding
agents. The reason is that you don't
want your coding agent to go change the
behavior of your system in a way that
you don't understand and don't expect
and don't want to see happen. So at
Google, this matters a lot for their
infrastructure because they don't want
their site to crash when someone changes
something. But if you really care about
the behavior of your system, you want to
make sure that it's fully tested.
So how do we actually write good tests?
I'll go through a bunch of different uh
components to this. So first, one of the
things that you can do is write code in
a functional style. By this I mean code
that has no side effects. This makes it
much much easier to run LLM and
understand if the code is actually
successful. You really don't want to be
running a test that has access to say
your live Gmail environment where if you
make a single mistake you can delete all
of your email. You really want to
isolate those types of side effects and
be able to focus most of the code uh on
the kind of functional transformations
that matter for your program.
Second, you can try and write two
different types of unit tests. Happy
path unit tests are those that are ones
that show you that your code is working.
It's happy. Hooray, it worked. uh you
don't need that many of those. You just
need a small number to show that things
are working as you hope. The unhappy
unit tests are the ones that help us
find bugs. And here LLMs can be really,
really helpful. So, especially if you've
written your code in a functional style,
you can have the LLM generate hundreds
or even thousands of potential inputs,
see what happens to those inputs, and
then ask the LLM, does that look weird?
And often when it says yes, that will be
a bug. And so now you have a perfect
test case replicating a bug.
Third, after you've written your unit
tests, it's maybe a good idea to throw
them away in some cases. This is a
little bit counterintuitive.
In the past, it spent we took all this
effort and spent all this time trying to
write good unit tests and so we feel
some aversion to throwing them away. But
now that it's so easy to run LLM and
generate the test suite again from
scratch, there's a reason a good reason
to not keep around too many unit tests
of behavior you don't care about too
much. You might also want to just
refactor the ones that you generated
into something that's slightly more
maintainable. But when you do keep them
around, it does kind of confuse the LLM
when you come back and change this
behavior. So it's something that's at
least worth thinking about whether you
want to keep the tests that were
originally generated, clean them up, how
many of them should you keep, etc.
Fourth, you should probably focus on
integration tests uh as opposed to
testing only the kind of code level
functional uh behavior of your program.
Integration tests are those that show
you that your program actually works.
Like from the user's perspective, like
when the user clicks on this thing, does
this other thing happen? AI systems can
be extremely good at writing these,
especially if you create nice test plans
where you can write, okay, when the user
clicks on the button to add the item to
the shopping cart, then the item is in
the shopping cart. If you write that out
and then you write the test, then you
can write another test plan like if the
user clicks to remove the button, the
thing from the shopping cart, then it is
gone. that systems can almost always get
this right and so it allows you to work
at the level of meaning for your testing
which can be much more efficient. Uh
fifth, you want to think about test
coverage as a core part of your testing
suite. So if you're having cloud code
write things for you then you don't care
just about the tests working on their
own but you also care are there enough
tests in the first place. If you think
back to the original screenshot where we
get back our PR of you know how many
lines have changed. If I tell you how
many lines have changed, it's not that
helpful. If I tell you so many lines
have changed and also there's 100% test
coverage and also all the tests pass and
also a thing looked at the tests and
thought they were reasonable. Now you
can probably click on that merge button
without quite as much fear. Uh and sixth
uh we try to make it easy to run tests
in sandboxes and without secrets as much
as possible.
This uh makes it a lot easier to
actually fix things and makes it a lot
easier to make sure that you're not
accidentally causing problems or making
flaky tests.
The third thing that we can do to detect
errors is ask an LLM. There are many
different things that we can check for,
including if there are issues before you
commit with your current change, if the
thing that you're trying to do even
makes sense, if there are issues in the
current branch you're working on, if
there are violations of rules in your
style guide or in your architecture
documents, if there are details that are
missing from the specs, if the specs
aren't implemented, if they're not well
tested, or whatever other custom things
that you want to check for. One of the
things that we're trying to enable in
Sculptor is for people to extend the
checks that we have so that they can add
their own types of best practices into
the codebase and make sure that they are
continually checked.
After you've found issues, then you have
to fix them. Very little of this talk is
about fixing the issues because it ends
up being a lot easier for the systems to
fix issues than you would expect. I
think this quote captures it relatively
well. And a problem wellstated is
halfsolved. What this means is that if
you really understand what went wrong,
then it's much easier to solve the
problem. This is especially true for
coding agents because the really simple
strategies work really well. So even
just try multiple times, try a hund
times with a different agent, it
actually ends up like working out quite
well. And one of the things that enables
this is having really good sandboxing.
If you have agents that can run safely,
then you can run an almost unlimited
number subject to cost constraints uh in
parallel. And then if any one of them
succeeds, then you can use that
solution.
And this is really just the beginning.
There are going to be so many more tools
that are released over the next year or
two and many of the people in this room
are working on those tools. There will
be things that are not just for writing
code like we've been talking about, but
for after deployment, for debugging,
logging, tracing, profiling, etc. There
are tools for doing automated quality
assurance where you can have an AI
system click around on your website and
check if it can actually do the thing
that you want the user to do. There are
tools for generating code from visual
designs. There are tons of de dev tools
coming out every week. you will have
much better contextual search systems
that are useful for both you and for the
agent. Uh and of course we'll get better
AI based models as well. If anyone is
working on these other sorts of tools
that that are kind of adjacent to
developer experience and helping you fix
this like much smaller piece of the
process, we would love to work together
and find out a way to integrate that
into Sculptor so that people can take
advantage of that. I think what we'll
see over the next year or two is that
most of these things will be accessible.
Uh, and it'll make the development
experience just a lot easier once all
these things are working together.
So, that's pretty much all that I have
for today. If you're interested, feel
free to take a look at the QR code, go
to our website at imbue.com and sign up
to try out Sculptor. And of course, if
you're interested in working on things
like this, we're always hiring. We're
always happy to chat, so feel free to
reach out. Thank you.
[Music]