Vision: Zero Bugs — Johann Schleier-Smith, Temporal

Channel: aiDotEngineer
Published at: 2025-11-24
YouTube video id: qLqttdO33UM
Source: https://www.youtube.com/watch?v=qLqttdO33UM
Please join me in envisioning a world
where software has zero bugs. Not just a
few bugs, but actually literally zero
bugs. Okay. Okay. Just bear with me now.
So for most people, let's just say
people who aren't software engineers,
bugs are actually just not a very big
part of their life. Period. Most of the
apps that we use on our phones, our
social media, our news, that stuff
pretty much works most of the time. The
camera works most of the time. Any of
those most popular apps, banking, they
work really well most of the time. So,
bugs are really not top of mind for most
people.
Now, anybody who makes software
is very familiar with a different world.
A world of constant stress about the
possibility of software errors creeping
into critical applications
on call uh responses to pagers, cloud
provider outages, the list goes on and
on. So there's a disconnect between what
most people are experiencing every day
in the world and the reality of making
software. Now I will say that even for
those of us who are not engineers,
the perils of broken software do crop up
from time to time.
Just yesterday, I took my seven-year-old
son to the mini golf place, and there
was just one reservation left.
reservations were required.
And I dutifully whipped out my
smartphone,
snapped the QR code, went through the
process to grab the last reservation
spot, only to be told that it had been
grabbed by somebody else.
Well, I got to say I was very proud of
my son because most kids most of the
time would have probably melted and he
actually didn't. He handled it great.
And then can you imagine my surprise
when I checked my messages about 10
minutes later to find out that in fact
that last reservation slot had gone to
us. So we were thrilled. That roller
coaster journey still reinforces the
fact that bugs are real in the world and
they have real impact on real people
every day. Even if it is just a
momentary emotional swing for a
seven-year-old.
I'm Johan Flyersmith and today I'm going
to be talking to you about a vision of
zero bugs. Now I work at Temporal
Technologies. Temporal makes software
for durable execution. and it makes
software that deployed to the cloud do
what it's supposed to do. But this talk
is not going to be about temporal. There
are several other talks at the AI
engineer summit that do talk about
temporal. My colleague Cornelia Davis
will be doing a workshop on Sunday. In
addition, Samuel Kovven from Pyantic
will be talking about building agents
that combine Temporal with Pideantic.
The push to build reliable software and
the vision of giving engineers time back
for innovation is tightly lined with our
products at Temporal. However,
everything in this presentation is going
to be outside of the scope of our
current products. Let's return to the
vision of zero bugs.
There are quite a few objections, really
reasonable objections to this vision.
So, let's talk through them. First of
all, as we've started out saying,
incidents happen. Incidents happen
whether it's because of cloud outages or
problems with orders. They happen and
generally speaking, we pick ourselves up
and get through them. More broadly, the
world is imperfect and so a few software
bugs here and there might be okay. And
in fact, we already are solving for
reliability pretty well in many of the
situations where it matters. So maybe
software is good enough. Maybe we don't
need to push towards a vision of zero
bugs.
Here's another objection. You could give
perhaps good reasons, good theoretical
reasons even why eliminating all of the
bugs is just simply impossible. Why?
It's a preposterous idea. So you could
say there are millions of lines of code.
The code is just too big. We have too
much code as we know as agents generate
more and more code that exacerbates the
problem and it's all just simply too
complicated.
Furthermore, if we look at the
definition of a bug, it seems that the
specifications unavoidably have some
degree of ambiguity. I would say that
it's a bug. Whatever the way the program
works
does not match the end user's
expectations.
They don't care whether it was a problem
with a product specification or whether
the programmer forgot to check for a
null. It just doesn't matter, right? And
furthermore, unexpected things happen in
the real world. If we think about
control systems for example, if there is
some aspect of the world that hasn't
been modeled correctly, you could see
this frequently for example in the fears
around the capabilities of self-driving
vehicles. Then that you could say just
simply can't be handled. It's hopeless.
Furthermore, we're going to talk about
some of the powerful techniques in
software verification, but we also know
and we can prove theoretically that
those have limits. There are problems
that are computationally intractable in
some cases.
Reason number three is economics. If you
have competitors who don't care much
about software quality and who will win
in the marketplace if you spend time on
it, then that reliable software may
never see the light of day. Also, you
might just say that the ROI just simply
isn't there for fixing every single bug.
Some of them maybe are just not so bad.
Maybe they have easy workarounds.
And finally, perhaps cynically,
some people think that there are
companies that are okay with shipping
buggy software because it helps them
sell support.
In this vision, this cynical and sad
vision of the world, the bugs win and
we'll never have bug-free software. Not
even close. [snorts]
Now, I contend that there is hope.
And if we look, there are practices,
a whole slew of techniques that really
allow very reliable software. Let's look
at this example, which is the Airbus
A320. The control software for this
airplane was developed in the 1980s and
has been held up as a showcase for
reliability. There are in fact to this
date no serious incidents with Airbus
A320 aircraft that have been attributed
to problems with the software.
So what is their approach?
There are a bunch of ideas here that are
really pretty neat. So one of them is
Nvers programming. So the most critical
elements of the Airbus control system
[snorts] were actually built with
different processors. Say one from x86
from Intel, one Motorola processor,
different operating systems on that,
separate teams writing the software
providing a tremendous level of
redundancy against unexpected issues.
They also use something called
specificationbased design. tremendous
amounts of documentation, but also
documentation that could be analyzed in
order to understand
and make provable guarantees about the
behavior of the system and what the
software would do under a whole variety
of scenarios. They use independent
verification teams where the people
writing the code and the people checking
to make sure that the code had the
desired behavior were completely
separate teams. And they also used a
slew of defensive programming
techniques. So for example, not
allocating any memory at runtime. That's
all done statically. Not having
sophisticated exception handling. Just
keeping it really simple, very explicit
in the code, how any error conditions
are handled. And finally, static
analysis and verification. We'll talk
about those techniques more in just a
few minutes. So the mindset here is also
really important. The Airbus engineering
team had this idea of zero defect
tolerance of thinking of software as a
certified component that was engineered
to meet a certain specification just
like a turbine fan blade might be.
And they also had a system level
approach to reliability because when you
think about it with an airplane there
are all sorts of things that could go
wrong that need to be protected against.
It stands to reason that the decades of
experience engineering mission critical
mechanical systems crossed over into the
software development process and there's
a lot that we can learn from that.
So
core to the A320 was quality through
process. Now, I know for folks who are
banging out code, process is often times
the last thing that they want to think
about.
But as we're thinking about how agentic
coding works, thinking about how we keep
agents on the rails and doing what we
want them to do, process really is
something that we do want to think
about. There are quite a few steps to
the quality process. Many of these are
familiar to people who are writing
software today, say planning and
requirements. But there are also some
others that are a little bit different
like certification by an external
agency, maybe a regulator or the
government. The integration testing
becomes particularly important for an
airplane where that software needs to
interact with a physical system. And as
we look ahead and think about where
things are going in terms of the
software that we are going to have in
the future that's interfacing more and
more with the physical world. So this is
something that is probably going to come
back. And the key thing too is that
there's a feedback process in refining
each of these processes and making sure
that it interfaces well with the steps
that come before and after.
The aerospace industry is particularly
rich in these examples of super super
reliable software being built. So the
space shuttle is one and and really it's
quite stunning. So in the last three
versions of that software 420,000
lines of code in each of those and the
result of that after sort of inspecting
was was one error per version. Sadly,
some of the space shuttles have been
lost, but space shuttles have never been
lost to software problems. Over the last
11 versions, there were a total of 17
errors. And so, this is probably a
thousand times fewer bugs um per line of
code than is typical in commercial
software. Another aerospace example is a
Curiosity rover. With a mission that
costs millions and with very little
ability to intervene once the system is
on Mars, it was critical to have a high
level of reliability. Now that said,
this software developed in the 2000s did
take a bit of a different approach that
really shows the evolution of reliable
systems. So, for example, while
redundant systems were used, they're
actually identical systems and a
commercial off-the-shelf real-time
[snorts] operating system was used
rather than a custom operating system.
Now, aerospace isn't the only industry
where high assurance software, high
quality software, software with
effectively zero bugs, has been
critical. So whether it's in the
chemical industry or the automotive
industry, medical software, nuclear
power industry or security systems, each
of these provides us with an opportunity
to learn something.
Let's take a moment to shift gears a
little bit. Let's look at the advances
in computer science that really set the
foundation for how reliable software is
built today. And in fact, as we look at
these, we'll find that they really are
the foundation for really all software
that's built today.
The biggest of these is highle
languages. Here we go back to the 1950s,
1960s.
And from that period
where people were mostly writing with
assembly language up through the 1980s
when really assembly language more or
less went out of favor as a language
that people would use. It was a language
that that was replaced by machine code
generated by machines for machines.
There was about a 5 to 10x productivity
gain.
And
the core idea with highle languages is
around abstraction.
It's around data abstraction so that
instead of poking at memory locations,
you work with data structures that have
some relevance in the problem domain.
And it's about structured programming
which we'll talk about in a minute. At
the end of the day though, what is sort
of a unifying concept here is preserving
the essential complexity, which is those
aspects of the problem that are directly
relevant to whatever it is that the
software is supposed to do and removing
as much as possible from the code. those
aspects of the problem that have
something to do with the implementation
that have something to do with the
machine underlying that runs the code
like what its registers are or how you
lay out or access the memory or even
many aspects of the performance of that
machine. Structured programming as
espoused by Edgar Dystra was one of the
really big advances coming in the 1960s
and being broadly accepted in the 1970s.
Today, programmers can be excused for
having forgotten about the debates about
whether go-to statements were a useful
programming tool or something that
should be avoided at all costs. Our
programming language that we use today
clearly don't have go-to statements.
What is structured programming all
about? It's really quite simple. You
have a set of basic control structures.
So these are things like sequences,
statements that come one after the
other. Um selection if then else,
iteration concepts that are completely
familiar to any programmer today. But
what's really important about structured
programming versus what came before
where people were modeling applications
in terms of flowcharts and having these
nonstructured concepts like go-tos where
you could really jump around throughout
a program was enabling this sort of
compositional reasoning and eliminating
spaghetti code in many cases. You could
still write spaghetti code of course
with structured programs but if you look
at forrren code and if you try to
understand that go for it. It's a fun
time. You'll find that uh it's really
very different. So this hierarchical
decomposition of programs, it really
mitigates complexity. It allows
programmers to focus on one piece of the
code at a time. When you have LMS
generating the code, this is just as
valuable as it was for the programmers
who are writing code decades ago.
Another
key idea that traces back to the 1970s
is David Parnes's
push to think about software systems in
terms of modules. What does modularity
mean? It's perhaps best known in the
context of object-oriented programming,
but it applies in a whole bunch of
situations.
It's perhaps best known as an aspect of
object-oriented programming, but you can
have modularity without object-oriented
programming. Libraries are one of the
obvious examples. And so when we think
about verifying a program, when we think
about making sure that that program does
what it's supposed to do, whether we're
verifying it as a person or as an LLM or
using some sort of formal verification
technique, modularity is a massive
boost. As you chain modules together,
you get a subexponential scaling,
perhaps even a linear scaling rather
than an exponential scaling where you
can apply local reasoning at every
level.
And the upshot of that is that you have
manageable complexity
regardless of the size of the system.
You take that spaghetti and you turn it
into something that is very nicely
organized. I want to take a moment here
to reflect on why LLMs are not simply
generating machine code rather than
highle language code. It's certainly a
reasonable question and I think that the
reasons that applied to human
programmers decades ago are just as
applicable to LMS today. So for one
thing, we know that context is limited.
The context for an LLM, the context
window might be a lot larger than what a
human is able to hold in their head. It
depends a little bit on how you count
that context. Certainly, we have a lot
of awareness of background facts that
we've sort of compressed into our brain.
Um but uh
context is definitely a scarce resource
for LLMs just like attention and ability
to reason perhaps call it working memory
is a scarce resource for people. The
argument for libraries is as strong
today as it ever was. So while you could
make the argument, oh why don't we just
let the AI generate all the code for the
libraries since it's fast and cheap.
Maybe we can customize it to the needs
of our specific application.
Getting that code properly tested,
properly verified is going to be a huge
challenge. And so we really want the
ability to use reliable, trusted
components and modules to build our
systems. On that note, I do need to put
in a little pitch for temporal. What
temporal allows you to do is it allows
you to abstract away the reliability of
your software in the cloud. It provides
durable execution, which means that it's
shipping that reliability problem to a
separate piece of code that's outside of
your application that your application
doesn't need to worry about. Let's now
go ahead and dive in on the fun part,
which is formal methods. And I want to
shoot straight to a few demos. Now, in
these demos, I'm going to be using the
Daphne language. What Daphne allows you
to do is it allows you to use a custom
programming language that generates
output to a whole variety of other
languages, whether it's JavaScript,
Python, um, or C, you name it. What
daffany allows you to do is it allows
you to put proofs in line with your
code,
allowing theorem proving software to
come along and verify that that code
does exactly what you said you wanted to
do. Okay. So I have a program here that
is written in the Daftly language and it
has one function. It's called a method
here and it does something very simple.
So it does index up. So what it's going
to do is it's going to search an array
to find the index of a particular number
and I can write a number of assertions
about this. The array length is greater
than zero. The number returned in that
result is either negative 1 if it's not
found or the uh some number that is less
than the length of the array and so
forth. What I can now do is I can just
go ahead and I can run the Daphne
verifier on that program.
Great, no bugs.
Let's go ahead and generate a Python
program that exercises this
functionality. And we can see that the
program first verifies before it runs.
So I know that all of those assertions
that are proven about the program have
been checked before that program runs.
This is an extremely powerful technique
because it spits out a Python library.
It's something that can be integrated
into your code. Now suppose I come over
here and I make a small change to the
algorithm, which is to say I've
introduced a bug.
If I now go back and I try to run that
again, the verifier steps in and throws
an error and we are saved from seeing
that bug.
All right, let's return to the
presentation here. So, one thing to keep
in mind is that verification is only as
good as the specification. If I leave
out anything that needs to be checked,
that creates an opportunity for bugs.
[snorts]
So I want to emphasize that in the last
few decades formal methods have become
commercially relevant on a really
impressive scale. For example, the Scl
micro kernel is a fully verified
operating system. It's a simple
operating system typically used for
embedded systems and security critical
applications, but it is an operating
system. The comfort C compiler again
often used in security critical
applications as well as in the aviation
industry that is a fully verified
compiler. That is to say that formal
methods have been used to ensure that
the code that that compiler admits given
a C program does exactly what that C
program is supposed to do. Project
Everest works on libraries for
cryptography, including libraries that
are widely deployed today, protecting
internet traffic. And really
impressively in the microprocessor space
now for several decades, formal methods
have being used to ensure the
correctness of those designs. There has
been just a huge motivation to make sure
that these systems are performing as
expected. And one of the things that's
really really cool is that there has
been just tremendous progress in terms
of the size and speed with which
verification can be performed over the
last sort of 20 plus years. And this
really coincides with the rise of
benchmarks. Benchmarks can have a
tremendous role in shaping an industry.
It gives folks something to focus on.
And so we can see that success rates for
the benchmarks have gone from the 30%ish
range up to nearly 100% while at the
same time the runtime on those
benchmarks has gone down by a factor of
50 or more. So there are a handful of
verification tools that you can use
today and I want to break them down in a
few different categories. So static
verification is probably that which you
are most familiar with. I'm starting
from the bottom here. If you are using
type systems that is a simple form of
static verification but there are ways
to attach more checks to the type
system. Jumping up to the top we just
saw Daphne Ron Spark. data is another
example of tight coupling between those
theorems and the code and then there are
other systems that are also wellknown
and lean for example that provide
theorem proving separate from the code
the problem there while those tools are
super super powerful is that you do need
to make sure that what the code does and
what you have written in terms of the
proof are the same Model checking deals
with finite state machines and proving
properties about those finite state
machines. Theorem proving on the other
hand doesn't have that limitation
because it is able to take advantage of
more powerful reasoning techniques,
automated reasoning techniques.
All right, let's get to the good stuff.
Agentic coding. Now, I wanted to give
you a set of really practical things
that you can try in your day-to-day work
to see what sorts of benefits you can
get. These are probably not things that
you're going to apply across the code
base, but when you're struggling to get
the agent to do what you want it to do
on a very specific piece of code, these
could all be pretty valuable. So some of
these are things that we are probably
reasonably well verssed with. So
detailed specifications, using type
languages, doing modular code, these are
all sort of things that we pretty much
do anyway. But some things that we might
not do are interacting with the LLM and
asking it to do explicit risk analysis,
asking it to write safety cases, which
are statements about things that could
go wrong and how that thing that could
go wrong is being mitigated in the code.
So this is separate from formal methods.
This is sort of a um more qualitative
reasoning which is something that we
know that LLMs can do. Another
inspiration that you can take is from
the design of high assurance systems
where they have separate teams do the
coding and the verification. That means
that you can have separate prompts to
the LLM for testing versus for
writing the code in the first place. And
if you want to take that to another
level, you can use multiple model
providers. So you can use one foundation
model for the tests and one foundation
model to write the code. You can bring
in those formal methods techniques to
give proofs around sections of critical
code. And lastly, this is sort of the
timeless advice, keeping your code
small, outsourcing those things that can
be to libraries which can be separately
tested, validated, developed, and now
your code doesn't need to worry about
it.
All right, let's talk for a minute about
software 3.0. So this is the idea
promoted by Andre Karpathy that prompts
can really function as programs that
what we're doing today is we are
programming through AI through LLMs
and it's a new world of coding whether
that means that the LLM directly solves
whatever problem you need solved or
whether it generates code or perhaps
loops and uses tools or any combination
thereof in order to get to whatever
behavior you want for the system. This
opens up a tremendous need for new
assurance techniques. Right? Because LMS
are fundamentally non-deterministic
and because the state space is
absolutely huge, all of the verification
techniques that we have discussed have
basically no bearing on this form of
software.
That said,
it's not all gloom and doom. And I am
really excited by the idea that despite
having new and different failure modes,
there are also potentially new forms of
resilience. LLMs can respond to
unanticipated inputs. They have that
ability to deal with ambiguity. And you
can imagine lots of architectures
whether they are pure agentic
architectures as we often have today to
ones that maybe invoke LLMs once certain
error conditions are encountered that
are actually getting ahead of and
protecting the world from all kinds of
software faults and perhaps doing it in
really simple and interesting way. So I
think this is just a tremendously
interesting idea. All right,
let's get to cost. This is one of the
big topics. So, what does agentic code
cost? I vibe coded up this very simple
game. I spend about 2 minutes prompting.
We can set that aside. I'm going to not
count my time towards the cost. GPT5
codeex. It's creating 600,000 input
tokens and it has 3.5 million cached
input tokens, 48,000 reasoning tokens,
and then is returning 28,000 tokens. The
cost to generate this game was about $2.
And the thing that's interesting here is
that the cost to generate the output
tokens
is only about 15%
of the overall cost. The rest of which
is going into the repeated use of input
tokens as tests are being run and the
reasoning tokens as well. As it with
human written code, the amount of time
that you spend actually writing the code
is a small fraction of the overall time
that's spent to build the software. All
right, so let's bring it down and let's
look at the cost of code. So for high
assurance code, if we look at something
like the space shuttle or the Airbus
example, the numbers there, if you take
$ 1990 from the space shuttle, it was
about $1,000 per line of code. If you
translate that into $205, it's probably
more like $2,500. And in some cases, so
for example, for security high assurance
software, numbers as high as $3,000 per
line of code have been quoted. For
typical software development, it's more
like $10 to $100 for real production
software, but nothing that is developed
with the high assurance techniques. And
in some cases, so for example, for
security high assurance software numbers
as high as $3,000 per line of code have
been quoted. If you have lowcost
contractors, you may be able to bring
that number down as low as $1 to $10.
This is all without considering any AI
or agentic codegen. For the agentic
coding, I've put a pretty broad range
that includes just cheap models spinning
out code. It could probably go even
lower than this if they're not iterating
on it very much up to more expensive
models that are working harder to
generate that code. Regardless of how
you slice the numbers, you're looking at
a factor of at least a thousand,
probably about 10,000. If you set aside
the cost of the people involved in the
agentic coding, if you just look at that
agentic coding piece, that code is being
generated far more cheaply than typical
software. And this is interesting
because the gap between the cost of high
assurance code and typical software is
only about 100x.
So
if we extrapolate
we could conclude that agentic coding
has the potential to produce high
assurance software
100 times more cheaply than typical
software is produced today.
That leads us to the vision of zero
bugs.
Software reliability is a solved
problem. It's solved in aerospace. It's
solved in other critical industries.
And with the deployment of agents geared
towards achieving high assurance code,
whether that's because they're using
formal methods, because they have
extensive processes, because they're
using adversarial testing, the list goes
on and on. We can believe that agents
will make high assurance code 100 times
cheaper and that in this context we will
see a proliferation of bug-free
experiences. I also want to emphasize
that this push towards a vision of zero
bugs serves to address many of the
limitations that agentic coding have
today. notably around the quality of the
software that's written. When developers
choose not to use the agent coding tools
that are at their disposal, the reason
for doing so typically is that it's just
going to take them more time to fix the
bugs in that software than it would to
take them to write the software
correctly in the first place.
As soon as we can get to the point where
agentic coding is routinely generating
software that has fewer defects than
software written by humans,
we can expect absolute takeoff in its
adoption.
We know how to do that. We've known how
to do that for decades.
Before we close, I want to emphasize
that tardigrades are not bugs.
This is Ziggy. Ziggy is temporal's
mascot and Ziggy belongs to the film
tardigrada,
not an insect.
Tardigrades are some of the most
resilient animals in the world. They
have even been known to survive in outer
space. And earlier this year, we
actually took Ziggy to space just to
prove that point. We are having a lot of
fun here at Temporal building durable
execution as the reliable foundation for
modern software. If anything that we
discussed here today resonates with you,
please reach out. We'd love to chat and
explore how to work together in any
possible way.