Vision: Zero Bugs — Johann Schleier-Smith, Temporal
Channel: aiDotEngineer
Published at: 2025-11-24
YouTube video id: qLqttdO33UM
Source: https://www.youtube.com/watch?v=qLqttdO33UM
Please join me in envisioning a world where software has zero bugs. Not just a few bugs, but actually literally zero bugs. Okay. Okay. Just bear with me now. So for most people, let's just say people who aren't software engineers, bugs are actually just not a very big part of their life. Period. Most of the apps that we use on our phones, our social media, our news, that stuff pretty much works most of the time. The camera works most of the time. Any of those most popular apps, banking, they work really well most of the time. So, bugs are really not top of mind for most people. Now, anybody who makes software is very familiar with a different world. A world of constant stress about the possibility of software errors creeping into critical applications on call uh responses to pagers, cloud provider outages, the list goes on and on. So there's a disconnect between what most people are experiencing every day in the world and the reality of making software. Now I will say that even for those of us who are not engineers, the perils of broken software do crop up from time to time. Just yesterday, I took my seven-year-old son to the mini golf place, and there was just one reservation left. reservations were required. And I dutifully whipped out my smartphone, snapped the QR code, went through the process to grab the last reservation spot, only to be told that it had been grabbed by somebody else. Well, I got to say I was very proud of my son because most kids most of the time would have probably melted and he actually didn't. He handled it great. And then can you imagine my surprise when I checked my messages about 10 minutes later to find out that in fact that last reservation slot had gone to us. So we were thrilled. That roller coaster journey still reinforces the fact that bugs are real in the world and they have real impact on real people every day. Even if it is just a momentary emotional swing for a seven-year-old. I'm Johan Flyersmith and today I'm going to be talking to you about a vision of zero bugs. Now I work at Temporal Technologies. Temporal makes software for durable execution. and it makes software that deployed to the cloud do what it's supposed to do. But this talk is not going to be about temporal. There are several other talks at the AI engineer summit that do talk about temporal. My colleague Cornelia Davis will be doing a workshop on Sunday. In addition, Samuel Kovven from Pyantic will be talking about building agents that combine Temporal with Pideantic. The push to build reliable software and the vision of giving engineers time back for innovation is tightly lined with our products at Temporal. However, everything in this presentation is going to be outside of the scope of our current products. Let's return to the vision of zero bugs. There are quite a few objections, really reasonable objections to this vision. So, let's talk through them. First of all, as we've started out saying, incidents happen. Incidents happen whether it's because of cloud outages or problems with orders. They happen and generally speaking, we pick ourselves up and get through them. More broadly, the world is imperfect and so a few software bugs here and there might be okay. And in fact, we already are solving for reliability pretty well in many of the situations where it matters. So maybe software is good enough. Maybe we don't need to push towards a vision of zero bugs. Here's another objection. You could give perhaps good reasons, good theoretical reasons even why eliminating all of the bugs is just simply impossible. Why? It's a preposterous idea. So you could say there are millions of lines of code. The code is just too big. We have too much code as we know as agents generate more and more code that exacerbates the problem and it's all just simply too complicated. Furthermore, if we look at the definition of a bug, it seems that the specifications unavoidably have some degree of ambiguity. I would say that it's a bug. Whatever the way the program works does not match the end user's expectations. They don't care whether it was a problem with a product specification or whether the programmer forgot to check for a null. It just doesn't matter, right? And furthermore, unexpected things happen in the real world. If we think about control systems for example, if there is some aspect of the world that hasn't been modeled correctly, you could see this frequently for example in the fears around the capabilities of self-driving vehicles. Then that you could say just simply can't be handled. It's hopeless. Furthermore, we're going to talk about some of the powerful techniques in software verification, but we also know and we can prove theoretically that those have limits. There are problems that are computationally intractable in some cases. Reason number three is economics. If you have competitors who don't care much about software quality and who will win in the marketplace if you spend time on it, then that reliable software may never see the light of day. Also, you might just say that the ROI just simply isn't there for fixing every single bug. Some of them maybe are just not so bad. Maybe they have easy workarounds. And finally, perhaps cynically, some people think that there are companies that are okay with shipping buggy software because it helps them sell support. In this vision, this cynical and sad vision of the world, the bugs win and we'll never have bug-free software. Not even close. [snorts] Now, I contend that there is hope. And if we look, there are practices, a whole slew of techniques that really allow very reliable software. Let's look at this example, which is the Airbus A320. The control software for this airplane was developed in the 1980s and has been held up as a showcase for reliability. There are in fact to this date no serious incidents with Airbus A320 aircraft that have been attributed to problems with the software. So what is their approach? There are a bunch of ideas here that are really pretty neat. So one of them is Nvers programming. So the most critical elements of the Airbus control system [snorts] were actually built with different processors. Say one from x86 from Intel, one Motorola processor, different operating systems on that, separate teams writing the software providing a tremendous level of redundancy against unexpected issues. They also use something called specificationbased design. tremendous amounts of documentation, but also documentation that could be analyzed in order to understand and make provable guarantees about the behavior of the system and what the software would do under a whole variety of scenarios. They use independent verification teams where the people writing the code and the people checking to make sure that the code had the desired behavior were completely separate teams. And they also used a slew of defensive programming techniques. So for example, not allocating any memory at runtime. That's all done statically. Not having sophisticated exception handling. Just keeping it really simple, very explicit in the code, how any error conditions are handled. And finally, static analysis and verification. We'll talk about those techniques more in just a few minutes. So the mindset here is also really important. The Airbus engineering team had this idea of zero defect tolerance of thinking of software as a certified component that was engineered to meet a certain specification just like a turbine fan blade might be. And they also had a system level approach to reliability because when you think about it with an airplane there are all sorts of things that could go wrong that need to be protected against. It stands to reason that the decades of experience engineering mission critical mechanical systems crossed over into the software development process and there's a lot that we can learn from that. So core to the A320 was quality through process. Now, I know for folks who are banging out code, process is often times the last thing that they want to think about. But as we're thinking about how agentic coding works, thinking about how we keep agents on the rails and doing what we want them to do, process really is something that we do want to think about. There are quite a few steps to the quality process. Many of these are familiar to people who are writing software today, say planning and requirements. But there are also some others that are a little bit different like certification by an external agency, maybe a regulator or the government. The integration testing becomes particularly important for an airplane where that software needs to interact with a physical system. And as we look ahead and think about where things are going in terms of the software that we are going to have in the future that's interfacing more and more with the physical world. So this is something that is probably going to come back. And the key thing too is that there's a feedback process in refining each of these processes and making sure that it interfaces well with the steps that come before and after. The aerospace industry is particularly rich in these examples of super super reliable software being built. So the space shuttle is one and and really it's quite stunning. So in the last three versions of that software 420,000 lines of code in each of those and the result of that after sort of inspecting was was one error per version. Sadly, some of the space shuttles have been lost, but space shuttles have never been lost to software problems. Over the last 11 versions, there were a total of 17 errors. And so, this is probably a thousand times fewer bugs um per line of code than is typical in commercial software. Another aerospace example is a Curiosity rover. With a mission that costs millions and with very little ability to intervene once the system is on Mars, it was critical to have a high level of reliability. Now that said, this software developed in the 2000s did take a bit of a different approach that really shows the evolution of reliable systems. So, for example, while redundant systems were used, they're actually identical systems and a commercial off-the-shelf real-time [snorts] operating system was used rather than a custom operating system. Now, aerospace isn't the only industry where high assurance software, high quality software, software with effectively zero bugs, has been critical. So whether it's in the chemical industry or the automotive industry, medical software, nuclear power industry or security systems, each of these provides us with an opportunity to learn something. Let's take a moment to shift gears a little bit. Let's look at the advances in computer science that really set the foundation for how reliable software is built today. And in fact, as we look at these, we'll find that they really are the foundation for really all software that's built today. The biggest of these is highle languages. Here we go back to the 1950s, 1960s. And from that period where people were mostly writing with assembly language up through the 1980s when really assembly language more or less went out of favor as a language that people would use. It was a language that that was replaced by machine code generated by machines for machines. There was about a 5 to 10x productivity gain. And the core idea with highle languages is around abstraction. It's around data abstraction so that instead of poking at memory locations, you work with data structures that have some relevance in the problem domain. And it's about structured programming which we'll talk about in a minute. At the end of the day though, what is sort of a unifying concept here is preserving the essential complexity, which is those aspects of the problem that are directly relevant to whatever it is that the software is supposed to do and removing as much as possible from the code. those aspects of the problem that have something to do with the implementation that have something to do with the machine underlying that runs the code like what its registers are or how you lay out or access the memory or even many aspects of the performance of that machine. Structured programming as espoused by Edgar Dystra was one of the really big advances coming in the 1960s and being broadly accepted in the 1970s. Today, programmers can be excused for having forgotten about the debates about whether go-to statements were a useful programming tool or something that should be avoided at all costs. Our programming language that we use today clearly don't have go-to statements. What is structured programming all about? It's really quite simple. You have a set of basic control structures. So these are things like sequences, statements that come one after the other. Um selection if then else, iteration concepts that are completely familiar to any programmer today. But what's really important about structured programming versus what came before where people were modeling applications in terms of flowcharts and having these nonstructured concepts like go-tos where you could really jump around throughout a program was enabling this sort of compositional reasoning and eliminating spaghetti code in many cases. You could still write spaghetti code of course with structured programs but if you look at forrren code and if you try to understand that go for it. It's a fun time. You'll find that uh it's really very different. So this hierarchical decomposition of programs, it really mitigates complexity. It allows programmers to focus on one piece of the code at a time. When you have LMS generating the code, this is just as valuable as it was for the programmers who are writing code decades ago. Another key idea that traces back to the 1970s is David Parnes's push to think about software systems in terms of modules. What does modularity mean? It's perhaps best known in the context of object-oriented programming, but it applies in a whole bunch of situations. It's perhaps best known as an aspect of object-oriented programming, but you can have modularity without object-oriented programming. Libraries are one of the obvious examples. And so when we think about verifying a program, when we think about making sure that that program does what it's supposed to do, whether we're verifying it as a person or as an LLM or using some sort of formal verification technique, modularity is a massive boost. As you chain modules together, you get a subexponential scaling, perhaps even a linear scaling rather than an exponential scaling where you can apply local reasoning at every level. And the upshot of that is that you have manageable complexity regardless of the size of the system. You take that spaghetti and you turn it into something that is very nicely organized. I want to take a moment here to reflect on why LLMs are not simply generating machine code rather than highle language code. It's certainly a reasonable question and I think that the reasons that applied to human programmers decades ago are just as applicable to LMS today. So for one thing, we know that context is limited. The context for an LLM, the context window might be a lot larger than what a human is able to hold in their head. It depends a little bit on how you count that context. Certainly, we have a lot of awareness of background facts that we've sort of compressed into our brain. Um but uh context is definitely a scarce resource for LLMs just like attention and ability to reason perhaps call it working memory is a scarce resource for people. The argument for libraries is as strong today as it ever was. So while you could make the argument, oh why don't we just let the AI generate all the code for the libraries since it's fast and cheap. Maybe we can customize it to the needs of our specific application. Getting that code properly tested, properly verified is going to be a huge challenge. And so we really want the ability to use reliable, trusted components and modules to build our systems. On that note, I do need to put in a little pitch for temporal. What temporal allows you to do is it allows you to abstract away the reliability of your software in the cloud. It provides durable execution, which means that it's shipping that reliability problem to a separate piece of code that's outside of your application that your application doesn't need to worry about. Let's now go ahead and dive in on the fun part, which is formal methods. And I want to shoot straight to a few demos. Now, in these demos, I'm going to be using the Daphne language. What Daphne allows you to do is it allows you to use a custom programming language that generates output to a whole variety of other languages, whether it's JavaScript, Python, um, or C, you name it. What daffany allows you to do is it allows you to put proofs in line with your code, allowing theorem proving software to come along and verify that that code does exactly what you said you wanted to do. Okay. So I have a program here that is written in the Daftly language and it has one function. It's called a method here and it does something very simple. So it does index up. So what it's going to do is it's going to search an array to find the index of a particular number and I can write a number of assertions about this. The array length is greater than zero. The number returned in that result is either negative 1 if it's not found or the uh some number that is less than the length of the array and so forth. What I can now do is I can just go ahead and I can run the Daphne verifier on that program. Great, no bugs. Let's go ahead and generate a Python program that exercises this functionality. And we can see that the program first verifies before it runs. So I know that all of those assertions that are proven about the program have been checked before that program runs. This is an extremely powerful technique because it spits out a Python library. It's something that can be integrated into your code. Now suppose I come over here and I make a small change to the algorithm, which is to say I've introduced a bug. If I now go back and I try to run that again, the verifier steps in and throws an error and we are saved from seeing that bug. All right, let's return to the presentation here. So, one thing to keep in mind is that verification is only as good as the specification. If I leave out anything that needs to be checked, that creates an opportunity for bugs. [snorts] So I want to emphasize that in the last few decades formal methods have become commercially relevant on a really impressive scale. For example, the Scl micro kernel is a fully verified operating system. It's a simple operating system typically used for embedded systems and security critical applications, but it is an operating system. The comfort C compiler again often used in security critical applications as well as in the aviation industry that is a fully verified compiler. That is to say that formal methods have been used to ensure that the code that that compiler admits given a C program does exactly what that C program is supposed to do. Project Everest works on libraries for cryptography, including libraries that are widely deployed today, protecting internet traffic. And really impressively in the microprocessor space now for several decades, formal methods have being used to ensure the correctness of those designs. There has been just a huge motivation to make sure that these systems are performing as expected. And one of the things that's really really cool is that there has been just tremendous progress in terms of the size and speed with which verification can be performed over the last sort of 20 plus years. And this really coincides with the rise of benchmarks. Benchmarks can have a tremendous role in shaping an industry. It gives folks something to focus on. And so we can see that success rates for the benchmarks have gone from the 30%ish range up to nearly 100% while at the same time the runtime on those benchmarks has gone down by a factor of 50 or more. So there are a handful of verification tools that you can use today and I want to break them down in a few different categories. So static verification is probably that which you are most familiar with. I'm starting from the bottom here. If you are using type systems that is a simple form of static verification but there are ways to attach more checks to the type system. Jumping up to the top we just saw Daphne Ron Spark. data is another example of tight coupling between those theorems and the code and then there are other systems that are also wellknown and lean for example that provide theorem proving separate from the code the problem there while those tools are super super powerful is that you do need to make sure that what the code does and what you have written in terms of the proof are the same Model checking deals with finite state machines and proving properties about those finite state machines. Theorem proving on the other hand doesn't have that limitation because it is able to take advantage of more powerful reasoning techniques, automated reasoning techniques. All right, let's get to the good stuff. Agentic coding. Now, I wanted to give you a set of really practical things that you can try in your day-to-day work to see what sorts of benefits you can get. These are probably not things that you're going to apply across the code base, but when you're struggling to get the agent to do what you want it to do on a very specific piece of code, these could all be pretty valuable. So some of these are things that we are probably reasonably well verssed with. So detailed specifications, using type languages, doing modular code, these are all sort of things that we pretty much do anyway. But some things that we might not do are interacting with the LLM and asking it to do explicit risk analysis, asking it to write safety cases, which are statements about things that could go wrong and how that thing that could go wrong is being mitigated in the code. So this is separate from formal methods. This is sort of a um more qualitative reasoning which is something that we know that LLMs can do. Another inspiration that you can take is from the design of high assurance systems where they have separate teams do the coding and the verification. That means that you can have separate prompts to the LLM for testing versus for writing the code in the first place. And if you want to take that to another level, you can use multiple model providers. So you can use one foundation model for the tests and one foundation model to write the code. You can bring in those formal methods techniques to give proofs around sections of critical code. And lastly, this is sort of the timeless advice, keeping your code small, outsourcing those things that can be to libraries which can be separately tested, validated, developed, and now your code doesn't need to worry about it. All right, let's talk for a minute about software 3.0. So this is the idea promoted by Andre Karpathy that prompts can really function as programs that what we're doing today is we are programming through AI through LLMs and it's a new world of coding whether that means that the LLM directly solves whatever problem you need solved or whether it generates code or perhaps loops and uses tools or any combination thereof in order to get to whatever behavior you want for the system. This opens up a tremendous need for new assurance techniques. Right? Because LMS are fundamentally non-deterministic and because the state space is absolutely huge, all of the verification techniques that we have discussed have basically no bearing on this form of software. That said, it's not all gloom and doom. And I am really excited by the idea that despite having new and different failure modes, there are also potentially new forms of resilience. LLMs can respond to unanticipated inputs. They have that ability to deal with ambiguity. And you can imagine lots of architectures whether they are pure agentic architectures as we often have today to ones that maybe invoke LLMs once certain error conditions are encountered that are actually getting ahead of and protecting the world from all kinds of software faults and perhaps doing it in really simple and interesting way. So I think this is just a tremendously interesting idea. All right, let's get to cost. This is one of the big topics. So, what does agentic code cost? I vibe coded up this very simple game. I spend about 2 minutes prompting. We can set that aside. I'm going to not count my time towards the cost. GPT5 codeex. It's creating 600,000 input tokens and it has 3.5 million cached input tokens, 48,000 reasoning tokens, and then is returning 28,000 tokens. The cost to generate this game was about $2. And the thing that's interesting here is that the cost to generate the output tokens is only about 15% of the overall cost. The rest of which is going into the repeated use of input tokens as tests are being run and the reasoning tokens as well. As it with human written code, the amount of time that you spend actually writing the code is a small fraction of the overall time that's spent to build the software. All right, so let's bring it down and let's look at the cost of code. So for high assurance code, if we look at something like the space shuttle or the Airbus example, the numbers there, if you take $ 1990 from the space shuttle, it was about $1,000 per line of code. If you translate that into $205, it's probably more like $2,500. And in some cases, so for example, for security high assurance software, numbers as high as $3,000 per line of code have been quoted. For typical software development, it's more like $10 to $100 for real production software, but nothing that is developed with the high assurance techniques. And in some cases, so for example, for security high assurance software numbers as high as $3,000 per line of code have been quoted. If you have lowcost contractors, you may be able to bring that number down as low as $1 to $10. This is all without considering any AI or agentic codegen. For the agentic coding, I've put a pretty broad range that includes just cheap models spinning out code. It could probably go even lower than this if they're not iterating on it very much up to more expensive models that are working harder to generate that code. Regardless of how you slice the numbers, you're looking at a factor of at least a thousand, probably about 10,000. If you set aside the cost of the people involved in the agentic coding, if you just look at that agentic coding piece, that code is being generated far more cheaply than typical software. And this is interesting because the gap between the cost of high assurance code and typical software is only about 100x. So if we extrapolate we could conclude that agentic coding has the potential to produce high assurance software 100 times more cheaply than typical software is produced today. That leads us to the vision of zero bugs. Software reliability is a solved problem. It's solved in aerospace. It's solved in other critical industries. And with the deployment of agents geared towards achieving high assurance code, whether that's because they're using formal methods, because they have extensive processes, because they're using adversarial testing, the list goes on and on. We can believe that agents will make high assurance code 100 times cheaper and that in this context we will see a proliferation of bug-free experiences. I also want to emphasize that this push towards a vision of zero bugs serves to address many of the limitations that agentic coding have today. notably around the quality of the software that's written. When developers choose not to use the agent coding tools that are at their disposal, the reason for doing so typically is that it's just going to take them more time to fix the bugs in that software than it would to take them to write the software correctly in the first place. As soon as we can get to the point where agentic coding is routinely generating software that has fewer defects than software written by humans, we can expect absolute takeoff in its adoption. We know how to do that. We've known how to do that for decades. Before we close, I want to emphasize that tardigrades are not bugs. This is Ziggy. Ziggy is temporal's mascot and Ziggy belongs to the film tardigrada, not an insect. Tardigrades are some of the most resilient animals in the world. They have even been known to survive in outer space. And earlier this year, we actually took Ziggy to space just to prove that point. We are having a lot of fun here at Temporal building durable execution as the reliable foundation for modern software. If anything that we discussed here today resonates with you, please reach out. We'd love to chat and explore how to work together in any possible way.