On Engineering AI Systems that Endure The Bitter Lesson - Omar Khattab, DSPy & Databricks
Channel: aiDotEngineer
Published at: 2025-08-06
YouTube video id: qdmxApz3EJI
Source: https://www.youtube.com/watch?v=qdmxApz3EJI
[Music] So, thanks everyone for showing up and uh thanks to the organizers for inviting me and having me here. Um I'm excited to talk to you all about uh engineering AI systems that endure the bitter lesson. So, MMOR, uh, I guess the intro has already happened, so let's not repeat that. So, I mean, if you're here, I think it's probably because you engineer what we might call AI software or you might maybe manage people or work with people that do. Um, it's not really a term that's like been used as a very special thing this way for that long. So, we're all kind of trying to figure out what are the right sort of uh basics and fundamentals here and what are the things that are fleeting. So this is what the talk will be uh largely about and you know like the name of the game is which is kind of a meme at this point. Every week there's a new large language model. Maybe every week is actually too slow at this point. Um that actually changes something um in terms of the trade-offs you can strike. It might not be the state-of-the-art in terms of the best quality necessarily although sometimes it is. Um, but maybe it's the best performance for certain costs or it's the best performance for certain types of applications or maybe it's the, you know, the speed that's really incredible. We've seen like things like, you know, diffusion now. Um, so every every week there's a new LLM that you kind of have to think about if you're engineering software in this space, which is really unusual. Like if you think back to normal software engineering, you change your hardware every two, three years maybe, um, if that. So this is pretty unusual. Um, the other part that's actually also a little bit weirder is if you're lucky, the LLM provider has recognized that they're not really building these LLMs. They're training them. They're emerging based on a lot of nudging and data and iterating on a lot of evals and a lot of vibes as well. Um, and they have realized, you know, if you're lucky, that there are new quirks in their latest models that weren't there before. And to the surprise of many people to these days, you know, you still get longer and longer prompting guides for the latest models that are supposed to be, you know, closer and closer to AGI. Um, and if you're less lucky, you have to figure that out on your own, right? Um, if you're even less lucky, the prompting guides from the provider are not even that good. So, you have to actually kind of figure out what what the right thing is. And every day uh maybe at an even faster pace, someone is releasing an archive paper or a tweet or something that introduces a new learning algorithm, maybe some reinforcement learning uh bells and whistles, maybe some prompting tricks, maybe a prompt optimization, you know, technique, something or the other that promises to make your system learn better and sort of fit your goals better. Someone else is introducing some search or scaling or inference strategies or agent frameworks or agent architectures that are promising to finally unlock levels of reliability or quality better than what you had before. And I think if you're actually doing a reasonable job now, most likely you're scrambling every week. That's not if you're doing a bad job, that's if you're doing a good job, right? Because you're like, you know, I've got to stay on top of at least some of this stuff so that like I don't fall behind. Um, and in many cases like you know model APIs actually change the model under the hood even though you know you're using the same name. So it's actually you're forced to scramble and actually I would say maybe the question isn't whether you will scramble every week maybe a different question is will you even get to scramble for long if you think about the rate of progress of these LLMs like are they going to eat your lunch right so these are I think questions that are on a lot of people's minds and this is what the talk is is is going to be addressing so the talk mentions the bitter lesson which is sounds like this you know really ancient old kind of um AI lore but it's just you know six years old uh where the current years uh Turing award winner Rich Sutton who's a pioneer of reinforcement learning wrote this short essay uh on his website basically that says 70 years of AI has taught him and taught you know other people in the AI community from his perspective that when AI researchers leverage domain knowledge to solve problems like I don't know chess or something we build complicated methods that essentially don't scale and we get stuck and we get beat by methods that leverage scale a lot better what seems to have to work better according to uh to Sutton is general methods that scale and he identifies search which is not like retrieval more of like uh you know exploring large spaces and learning um so getting the system to kind of understand its environment maybe for example um work best and search here is what we'd call in LM land maybe inference time scaling or something so I don't speak for Sutton and I'm not you know suggesting that I have the right understanding of what he's saying or that I necessarily agree or disagree but I think this is just fundamental and important kind of um um concept in space. So I think it's it's it raises interesting questions for us as people who build uh you know engineer AI systems because if leveraging domain knowledge is bad what exactly is AI engineering supposed to be about? I mean engineering is understanding your domain and working in it with a lot of human ingenuity in repeatable ways let's say or with principles. So like are we just doomed like are we just wasting our time? Why are we at an AI engineering you know fair? And I'll tell you um how to resolve this. I've not really seen a lot of people discuss that. Sutton is talking about and a lot of people, you know, throw the bitter lesson around. So clearly somebody has to think about this, right? Sutton is talking about maximizing intelligence. All of us probably care about that to some degree, but which is like something like the ability to figure things out in a new environment really fast, let's say. Um, all of us kind of care about this to some degree. I'm also an AI researcher. Um, but when we're building AI systems, I think it's important to remember that the reason we build software is not that we lack AGI. We build software um you know and and the reason for this and the way kind of to understand this is we already have general intelligence everywhere. We have eight billions of them. Um they're unreliable because that's what intelligence is. Um and they've not solve the problems that we want to solve with software. That's why we're building software. Um so we program software not because we lack AGI but because we want reliable, robust, controllable uh scalable systems. Um and we want these things that to be things that we can reason about, understand at scale. Um and actually if you think about engineering and reliable systems, if you think about checks and balances in any case where you try to systematize stuff, it's about subtracting agency and subtracting intelligence in exactly the right places uh carefully uh and not restricting the intelligence otherwise. So this is a very different axis from the kinds of lessons that you would draw on from the bitter lesson. Now that does not mean the bitter lesson is irrelevant. Let me tell you the precise way in which it's relevant. Um so the first takeaway here is that scaling search and learning works best for intelligence. This is the right thing to do if you're an AI researcher interested in building, you know, agents that learn really well really fast in new environments, right? Don't hard code stuff at all. Um, or unless you really have to. But in building AI systems, it's helpful to think about, well, sure, search and learning, but searching for what, right? Like what is your AI system even supposed to be doing? What is the the the fundamental problem that you're solving? It's not intelligence. It's something else. Um, and what are you learning for, right? Like what is the system learning in order to do well? And that is what you need to be engineering. Um not the specifics of search and not the specifics of learning as I'll talk about in the rest of this talk. So he's saying um Sutton is saying complicated methods get in the way of scaling uh especially if you do it early uh like before you know what you're doing essentially. Did you hear that before? I feel like I heard that back in the 1970s although I wasn't around. This is you know the notion of structured programming uh with with Kous saying his popular you know uh uh phrase in a paper premature optimization is the root of all evil. I think this is the bitter lesson for software and thereby also for AI software. So it's human ingenuity and human knowledge of the domain. It's not that it's harmful. It's that when you do it prematurely in ways that constrain your system in ways that reflect poor understanding they're bad. But you can't get away in an engineering field with not engineering your system. Like you're just quitting or something, right? So here's a little piece of code. Um, if you follow me on X on Twitter, you might recognize it, but otherwise I think it looks pretty opaque to me in like 3 seconds. And I can't really look at this and tell exactly what it's doing. And I al also honestly don't really care. Um, so lo and behold, um, this is computing a square root in a certain floating point representation on an old machine. And I think the thing that jumps at me immediately is this is not the most futurep proof program possible. If you change the machine architecture, different floatingoint representations, better CPUs, first of all, it'll be wrong because, you know, like it's just hard- coding some values here. Um, and second of all, it'll probably be slower than a normal, you know, square root that maybe is a single instruction or maybe the compiler has a really smart way of doing it or, you know, a lot of other things that could be optimized for you, right? So someone who wrote this, maybe they had a good reason, maybe they didn't, but certainly if you're writing this kind of thing often, you're probably messing up as an engineer. So premature optimization is maybe the um square root of all evil or something, but what counts as uh premature? Like I mean that's kind of the name of the game, right? Like we could just say that, but it doesn't mean anything. Um so I don't think any strategy will work in tech. Nobody can anticipate what will happen in three years, five years, 10 years. But I think you still have to have a conceptual model that you're working off of. And I happen to have built two things that are, you know, on the order of several years old that have fundamentally stayed the same over the years from the days of birth text Davinci 2 up to four 04 mini and they're bigger now than they ever were. And they're sort of like these um stable fundamental kind of um abstractions or AI systems around around LLMs. So what gives what happens u in order to get something like kulbear or something like the spy in this ecosystem um and sort of endure a few years which is like you know centuries in AI land um I'll try to reflect on this and you know again none of this is guaranteed to be something that lasts forever. So here's my hypothesis um premature optimization is what is happening if and only if you're hard coding stuff at a lower level of abstraction that you can than you can justify. Um, if you want a square root, please just say, "Give me a square root." Don't start doing random bit shifts and bit stuff like, you know, bit manipulation that happens to appease your particular machine today. Um, but actually take a step back. Do you even want a square root or are you computing something even more general? And is there a way you could express that thing that is more general? Right? And only, you know, uh, stoop down or go down to the level of abstraction that's lower if you've demonstrated that a higher level of abstraction is not good enough. So I think the bigger picture here is applied machine learning and definitely prompt engineering has a huge issue here. Tight coupling is known tighter coupling than necessary is known to be bad in software but it's not really something we talk about when we're building machine learning systems. In fact the name of the game in machine learning is usually like hey this latest thing came out let's rewrite everything so that we're working around that specific thing. And I tweeted about this a year ago, 13 months ago, last May in 2024, saying the bitter lesson is just an artifact of lacking high level good highle ML abstractions. Deep learning uh scaling deep learning helps predictably, but after every paradigm shift, the best systems always include modular specializations because we're trying to build software. We need those. Um and every time they basically look the same and they should have been reusable, but they're not because we're writing code bad, but we're writing bad code. So here's a nice example just to demonstrate this. It's not special at all. Here's a 2006 paper. The title could have really been a paper of now, right? A modular approach for multilingual question answering. And here's a system architecture. It looks like your favorite multi- aent framework today, right? It has an execution manager. It has some question analyzers and retrieval strategy strategists from a bunch of corpora. And it's like a figure you, you know, if you color it, you would think it's a paper maybe from last year or something. Um, now here's a problem. It's a pretty figure. The system architecturally is actually not that wrong. I'm not saying it's the perfect architecture, but in a normal software environment, you could actually just upgrade the machine, right? Put it on a new hardware, put it on a new operating system, and it would just work. And it would actually work reasonably well because the architecture is not that bad. But we know that that's not the case for these ML sort of architectures because they're not expressed in the right way. So I think fundamentally I can express this most um passionately against prompts. A prompt is a horrible abstraction for programming and this needs to be fixed ASAP. Um, I say for programming because it's actually not a horrible one for management. If you want to manage an employee or an agent, a prompt is a reasonably kind of like it's an Slack channel. You have a remote employee. If you want to be a pet trainer, you know, working with tensors and, you know, objectives is a great way to iterate. That's how we build the models. But I want us to be able to also engineer AI systems. And I think for engineering and programming, a prompt is a horrible abstraction. Here's why. It's a stringly typed canvas, just a big blurb, no structure whatsoever, even if structure actually exists in a latent way. Um, that couples and entangles the fundamental task definition you want to say, which is really important stuff. This is what you're engineering with some random over uh fitted halfbaked decisions about, hey, this LLM responded to this language, you know, when I talk to it this way or I put this example to demonstrate my point um and it kind of clicked for this model, so I'll just keep it in. And there's no way to really tell the difference. What was the fundamental thing you're solving and like you know what was the random uh trick you applied. It's like a square root thing except you don't call it a square root and we just have to stare at it and be like wait why are we shifting to the left five you know by five bits or something. Um you're also inducing the inference time strategy which is like changing every few weeks or people are proposing stuff all the time and you're baking it literally entangling it into your system. So if it's an agent your prompt is telling it it's an agent. your system has no big like deal knowing about the fact it's an agent or a reasoning system or whatever. What are you actually trying to solve? Right? If it's like if you're writing a square root function and then you're like hey here is the layout of the strcts in memory or something. Um you're also talking about formatting and parsing things you know write in XML produce JSON whatever like again that's really none of your business most of the time. So you want to write a human readable spec but you're saying things like do not ignore this generate XML answer in JSON. you are professor Einstein, a wise expert in the field of I'll tip you $1,000, right? Like that is just not engineering, guys. Um, so what should we do? Trusty old separation of concerns, I think, is the answer. Your job as an engineer is to invest in your actual system design. And you know, starting with the spec, the spec unfortunately or fortunately cannot be reduced to one thing. And this is the time I'll talk about evals. I know everyone hears about eval. This is the one line about evals that makes this talk about evals. A lot of the time um you want to invest in natural language descriptions because that is the power of this new framework. Natural language definitions are not prompts. They are highly localized pieces of ambiguous stuff that could not have been said in any other way. Right? I can't tell the system certain things except in English. So I'll say it in English. But a lot of the time I'm actually iterating to uh to appease a certain model and to make it perform well relative to some criteria I have not telling it the criteria just tinkering with things there. Evals is the way to do this because eval say here's what I actually care about. Change the model. The eval are still what I care about. It's a fundamental thing. Now eval to define the core behavior of your system. You will not learn. Induction learning from data is a lot harder than following instructions. Right? So you need to have both. Code is another thing that you need. You know, a lot of people are like, "Oh, it's just like a, you know, just just ask it to do the thing." Well, who's going to define the tools? Who's going to define the structure? How do you handle information flow? Like, you know, like things that are private should not flow in the wrong places, right? You need to control these things. Um, how do you apply function composition? LLMs are horrible at composition because neuronet networks kind of essentially don't learn things that reliably. Function composition in software is always perfectly reliable basically, right, by construction. So a lot of the things um are often best delegated to code, right? But it's it's hard and it's really important that you can actually juggle and combine these things and you need a canvas that can allow you to combine these things. Well, when you do this, a good canvas, the definition here of a good canvas or a c the the criteria for a good canvas is that it should allow you to express those three in a way that's highly streamlined and in a way that is decoupled and not entangled with models that are changing. I should just be able to hot swap models. Uh inference strategies that are changing. Hey, I want to switch from a chain of thought to an agent. I want to switch from an agent to a Monte Carlo research. Whatever the latest thing that has come out is, right? I should be able to just do that. Um and new learning algorithms. So, this is really important. We talked about learning, but learning uh is, you know, always happening at the level of your entire system if you're engineering it or at least you got to be thinking about it that way where you're saying, I want the whole thing to work as a whole for my problem, not for some general default. Right? So that's what the eval here are going to be doing. And you want a a way of expressing this that allows you to do reinforcement learning but also allows you to do prompt optimization but also allows you to do any of these things at the level of abstraction that you're actually working with. So the second takeaway is that you should invest in defining things specific to your AI system and decouple from uh the lower level swappable pieces because they'll expire faster than ever. Um so I'll just conclude by telling you we've built and been building for three years this DSPI framework which is the only framework that actually decouples u your job from which is writing the lower level AI software from our job which is giving you powerful evolving toolkits for learning and for search which is scaling um and for swapping LLMs through adapters um so there's only one concept you have to learn it is a new concept which is which we call signatures a new first class concept If you learn it, you've learned DSPI. Um, I'll have to unfortunately skip this because of the because of the time for the other speakers. Uh, but let me give you a summary. I can't predict the future. I'm not telling you if you do this, you know, this the code you write tomorrow will be there forever. But I'm telling you the least you can do. This is not like the the kind of the top level. It's just like the b the baseline I would say is avoid the hand engineering at lower levels than today allows you to do, right? That's the that's the big lesson from the bitter lesson and from premature optimization being the root of all evil. Um, among your safest bets, they could all turn out to be wrong. I don't know, is um, models are not anytime soon going to read uh, specs off of your mind. I don't know if like we'll figure that out. Um, and they're not going to magically collect all the structure and tools specific to your application. So, that's clearly stuff you should invest in, right? When you're building a system, invest in the signatures, which I again you can learn about on the DSpy site, uh, dspi.ai. Invest in essential control flow and tools and invest in evaluating on by hand and ride the wave of swappable models. Ride the wave of the modules we build. Um you just swap them in and out and ride the wave of optimizers which can do things like reinforcement learning or prompt optimization for any application that it is that you've built. Uh all right, thank you everyone. [Music]