On Engineering AI Systems that Endure The Bitter Lesson - Omar Khattab, DSPy & Databricks

Channel: aiDotEngineer
Published at: 2025-08-06
YouTube video id: qdmxApz3EJI
Source: https://www.youtube.com/watch?v=qdmxApz3EJI
[Music]
So, thanks everyone for showing up and
uh thanks to the organizers for inviting
me and having me here. Um I'm excited to
talk to you all about uh engineering AI
systems that endure the bitter lesson.
So, MMOR, uh, I guess the intro has
already happened, so let's not repeat
that. So, I mean, if you're here, I
think it's probably because you engineer
what we might call AI software or you
might maybe manage people or work with
people that do. Um, it's not really a
term that's like been used as a very
special thing this way for that long.
So, we're all kind of trying to figure
out what are the right sort of uh basics
and fundamentals here and what are the
things that are fleeting. So this is
what the talk will be uh largely about
and you know like the name of the game
is which is kind of a meme at this
point. Every week there's a new large
language model. Maybe every week is
actually too slow at this point. Um that
actually changes something um in terms
of the trade-offs you can strike. It
might not be the state-of-the-art in
terms of the best quality necessarily
although sometimes it is. Um, but maybe
it's the best performance for certain
costs or it's the best performance for
certain types of applications or maybe
it's the, you know, the speed that's
really incredible. We've seen like
things like, you know, diffusion now.
Um, so every every week there's a new
LLM that you kind of have to think about
if you're engineering software in this
space, which is really unusual. Like if
you think back to normal software
engineering, you change your hardware
every two, three years maybe, um, if
that. So this is pretty unusual. Um, the
other part that's actually also a little
bit weirder is if you're lucky, the LLM
provider has recognized that they're not
really building these LLMs. They're
training them. They're emerging based on
a lot of nudging and data and iterating
on a lot of evals and a lot of vibes as
well. Um, and they have realized, you
know, if you're lucky, that there are
new quirks in their latest models that
weren't there before. And to the
surprise of many people to these days,
you know, you still get longer and
longer prompting guides for the latest
models that are supposed to be, you
know, closer and closer to AGI. Um, and
if you're less lucky, you have to figure
that out on your own, right? Um, if
you're even less lucky, the prompting
guides from the provider are not even
that good. So, you have to actually kind
of figure out what what the right thing
is. And every day uh maybe at an even
faster pace, someone is releasing an
archive paper or a tweet or something
that introduces a new learning
algorithm, maybe some reinforcement
learning uh bells and whistles, maybe
some prompting tricks, maybe a prompt
optimization, you know, technique,
something or the other that promises to
make your system learn better and sort
of fit your goals better. Someone else
is introducing some search or scaling or
inference strategies or agent frameworks
or agent architectures that are
promising to finally unlock levels of
reliability or quality better than what
you had before. And I think if you're
actually doing a reasonable job now,
most likely you're scrambling every
week. That's not if you're doing a bad
job, that's if you're doing a good job,
right? Because you're like, you know,
I've got to stay on top of at least some
of this stuff so that like I don't fall
behind. Um, and in many cases like you
know model APIs actually change the
model under the hood even though you
know you're using the same name. So it's
actually you're forced to scramble
and actually I would say maybe the
question isn't whether you will scramble
every week maybe a different question is
will you even get to scramble for long
if you think about the rate of progress
of these LLMs like are they going to eat
your lunch right so these are I think
questions that are on a lot of people's
minds and this is what the talk is is is
going to be addressing so the talk
mentions the bitter lesson which is
sounds like this you know really ancient
old kind of um AI lore but it's just you
know six years old uh where the current
years uh Turing award winner Rich Sutton
who's a pioneer of reinforcement
learning wrote this short essay uh on
his website basically that says 70 years
of AI has taught him and taught you know
other people in the AI community from
his perspective that when AI researchers
leverage domain knowledge to solve
problems like I don't know chess or
something we build complicated methods
that essentially don't scale and we get
stuck and we get beat by methods that
leverage scale a lot better what seems
to have to work better according to uh
to Sutton is general methods that scale
and he identifies search which is not
like retrieval more of like uh you know
exploring large spaces and learning um
so getting the system to kind of
understand its environment maybe for
example um work best and search here is
what we'd call in LM land maybe
inference time scaling or something so I
don't speak for Sutton and I'm not you
know suggesting that I have the right
understanding of what he's saying or
that I necessarily agree or disagree but
I think this is just fundamental and
important kind of um um concept in
space. So I think it's it's it raises
interesting questions for us as people
who build uh you know engineer AI
systems because if leveraging domain
knowledge is bad what exactly is AI
engineering supposed to be about? I mean
engineering is understanding your domain
and working in it with a lot of human
ingenuity in repeatable ways let's say
or with principles. So like are we just
doomed like are we just wasting our
time? Why are we at an AI engineering
you know fair? And I'll tell you um how
to resolve this. I've not really seen a
lot of people discuss that. Sutton is
talking about and a lot of people, you
know, throw the bitter lesson around. So
clearly somebody has to think about
this, right? Sutton is talking about
maximizing intelligence. All of us
probably care about that to some degree,
but which is like something like the
ability to figure things out in a new
environment really fast, let's say. Um,
all of us kind of care about this to
some degree. I'm also an AI researcher.
Um, but when we're building AI systems,
I think it's important to remember that
the reason we build software is not that
we lack AGI. We build software um you
know and and the reason for this and the
way kind of to understand this is we
already have general intelligence
everywhere. We have eight billions of
them. Um they're unreliable because
that's what intelligence is. Um and
they've not solve the problems that we
want to solve with software. That's why
we're building software. Um so we
program software not because we lack AGI
but because we want reliable, robust,
controllable uh scalable systems. Um and
we want these things that to be things
that we can reason about, understand at
scale. Um and actually if you think
about engineering and reliable systems,
if you think about checks and balances
in any case where you try to systematize
stuff, it's about subtracting agency and
subtracting intelligence in exactly the
right places uh carefully uh and not
restricting the intelligence otherwise.
So this is a very different axis from
the kinds of lessons that you would draw
on from the bitter lesson. Now that does
not mean the bitter lesson is
irrelevant. Let me tell you the precise
way in which it's relevant. Um so the
first takeaway here is that scaling
search and learning works best for
intelligence. This is the right thing to
do if you're an AI researcher interested
in building, you know, agents that learn
really well really fast in new
environments, right? Don't hard code
stuff at all. Um, or unless you really
have to. But in building AI systems,
it's helpful to think about, well, sure,
search and learning, but searching for
what, right? Like what is your AI system
even supposed to be doing? What is the
the the fundamental problem that you're
solving? It's not intelligence. It's
something else. Um, and what are you
learning for, right? Like what is the
system learning in order to do well? And
that is what you need to be engineering.
Um not the specifics of search and not
the specifics of learning as I'll talk
about in the rest of this talk. So he's
saying um Sutton is saying complicated
methods get in the way of scaling uh
especially if you do it early uh like
before you know what you're doing
essentially. Did you hear that before? I
feel like I heard that back in the 1970s
although I wasn't around. This is you
know the notion of structured
programming uh with with Kous saying his
popular you know uh uh phrase in a paper
premature optimization is the root of
all evil. I think this is the bitter
lesson for software and thereby also for
AI software. So it's human ingenuity and
human knowledge of the domain. It's not
that it's harmful. It's that when you do
it prematurely in ways that constrain
your system in ways that reflect poor
understanding they're bad. But you can't
get away in an engineering field with
not engineering your system. Like you're
just quitting or something, right? So
here's a little piece of code. Um, if
you follow me on X on Twitter, you might
recognize it, but otherwise I think it
looks pretty opaque to me in like 3
seconds. And I can't really look at this
and tell exactly what it's doing. And I
al also honestly don't really care. Um,
so lo and behold, um, this is computing
a square root in a certain floating
point representation on an old machine.
And I think the thing that jumps at me
immediately is this is not the most
futurep proof program possible. If you
change the machine architecture,
different floatingoint representations,
better CPUs, first of all, it'll be
wrong because, you know, like it's just
hard- coding some values here. Um, and
second of all, it'll probably be slower
than a normal, you know, square root
that maybe is a single instruction or
maybe the compiler has a really smart
way of doing it or, you know, a lot of
other things that could be optimized for
you, right? So someone who wrote this,
maybe they had a good reason, maybe they
didn't, but certainly if you're writing
this kind of thing often, you're
probably messing up as an engineer.
So premature optimization is maybe the
um square root of all evil or something,
but what counts as uh premature?
Like I mean that's kind of the name of
the game, right? Like we could just say
that, but it doesn't mean anything. Um
so I don't think any strategy will work
in tech. Nobody can anticipate what will
happen in three years, five years, 10
years. But I think you still have to
have a conceptual model that you're
working off of. And I happen to have
built two things that are, you know, on
the order of several years old that have
fundamentally stayed the same over the
years from the days of birth text
Davinci 2 up to four 04 mini and they're
bigger now than they ever were. And
they're sort of like these um stable
fundamental kind of um abstractions or
AI systems around around LLMs. So what
gives what happens u in order to get
something like kulbear or something like
the spy in this ecosystem um and sort of
endure a few years which is like you
know centuries in AI land um I'll try to
reflect on this and you know again none
of this is guaranteed to be something
that lasts forever. So here's my
hypothesis um premature optimization is
what is happening if and only if you're
hard coding stuff at a lower level of
abstraction that you can than you can
justify. Um, if you want a square root,
please just say, "Give me a square
root." Don't start doing random bit
shifts and bit stuff like, you know, bit
manipulation that happens to appease
your particular machine today. Um, but
actually take a step back. Do you even
want a square root or are you computing
something even more general? And is
there a way you could express that thing
that is more general? Right? And only,
you know, uh, stoop down or go down to
the level of abstraction that's lower if
you've demonstrated that a higher level
of abstraction is not good enough.
So I think the bigger picture here is
applied machine learning and definitely
prompt engineering has a huge issue
here. Tight coupling is known tighter
coupling than necessary is known to be
bad in software but it's not really
something we talk about when we're
building machine learning systems. In
fact the name of the game in machine
learning is usually like hey this latest
thing came out let's rewrite everything
so that we're working around that
specific thing. And I tweeted about this
a year ago, 13 months ago, last May in
2024, saying the bitter lesson is just
an artifact of lacking high level good
highle ML abstractions. Deep learning uh
scaling deep learning helps predictably,
but after every paradigm shift, the best
systems always include modular
specializations because we're trying to
build software. We need those. Um and
every time they basically look the same
and they should have been reusable, but
they're not because we're writing code
bad, but we're writing bad code. So
here's a nice example just to
demonstrate this. It's not special at
all. Here's a 2006 paper. The title
could have really been a paper of now,
right? A modular approach for
multilingual question answering. And
here's a system architecture. It looks
like your favorite multi- aent framework
today, right? It has an execution
manager. It has some question analyzers
and retrieval strategy strategists from
a bunch of corpora. And it's like a
figure you, you know, if you color it,
you would think it's a paper maybe from
last year or something. Um, now here's a
problem. It's a pretty figure. The
system architecturally is actually not
that wrong. I'm not saying it's the
perfect architecture, but in a normal
software environment, you could actually
just upgrade the machine, right? Put it
on a new hardware, put it on a new
operating system, and it would just
work. And it would actually work
reasonably well because the architecture
is not that bad. But we know that that's
not the case for these ML sort of
architectures because they're not
expressed in the right way. So I think
fundamentally I can express this most um
passionately against prompts. A prompt
is a horrible abstraction for
programming and this needs to be fixed
ASAP. Um, I say for programming because
it's actually not a horrible one for
management. If you want to manage an
employee or an agent, a prompt is a
reasonably kind of like it's an Slack
channel. You have a remote employee. If
you want to be a pet trainer, you know,
working with tensors and, you know,
objectives is a great way to iterate.
That's how we build the models. But I
want us to be able to also engineer AI
systems. And I think for engineering and
programming, a prompt is a horrible
abstraction. Here's why. It's a stringly
typed canvas, just a big blurb, no
structure whatsoever, even if structure
actually exists in a latent way. Um,
that couples and entangles the
fundamental task definition you want to
say, which is really important stuff.
This is what you're engineering with
some random over uh fitted halfbaked
decisions about, hey, this LLM responded
to this language, you know, when I talk
to it this way or I put this example to
demonstrate my point um and it kind of
clicked for this model, so I'll just
keep it in. And there's no way to really
tell the difference. What was the
fundamental thing you're solving and
like you know what was the random uh
trick you applied. It's like a square
root thing except you don't call it a
square root and we just have to stare at
it and be like wait why are we shifting
to the left five you know by five bits
or something. Um you're also inducing
the inference time strategy which is
like changing every few weeks or people
are proposing stuff all the time and
you're baking it literally entangling it
into your system. So if it's an agent
your prompt is telling it it's an agent.
your system has no big like deal knowing
about the fact it's an agent or a
reasoning system or whatever. What are
you actually trying to solve? Right? If
it's like if you're writing a square
root function and then you're like hey
here is the layout of the strcts in
memory or something. Um you're also
talking about formatting and parsing
things you know write in XML produce
JSON whatever like again that's really
none of your business most of the time.
So you want to write a human readable
spec but you're saying things like do
not ignore this generate XML answer in
JSON. you are professor Einstein, a wise
expert in the field of I'll tip you
$1,000, right? Like that is just not
engineering, guys. Um, so what should we
do? Trusty old separation of concerns, I
think, is the answer. Your job as an
engineer is to invest in your actual
system design. And you know, starting
with the spec, the spec unfortunately or
fortunately cannot be reduced to one
thing. And this is the time I'll talk
about evals. I know everyone hears about
eval. This is the one line about evals
that makes this talk about evals. A lot
of the time um you want to invest in
natural language descriptions because
that is the power of this new framework.
Natural language definitions are not
prompts. They are highly localized
pieces of ambiguous stuff that could not
have been said in any other way. Right?
I can't tell the system certain things
except in English. So I'll say it in
English. But a lot of the time I'm
actually iterating to uh to appease a
certain model and to make it perform
well relative to some criteria I have
not telling it the criteria just
tinkering with things there. Evals is
the way to do this because eval say
here's what I actually care about.
Change the model. The eval are still
what I care about. It's a fundamental
thing. Now eval
to define the core behavior of your
system. You will not learn. Induction
learning from data is a lot harder than
following instructions. Right? So you
need to have both. Code is another thing
that you need. You know, a lot of people
are like, "Oh, it's just like a, you
know, just just ask it to do the thing."
Well, who's going to define the tools?
Who's going to define the structure? How
do you handle information flow? Like,
you know, like things that are private
should not flow in the wrong places,
right? You need to control these things.
Um, how do you apply function
composition? LLMs are horrible at
composition because neuronet networks
kind of essentially don't learn things
that reliably. Function composition in
software is always perfectly reliable
basically, right, by construction. So a
lot of the things um are often best
delegated to code, right? But it's it's
hard and it's really important that you
can actually juggle and combine these
things and you need a canvas that can
allow you to combine these things. Well,
when you do this, a good canvas, the
definition here of a good canvas or a c
the the criteria for a good canvas is
that it should allow you to express
those three in a way that's highly
streamlined and in a way that is
decoupled and not entangled with models
that are changing. I should just be able
to hot swap models. Uh inference
strategies that are changing. Hey, I
want to switch from a chain of thought
to an agent. I want to switch from an
agent to a Monte Carlo research.
Whatever the latest thing that has come
out is, right? I should be able to just
do that. Um and new learning algorithms.
So, this is really important. We talked
about learning, but learning uh is, you
know, always happening at the level of
your entire system if you're engineering
it or at least you got to be thinking
about it that way where you're saying, I
want the whole thing to work as a whole
for my problem, not for some general
default. Right? So that's what the eval
here are going to be doing. And you want
a a way of expressing this that allows
you to do reinforcement learning but
also allows you to do prompt
optimization but also allows you to do
any of these things at the level of
abstraction that you're actually working
with. So the second takeaway is that you
should invest in defining things
specific to your AI system and decouple
from uh the lower level swappable pieces
because they'll expire faster than ever.
Um so I'll just conclude by telling you
we've built and been building for three
years this DSPI framework which is the
only framework that actually decouples u
your job from which is writing the lower
level AI software from our job which is
giving you powerful evolving toolkits
for learning and for search which is
scaling um and for swapping LLMs through
adapters um so there's only one concept
you have to learn it is a new concept
which is which we call signatures a new
first class concept If you learn it,
you've learned DSPI. Um, I'll have to
unfortunately skip this because of the
because of the time for the other
speakers. Uh, but let me give you a
summary. I can't predict the future. I'm
not telling you if you do this, you
know, this the code you write tomorrow
will be there forever. But I'm telling
you the least you can do. This is not
like the the kind of the top level. It's
just like the b the baseline I would say
is avoid the hand engineering at lower
levels than today allows you to do,
right? That's the that's the big lesson
from the bitter lesson and from
premature optimization being the root of
all evil. Um, among your safest bets,
they could all turn out to be wrong. I
don't know, is um, models are not
anytime soon going to read uh, specs off
of your mind. I don't know if like we'll
figure that out. Um, and they're not
going to magically collect all the
structure and tools specific to your
application. So, that's clearly stuff
you should invest in, right? When you're
building a system, invest in the
signatures, which I again you can learn
about on the DSpy site, uh, dspi.ai.
Invest in essential control flow and
tools and invest in evaluating
on by hand and ride the wave of
swappable models. Ride the wave of the
modules we build. Um you just swap them
in and out and ride the wave of
optimizers which can do things like
reinforcement learning or prompt
optimization for any application that it
is that you've built. Uh all right,
thank you everyone.
[Music]