The Rise of Open Models in the Enterprise — Amir Haghighat, Baseten

Channel: aiDotEngineer
Published at: 2025-07-24
YouTube video id: 3WV1vT0B0cg
Source: https://www.youtube.com/watch?v=3WV1vT0B0cg
[Music]
Yeah. Hi everyone. My my name is Amir.
I'm co-founder and CTO of B 10 uh the
inference company. But I'm not here to
talk about B 10. I'm here to talk about
the adoption of AI in the enterprise.
why we should care about it and how it's
going uh based on what we've seen. So,
uh first why why we should uh care about
it. Um ultimately, um we've heard this
before is like, hey, is there hype in
the market? You know, is is AI hyped?
And like it probably is, but uh the the
evidence that a lot of people point to
that there's hype here is that the
adoption in enterprise has been slow. uh
you know you've I've heard this so many
times that like you know enterprises are
are are slow to adopt uh and uh and and
and that that has an implication uh if
if that is true that has implication for
really the impact of AI uh and how large
it can be and whether it is truly a hype
uh or or real um and the reason is that
enterprises are massive like the reach
is massive um they have all the money uh
and uh and if if they're slow in
adopting uh then uh that that paradigm
shift that we're talking about that uh
will be slow uh to to materialize.
So, but I'm here to tell you based on
what I've learned uh about adoption of
AI in enterprise um why why me? Why um
uh because we're we're happen to sit
somewhere interesting. We happen to sell
to enterprises uh and and so over the
past companies six years old but over
the past two years uh you know in
particular talked to honestly 100 plus
enterprises from software companies that
are public to literally soft drink
companies uh that that are you know
fortune fortune 50 um and uh and I've
seen patterns that I want to share those
patterns with you uh one bias that I
have uh is that I'm I don't sell a
verticalized
AI tooling. I I sell a very horizontal
AI tooling and this is important. So,
enterprises are adopting vertical
solutions, you know, um AI for sales, AI
for marketing, um AI for customer
service. You just heard from Clay from
Sierra. Um that adoption is happening.
But I think for the true value to get
unlocked, uh we need to see enterprises
actually build with AI. Um the analogy
that I use is that if in the 2000s uh
enterprises were u not really building
tech themselves and and were just buying
uh Salesforce uh or uh you know products
like that verticalized product products
like that then then the the tech
industry would just not be as big. Uh
companies like Snowflake and data bricks
and data dog would not exist or would
not exist to the shape that that they
do. And so I really think that the value
is ultimately unlocked once enterprises
feel comfortable to actually build with
AI themselves as opposed to just by uh
verticalized tooling. So let's talk
about the the journey uh the the journey
that they go through. Um uh they they
all start with OpenAI and anthropic you
know enterprises they're like us um you
know for for good reasons. It's just so
easy to get started. Um it's just they
do it differently from from the rest of
us in that they they have their own uh
dedicated deployments of of these models
on Azure or or AWS um and um for for
reasons around security and privacy and
and all that. Um, and then they they get
their engineers. A lot of times they're
more, you know, predictive uh ML teams
uh to become AI teams and and and build
on top of these. Um, and they're they're
happy with that. And if they can
continue doing that, they will uh
because uh there's just a a lot of
inertia in in in in sticking to that if
it actually works. Uh sticking to closed
models, so easy to use, API based, you
know, build on top of it.
Um, but we're seeing cracks in that
assumption. Uh, and so let me let me
tell you what what I've seen going back
in time and and how that's changed over
time. So in 2023, I remember like going
and trying to sell to enterprises and
and and the terms toying around came up
quite a bit. And I heard this actually
literally from um CIO of a massive
insurance company uh back then. is like,
"Yeah, we put a dedicated deployment of
OpenAI uh of of GPD4 or GPD3 um uh so
that our engineers can toy around with
it." Kind of like almost dismissively
talking about is like, "Hey, go build
something cute." Um that started to
change in 2024. We saw actual production
uh use cases again built on top of these
closed models. Um I would say like 40 50
out of the hundred like had something in
production uh uh in that year. Um and
then in 2025 this year something changed
uh and and and is palpable from at least
from where I'm sitting. Uh and and and
the change is that there are cracks in
that assumption. There are cracks in the
assumption that we can actually build on
top of these closed frontier models uh
indefinitely.
So what are those cracks? Um uh I'll
tell you what the cracks are not. And
there's some misconceptions over here.
Um I'll tell you what they're not. Uh so
like often people say oh because people
like enterprises don't want to have
vendor locking. I don't hear that
honestly. Uh like we go and talk to
them. I think I know why is because one
there's a few of them now. You know you
can go open AI anthropic Google has been
coming up pretty well. They're somewhat
interoperable uh at a certain level like
they all use OpenAI specs. So like
building on top of them. Yeah you might
have to like do your emails again and
and um do some prompt tuning but
generally you can go from one to the
other. So vendor lockin is not something
that I hear about. Um ballooning cost I
I didn't hear that last year. Um and I
and I know why is because when I asked
them they say look uh the price per
token is plummeting and we we just
talked about this like right before
this. Uh and that's the they were saying
that problem will just take care of
itself. um compliance, privacy, security
also not problems because these frontier
model companies kind of take care of
that with the help of the CSPs, with the
help of the the the cloud providers so
that these models are running in a
dedicated way inside of their existing
VPCs.
Um if these aren't the the the cracks in
that assumption of just use closed
models, then then then what are the
cracks? uh these these are uh the the
the reasons that I have seen uh and and
I'll go through them one by one uh and
and go through like examples of these as
well uh and uh and then at the end also
talk about um you know if if these are
the cracks then uh you know how do you
get around them and and you know there
be the dragons.
So one is around quality. Look, none of
these enterprises are in any sort of
misconception that they can build the
next GP4 better than OpenAI can. That's
just not the the reality. Not not as a
gen general model at least. But for
specific use cases, for specific tasks,
we're seeing this where the the frontier
models are not necessarily the right
tool. Uh so the example uh I've seen
this in a couple of big health plans is
that they want to do medical document
extraction. So they have millions of
medical documents uh prior off and
medical claims and and they're trying to
get you know CPT like um procedure codes
and diagnosis codes and and
prescriptions uh and just giving that to
you know claude or or GPT doesn't uh
doesn't do it but they have the data uh
over the years they've collected a lot
of label data and they're like oh we can
do better uh and and so that's that and
and they actually did. That's one
example. Another example is um on the
voice side in particular on the
transcription side like again staying in
the healthcare space like understanding
medical jargon or like you know getting
getting transcription models to
understand medical jargon um that that
has been another reason to not just use
an API based generic model u but but
inhouse it and do better than than what
they can uh what they can do with just
uh API based models. Another one is is
around latency. Um look these these
models um uh open anthropic even even
you know the the big players that serve
open source models behind shared APIs
inherently they're optimized for high
throughput and high QPS at the expense
of latency but a lot of times we're
seeing more and more now where latency
is becoming very critical uh especially
when you know the the AI AI voices or or
AI phone calls uh latency starts really
mattering
uh time to first to talk and time to
first sentence really starts mattering.
Uh and you have to just think about
things differently. Uh you can't just
use the uh the the frontier models as is
because again they're optimized for
something else.
Um around the unit economics uh there's
uh again like I said before like pricing
they said this this will take care of
itself. Uh then came this year and and
as you saw in the previous talk from
Michael that the agentic use cases
ballooned uh and the when when when they
balloon it's crazy like I've seen this
like every single user action can result
in literally 50 inference calls uh and
and so suddenly the thing that you
thought is going to take care of itself
is not taking care of itself. uh that
that cost are are really uh ballooning
and and and enterprises think that maybe
they can do better on the cost and unit
economics. In order to show ROI, in
order to show that the the solutions
that they're pushing are are
economically viable, they need to show
uh they need to reduce uh the their cost
somehow. And they're realizing that they
can actually run these models and pay
for the compute and have that be a lot
cheaper than paying per token and and
covering all the margins of someone else
and and really going from being a price
taker to actually be the the the maker
of the price and and being control of
that. And then lastly, Destiny. This one
is a bit vibby uh but but I'm hearing it
more recently uh that um you know some
CIO CTO saying if we the enterprise use
uh just the frontier models and so do
our competitors what is our advantage
what is our alpha uh and uh and maybe we
should bring in some of these things
inhouse and to be able to even
differentiate not just at the workflow
and application level but also at the AI
level. So now what if if those are the
reasons why they want to adopt open
source models and and iterate on those
and build those uh and fine-tune those
and distill those uh then what changes
well what changes is that they go from
super simple world that just call an API
and run with it uh to now you need to
build inference uh you need to build
inference infra and you need to make
sure that it scales well and you need to
make sure that you can move fast uh that
that your engineers can actually uh
deliver instead of having to you know
hire a bunch to new types of people uh
and then wait for them for a long time
to actually build this uh build this
infrastructure uh inhouse.
So one thing that I hear quite a bit at
this point is look I've I hear this from
enterprises I hear this from startups
too actually uh which is that look you
know we've picked a model an open source
model um we've heard of ELM or SG lang
or TRTLM we have some GPUs in the case
of enterprises in the data center in the
case of startups it's in some cloud and
you put these together and you get
production inference and I know for a
fact that this is not true uh I I wish
it was true but I know for a fact that
that this is not uh that that there's a
lot more that goes into uh making uh
inference especially mission critical
inference work well uh in inside of your
company.
So so what are those? These are the
dragons. These are the dragons. So so
one uh at the at the performance layer u
you know we talked about you know
situations where things are very latency
sensitive. um the the way that you
optimize models uh for for latency and
is is actually quite quite involved uh
both at the model level and at the
infrastructure level. You have to do it
you have to attack it at both levels. So
as an example um you know for um uh at
the model level it's like hey do you use
speculative decoding uh and if so do you
which which which route do you go do you
go with a good draft model? Do you go
with Medusa heads with with Eagle 3? Uh
do you go with MTP? Um there's there's a
lot and new techniques are coming out
all the time. Like the the Eagle 3 paper
came out like six months ago and it's
like running in production and actually
being very meaningful. Uh and so as an
enterprise can can you hire the right
folks to be able to be on top of the
research because you don't this is these
are not just you know switches that you
flip in in SG lang or VLM and and and
get the get the results. Um uh some of
these optimizations bleed out of the
model level into the infrastructure
level. Uh so as an example, if you want
to uh be able to do uh uh prefix caching
really well, if you want to be able to
disagregate a serving really well, uh
because that starts really mattering uh
especially in in agentic use cases where
like the prompts are massive but the
prompts are somewhat similar from one to
the other. uh ends up mattering a lot in
in you hitting your you know time to
first token and you hitting your P99 of
that uh in a reliable way. Another thing
on the infrastructure is especially if
it's mission critical inference which
more and more I see that that's the case
how do you guarantee four nines and with
with this with this formula does not
guarantee you you know more than two
nines uh and I and I saw this firsthand
uh so how do you make sure that when the
hardware fails underneath it uh you you
actually recover how do you make sure
that you know when VLM crashes which it
happens often I saw this firsthand like
when Triton crashes often your tail
latencies go through the roof while you
wait for these things to come back. Uh
and uh and and and and during that time
like you your users are are are feeling
that. Um how do you build uh against
those and and make sure that you know
you can still guarantee 49s and not be
super overprovisioned and mess up all
the unit economics that we talked about.
Um when a big burst of traffic comes in,
how do you make sure that you scale up
fast? uh how do you make sure that uh
you know I was talking to this massive
enterprise like uh soft drink example
where they're like yeah it takes us
eight minutes when you want to bring up
a new replica of the same model it takes
eight minutes um and I believe that
because that if you add up all the
different things that it goes into doing
that that is how long it takes but but
that's not okay again your tail
latencies go through the roof as soon as
there's a big spike of traffic h how do
you account for that and then there are
other things around uh again like making
sure that your engineers move fast uh
the tooling life
life cycle management, the observability
massive uh uh um uh iceberg which is
like it's like oh yeah just put some
logs and metrics and you realize there's
a lot more to do underneath it. We just
previous talk Michael talks about um and
then lots of stuff around controls and
um audits and uh things that enterprises
actually care about. So so these are the
these are the dragons. These are the
dragons and these are the times that
then enterprises h have a decision to
make like once they once they get to
this level either like I tell them or
and they have to believe me or they
don't and they go and build it and then
then they they see some of these things
that they have a decision which is build
or buy and that's my job to then try to
convince them that I think they should
buy this uh this layer of infrastructure
and platform uh as opposed to uh build
it. Uh, and that's sometimes uh harder
harder than it seems.
So, so I'm happy to talk more about
these things. So, I'll be I'll be at our
booth like two two topics I'd love to
talk about if uh if you're interested.
one, self-s servingly, if you're an
enterprise and those problems resonate,
I'd love to chat with you about. And
two, less self-servingly, um, if you're
a startup and you're trying to sell to
enterprises, um, I'm happy to chat about
all the right decisions that we made,
the wrong decisions that we made along
the way to to build something that then
when it comes to selling to enterprises,
when it comes to deploying it into their
own clouds, uh, that that is actually
possible, uh, and and and not a massive
other set of dragons over there. And
then last thing, we have a a happy hour.
Uh we'd love to see you there. Thank
you.
[Music]