The Rise of Open Models in the Enterprise — Amir Haghighat, Baseten
Channel: aiDotEngineer
Published at: 2025-07-24
YouTube video id: 3WV1vT0B0cg
Source: https://www.youtube.com/watch?v=3WV1vT0B0cg
[Music] Yeah. Hi everyone. My my name is Amir. I'm co-founder and CTO of B 10 uh the inference company. But I'm not here to talk about B 10. I'm here to talk about the adoption of AI in the enterprise. why we should care about it and how it's going uh based on what we've seen. So, uh first why why we should uh care about it. Um ultimately, um we've heard this before is like, hey, is there hype in the market? You know, is is AI hyped? And like it probably is, but uh the the evidence that a lot of people point to that there's hype here is that the adoption in enterprise has been slow. uh you know you've I've heard this so many times that like you know enterprises are are are slow to adopt uh and uh and and and that that has an implication uh if if that is true that has implication for really the impact of AI uh and how large it can be and whether it is truly a hype uh or or real um and the reason is that enterprises are massive like the reach is massive um they have all the money uh and uh and if if they're slow in adopting uh then uh that that paradigm shift that we're talking about that uh will be slow uh to to materialize. So, but I'm here to tell you based on what I've learned uh about adoption of AI in enterprise um why why me? Why um uh because we're we're happen to sit somewhere interesting. We happen to sell to enterprises uh and and so over the past companies six years old but over the past two years uh you know in particular talked to honestly 100 plus enterprises from software companies that are public to literally soft drink companies uh that that are you know fortune fortune 50 um and uh and I've seen patterns that I want to share those patterns with you uh one bias that I have uh is that I'm I don't sell a verticalized AI tooling. I I sell a very horizontal AI tooling and this is important. So, enterprises are adopting vertical solutions, you know, um AI for sales, AI for marketing, um AI for customer service. You just heard from Clay from Sierra. Um that adoption is happening. But I think for the true value to get unlocked, uh we need to see enterprises actually build with AI. Um the analogy that I use is that if in the 2000s uh enterprises were u not really building tech themselves and and were just buying uh Salesforce uh or uh you know products like that verticalized product products like that then then the the tech industry would just not be as big. Uh companies like Snowflake and data bricks and data dog would not exist or would not exist to the shape that that they do. And so I really think that the value is ultimately unlocked once enterprises feel comfortable to actually build with AI themselves as opposed to just by uh verticalized tooling. So let's talk about the the journey uh the the journey that they go through. Um uh they they all start with OpenAI and anthropic you know enterprises they're like us um you know for for good reasons. It's just so easy to get started. Um it's just they do it differently from from the rest of us in that they they have their own uh dedicated deployments of of these models on Azure or or AWS um and um for for reasons around security and privacy and and all that. Um, and then they they get their engineers. A lot of times they're more, you know, predictive uh ML teams uh to become AI teams and and and build on top of these. Um, and they're they're happy with that. And if they can continue doing that, they will uh because uh there's just a a lot of inertia in in in in sticking to that if it actually works. Uh sticking to closed models, so easy to use, API based, you know, build on top of it. Um, but we're seeing cracks in that assumption. Uh, and so let me let me tell you what what I've seen going back in time and and how that's changed over time. So in 2023, I remember like going and trying to sell to enterprises and and and the terms toying around came up quite a bit. And I heard this actually literally from um CIO of a massive insurance company uh back then. is like, "Yeah, we put a dedicated deployment of OpenAI uh of of GPD4 or GPD3 um uh so that our engineers can toy around with it." Kind of like almost dismissively talking about is like, "Hey, go build something cute." Um that started to change in 2024. We saw actual production uh use cases again built on top of these closed models. Um I would say like 40 50 out of the hundred like had something in production uh uh in that year. Um and then in 2025 this year something changed uh and and and is palpable from at least from where I'm sitting. Uh and and and the change is that there are cracks in that assumption. There are cracks in the assumption that we can actually build on top of these closed frontier models uh indefinitely. So what are those cracks? Um uh I'll tell you what the cracks are not. And there's some misconceptions over here. Um I'll tell you what they're not. Uh so like often people say oh because people like enterprises don't want to have vendor locking. I don't hear that honestly. Uh like we go and talk to them. I think I know why is because one there's a few of them now. You know you can go open AI anthropic Google has been coming up pretty well. They're somewhat interoperable uh at a certain level like they all use OpenAI specs. So like building on top of them. Yeah you might have to like do your emails again and and um do some prompt tuning but generally you can go from one to the other. So vendor lockin is not something that I hear about. Um ballooning cost I I didn't hear that last year. Um and I and I know why is because when I asked them they say look uh the price per token is plummeting and we we just talked about this like right before this. Uh and that's the they were saying that problem will just take care of itself. um compliance, privacy, security also not problems because these frontier model companies kind of take care of that with the help of the CSPs, with the help of the the the cloud providers so that these models are running in a dedicated way inside of their existing VPCs. Um if these aren't the the the cracks in that assumption of just use closed models, then then then what are the cracks? uh these these are uh the the the reasons that I have seen uh and and I'll go through them one by one uh and and go through like examples of these as well uh and uh and then at the end also talk about um you know if if these are the cracks then uh you know how do you get around them and and you know there be the dragons. So one is around quality. Look, none of these enterprises are in any sort of misconception that they can build the next GP4 better than OpenAI can. That's just not the the reality. Not not as a gen general model at least. But for specific use cases, for specific tasks, we're seeing this where the the frontier models are not necessarily the right tool. Uh so the example uh I've seen this in a couple of big health plans is that they want to do medical document extraction. So they have millions of medical documents uh prior off and medical claims and and they're trying to get you know CPT like um procedure codes and diagnosis codes and and prescriptions uh and just giving that to you know claude or or GPT doesn't uh doesn't do it but they have the data uh over the years they've collected a lot of label data and they're like oh we can do better uh and and so that's that and and they actually did. That's one example. Another example is um on the voice side in particular on the transcription side like again staying in the healthcare space like understanding medical jargon or like you know getting getting transcription models to understand medical jargon um that that has been another reason to not just use an API based generic model u but but inhouse it and do better than than what they can uh what they can do with just uh API based models. Another one is is around latency. Um look these these models um uh open anthropic even even you know the the big players that serve open source models behind shared APIs inherently they're optimized for high throughput and high QPS at the expense of latency but a lot of times we're seeing more and more now where latency is becoming very critical uh especially when you know the the AI AI voices or or AI phone calls uh latency starts really mattering uh time to first to talk and time to first sentence really starts mattering. Uh and you have to just think about things differently. Uh you can't just use the uh the the frontier models as is because again they're optimized for something else. Um around the unit economics uh there's uh again like I said before like pricing they said this this will take care of itself. Uh then came this year and and as you saw in the previous talk from Michael that the agentic use cases ballooned uh and the when when when they balloon it's crazy like I've seen this like every single user action can result in literally 50 inference calls uh and and so suddenly the thing that you thought is going to take care of itself is not taking care of itself. uh that that cost are are really uh ballooning and and and enterprises think that maybe they can do better on the cost and unit economics. In order to show ROI, in order to show that the the solutions that they're pushing are are economically viable, they need to show uh they need to reduce uh the their cost somehow. And they're realizing that they can actually run these models and pay for the compute and have that be a lot cheaper than paying per token and and covering all the margins of someone else and and really going from being a price taker to actually be the the the maker of the price and and being control of that. And then lastly, Destiny. This one is a bit vibby uh but but I'm hearing it more recently uh that um you know some CIO CTO saying if we the enterprise use uh just the frontier models and so do our competitors what is our advantage what is our alpha uh and uh and maybe we should bring in some of these things inhouse and to be able to even differentiate not just at the workflow and application level but also at the AI level. So now what if if those are the reasons why they want to adopt open source models and and iterate on those and build those uh and fine-tune those and distill those uh then what changes well what changes is that they go from super simple world that just call an API and run with it uh to now you need to build inference uh you need to build inference infra and you need to make sure that it scales well and you need to make sure that you can move fast uh that that your engineers can actually uh deliver instead of having to you know hire a bunch to new types of people uh and then wait for them for a long time to actually build this uh build this infrastructure uh inhouse. So one thing that I hear quite a bit at this point is look I've I hear this from enterprises I hear this from startups too actually uh which is that look you know we've picked a model an open source model um we've heard of ELM or SG lang or TRTLM we have some GPUs in the case of enterprises in the data center in the case of startups it's in some cloud and you put these together and you get production inference and I know for a fact that this is not true uh I I wish it was true but I know for a fact that that this is not uh that that there's a lot more that goes into uh making uh inference especially mission critical inference work well uh in inside of your company. So so what are those? These are the dragons. These are the dragons. So so one uh at the at the performance layer u you know we talked about you know situations where things are very latency sensitive. um the the way that you optimize models uh for for latency and is is actually quite quite involved uh both at the model level and at the infrastructure level. You have to do it you have to attack it at both levels. So as an example um you know for um uh at the model level it's like hey do you use speculative decoding uh and if so do you which which which route do you go do you go with a good draft model? Do you go with Medusa heads with with Eagle 3? Uh do you go with MTP? Um there's there's a lot and new techniques are coming out all the time. Like the the Eagle 3 paper came out like six months ago and it's like running in production and actually being very meaningful. Uh and so as an enterprise can can you hire the right folks to be able to be on top of the research because you don't this is these are not just you know switches that you flip in in SG lang or VLM and and and get the get the results. Um uh some of these optimizations bleed out of the model level into the infrastructure level. Uh so as an example, if you want to uh be able to do uh uh prefix caching really well, if you want to be able to disagregate a serving really well, uh because that starts really mattering uh especially in in agentic use cases where like the prompts are massive but the prompts are somewhat similar from one to the other. uh ends up mattering a lot in in you hitting your you know time to first token and you hitting your P99 of that uh in a reliable way. Another thing on the infrastructure is especially if it's mission critical inference which more and more I see that that's the case how do you guarantee four nines and with with this with this formula does not guarantee you you know more than two nines uh and I and I saw this firsthand uh so how do you make sure that when the hardware fails underneath it uh you you actually recover how do you make sure that you know when VLM crashes which it happens often I saw this firsthand like when Triton crashes often your tail latencies go through the roof while you wait for these things to come back. Uh and uh and and and and during that time like you your users are are are feeling that. Um how do you build uh against those and and make sure that you know you can still guarantee 49s and not be super overprovisioned and mess up all the unit economics that we talked about. Um when a big burst of traffic comes in, how do you make sure that you scale up fast? uh how do you make sure that uh you know I was talking to this massive enterprise like uh soft drink example where they're like yeah it takes us eight minutes when you want to bring up a new replica of the same model it takes eight minutes um and I believe that because that if you add up all the different things that it goes into doing that that is how long it takes but but that's not okay again your tail latencies go through the roof as soon as there's a big spike of traffic h how do you account for that and then there are other things around uh again like making sure that your engineers move fast uh the tooling life life cycle management, the observability massive uh uh um uh iceberg which is like it's like oh yeah just put some logs and metrics and you realize there's a lot more to do underneath it. We just previous talk Michael talks about um and then lots of stuff around controls and um audits and uh things that enterprises actually care about. So so these are the these are the dragons. These are the dragons and these are the times that then enterprises h have a decision to make like once they once they get to this level either like I tell them or and they have to believe me or they don't and they go and build it and then then they they see some of these things that they have a decision which is build or buy and that's my job to then try to convince them that I think they should buy this uh this layer of infrastructure and platform uh as opposed to uh build it. Uh, and that's sometimes uh harder harder than it seems. So, so I'm happy to talk more about these things. So, I'll be I'll be at our booth like two two topics I'd love to talk about if uh if you're interested. one, self-s servingly, if you're an enterprise and those problems resonate, I'd love to chat with you about. And two, less self-servingly, um, if you're a startup and you're trying to sell to enterprises, um, I'm happy to chat about all the right decisions that we made, the wrong decisions that we made along the way to to build something that then when it comes to selling to enterprises, when it comes to deploying it into their own clouds, uh, that that is actually possible, uh, and and and not a massive other set of dragons over there. And then last thing, we have a a happy hour. Uh we'd love to see you there. Thank you. [Music]