POC to PROD: Hard Lessons from 200+ Enterprise GenAI Deployments - Randall Hunt, Caylent
Channel: aiDotEngineer
Published at: 2025-07-23
YouTube video id: vW8wLsb3Nnc
Source: https://www.youtube.com/watch?v=vW8wLsb3Nnc
[Music] Everybody excited? So, uh, what does Kalin do? We build stuff for people. So, people come to us with ideas and they're like, I want to make an app or like, oh, I want to move off of Oracle onto Postgress, you know. And we just do that stuff. We are builders. We uh created a company by hiring a bunch of passionate autodidacts with a little bit of product ADHD. And we jump around to all these different things and build cool things for our customers. And we have hundreds of customers at any given time. Everyone from like the Fortune 500 to startups. And it's a very fun gig. It's really cool. You get exposed to a lot of technology. And what we've learned is that uh generative AI is not the the magical pill that solves everything that a lot of people seem to think it is. Uh and then what your CTO read in the Wall Street Journal is not necessarily the latest and greatest thing. And we'll share some concrete components of that. Uh but I'll just point out a couple of different customers here. One of the ones is Brainbox AI. So they are a uh building operating system. They help decarbonize the built environment. So they manage uh tens of thousands of buildings across the United States and Canada or North America and they manage the HVAC systems. And we built an agent for them for helping uh with that uh decarbonization of the built environment and managing those things. And that was uh I think in the Times 100 best inventions of the year or something because it helps drastically reduce greenhouse emissions. Uh and then Simmons is uh water management conservation which we also implemented with AI. Uh and with that, you know, there's a couple other customers here. Pipes AI, virtual moving technologies, Z5 inventory. Uh but I thought it'd be cool to just show a demo. And one of the things that I'm most interested in right now is multimodal search and uh semantic understanding of videos. So this is one of our customers, Nature Footage. They have uh a ton of stock footage of, you know, lions and tigers and bears. Oh my. and crocodiles I suppose and we needed to index all of that and make it searchable over uh not just a vector index but also like a caption. So we leverage the Nova Pro models to generate understandings and timestamps and features of these videos store all of those in elastic search and then we are able to search on them and one of the most important things there is that we were able to build uh a pooling embedding. So by taking frame samples and pulling the embeddings uh of those frames, we can do a multimodal embedding and search with text for the images. And that's provided through the Titan v2 multimodal embeddings. So uh I thought we'd take a look at a different architecture. I hope no one here is from Michigan because that's a terrible team. I hate them. Anyway, anyone remember March Madness? So, this is another customer of ours that uh I'm not going to reveal their name, but essentially we have a ton of sports footage that we're processing both in real time and in batch, archival and in real time. And what we'll do is we'll split that data into the audio. We'll generate the transcription. Fun fact, if you're looking for highlights, the easiest thing to do is just ffmpeg get an amplitude spectrograph of the audio and look for the audience cheering and lo and behold, you have your highlight reel. Um, very simple hack right there. And we'll take that and we'll generate embeddings from both the text and from the video itself. And we'll be able to uh identify certain behaviors with a certain vector and a certain confidence. And we'll store those then into a database. Oh, I think I paused the video by accident. My apologies. No, I didn't. And then we'll use something like Amazon end user messaging or SNS or whatever. we'll send a push notification to our end users and say, "Look, we found uh a three-pointer or uh we found this other thing." And what we found is um you don't even have to take the the raw video. A a tiny little bit of annotation can do wonders um for the video understanding models at the as they exist right now. The soda models still just with a little tiny bit of uh augmentation on the video will outperform um what you can get with an unmodified video. And what I mean by that is if you have static camera angles and you annotate on the court where the three-pointer line is with a big blue line and then you just ask the model questions like did the player cross the big blue line. Lo and behold you get way better results and it takes you know seconds and you can even have something like SAM 2 which is another model from meta go and do some of those annotations for you. So that's an architecture. You'll notice that I've put up a couple of different databases there. We had uh Postgress PG vector uh which is my favorite right now. We had open search. That's another implementation of vector search there. Um, but anyway, why should you listen to me? Hi, I'm Randall. Um, I got started out hacking and building stuff and uh playing video games and hacking into video games. It turns out that's super illegal. Did not know that. Um, and then I went on to do some physics stuff at NASA. Uh, I joined a small company called Tenen, which became MongoDB. They IPOed. Um, I was an idiot and sold all my stock before the IPO. Uh, and then I worked at SpaceX where I led the CI/CD team. Fun fact, we never blew up a rocket while I was in charge of that team. Before and after my tenure, we blew up rockets. Um, I I don't know what else I can say there. Uh, and then I spent a long time at AWS and I had a great time building a ton of technology for a lot of customers. I even made a video about the transformer paper in July of 2017, not realizing what it was going to lead to. And the fact that we're all even here today is is still attention is all you need. Uh you can follow me on Twitter at JR Hunt. Uh it's still called Twitter. It will never be called X in my mind. And uh this is Kalin. You know, we've won partner of the year for AWS for a long time. We build stuff. Like I said, I I I like to say our motto is we build cool stuff. Um marketing doesn't like it when I say that. Uh because I don't always say the word stuff. Sometimes I'll sub in a different word. And what we build, you know, everything from chat bots to co-pilots to AI agents. And I'm going to share all the lessons that we've learned from building all of these things. You know, this sort of stuff on the top here, these self-service productivity tools. Um, these are things that you can typically buy. Uh, but certain institutions may need a fine tune. They may need a a particular application on top of that self-service productivity tool and we will often build things for them. Uh, one of the issues that we see organizations facing is how do they administer and track the usage of these third-party tools and APIs. Uh, and some people have an on-prem network and a VPN where they can just measure all the traffic. They can intercept things. They can look for PII or PHI and they can do all the fun stuff that we're supposed to do with network interception. There's a great tool called Shure Path. Uh, we use it at Kalin. I recommend them. Uh, it does all of that for you and it can integrate with Zcal or whatever else you might need. Um in terms of automating business functions, you know, this is typically trying to get a a percentage of time or dollars back uh end to end in a particular business process. Uh we work with a large logistics management customer that does a tremendous amount of processing of uh of receipts and bills of laden and things like that. And this is a typical intelligent document processing use case leveraging generative AI and a custom classifier before we send it into the generative AI models. Uh we can get far faster better results than even their human annotators can. Um and then there's monetization which is adding a new skew to an existing product. It's an existing SAS platform. It's an existing utility and the customer is like oh I want to add a new skew so I can charge my users for fancy AI because the Wall Street Journal told me to. And that is a very fun area to work in. But if you just build a chatbot, you know, sayanara, like good luck. I'll, you know, you're the Polaroid. Um, do people still use Polaroid? Are they doing okay? I don't know. Anyway, I used to say Kodak. Um, this is how we build these things and these are the lessons that we've learned. Um, I stole this slide. This is not my slide. I cannot remember where it is from. It's from Twitter somewhere. It might have been Jason Louu. It might have been from DSPY. But this is a great slide that I think very strategically identifies what the uh specifications are to build a moat in your business and the inputs to your system and what your system is going to do with them. That is the most fundamental part your inputs and your outputs. Um, does everyone remember Steve Balmer, uh, the former CEO of Microsoft and how he, uh, famously went on stage, uh, on a tremendous amount of cocaine and just started screaming, um, developers, developers, developers, developers. If I were to channel my inner balmer, what I would say is eval. So when we do this eval layer, this is where we prove that the system is robust and not just a vibe check and we're getting a one-off on a particularly unique uh prompt. Then we have the system architecture and then we have the different LLMs and tools and things we may use. And these are all incidental to your AI system and you should expect them to evolve and change. What will not evolve and change is your fundamental definition and specification of what are your inputs and what are your outputs. Uh and as you know the models get better and they improve and you can get other like modalities of output that may evolve. But you're always going to figure out why am I doing this? What is my ROI? What do I expect? This is how we build these things in AWS. On the bottom layer we have two services. We have Bedrock and we have SageMaker. Uh these are uh useful services. SageMaker comes at a particular compute premium. You can also just run on EKS or EC2 if you want. Um there's two different pieces of custom silicon that exist within AWS. One is trrenium, one is inferentia. Uh these come at about a 60% price performance improvement over using Nvidia GPUs. Now the downside is the amount of HP RAM is not as big as like an H200. I don't know if anyone saw today, but it was great news. Amazon announced that they were reducing the prices of the P4 and P5 instances by up to 40%. So we all get more GPUs cheaper. Very happy about that. Um the interesting thing with tranium and inferentia is that you must uh use something called the neuron SDK to write these. So if anyone has ever written XLA for like TensorFlow and the good old um what were they called the TPUs and now the new TPU7 and all that great stuff. Uh the the neuron kernel interface for tranium and infinia is very similar. One level up from that we get to pick our various models. So we have everything from uh claude and nova to llama and deepseeek uh and then open source models that we can deploy. I don't know if mistrol is ever going to release another open source model but who knows. Uh and then we have our embeddings and our vector stores. So like I said uh I do prefer Postgress right now. If you need um persistence in Reddus uh there's a great thing called memory DB on AWS that also supports vector search. Um the good news about the reddis vector search is that it is extremely fast. The bad news is that it is extremely expensive because it has to sit in RAM. Um so if you think about how you're going to construct your indexes and like do IVV flat or something uh be prepared to blow up your RAM in order to store all of that stuff. Now um within Postgress and open search you can go to disk and you can use things like HNSW indexes so that you can have uh a better allocation and search mechanism. Then we have the prompt versioning and prompt management. Uh all of these things are incidental and and kind of uh you you know not unique anymore. But this one context management is incredibly important. And if you are looking to differentiate your application from someone else's application, context is key. So if your competitor doesn't have the context of the user and additional information uh but you're able to inject oh the the user is on this page they have a history of this browsing you know these are the cookies that I saw this is you know then you can go and make a much more strategic inference on behalf of that end user. So here are the lessons that we learned and I I'll jump into these but I'm also going to run out of time so I I'll speed through a little bit of it and I'll make the stack available for folks. But uh it turns out eval and embeddings are not all you need. Uh you know the understanding the access patterns and understanding the way that people will use the product uh will lead to a much better result than just throwing out evals and throwing out embeddings and wishing the best of luck. Embeddings alone do not a great query system make. How do you do faceted search and filters on top of embeddings alone? That is why we love things like open search and postgress. Um speed matters. So if your inference is slow, uh UX is a means of mitigating the slowness of some of these things. There's other techniques you can use. You can use caching, you can use other components. Um but if you are slower and more expensive, you will not be used. If you are uh slower and cheaper and you're mitigating some of the effects by leveraging something like a fancy UI spinner or something that keeps your users entertained as the inference is being calculated uh you can uh still win. Now uh knowing your end customer as I said is very important. And then the other very important thing is the number of times I see people defining a tool called get current date is infuriating to me. Like it is literally like import time.now you know like just it's a format string just throw it in the string like you control the prompt. Um, so, uh, the downside of putting some of that information very high up in the prompt is that your caching, uh, is not as effective. But if you can put some of that information at the bottom of the prompt after the instructions, you can often, uh, get very effective caching. Um, then there is like I I used to say we should fine-tune, we should do these things. Uh, it turns out I was wrong. As the models have improved and gotten more and more powerful, uh, prompt engineering has proven unreasonably effective for us, like far more effective than I would have predicted. Within, uh, cloud 3.7 to claude 4, we saw zero regressions. From cloud 35 to 37, we did see regressions on certain things when we moved the exact same prompts over to some of our, uh, users and some of our evals. But from 37 to four, we got faster, better, cheaper, more optimized inference in virtually every use case. So it was like a drop in replacement and it was amazing. Um, and I hoping future versions will be the same. Uh, I'm hoping we're the era of having to adjust your prompt every time a new model comes out is ending. Um, and then finally, it's very important to know your economics like is this inference going to bankrupt my company? Um if you think about some of the cost of uh uh the the opus models, you know, it may not always be the best thing to run. Okay, so just in the interest of time, this is another great slide. This is uh from anthropic actually. And when we think about how to create our evals, the vibe check, the very first thing that you do when you try to create um a uh a test, that vibe check becomes your first eval. And then you change the data and the stuff that you're sending in and lo and behold, 20 minutes later, you do have some form of eval set that you can begin running. And then you can go for metrics. Now metrics do not have to be a score like a BERT or you know a benchmark score that is calculated. They can just be a boolean. It can just be true or false. Was this inference successful or not? Um that is often easier than trying to assign a particular value and a particular score. Uh and then you just iterate, you know, keep going. And like I said, speed matters, but UX matters more. you know, this UX orchestration, prop management, all of this great stuff uh is why we end up doing better than uh some of our competitors. And then, you know, one of our customers, Cloud Zero, uh we originally built a chatbot for them for you to chat with your AWS infrastructure and get cost out of that AWS infrastructure. Um we are now using generative UI in order to render uh the information that is shown in those charts. So in just in time we will craft a react component and inject it into the uh the rendering of the response and then we can cache those uh components and describe in the prompt hey I made this for this other user and maybe it's helpful one day uh for some other user's query. And so this generative UI allows the tool to constantly evolve and personalize to the individual end user. Um, this is an extremely powerful paradigm that is finally fast enough with some of these uh models and their lightning fast inference speed. Um, nature footage, we covered that earlier. Uh there's also knowing your end user which is we had a customer uh that had users in remote areas and so we would give uh text summaries of these PDFs and manuals and things and that would uh be great and then they would get the PDF and it would be 200 megabytes you know and then so what we found is on the back end on the server we could take a screenshot essentially of the PDF and just send that one page so that even when they were in low connectivity areas we could still send the text summary of the full document mentation and instructions but just send the relevant parts of the PDF without them having to download a 200 megabyte thing. So that's know your end customer. We worked with a hospital system for instance that uh we originally built a voice bot for these nurses uh and it turns out nurses hate voice bots because hospitals are loud and noisy and the voice transcription is not very good and you just hear other people yelling and they preferred a regular old chat interface. So, we had to know our end customers. Figure out what exactly they were doing day-to-day. And then let the computer do what the computer's good at. Don't do math in an LLM. It is the most expensive possible way of doing math. Um, let the the computer do its calculations. And then prompt engineering. I'm not going to break this down. I'm sure you've seen hundreds of talks over the last two days about the uh way to engineer your prompts and everything. Uh but one of the things that we like to do as part of our optimization is to think about the output tokens and the costs that are associated there and how we can make that perform better. And then finally, know your economics. There's lots of great tools. There's things like prompt caching. There's things like tool usage and batch. Um batch on bedrock is a 50% off whatever model infrance you're trying to make across the board. And then context management. You can optimize your context. you can figure out what is the minimum viable context in order to get the correct inference and how can I optimize that context over time and this again requires knowing your end user knowing what they're doing and injecting that information into the model and also optimizing stuff that is irrelevant and taking it out of the context so that the model has less to reason over this and you want to learn more if you want to talk more um I'm always happy to hop on the phone with customers you can scan this QR code we like building cool stuff. Uh, I got a whole bunch of talented engineers who were just excited to go out and build things for customers. So, if you have a super cool use case, come at me. All right. Thank you very much. [Music]