POC to PROD: Hard Lessons from 200+ Enterprise GenAI Deployments - Randall Hunt, Caylent

Channel: aiDotEngineer

Published at: 2025-07-23

YouTube video id: vW8wLsb3Nnc

Source: https://www.youtube.com/watch?v=vW8wLsb3Nnc

[Music]
Everybody excited? So, uh, what does
Kalin do? We build stuff for people. So,
people come to us with ideas and they're
like, I want to make an app or like, oh,
I want to move off of Oracle onto
Postgress, you know. And we just do that
stuff. We are builders. We uh created a
company by hiring a bunch of passionate
autodidacts with a little bit of product
ADHD. And we jump around to all these
different things and build cool things
for our customers. And we have hundreds
of customers at any given time. Everyone
from like the Fortune 500 to startups.
And it's a very fun gig. It's really
cool. You get exposed to a lot of
technology. And what we've learned is
that uh generative AI is not the the
magical pill that solves everything that
a lot of people seem to think it is. Uh
and then what your CTO read in the Wall
Street Journal is not necessarily the
latest and greatest thing. And we'll
share some concrete components of that.
Uh but I'll just point out a couple of
different customers here. One of the
ones is Brainbox AI. So they are a uh
building operating system. They help
decarbonize the built environment. So
they manage uh tens of thousands of
buildings across the United States and
Canada or North America and they manage
the HVAC systems. And we built an agent
for them for helping uh with that uh
decarbonization of the built environment
and managing those things. And that was
uh I think in the Times 100 best
inventions of the year or something
because it helps drastically reduce
greenhouse emissions. Uh and then
Simmons is uh water management
conservation which we also implemented
with AI. Uh and with that, you know,
there's a couple other customers here.
Pipes AI, virtual moving technologies,
Z5 inventory. Uh but I thought it'd be
cool to just show a demo. And one of the
things that I'm most interested in right
now is multimodal search and uh semantic
understanding of videos. So this is one
of our customers, Nature Footage. They
have uh a ton of stock footage of, you
know, lions and tigers and bears. Oh my.
and crocodiles I suppose and we needed
to index all of that and make it
searchable over uh not just a vector
index but also like a caption. So we
leverage the Nova Pro models to generate
understandings and timestamps and
features of these videos store all of
those in elastic search and then we are
able to search on them and one of the
most important things there is that we
were able to build uh a pooling
embedding. So by taking frame samples
and pulling the embeddings uh of those
frames, we can do a multimodal embedding
and search with text for the images. And
that's provided through the Titan v2
multimodal embeddings.
So uh I thought we'd take a look at a
different architecture.
I hope no one here is from Michigan
because that's a terrible team. I hate
them. Anyway, anyone remember March
Madness? So, this is another customer of
ours that uh I'm not going to reveal
their name, but essentially we have a
ton of sports footage that we're
processing both in real time and in
batch, archival and in real time. And
what we'll do is we'll split that data
into the audio. We'll generate the
transcription. Fun fact, if you're
looking for highlights, the easiest
thing to do is just ffmpeg get an
amplitude spectrograph of the audio and
look for the audience cheering and lo
and behold, you have your highlight
reel. Um, very simple hack right there.
And we'll take that and we'll generate
embeddings from both the text and from
the video itself. And we'll be able to
uh identify certain behaviors with a
certain vector and a certain confidence.
And we'll store those then into a
database. Oh, I think I paused the video
by accident. My apologies. No, I didn't.
And then we'll use something like Amazon
end user messaging or SNS or whatever.
we'll send a push notification to our
end users and say, "Look, we found uh a
three-pointer or uh we found this other
thing." And what we found is um you
don't even have to take the the raw
video. A a tiny little bit of annotation
can do wonders um for the video
understanding models at the as they
exist right now. The soda models still
just with a little tiny bit of uh
augmentation on the video will
outperform um what you can get with an
unmodified video. And what I mean by
that is if you have static camera angles
and you annotate on the court where the
three-pointer line is with a big blue
line and then you just ask the model
questions like did the player cross the
big blue line. Lo and behold you get way
better results and it takes you know
seconds and you can even have something
like SAM 2 which is another model from
meta go and do some of those annotations
for you. So that's an architecture.
You'll notice that I've put up a couple
of different databases there. We had uh
Postgress PG vector uh which is my
favorite right now. We had open search.
That's another implementation of vector
search there. Um, but anyway, why should
you listen to me? Hi, I'm Randall. Um, I
got started out hacking and building
stuff and uh playing video games and
hacking into video games. It turns out
that's super illegal. Did not know that.
Um, and then I went on to do some
physics stuff at NASA. Uh, I joined a
small company called Tenen, which became
MongoDB. They IPOed. Um, I was an idiot
and sold all my stock before the IPO.
Uh, and then I worked at SpaceX where I
led the CI/CD team. Fun fact, we never
blew up a rocket while I was in charge
of that team. Before and after my
tenure, we blew up rockets. Um, I I
don't know what else I can say there.
Uh, and then I spent a long time at AWS
and I had a great time building a ton of
technology for a lot of customers. I
even made a video about the transformer
paper in July of 2017, not realizing
what it was going to lead to. And the
fact that we're all even here today is
is still attention is all you need. Uh
you can follow me on Twitter at JR Hunt.
Uh it's still called Twitter. It will
never be called X in my mind. And uh
this is Kalin. You know, we've won
partner of the year for AWS for a long
time. We build stuff. Like I said, I I I
like to say our motto is we build cool
stuff. Um marketing doesn't like it when
I say that. Uh because I don't always
say the word stuff. Sometimes I'll sub
in a different word. And what we build,
you know, everything from chat bots to
co-pilots to AI agents. And I'm going to
share all the lessons that we've learned
from building all of these things. You
know, this sort of stuff on the top
here, these self-service productivity
tools. Um, these are things that you can
typically buy. Uh, but certain
institutions may need a fine tune. They
may need a a particular application on
top of that self-service productivity
tool and we will often build things for
them. Uh, one of the issues that we see
organizations facing is how do they
administer and track the usage of these
third-party tools and APIs. Uh, and some
people have an on-prem network and a VPN
where they can just measure all the
traffic. They can intercept things. They
can look for PII or PHI and they can do
all the fun stuff that we're supposed to
do with network interception. There's a
great tool called Shure Path. Uh, we use
it at Kalin. I recommend them. Uh, it
does all of that for you and it can
integrate with Zcal or whatever else you
might need. Um in terms of automating
business functions, you know, this is
typically trying to get a a percentage
of time or dollars back uh end to end in
a particular business process. Uh we
work with a large logistics management
customer that does a tremendous amount
of processing of uh of receipts and
bills of laden and things like that. And
this is a typical intelligent document
processing use case leveraging
generative AI and a custom classifier
before we send it into the generative AI
models. Uh we can get far faster better
results than even their human annotators
can. Um and then there's monetization
which is adding a new skew to an
existing product. It's an existing SAS
platform. It's an existing utility and
the customer is like oh I want to add a
new skew so I can charge my users for
fancy AI because the Wall Street Journal
told me to. And that is a very fun area
to work in. But if you just build a
chatbot, you know, sayanara, like good
luck. I'll, you know, you're the
Polaroid. Um, do people still use
Polaroid? Are they doing okay? I don't
know. Anyway, I used to say Kodak. Um,
this is how we build these things and
these are the lessons that we've
learned. Um, I stole this slide. This is
not my slide. I cannot remember where it
is from. It's from Twitter somewhere. It
might have been Jason Louu. It might
have been from DSPY. But this is a great
slide that I think very strategically
identifies what the uh specifications
are to build a moat in your business and
the inputs to your system and what your
system is going to do with them. That is
the most fundamental part your inputs
and your outputs. Um, does everyone
remember Steve Balmer, uh, the former
CEO of Microsoft and how he, uh,
famously went on stage, uh, on a
tremendous amount of cocaine and just
started screaming, um, developers,
developers, developers, developers. If I
were to channel my inner balmer, what I
would say is eval.
So when we do this eval layer, this is
where we prove that the system is robust
and not just a vibe check and we're
getting a one-off on a particularly
unique uh prompt. Then we have the
system architecture and then we have the
different LLMs and tools and things we
may use. And these are all incidental to
your AI system and you should expect
them to evolve and change. What will not
evolve and change is your fundamental
definition and specification of what are
your inputs and what are your outputs.
Uh and as you know the models get better
and they improve and you can get other
like modalities of output that may
evolve. But you're always going to
figure out why am I doing this? What is
my ROI? What do I expect?
This is how we build these things in
AWS. On the bottom layer we have two
services. We have Bedrock and we have
SageMaker. Uh these are uh useful
services. SageMaker comes at a
particular compute premium. You can also
just run on EKS or EC2 if you want. Um
there's two different pieces of custom
silicon that exist within AWS. One is
trrenium, one is inferentia. Uh these
come at about a 60% price performance
improvement over using Nvidia GPUs. Now
the downside is the amount of HP RAM is
not as big as like an H200. I don't know
if anyone saw today, but it was great
news. Amazon announced that they were
reducing the prices of the P4 and P5
instances by up to 40%. So we all get
more GPUs cheaper. Very happy about
that. Um the interesting thing with
tranium and inferentia is that you must
uh use something called the neuron SDK
to write these. So if anyone has ever
written XLA for like TensorFlow and the
good old um what were they called the
TPUs and now the new TPU7 and all that
great stuff. Uh the the neuron kernel
interface for tranium and infinia is
very similar. One level up from that we
get to pick our various models. So we
have everything from uh claude and nova
to llama and deepseeek uh and then open
source models that we can deploy. I
don't know if mistrol is ever going to
release another open source model but
who knows. Uh and then we have our
embeddings and our vector stores. So
like I said uh I do prefer Postgress
right now. If you need um persistence in
Reddus uh there's a great thing called
memory DB on AWS that also supports
vector search. Um the good news about
the reddis vector search is that it is
extremely fast. The bad news is that it
is extremely expensive because it has to
sit in RAM. Um so if you think about how
you're going to construct your indexes
and like do IVV flat or something uh be
prepared to blow up your RAM in order to
store all of that stuff. Now um within
Postgress and open search you can go to
disk and you can use things like HNSW
indexes so that you can have uh a better
allocation and search mechanism. Then we
have the prompt versioning and prompt
management. Uh all of these things are
incidental and and kind of uh you you
know not unique anymore. But this one
context management is incredibly
important. And if you are looking to
differentiate your application from
someone else's application, context is
key. So if your competitor doesn't have
the context of the user and additional
information uh but you're able to inject
oh the the user is on this page they
have a history of this browsing you know
these are the cookies that I saw this is
you know then you can go and make a much
more strategic inference on behalf of
that end user. So here are the lessons
that we learned and I I'll jump into
these but I'm also going to run out of
time so I I'll speed through a little
bit of it and I'll make the stack
available for folks. But uh it turns out
eval and embeddings are not all you
need. Uh
you know the understanding the access
patterns and understanding the way that
people will use the product uh will lead
to a much better result than just
throwing out evals and throwing out
embeddings and wishing the best of luck.
Embeddings alone do not a great query
system make. How do you do faceted
search and filters on top of embeddings
alone? That is why we love things like
open search and postgress. Um speed
matters. So if your inference is slow,
uh UX is a means of mitigating the
slowness of some of these things.
There's other techniques you can use.
You can use caching, you can use other
components. Um but if you are slower and
more expensive, you will not be used. If
you are uh slower and cheaper and you're
mitigating some of the effects by
leveraging something like a fancy UI
spinner or something that keeps your
users entertained as the inference is
being calculated uh you can uh still
win. Now uh knowing your end customer as
I said is very important. And then the
other very important thing is the number
of times I see people defining a tool
called get current date is infuriating
to me. Like it is literally like import
time.now
you know like just it's a format string
just throw it in the string like you
control the prompt. Um, so, uh, the
downside of putting some of that
information very high up in the prompt
is that your caching, uh, is not as
effective. But if you can put some of
that information at the bottom of the
prompt after the instructions, you can
often, uh, get very effective caching.
Um, then there is like I I used to say
we should fine-tune, we should do these
things. Uh, it turns out I was wrong. As
the models have improved and gotten more
and more powerful, uh, prompt
engineering has proven unreasonably
effective for us, like far more
effective than I would have predicted.
Within, uh, cloud 3.7 to claude 4, we
saw zero regressions. From cloud 35 to
37, we did see regressions on certain
things when we moved the exact same
prompts over to some of our, uh, users
and some of our evals. But from 37 to
four, we got faster, better, cheaper,
more optimized inference in virtually
every use case. So it was like a drop in
replacement and it was amazing. Um, and
I hoping future versions will be the
same. Uh, I'm hoping we're the era of
having to adjust your prompt every time
a new model comes out is ending. Um, and
then finally, it's very important to
know your economics like is this
inference going to bankrupt my company?
Um if you think about some of the cost
of uh uh the the opus models, you know,
it may not always be the best thing to
run.
Okay, so just in the interest of time,
this is another great slide. This is uh
from anthropic actually. And when we
think about how to create our evals, the
vibe check, the very first thing that
you do when you try to create um a uh
a test, that vibe check becomes your
first eval. And then you change the data
and the stuff that you're sending in and
lo and behold, 20 minutes later, you do
have some form of eval set that you can
begin running. And then you can go for
metrics. Now metrics do not have to be a
score like a BERT or
you know a benchmark score that is
calculated. They can just be a boolean.
It can just be true or false. Was this
inference successful or not? Um that is
often easier than trying to assign a
particular value and a particular score.
Uh and then you just iterate, you know,
keep going. And like I said, speed
matters, but UX matters more. you know,
this UX orchestration, prop management,
all of this great stuff uh is why we end
up doing better than uh some of our
competitors. And then, you know, one of
our customers, Cloud Zero, uh we
originally built a chatbot for them for
you to chat with your AWS infrastructure
and get cost out of that AWS
infrastructure. Um we are now using
generative UI in order to render uh the
information that is shown in those
charts. So in just in time we will craft
a react component and inject it into the
uh the rendering of the response and
then we can cache those uh components
and describe in the prompt hey I made
this for this other user and maybe it's
helpful one day uh for some other user's
query. And so this generative UI allows
the tool to constantly evolve and
personalize to the individual end user.
Um, this is an extremely powerful
paradigm that is finally fast enough
with some of these uh models and their
lightning fast inference speed. Um,
nature footage, we covered that earlier.
Uh there's also knowing your end user
which is we had a customer uh that had
users in remote areas and so we would
give uh text summaries of these PDFs and
manuals and things and that would uh
be great and then they would get the PDF
and it would be 200 megabytes you know
and then so what we found is on the back
end on the server we could take a
screenshot essentially of the PDF and
just send that one page so that even
when they were in low connectivity areas
we could still send the text summary of
the full document mentation and
instructions but just send the relevant
parts of the PDF without them having to
download a 200 megabyte thing. So that's
know your end customer. We worked with a
hospital system for instance that uh we
originally built a voice bot for these
nurses uh and it turns out nurses hate
voice bots because hospitals are loud
and noisy and the voice transcription is
not very good and you just hear other
people yelling and they preferred a
regular old chat interface. So, we had
to know our end customers. Figure out
what exactly they were doing day-to-day.
And then
let the computer do what the computer's
good at. Don't do math in an LLM. It is
the most expensive possible way of doing
math. Um, let the the computer do its
calculations. And then prompt
engineering. I'm not going to break this
down. I'm sure you've seen hundreds of
talks over the last two days about the
uh way to engineer your prompts and
everything. Uh but one of the things
that we like to do as part of our
optimization is to think about the
output tokens and the costs that are
associated there and how we can make
that perform better. And then finally,
know your economics. There's lots of
great tools. There's things like prompt
caching. There's things like tool usage
and batch. Um batch on bedrock is a 50%
off whatever model infrance you're
trying to make across the board. And
then context management. You can
optimize your context. you can figure
out what is the minimum viable context
in order to get the correct inference
and how can I optimize that context over
time and this again requires knowing
your end user knowing what they're doing
and injecting that information into the
model and also optimizing stuff that is
irrelevant and taking it out of the
context so that the model has less to
reason over
this and you want to learn more if you
want to talk more um I'm always happy to
hop on the phone with customers you can
scan this QR code we like building cool
stuff. Uh, I got a whole bunch of
talented engineers who were just excited
to go out and build things for
customers. So, if you have a super cool
use case, come at me. All right. Thank
you very much.
[Music]