Building an Agentic Platform — Ben Kus, CTO Box

Channel: aiDotEngineer
Published at: 2025-08-24
YouTube video id: 12v5S1n1eOY
Source: https://www.youtube.com/watch?v=12v5S1n1eOY
[Music]
Hello. Um, so I'm Ben Kuss. I'm CTO Box
and I'm going to talk today about our
journey of uh through AI and in
particular our AI agentic journey. Um
and uh if you don't know much about Box,
uh a little bit of background. Um so at
Box, we are um a unstructured content
platform. Uh we've been around for a
while, uh more than 15 years. And um our
we very much concentrate on large
enterprises. So uh we've got uh over
115,000 enterprise customers. We've got
uh twothirds of the Fortune 500. And um
our job really is to bring everything
you'd want to do with your content to
these customers and to provide them all
the capabilities they might want. In
many cases uh for AI many of these
customers their first AI deployment was
actually with box because um of course
many enterprises uh worry a lot about
data concern security concerns and worry
about data leakage with AI make sure to
do safe and secure AI and this is one
thing that we have specialized in over
time. Um but the way that we think about
um AI is at a platform level. So um we
have sort of the historic version of Box
which um has the idea of the global
infrastructure sort of everything you
need to manage and maintain content at
scale. We've got over an exabyte of
data. We have an awful lot of of uh
hundreds of billions of of files that
our customers have trusted us with. Um
and we have the natural way to protect
them in addition to the type of services
that you provide when you're an
unstructured data platform. But then for
the last few years um one of the key
things we've been investing in has been
in AI on top of the platform. And I'm
here to tell you a bit about our journey
here.
So um we started our journey in 2023
uh shortly after uh AI became sort of
production ready from a generative AI
sense and everything I'm talking about
here today will be generative AI of
course so um we ended up with a set of
features things like QA across documents
things like being able to extract data
things like being able to do AI power
workflows happy to talk about these in
general but um today I'm going to focus
on one aspect of uh the features that we
built which is the idea of data
extraction
This is the idea of taking structured
data from your unstructured data and
using that inpentic
sort of um thing that you might think of
when you're uh thinking about these
other examples about how you interact
with AI. This is much less like a
standard chatbot style integration. But
uh what we learned and what I'll tell
you about is how you the concepts of
agentic uh uh capabilities applies well
beyond just sort of a end user
interactions.
So um we'll be talking about data
extraction for a moment. Just quick
background when we talk about metadata
or data we talk about the things in
unstructured data be it documents be it
contracts be it project proposals
anything that then turns into structured
data. Uh this is a very common challenge
in enterprises is that they have like
90% of their data is unstructured 10% of
their data is in databases structured
data. Um and uh and historically there
has been this this challenge that like
it was kind of hard to to utilize this.
So many many customers have for a very
long time wish they had better ways to
automate their unstructured data and
there's a lot of it and it's really
critical in some cases it's the most
critical thing in an enterprise. So um
uh the things you do with it would be to
like uh um query your data, being able
to kick off workflows, being able to do
um just a better search and filtering
across all of your data. And so so this
like uh the prototypical example, this
is something like a contract where you
have an authoritative unstructured piece
of data, but then also uh the the key
fields in there are are very important.
So um this is not a new thing. for many
many years uh the world uh for box
included has been interested in pulling
out unstructured structured data from
unstructured data and um there were a
lot of techniques to do this and there
there's a whole industry if you ever
heard of IDP this is like a a
multi-billion dollar industry whose job
in life was to do this kind of of of uh
extraction but it was really hard you
had to build these specialized AI models
uh you had to like focus on specific
types of content you had to have this
huge corpus of training data often times
you need to build custom vendors, your
custom uh uh ML models that you make and
it was quite brittle and to the point
not a lot of companies ever thought
about automating most of their most
their critical unstructured data. So
this was sort of the state of the
industry for a very long time like uh
just um don't bother trying too hard
with unstructured data. Do everything
you can to get it in in some sort of
structured format but don't try to too
too hard to deal with that structured
data until generative AI came along. And
so this is where our our journey uh sort
of begins with AI uh is for a long time
we've been using ML models of different
uh in different ways and we it in and
the first thing that we tried um when
confronted with sort of a GPT2 GPT3
style of of of uh of AI models is that
you just say uh I have a question for
you AI model would can you extract this
kind of data and in the same and and as
we mostly all know is is is uh AI is not
only great at um generating uh uh
content. It's also great at
understanding the nuances of content. So
this uh so what we did we we first start
out with um some some uh pre-processing
you know doing sort of um OCR steps
classic ways to do this um and then
being able to then say I want to extract
these fields standard AI calls single
shot or with some some decoration of the
on the prompts um and this worked great.
This was amazing. This was something
where suddenly a standard generic
off-the-shelf AI model from multiple
vendors could outperform even the best
sort of models that you had seen in the
past.
Uh and uh we supported multiple models
just in case and then it got better and
better. This was wonderful. So this was
flexible. You could do it across any
kind of data. You could it performed
well. Um it was uh uh yes you had to do
OCR and pre-process it but that was
straightforward. And so we were just
thrilled. This was like uh for us it was
like this is this is this is a new
generation of of of AI and um
interestingly we would go to our
customers and say we can do this across
a data and then they would give us some
and it would work and then we'd be like
great AI models are awesome until they
said oh now uh now that you do that well
and I I get it now what about this one?
What about this 300page lease document
with 300 fields? What about this really
complex uh set of digital assets? You
want to give these really complex
questions associated with it. what about
I want to do not just extract data I
want to do risk assessments and things
that are these like more complex fields
you start to realize huh like this as a
human when I if you ask me that question
I'm struggling to answer it um and then
in the same way the AI started to
struggle to to answer it so um suddenly
uh we ended up uh with um more complex
documents
um also OCR is just a hard problem uh
like like there's no seemingly like no
end of of uh heristics and tricks that
you do on OCR to get it right So, I've
got a scan document, somebody writes
stuff in it, somebody crosses stuff out.
It's just hard. Um, and then and then um
for people who have dealt with like
things like different file formats,
PDFs, like um it's a challenge. So,
whenever the OCR broke, it would just
naturally give bad info to the AI and
then um languages were a big pain. Um
and and so we started to get more and
more challenges as we have an
international set of customers across
different use cases. Um, also there was
a clear limit to the AI in terms of how
much it could handle the uh attention to
so many different fields. So if you say
here's 10 fields, here's a 10-page
document, figure it out. They're great.
Most of them are great. If you say
here's a 100page document and here's a
100 fields that are each of them complex
with separate instructions, then they it
loses track and and I have sympathy
because people would lose track too. And
so um this became very problematic
because if you want high accuracy in an
enterprise setting like this just starts
to not work. Um and then also just like
well what is accuracy? What does it mean
in the old ML world? They give you
confidence scores. 865 is this one
versus and then of course large language
models don't really know their own
accuracy. So we would implement things
like LM as a judge and we come back and
tell you like here's your extraction.
also we're not quite sure this is right
and and then our enterprise customers
would kind of be like well that's
helpful to know but like I want it to
work right not just you tell me it
doesn't work right and so this became
this kind of set of challenges that that
that um we we we focused on and so
customers were looking for speed they're
looking for affordability they're making
this work they're saying if AI is this
future awesome thing then like you know
show it to me and so and on these more
complex documents so at this point we
kind of hit our our despair moment um
our we thought LLM's resolution
everything we thought that like we could
have these AI models that worked but um
and we actually struggled like what do
you do now how do you fix this and I
know let's just wait until uh the next
Gemini model or uh you know OpenAI seems
to be on top of this so like wait till
the next one which is part of it right
the models do get better but um the
fragility of the architecture was one
that was uh we weren't really going to
be able to solve on our own so um
naturally uh one of the answers uh that
we were came up with was um bringing
agentic approaches to everything that we
do. And this is really the the one of
the key things that um I want to sort of
bring out in this session is that um it
certainly was not obvious that the way
to fix all these problems in something
like data extraction was to do a gentic
style of interactions. And when I say
agentic, I mean an AI agent that does
something like this instructions,
objectives with the model background
tools, we can make have secure access.
Of course, it has memory from the
purposes of of advancing and being able
to look up information inside of of of
the system, but also with a uh full uh
directed graph. So the ability to
orchestrate it to be able to do things
like where you say do this then this.
Either it comes up with its own plan or
we actually can orchestrate it ourselves
because we have knowledge of what we
want it to do.
And this was for us um it was
controversial like it was like our
engineers like what are you talking
about like let's just make the OCR
better like uh like let's just add
another step somewhere like let's just
add a post-processing uh regular regular
expression checks and then and then of
course everybody always like I have a
way to do this um based on the old way
of doing this why don't we make train ML
model like why don't we fine-tune and
then and and then and then and then and
then and then and then and then and then
and then and then and then and and so
suddenly all of the genericness of it
would be get lost in this process
so um we came up with a mechanism which
was a uh so this is uh think like kind
of langraph style they have agentic
capabilities and um so we still we went
uh we still had the same inputs and
outputs in document with fields out
answers however the approach was an
agentic approach and so um you know we
played with all the models uh reflecting
uh back and forth and criticism uh being
able to uh uh separate in multiple tasks
uh to be able to have different multi
page systems work on this and we ended
up with something like this where you
have a step where you prepare the fields
you go through you group the fields we
learned quickly that like if if there's
like a set of fields that are like
customers uh from a contract and then or
like like parties and then somewhere
else there's like the address of the
parties like you need the AI to handle
those together otherwise it's like you
have three parties and two sets of
addresses which don't match match so we
we so we had to break up intelligently
the set of fields we had to go through
and we had to um uh like uh uh do
multiple queries on a document
Then after we got that, we would then
use a set of tools to check and to
double check the results. In some cases,
we use OCR. We then double check it by
looking at pictures of the pages. Um,
and and then using multiple models.
Sometimes they vote and they're like,
"Wow, like this is a hard question.
Three models from different vendors, two
of them think this is the answer. That
was probably a good answer." Um, and
then on to the idea of the element as a
judge. not just a judge to tell you that
this is a um this is the answer, but a
judge to tell you uh hey uh here's some
feedback, keep trying. Now, of course,
this takes a little bit longer um but uh
this is something that then leads to the
kind of accuracy that you'd want
overall. And so for us, this was the um
the uh uh the architecture that then
helped us solve a set of problems. And
it became um interesting because every
time there was a new set of challenges,
the answer was not rethink everything or
let's then try like a whole new set of
like oh you know give us six months and
and we'll come up with a new idea but uh
I wonder if we change that prompt on
that one note or I wonder if we add
another double check at the end then we
can actually start to solve this
problem. So we bring the power of AI
intelligence to help us then solve
something that we used to think of as a
standard function.
Um, and then not only that, it it helped
us in other ways. Like, so we we're
naturally as an unstructured content
store, like one of the first things you
always see people if I can give you a
demo right now, it's I have a bunch of
of documents. I have a question. And
then we had the same thing. We had a
judge and it would be like it would tell
us like, oh, that was a good answer or
that wasn't. And then why not just if
it's not a good answer, we'll take
another bait and and tell the AI like,
uh, try again. Before you tell the user
this answer, like I want you to um, uh,
like reflect on it for a second. And
this kind of thing just leads to higher
accuracy. And then it also leads to much
more complexity. So we just announced
our deep research capabilities on your
content. So in the same way that like
OpenAI or Gemini does deep research on
the internet, we let you do deep
research on your data in box would look
something like this. So this would be
like roughly the the directed graph that
you'd have where you go through you know
first we searched for the data kind of
do that for a while figure out what's
relevant double check then make an
outline kind of prepare a plan go
through um um make make a a process. So
this is all agentic thinking and it and
and this kind of thing wouldn't really
be possible if we hadn't kind of laid
the fra the framework of having an
agentic foundation overall.
So um I will leave you with uh these uh
a few lessons learned here. Um so this
is based on our time in the last few
years. Um the first is uh that um it
wasn't obvious to us at first but the
agentic uh abstraction layer from an
architecture perspective is actually
quite clean. It is it is very um once
you start to think this way it is very
natural to think I'm going to run an
intelligent workflow intelligent
directed graph powered by a models are
every step to be able to accomplish a
task not everything but sometimes that's
a great that's a great approach and this
and this is independent of some of a
highcale set of of sort of distributed
system design and and in both are
important like at some point you have to
deal with you know 100 million documents
that day at the same other point you
have to deal with that one and so being
able to separate these two systems into
like somebody who thinks about the
agentic framework and somebody who
thinks about the the how to scale a
generic process is this is this is very
helpful to keep these distinct. Um, also
it's just easy to evolve. Like, uh, in
that deep research example, one of our
biggest we we we did it and then it
worked really well except for the output
was kind of sloppy and so we were like,
ah, I guess we got to redesign the whole
thing or add another note at the end to
say summarize this in according to this
and it would just take that in and just
redo the output. Took not that long to
fix. And this was something that was not
obvious to me until later, which is that
um if you're going to be using um a
aentic uh uh AI with a team who's been
around for a while, like you start to
need to get them to think about agentic
first kind of thinking, AI first
thinking. And one way to do that is to
um let them build something so they can
start to think, oh, like this is not
only how we can build more things, but
also because we're also a platform for
our enterprise customers, they can think
about how to make it better make it
better for them. So things like uh
really doubling down on the idea of um
we we publish MCP servers, what are the
tools like for them, what can we do to
make it easier, how can we do our agent
to agent communications and so on.
So um this is uh all kind of summed up
with is if you're confronted with a
challenge, the lesson that we learned is
that if it's plausible that an a set of
AI models uh could help you solve that
problem, then you should build this AI
agentic architecture early. If I go back
in time, I would wish to done this
sooner because then we'd kind of be have
been able to continue to take advantage
of that. Um, and so that's my uh that's
my journey and that's my my my lesson
for you. Uh, so thank you
uh an are we um
two minutes. Okay. So um if any what
>> two questions okay if anybody has any
questions I'm happy to answer them
>> uh question being is this available as
API? Yes. Um so we are very API first
oriented. So we have an agent API that
you can call upon these agents to do
things and give them the arguments. So
yes uh we we we provide
agent uh just APIs across everything and
tools um to to call our APIs.
Um
>> okay
think when you start using a more manual
approach as well.
>> Um in terms of evaluating our agents and
how do we do that? Um so we we not only
use LM as a judge but we also create
eval sets. So we have our standard set
of eval sets. Um and then we've learned
that um since the gets go so good over
time we created a challenge set of of
eval sets to so that we can better
explore like things that not everybody
asked but if they did it would be really
hard and then that way you can better
decide on whether or not you're not only
prepared for now but as people get more
challenging things we we know that we
can grow across that. So a mixture of
eval plus as a judge plus the idea of
just having people give feedback. We we
have limited ability to look as an
enterprise company what happening but
the the the idea of them telling us this
is still useful in all cases
>> you can yell if you want I'll hear you
>> so
apologies if
you seems like you're mostly building
agents but at least together you know by
>> uh so the question being why bother with
agents if you can find tune a model. Um,
>> no.
>> Have you tried Have you tried
fine-tuning agents?
>> We're um we're pretty anti- fine-tuning
at this moment because um of the
challenges of once you fine-tune
something, you have to then fine-tune
all of the evolutions of them going
forward. We support mult multiple
models, Gemini, Llama, OpenAI,
Anthropic, and it's just hard to
consistently fine-tune across the board
in ways that like not only and usually
just the next version of the model gets
better. So we've we've got to the point
where we use use prompts or cache
prompts or agenticness as opposed to
fine-tuning. That's the approach for our
particular use cases that works quite
well.
Okay, thank you everyone.
[Music]