Building an Agentic Platform — Ben Kus, CTO Box
Channel: aiDotEngineer
Published at: 2025-08-24
YouTube video id: 12v5S1n1eOY
Source: https://www.youtube.com/watch?v=12v5S1n1eOY
[Music] Hello. Um, so I'm Ben Kuss. I'm CTO Box and I'm going to talk today about our journey of uh through AI and in particular our AI agentic journey. Um and uh if you don't know much about Box, uh a little bit of background. Um so at Box, we are um a unstructured content platform. Uh we've been around for a while, uh more than 15 years. And um our we very much concentrate on large enterprises. So uh we've got uh over 115,000 enterprise customers. We've got uh twothirds of the Fortune 500. And um our job really is to bring everything you'd want to do with your content to these customers and to provide them all the capabilities they might want. In many cases uh for AI many of these customers their first AI deployment was actually with box because um of course many enterprises uh worry a lot about data concern security concerns and worry about data leakage with AI make sure to do safe and secure AI and this is one thing that we have specialized in over time. Um but the way that we think about um AI is at a platform level. So um we have sort of the historic version of Box which um has the idea of the global infrastructure sort of everything you need to manage and maintain content at scale. We've got over an exabyte of data. We have an awful lot of of uh hundreds of billions of of files that our customers have trusted us with. Um and we have the natural way to protect them in addition to the type of services that you provide when you're an unstructured data platform. But then for the last few years um one of the key things we've been investing in has been in AI on top of the platform. And I'm here to tell you a bit about our journey here. So um we started our journey in 2023 uh shortly after uh AI became sort of production ready from a generative AI sense and everything I'm talking about here today will be generative AI of course so um we ended up with a set of features things like QA across documents things like being able to extract data things like being able to do AI power workflows happy to talk about these in general but um today I'm going to focus on one aspect of uh the features that we built which is the idea of data extraction This is the idea of taking structured data from your unstructured data and using that inpentic sort of um thing that you might think of when you're uh thinking about these other examples about how you interact with AI. This is much less like a standard chatbot style integration. But uh what we learned and what I'll tell you about is how you the concepts of agentic uh uh capabilities applies well beyond just sort of a end user interactions. So um we'll be talking about data extraction for a moment. Just quick background when we talk about metadata or data we talk about the things in unstructured data be it documents be it contracts be it project proposals anything that then turns into structured data. Uh this is a very common challenge in enterprises is that they have like 90% of their data is unstructured 10% of their data is in databases structured data. Um and uh and historically there has been this this challenge that like it was kind of hard to to utilize this. So many many customers have for a very long time wish they had better ways to automate their unstructured data and there's a lot of it and it's really critical in some cases it's the most critical thing in an enterprise. So um uh the things you do with it would be to like uh um query your data, being able to kick off workflows, being able to do um just a better search and filtering across all of your data. And so so this like uh the prototypical example, this is something like a contract where you have an authoritative unstructured piece of data, but then also uh the the key fields in there are are very important. So um this is not a new thing. for many many years uh the world uh for box included has been interested in pulling out unstructured structured data from unstructured data and um there were a lot of techniques to do this and there there's a whole industry if you ever heard of IDP this is like a a multi-billion dollar industry whose job in life was to do this kind of of of uh extraction but it was really hard you had to build these specialized AI models uh you had to like focus on specific types of content you had to have this huge corpus of training data often times you need to build custom vendors, your custom uh uh ML models that you make and it was quite brittle and to the point not a lot of companies ever thought about automating most of their most their critical unstructured data. So this was sort of the state of the industry for a very long time like uh just um don't bother trying too hard with unstructured data. Do everything you can to get it in in some sort of structured format but don't try to too too hard to deal with that structured data until generative AI came along. And so this is where our our journey uh sort of begins with AI uh is for a long time we've been using ML models of different uh in different ways and we it in and the first thing that we tried um when confronted with sort of a GPT2 GPT3 style of of of uh of AI models is that you just say uh I have a question for you AI model would can you extract this kind of data and in the same and and as we mostly all know is is is uh AI is not only great at um generating uh uh content. It's also great at understanding the nuances of content. So this uh so what we did we we first start out with um some some uh pre-processing you know doing sort of um OCR steps classic ways to do this um and then being able to then say I want to extract these fields standard AI calls single shot or with some some decoration of the on the prompts um and this worked great. This was amazing. This was something where suddenly a standard generic off-the-shelf AI model from multiple vendors could outperform even the best sort of models that you had seen in the past. Uh and uh we supported multiple models just in case and then it got better and better. This was wonderful. So this was flexible. You could do it across any kind of data. You could it performed well. Um it was uh uh yes you had to do OCR and pre-process it but that was straightforward. And so we were just thrilled. This was like uh for us it was like this is this is this is a new generation of of of AI and um interestingly we would go to our customers and say we can do this across a data and then they would give us some and it would work and then we'd be like great AI models are awesome until they said oh now uh now that you do that well and I I get it now what about this one? What about this 300page lease document with 300 fields? What about this really complex uh set of digital assets? You want to give these really complex questions associated with it. what about I want to do not just extract data I want to do risk assessments and things that are these like more complex fields you start to realize huh like this as a human when I if you ask me that question I'm struggling to answer it um and then in the same way the AI started to struggle to to answer it so um suddenly uh we ended up uh with um more complex documents um also OCR is just a hard problem uh like like there's no seemingly like no end of of uh heristics and tricks that you do on OCR to get it right So, I've got a scan document, somebody writes stuff in it, somebody crosses stuff out. It's just hard. Um, and then and then um for people who have dealt with like things like different file formats, PDFs, like um it's a challenge. So, whenever the OCR broke, it would just naturally give bad info to the AI and then um languages were a big pain. Um and and so we started to get more and more challenges as we have an international set of customers across different use cases. Um, also there was a clear limit to the AI in terms of how much it could handle the uh attention to so many different fields. So if you say here's 10 fields, here's a 10-page document, figure it out. They're great. Most of them are great. If you say here's a 100page document and here's a 100 fields that are each of them complex with separate instructions, then they it loses track and and I have sympathy because people would lose track too. And so um this became very problematic because if you want high accuracy in an enterprise setting like this just starts to not work. Um and then also just like well what is accuracy? What does it mean in the old ML world? They give you confidence scores. 865 is this one versus and then of course large language models don't really know their own accuracy. So we would implement things like LM as a judge and we come back and tell you like here's your extraction. also we're not quite sure this is right and and then our enterprise customers would kind of be like well that's helpful to know but like I want it to work right not just you tell me it doesn't work right and so this became this kind of set of challenges that that that um we we we focused on and so customers were looking for speed they're looking for affordability they're making this work they're saying if AI is this future awesome thing then like you know show it to me and so and on these more complex documents so at this point we kind of hit our our despair moment um our we thought LLM's resolution everything we thought that like we could have these AI models that worked but um and we actually struggled like what do you do now how do you fix this and I know let's just wait until uh the next Gemini model or uh you know OpenAI seems to be on top of this so like wait till the next one which is part of it right the models do get better but um the fragility of the architecture was one that was uh we weren't really going to be able to solve on our own so um naturally uh one of the answers uh that we were came up with was um bringing agentic approaches to everything that we do. And this is really the the one of the key things that um I want to sort of bring out in this session is that um it certainly was not obvious that the way to fix all these problems in something like data extraction was to do a gentic style of interactions. And when I say agentic, I mean an AI agent that does something like this instructions, objectives with the model background tools, we can make have secure access. Of course, it has memory from the purposes of of advancing and being able to look up information inside of of of the system, but also with a uh full uh directed graph. So the ability to orchestrate it to be able to do things like where you say do this then this. Either it comes up with its own plan or we actually can orchestrate it ourselves because we have knowledge of what we want it to do. And this was for us um it was controversial like it was like our engineers like what are you talking about like let's just make the OCR better like uh like let's just add another step somewhere like let's just add a post-processing uh regular regular expression checks and then and then of course everybody always like I have a way to do this um based on the old way of doing this why don't we make train ML model like why don't we fine-tune and then and and then and then and then and then and then and then and then and then and then and then and then and and so suddenly all of the genericness of it would be get lost in this process so um we came up with a mechanism which was a uh so this is uh think like kind of langraph style they have agentic capabilities and um so we still we went uh we still had the same inputs and outputs in document with fields out answers however the approach was an agentic approach and so um you know we played with all the models uh reflecting uh back and forth and criticism uh being able to uh uh separate in multiple tasks uh to be able to have different multi page systems work on this and we ended up with something like this where you have a step where you prepare the fields you go through you group the fields we learned quickly that like if if there's like a set of fields that are like customers uh from a contract and then or like like parties and then somewhere else there's like the address of the parties like you need the AI to handle those together otherwise it's like you have three parties and two sets of addresses which don't match match so we we so we had to break up intelligently the set of fields we had to go through and we had to um uh like uh uh do multiple queries on a document Then after we got that, we would then use a set of tools to check and to double check the results. In some cases, we use OCR. We then double check it by looking at pictures of the pages. Um, and and then using multiple models. Sometimes they vote and they're like, "Wow, like this is a hard question. Three models from different vendors, two of them think this is the answer. That was probably a good answer." Um, and then on to the idea of the element as a judge. not just a judge to tell you that this is a um this is the answer, but a judge to tell you uh hey uh here's some feedback, keep trying. Now, of course, this takes a little bit longer um but uh this is something that then leads to the kind of accuracy that you'd want overall. And so for us, this was the um the uh uh the architecture that then helped us solve a set of problems. And it became um interesting because every time there was a new set of challenges, the answer was not rethink everything or let's then try like a whole new set of like oh you know give us six months and and we'll come up with a new idea but uh I wonder if we change that prompt on that one note or I wonder if we add another double check at the end then we can actually start to solve this problem. So we bring the power of AI intelligence to help us then solve something that we used to think of as a standard function. Um, and then not only that, it it helped us in other ways. Like, so we we're naturally as an unstructured content store, like one of the first things you always see people if I can give you a demo right now, it's I have a bunch of of documents. I have a question. And then we had the same thing. We had a judge and it would be like it would tell us like, oh, that was a good answer or that wasn't. And then why not just if it's not a good answer, we'll take another bait and and tell the AI like, uh, try again. Before you tell the user this answer, like I want you to um, uh, like reflect on it for a second. And this kind of thing just leads to higher accuracy. And then it also leads to much more complexity. So we just announced our deep research capabilities on your content. So in the same way that like OpenAI or Gemini does deep research on the internet, we let you do deep research on your data in box would look something like this. So this would be like roughly the the directed graph that you'd have where you go through you know first we searched for the data kind of do that for a while figure out what's relevant double check then make an outline kind of prepare a plan go through um um make make a a process. So this is all agentic thinking and it and and this kind of thing wouldn't really be possible if we hadn't kind of laid the fra the framework of having an agentic foundation overall. So um I will leave you with uh these uh a few lessons learned here. Um so this is based on our time in the last few years. Um the first is uh that um it wasn't obvious to us at first but the agentic uh abstraction layer from an architecture perspective is actually quite clean. It is it is very um once you start to think this way it is very natural to think I'm going to run an intelligent workflow intelligent directed graph powered by a models are every step to be able to accomplish a task not everything but sometimes that's a great that's a great approach and this and this is independent of some of a highcale set of of sort of distributed system design and and in both are important like at some point you have to deal with you know 100 million documents that day at the same other point you have to deal with that one and so being able to separate these two systems into like somebody who thinks about the agentic framework and somebody who thinks about the the how to scale a generic process is this is this is very helpful to keep these distinct. Um, also it's just easy to evolve. Like, uh, in that deep research example, one of our biggest we we we did it and then it worked really well except for the output was kind of sloppy and so we were like, ah, I guess we got to redesign the whole thing or add another note at the end to say summarize this in according to this and it would just take that in and just redo the output. Took not that long to fix. And this was something that was not obvious to me until later, which is that um if you're going to be using um a aentic uh uh AI with a team who's been around for a while, like you start to need to get them to think about agentic first kind of thinking, AI first thinking. And one way to do that is to um let them build something so they can start to think, oh, like this is not only how we can build more things, but also because we're also a platform for our enterprise customers, they can think about how to make it better make it better for them. So things like uh really doubling down on the idea of um we we publish MCP servers, what are the tools like for them, what can we do to make it easier, how can we do our agent to agent communications and so on. So um this is uh all kind of summed up with is if you're confronted with a challenge, the lesson that we learned is that if it's plausible that an a set of AI models uh could help you solve that problem, then you should build this AI agentic architecture early. If I go back in time, I would wish to done this sooner because then we'd kind of be have been able to continue to take advantage of that. Um, and so that's my uh that's my journey and that's my my my lesson for you. Uh, so thank you uh an are we um two minutes. Okay. So um if any what >> two questions okay if anybody has any questions I'm happy to answer them >> uh question being is this available as API? Yes. Um so we are very API first oriented. So we have an agent API that you can call upon these agents to do things and give them the arguments. So yes uh we we we provide agent uh just APIs across everything and tools um to to call our APIs. Um >> okay think when you start using a more manual approach as well. >> Um in terms of evaluating our agents and how do we do that? Um so we we not only use LM as a judge but we also create eval sets. So we have our standard set of eval sets. Um and then we've learned that um since the gets go so good over time we created a challenge set of of eval sets to so that we can better explore like things that not everybody asked but if they did it would be really hard and then that way you can better decide on whether or not you're not only prepared for now but as people get more challenging things we we know that we can grow across that. So a mixture of eval plus as a judge plus the idea of just having people give feedback. We we have limited ability to look as an enterprise company what happening but the the the idea of them telling us this is still useful in all cases >> you can yell if you want I'll hear you >> so apologies if you seems like you're mostly building agents but at least together you know by >> uh so the question being why bother with agents if you can find tune a model. Um, >> no. >> Have you tried Have you tried fine-tuning agents? >> We're um we're pretty anti- fine-tuning at this moment because um of the challenges of once you fine-tune something, you have to then fine-tune all of the evolutions of them going forward. We support mult multiple models, Gemini, Llama, OpenAI, Anthropic, and it's just hard to consistently fine-tune across the board in ways that like not only and usually just the next version of the model gets better. So we've we've got to the point where we use use prompts or cache prompts or agenticness as opposed to fine-tuning. That's the approach for our particular use cases that works quite well. Okay, thank you everyone. [Music]