Context Platform Engineering to Reduce Token Anxiety — Val Bercovici, WEKA
Channel: aiDotEngineer
Published at: 2025-11-24
YouTube video id: NTBX-wxUhHs
Source: https://www.youtube.com/watch?v=NTBX-wxUhHs
This is Valberkichi, Weta's chief AI officer, and I am joined by >> Kellen Fox, head out of the product management team here at WA >> and we're both thrilled to present context platform engineering to you at the AI.engineering code summit. Now, let's kick this off with uh an announcement we're making. We're actually open sourcing our context platform engineering toolkit. And this toolkit features a really cool load generator that Kalen wrote that lets you configure agent swarms uh and agent subtasks with very specific SLOs's being able to cycle through deterministic and random prompt cycles and engineer context platforms with all sorts of model parallelism options, disagregated or aggregated pre-fill and decode options and some really important memory tiering options we're going to be discussing here. So, if we advance the next slide, we'll see that this is an open-source toolkit that's already available to you on GitHub. So, Ken and I really encourage you to just get on GitHub, download this, play with it, and give us your feedback. Let us know what you need change. Feel free to contribute and fork the project uh and advance the field of context platform engineering, which we're going to be introducing to you later today. So moving on, one of the key requirements for context platform engineering really relates to the contact engineering uh insight that our friends at Manis shared with us earlier this summer in their pretty infamous now context engineering blog and they highlighted the fact that KV cache hit rate is the single most important metric for production grade AI agents. And the reason context platform engineering is so important is it dramatically simplifies reaching maximum KV cache hit rates as we're about to show you on a more personal level. If we think about token anxiety, I know that each and every one of us, you know, feel that anxiety. The reason context platform engineering is so important is shared by the context engineering blog from Manis earlier this summer where they particularly emphasize KV cache hit rates are the single most important metrics for production grade AI agents and context platform engineering quite simply maximizes KV cache hit rates in a very straightforward manner. On a more personal note, if you think about to the concept of token anxiety, as we all regularly hit token rate limits, context platform engineering helps to engineer platforms that eliminate token rate limits uh and help us be more productive with regards to developing our software. Now in the absence of context platform engineering, we often resort to context financial engineering and that's fundamentally prom arbitrage where we balance the needs of pricing between the bookends of input and output tokens with these new token pricing categories that have appeared in the landscape over the past few months focusing on cash rights and cash reads. And we've got to be somewhat clairvoyant when we're doing the arbitrage to figure out how many cash rights you want to invest in for either five minute time to live. In some cases with anthropic, for example, uh we can do one hour time to live. And that's all against balanced against the predictions we need to make on how many cash reads and cash hits we think we're going to have during those intervals. This becomes very very tricky to be clairvoyant and predict the future. And I think it's much better to apply context prompt engineering techniques to overcome token anxiety and prompt cash arbitrage than to continue to to do the arbitrage and context financial engineering. And so one of the ways we're going to be doing that is looking at and and Ken's going to dive into this deeply, the cadence mismatch between the relatively slow human feedback loops for agents and then the agent swarms and the agent subtasks themselves that iterate at much higher cadence, often in parallel, waiting on humans, but conducting a lot of really cool work in the background, consuming a lot of tokens in the background, many of which are cachable, but we just never know how the platform is able to respond. And that's one thing we're going to be diving into here is the fact that if we go to the next slide, we're looking at fundamentally a token storage problem. And what we're going to be doing is explaining how the service level agreements we sign up to when we subscribe to our various, you know, token tiers or we actually commit in our instructions and our agentic instructions to specific token cache rights and cash reads. how those SLAs's convert to service level objectives delivered by the context platform itself. And more particularly, one of the insights that Kalan reached from his research at WA Labs is that what we're doing when we actually subscribe to our token tiers or we actually pay for particular token rights is we're really purchasing cash KB slots in token storage. So there's definitely a whole science around the context platform engineering to how context platforms take those SLA requirements optimize infrastructure optimize KV caching and memory tiers and deliver specific SLOs's to try and meet those SLAs's as much as possible. So with that let me actually hand it over to Ken for uh actual research findings and lab and and test results from WA Labs. >> Thanks Val. So, look, what I want to do is just go back to one of the slides that Val showed earlier. And what I'm going to do from now on is I'm going to focus on that right hand loop. And the first thing I'm going to do is I'm going to start by visualizing what that loop actually looks like. And then we're going to go into a little bit more detail. So, if you if you think about that loop as a column, and I've got a graph here that shows a very very common uh pattern that happens in agents. So the the salmon color is showing new tokens that the system's being exposed to. The gray is something that could be ced again within a limited amount of C. We'll get into that shortly. The blue is the output tokens. And these blue dots down the bottom are showing when the user is actually giving responses in this particular case. This is a really common example you get where basically you start off you consume context all the way up until you hit a um a high a high watermark set by either the model maximum length or by the inference provider itself. there's a summarization um phase and then you start a new cycle and everybody knows that summarization phase where sometimes you know the agent loses a little bit of its fidelity a bit of its intelligence and uh and that's why we're trying to you know uh get more context engineering to larger set of platforms and we can we can raise that watermark so if we go into this in a little bit more detail the question I often get is okay well what is that that's a lot of gray what what's that made out of so here I'm able to um get the data and actually look at individual prompts and what actually makes them up. So when you look at agentic data especially agentic coding the actual user input is only a really small part of it and you can kind of see it here just visually that if you just scan across the the lighter whiter colors are the um the system prompt and the user text itself and the rest of it is tool use and tool responses. So uh this is this one in particular is from claw code where you're spending a lot of time um where the the system is you know doing like for example a a bash command it's grapping something it's getting a result and then it's doing something else. So where where this really shows out in the data is if you actually look at the median time between requests it may be some for conversation that looks like that we have data for billions and billions and billions and billions of tokens. Um the median time is 10 seconds, 15 seconds maybe. Um that heavily depends on whether the human's involved in checking every single uh tool use, but the meanantime is in the minutes because the human or even hours because the human time to respond is much much much higher. And that's what we're showing before of the two sides of a loop. So the other thing that's interesting and and something that's very common today is is uh is multi- aent. So you might have a core agent which I've shown here is the orchestrator and then you've got these sub agents that are like spun up to do individual tasks and depending on the type of agentic uh coding um or just any agentic software in general. These agents or these sub agents may be short-lived as in their context does not endure between one wake up and the next or there are somes some when they do endure and it's really important to use our agents because it allows us to create to effectively target more context at very particular parts of what the problem you're trying to solve. But as a result, you do actually end up using more context and I'll explain that very shortly. But if you visualize this gray section a different way and I show you the colors, you can kind of see how there's this common relationship of the common context between all of them. Again, this is varies a little bit depending on codeex versus cloud code versus versus others. But you can see how it changes over time and how the agents um relate to each other and have this common understanding and then back to the orchestrator to to wake up the next agent. The the the the thing that we're here to talk about today though mainly is that like while there's a lot of gray that could be ced, the reality is very different. So if you send this to an inference provider, what ends up happening is you don't actually get 100% of the C hits that you could um that you could get. Now why does this matter? Well, there's two ways to look at this. If you're paying for API tokens, uh you're literally it's literally costing you more money because every time you see a yellow here, and this is just a simple example, you're paying input token cost. So, you're re you're refreshing your cage and you're paying a full hit for that. So, potentially 10 times more than than what you were if it was caged. If you're a subscription user and you're thinking, well, I don't care about the cost. I don't pay for that. I pay a flat rate. That is true, but you're still, like we said before, you're paying for a subscription and that subscription is rate limited due to your case usage and um you may actually hit rate limits further or quicker. So, that's something that we want to be able to do. We work with a lot of providers today to to remove as much of this as possible. That's good for the user experience and it's also good for the provider. So, why does this happen? Well, I mean it if you think about the last graph where I show the columns, they're they're not they don't take into account time. They're just one after the other after the other. But there's obviously um a temporal uh way to look at this. So this is the way that I like to think about it. And I know this is a little bit more of a complex graph to look at, but bear with me for a second. So on the left hand side, I'm talking about working set. So that's the number of tokens that the C system is holding in its memory based on different time to lives of the co of the actual C itself. And then the the bit at the top the dotted lines based on the right hand secondary access is showing the case hit rate as a result. So the red is showing one minute time to live. And what you can see is there's prompts here at the start on the left where the um it's thrashing up and down. And the reason it's doing that is the time between requests at that period is is longer than 1 minute. So you're getting a period where you might uh take the cash, get a hit or two, and then drop the cash and then you get another one. You got to refresh it. So it it just it doesn't really make sense, right? You go to 5 minutes, which is the blue, and you can now ride out more and more of those cash hits, and as a result, you get a higher case hit rate. You can see it at that very start um up there uh comparing the two. But then you're still missing many others. There's still many times where the the time between a request is even larger. So the next one up is showing 1 hour. And while that requires the C system to hold uh you know a little bit more tokens in C and eventually quite a fair bit more tokens in C, it's got to hold it for a longer period of time. But the result to the end user is a better um actual experience and to the enterp to the uh inference provider which we'll show very shortly it's a much better experience for them as well. The problem though is to do that you need to be able to hold a lot of tokens in C and you need good memory tiers to support that. Um, so the next thing I want to go into is that a lot of people think of C hit rate isn't really something that a human's able to really internalize. Well, so another way that I can visualize it is by thinking about it in terms of the number of times on average that a chunk of of tokens, which is a group of tokens, is refreshed. So in this particular conversation that we're looking at here, you can see that there's this is showing the relationship of as I increase the time to live or how that affects my case hit rate. But it also shows based on the secondary access that at 1 minute I'm literally re re uh prefilling like 15 16 times the same tokens. And over time we can get that all the way down to approaching one um and um make significant differences to again the experience of both the user and the inference provider. So with that what I'd like to do now is go into the the context engineering side of it, some of the lessons we learned and um just sort of really drive this home. So now I want you to think about uh what I think will be common in 2026 and onwards of people hosting their own or having their own dedicated systems hosting for them. So imagine you being an inference provider now. Okay. So now what I want you to think of is think of yourself as an inference provider. Uh maybe you've um you've you know worked with us or one of our partners to build your own your own self-hosted instance um and uh you want to get the most out of it. What this graph is showing you is uh a relationship between a certain context length and the C hit rate and how many output tokens you get as a result of that C hit rate. Now the first thing you'll see is it's not linear and it it and the shape of this curve will change based on the context length based on the accelerators you use. B there's lots of things that come into it. how you do p disag and prefill. Uh there's a lot of stuff that comes into it, but the co the the curve is more or less the same. And if I asked you as an inference provider, where do you want to be? You'd obviously say C. And if you're in A or B, you're you're not making money or you're not getting enough value out of the system. And inference providers that we work with that they they have the same answer obviously. So the question is, well, how do they keep in C? And this is where it goes back to a slide that um Bow showed earlier where what they're doing is they're incentivizing users to stay within C. And this is where we we came to the realization that a lot of the times because of how much C hit rate uh impacts your actual output. That's why it's you're buying case a lotments in storage when you're actually buying subscription services because it is so important to them that you stay in a certain case hit rate band especially for agentic workflows. Otherwise they literally you'll just melt the GPU clusters that they have. Um and I and I think it's a really powerful thing to to have in your head about how that works. So what we're going to do now is go through and think about okay what what makes up this token storage. So when you think about the token storage there's lots of aspects that uh the memory tiers that support the token storage need to be able to do. But to really make it really really simple it's literally as as as simple as you need enough capacity in these memory tiers so that you can hold a optimal amount of cash. Uh if you think back to the the slides I just showed, there's this point where having more cash helps you a little bit, but it kind of gets to a point of diminishing returns. Um you need to get at least to that point and you need to be able to store extremely fast into it because if you can't, you're going to be able drop in KVs before they're in the memory tier or you're going to be blocking GPUs, which is probably even worse. And then the other way you need to do it is you need to be able to fetch from that token storage very very rapidly so that you can again not block the GPUs. They're the primary first class citizen of this whole system. So what does it look like? So there's a few different types of memory tiers. The most common obviously is HBM and uh Val and I would love it if all our sessions are in HBM at all times. It's just not reasonable. Um there's many reasons for this around how the batch works which we're not going to go into today. But the point is is that the the the main common way that this is done today is DRAM. And there's nothing really wrong with DRAM as such. It it's sort of a means to an end, but it's quite limited in size. It's it's okay in terms of performance. But the other thing is it's tightly coupled with the compute. So if you want to expand your DRAM, there's not really many good ways to do that. There are some technologies out there that kind of do this, but the way they're implemented, they they kind of just hurt your performance. And that's what I'm showing with pulled DRAM. You could pull more together, but it's, you know, it's kind of a uh uh it doesn't help that much. So what we at Wcker um did is we took all the durable advantages of our product which has been you know tried and tested in AI training in HBC environments and augmented memory grid is basically a uh supported um optimized connector between the inference systems and our um existing product. And because we're backed by NVMe we we're we're much denser. where like thousand times depending on how you look at it denser it's quite significant and then I show another example of a storage at the top there where you know not not something sluggish something that can still get 50 60 GB a second but uh and it has the capacity but still relative to what we're talking about is is still quite slow. Okay. So then moving on to how do we test this? So again, um we we talked about how we're we've open sourced this. Um basically, um Val already covered the the main part of it and that the fact that it it it acts like it's an inference provider. It's trying to keep the load within two SLOs's if you enable them. You actually don't have to enable them and it'll just go as hard as it can regardless of of an SLO being time to first token or output tokens per request. But the main thing that it can do is you can either set a static number of coding agent users or you can um increase the number of those users over time so that you can slowly utilize more of the memory tiers and be able to compare different configurations. So there's two ways that it works. Um I'll just be quick through these sections because you can read about this. I have a blog that explains how I do the testing that goes through all of this in detail. And there's obviously the GitHub as well, but basically it can do the initial working set and then sequentially go through those prompts. So this will be very very very deterministic because as soon as you over overflow the memory tier even the slightest bit, you'll see a massive drop off in performance. But the other way that it can be done and realistically the more fair way that it can be done is you can ex increase the size over time. So the amount of concurrent users that you're accessing out of a pool and you can randomly sample where in that sample set you'll get that uh prompt from. So sometimes you might be hitting HPM, sometimes you might be hitting your your memory tier 2. Let's say that let's say that's DAM and you get a really nice blended number. So with that, let's go in and tell show you some results and just sort of explain and and show why we're so excited about what we're talking about today. So this showing three comparisons. Comparison number one is HBM with weter. That's the purple. Uh there's orange which is HBM and DRAM. And there's the you know orangey pinky color with uh HBM plus DRAM plus that uh other uh posics system that I talked about earlier. The dotted line is showing uh concurrent users. So the amount the amount of users that are in a pool and that's increasing over time. So in the initial shaded area you can see that all three of them get an advantage of HBM. The primary uh hit out of uh C hit rate is coming out of HBM. But then over time as we increase the users more and more and more you're overflowing what the DRAM system what the DRAM memory tier can do and both orange and the pinky color start to drop off quite dramatically. Um we also from a wcker perspective also drop off because we get less and less advantage from HBM. So we have to uh pull back our concurrency a little bit. The system does automatically the uh the benchmarking tool. But then once we've sort of got down to the steady state, all three start to like um level out a little bit. But the main difference is is that once you get down to that steady state, we can maintain that at a much higher amount of users at a much higher amount of output tokens. The other way that you look at this is um that was a decode focused role. Um if you look at a pre-fill focus ro if you're doing disag prefill um then the prefill is actually even better result for us because the systems the GPUs are so much more efficient when you're doing large um batches of pre-fill tokens with a single decode. Um then we we can basically saturate things more fairly and um and it continues. Now the main difference between pink and orange is that we uh sorry purple and orange is that we have a lot more cash. So we can hit a lot more. The interesting thing about the orangey pinky color is that it also has the ability to hit every single thing that it's possible but it's not fast enough to get it into the GPU for it to make a difference. And that's why we're sort of showing the difference between these three because with purple you're getting the advantage of capacity but at DM speeds so you can maintain that benefit longer periods of time and then maybe Val I'll hand back to you. >> Absolutely. That was a great walk through Ken of all of your research and benchmark results in WA labs. So once again we're thrilled to be announcing the open sourcing of this context platform engineering toolkit today. Please do download it, use it, give us your feedback. Again, feel free to fork it and improve it yourself. And we look forward together just contributing to less token anxiety overall, less prompt cash arbitrage and more context and context platform engineering in the future. A nice QR code for you to find out even more information. And at the end of this video in um in the actual transcript section and so forth, there'll be links to all the blogs we referenced here. So, thank you for joining us today and we look forward to pairing on the context platform engineering conversation with you in the future.