Context Platform Engineering to Reduce Token Anxiety — Val Bercovici, WEKA

Channel: aiDotEngineer
Published at: 2025-11-24
YouTube video id: NTBX-wxUhHs
Source: https://www.youtube.com/watch?v=NTBX-wxUhHs
This is Valberkichi, Weta's chief AI
officer, and I am joined by
>> Kellen Fox, head out of the product
management team here at WA
>> and we're both thrilled to present
context platform engineering to you at
the AI.engineering code summit. Now,
let's kick this off with uh an
announcement we're making. We're
actually open sourcing our context
platform engineering toolkit.
And this toolkit features a really cool
load generator that Kalen wrote that
lets you configure agent swarms uh and
agent subtasks with very specific SLOs's
being able to cycle through
deterministic and random prompt cycles
and engineer context platforms with all
sorts of model parallelism options,
disagregated or aggregated pre-fill and
decode options and some really important
memory tiering options we're going to be
discussing here. So, if we advance the
next slide, we'll see that this is an
open-source toolkit that's already
available to you on GitHub. So, Ken and
I really encourage you to just get on
GitHub, download this, play with it, and
give us your feedback. Let us know what
you need change. Feel free to contribute
and fork the project uh and advance the
field of context platform engineering,
which we're going to be introducing to
you later today. So moving on, one of
the key requirements for context
platform engineering really relates to
the contact engineering uh insight that
our friends at Manis shared with us
earlier this summer in their pretty
infamous now context engineering blog
and they highlighted the fact that KV
cache hit rate is the single most
important metric for production grade AI
agents. And the reason context platform
engineering is so important is it
dramatically simplifies reaching maximum
KV cache hit rates as we're about to
show you
on a more personal level. If we think
about token anxiety, I know that each
and every one of us, you know, feel that
anxiety. The reason context platform
engineering is so important is shared by
the context engineering blog from Manis
earlier this summer where they
particularly emphasize KV cache hit
rates are the single most important
metrics for production grade AI agents
and context platform engineering quite
simply maximizes KV cache hit rates in a
very straightforward manner.
On a more personal note, if you think
about to the concept of token anxiety,
as we all regularly hit token rate
limits, context platform engineering
helps to engineer platforms that
eliminate token rate limits uh and help
us be more productive with regards to
developing our software.
Now in the absence of context platform
engineering, we often resort to context
financial engineering and that's
fundamentally prom arbitrage where we
balance the needs of pricing between the
bookends of input and output tokens with
these new token pricing categories that
have appeared in the landscape over the
past few months focusing on cash rights
and cash reads. And we've got to be
somewhat clairvoyant
when we're doing the arbitrage to figure
out how many cash rights you want to
invest in for either five minute time to
live. In some cases with anthropic, for
example, uh we can do one hour time to
live. And that's all against balanced
against the predictions we need to make
on how many cash reads and cash hits we
think we're going to have during those
intervals. This becomes very very tricky
to be clairvoyant and predict the
future. And I think it's much better to
apply context prompt engineering
techniques to overcome token anxiety and
prompt cash arbitrage than to continue
to to do the arbitrage and context
financial engineering.
And so one of the ways we're going to be
doing that
is looking at and and Ken's going to
dive into this deeply, the cadence
mismatch between the relatively slow
human feedback loops for agents and then
the agent swarms and the agent subtasks
themselves that iterate at much higher
cadence, often in parallel, waiting on
humans, but conducting a lot of really
cool work in the background, consuming a
lot of tokens in the background, many of
which are cachable, but we just never
know how the platform is able to
respond. And that's one thing we're
going to be diving into here is the fact
that if we go to the next slide, we're
looking at fundamentally a token storage
problem. And what we're going to be
doing is explaining how the service
level agreements we sign up to when we
subscribe to our various, you know,
token tiers or we actually commit in our
instructions and our agentic
instructions to specific token cache
rights and cash reads. how those SLAs's
convert to service level objectives
delivered by the context platform
itself. And more particularly, one of
the insights that Kalan reached from his
research at WA Labs is that what we're
doing when we actually subscribe to our
token tiers or we actually pay for
particular token rights is we're really
purchasing cash KB slots in token
storage. So there's definitely a whole
science around the context platform
engineering to how context platforms
take those SLA requirements optimize
infrastructure optimize KV caching and
memory tiers and deliver specific SLOs's
to try and meet those SLAs's as much as
possible. So with that let me actually
hand it over to Ken for uh actual
research findings and lab and and test
results from WA Labs.
>> Thanks Val. So, look, what I want to do
is just go back to one of the slides
that Val showed earlier. And what I'm
going to do from now on is I'm going to
focus on that right hand loop. And the
first thing I'm going to do is I'm going
to start by visualizing what that loop
actually looks like. And then we're
going to go into a little bit more
detail.
So, if you if you think about that loop
as a column, and I've got a graph here
that shows a very very common uh pattern
that happens in agents. So the the
salmon color is showing new tokens that
the system's being exposed to. The gray
is something that could be ced again
within a limited amount of C. We'll get
into that shortly. The blue is the
output tokens. And these blue dots down
the bottom are showing when the user is
actually giving responses in this
particular case. This is a really common
example you get where basically you
start off you consume context all the
way up until you hit a um a high a high
watermark set by either the model
maximum length or by the inference
provider itself. there's a summarization
um phase and then you start a new cycle
and everybody knows that summarization
phase where sometimes you know the agent
loses a little bit of its fidelity a bit
of its intelligence and uh and that's
why we're trying to you know uh get more
context engineering to larger set of
platforms and we can we can raise that
watermark
so if we go into this in a little bit
more detail the question I often get is
okay well what is that that's a lot of
gray what what's that made out of so
here I'm able to um get the data and
actually look at individual prompts and
what actually makes them up. So when you
look at agentic data especially agentic
coding the actual user input is only a
really small part of it and you can kind
of see it here just visually that if you
just scan across the the lighter whiter
colors are the um the system prompt and
the user text itself and the rest of it
is tool use and tool responses. So uh
this is this one in particular is from
claw code where you're spending a lot of
time um where the the system is you know
doing like for example a a bash command
it's grapping something it's getting a
result and then it's doing something
else. So where where this really shows
out in the data is if you actually look
at the median time between requests it
may be some for conversation that looks
like that we have data for billions and
billions and billions and billions of
tokens. Um the median time is 10
seconds, 15 seconds maybe. Um that
heavily depends on whether the human's
involved in checking every single uh
tool use, but the meanantime is in the
minutes because the human or even hours
because the human time to respond is
much much much higher. And that's what
we're showing before of the two sides of
a loop.
So the other thing that's interesting
and and something that's very common
today is is uh is multi- aent. So you
might have a core agent which I've shown
here is the orchestrator and then you've
got these sub agents that are like spun
up to do individual tasks and depending
on the type of agentic uh coding um or
just any agentic software in general.
These agents or these sub agents may be
short-lived as in their context does not
endure between one wake up and the next
or there are somes some when they do
endure and it's really important to use
our agents because it allows us to
create to effectively target more
context at very particular parts of what
the problem you're trying to solve. But
as a result, you do actually end up
using more context and I'll explain that
very shortly. But if you visualize this
gray section a different way and I show
you the colors, you can kind of see how
there's this common relationship of the
common context between all of them.
Again, this is varies a little bit
depending on codeex versus cloud code
versus versus others. But you can see
how it changes over time and how the
agents um relate to each other and have
this common understanding and then back
to the orchestrator to to wake up the
next agent.
The the the the thing that we're here to
talk about today though mainly is that
like while there's a lot of gray that
could be ced, the reality is very
different. So if you send this to an
inference provider, what ends up
happening is you don't actually get 100%
of the C hits that you could um that you
could get. Now why does this matter?
Well, there's two ways to look at this.
If you're paying for API tokens, uh
you're literally it's literally costing
you more money because every time you
see a yellow here, and this is just a
simple example, you're paying input
token cost. So, you're re you're
refreshing your cage and you're paying a
full hit for that. So, potentially 10
times more than than what you were if it
was caged. If you're a subscription user
and you're thinking, well, I don't care
about the cost. I don't pay for that. I
pay a flat rate. That is true, but
you're still, like we said before,
you're paying for a subscription and
that subscription is rate limited due to
your case usage and um you may actually
hit rate limits further or quicker. So,
that's something that we want to be able
to do. We work with a lot of providers
today to to remove as much of this as
possible. That's good for the user
experience and it's also good for the
provider.
So, why does this happen? Well, I mean
it if you think about the last graph
where I show the columns, they're
they're not they don't take into account
time. They're just one after the other
after the other. But there's obviously
um a temporal uh way to look at this. So
this is the way that I like to think
about it. And I know this is a little
bit more of a complex graph to look at,
but bear with me for a second. So on the
left hand side, I'm talking about
working set. So that's the number of
tokens that the C system is holding in
its memory based on different time to
lives of the co of the actual C itself.
And then the the bit at the top the
dotted lines based on the right hand
secondary access is showing the case hit
rate as a result. So the red is showing
one minute time to live. And what you
can see is there's prompts here at the
start on the left where the um it's
thrashing up and down. And the reason
it's doing that is the time between
requests at that period is is longer
than 1 minute. So you're getting a
period where you might uh take the cash,
get a hit or two, and then drop the cash
and then you get another one. You got to
refresh it. So it it just it doesn't
really make sense, right? You go to 5
minutes, which is the blue, and you can
now ride out more and more of those cash
hits, and as a result, you get a higher
case hit rate. You can see it at that
very start um up there uh comparing the
two. But then you're still missing many
others. There's still many times where
the the time between a request is even
larger. So the next one up is showing 1
hour. And while that requires the C
system to hold uh you know a little bit
more tokens in C and eventually quite a
fair bit more tokens in C, it's got to
hold it for a longer period of time. But
the result to the end user is a better
um actual experience and to the enterp
to the uh inference provider which we'll
show very shortly it's a much better
experience for them as well. The problem
though is to do that you need to be able
to hold a lot of tokens in C and you
need good memory tiers to support that.
Um, so the next thing I want to go into
is that a lot of people think of C hit
rate isn't really something that a
human's able to really internalize.
Well, so another way that I can
visualize it is by thinking about it in
terms of the number of times on average
that a chunk of of tokens, which is a
group of tokens, is refreshed. So in
this particular conversation that we're
looking at here, you can see that
there's this is showing the relationship
of as I increase the time to live or how
that affects my case hit rate. But it
also shows based on the secondary access
that at 1 minute I'm literally re re uh
prefilling like 15 16 times the same
tokens. And over time we can get that
all the way down to approaching one um
and um make significant differences to
again the experience of both the user
and the inference provider.
So with that what I'd like to do now is
go into the the context engineering side
of it, some of the lessons we learned
and um just sort of really drive this
home. So now I want you to think about
uh what I think will be common in 2026
and onwards of people hosting their own
or having their own dedicated systems
hosting for them. So imagine you being
an inference provider now. Okay. So now
what I want you to think of is think of
yourself as an inference provider. Uh
maybe you've um you've you know worked
with us or one of our partners to build
your own your own self-hosted instance
um and uh you want to get the most out
of it. What this graph is showing you is
uh a relationship between a certain
context length and the C hit rate and
how many output tokens you get as a
result of that C hit rate. Now the first
thing you'll see is it's not linear and
it it and the shape of this curve will
change based on the context length based
on the accelerators you use. B there's
lots of things that come into it. how
you do p disag and prefill. Uh there's a
lot of stuff that comes into it, but the
co the the curve is more or less the
same. And if I asked you as an inference
provider, where do you want to be? You'd
obviously say C. And if you're in A or
B, you're you're not making money or
you're not getting enough value out of
the system. And inference providers that
we work with that they they have the
same answer obviously. So the question
is, well, how do they keep in C? And
this is where it goes back to a slide
that um Bow showed earlier where what
they're doing is they're incentivizing
users to stay within C. And this is
where we we came to the realization that
a lot of the times because of how much C
hit rate uh impacts your actual output.
That's why it's you're buying case a
lotments in storage when you're actually
buying subscription services because it
is so important to them that you stay in
a certain case hit rate band especially
for agentic workflows. Otherwise they
literally you'll just melt the GPU
clusters that they have. Um and I and I
think it's a really powerful thing to to
have in your head about how that works.
So what we're going to do now is go
through and think about okay what what
makes up this token storage.
So when you think about the token
storage there's lots of aspects that uh
the memory tiers that support the token
storage need to be able to do. But to
really make it really really simple it's
literally as as as simple as you need
enough capacity in these memory tiers so
that you can hold a optimal amount of
cash. Uh if you think back to the the
slides I just showed, there's this point
where having more cash helps you a
little bit, but it kind of gets to a
point of diminishing returns. Um you
need to get at least to that point and
you need to be able to store extremely
fast into it because if you can't,
you're going to be able drop in KVs
before they're in the memory tier or
you're going to be blocking GPUs, which
is probably even worse. And then the
other way you need to do it is you need
to be able to fetch from that token
storage very very rapidly so that you
can again not block the GPUs. They're
the primary first class citizen of this
whole system.
So what does it look like? So there's a
few different types of memory tiers. The
most common obviously is HBM and uh Val
and I would love it if all our sessions
are in HBM at all times. It's just not
reasonable. Um there's many reasons for
this around how the batch works which
we're not going to go into today. But
the point is is that the the the main
common way that this is done today is
DRAM. And there's nothing really wrong
with DRAM as such. It it's sort of a
means to an end, but it's quite limited
in size. It's it's okay in terms of
performance. But the other thing is it's
tightly coupled with the compute. So if
you want to expand your DRAM, there's
not really many good ways to do that.
There are some technologies out there
that kind of do this, but the way
they're implemented, they they kind of
just hurt your performance. And that's
what I'm showing with pulled DRAM. You
could pull more together, but it's, you
know, it's kind of a uh uh it doesn't
help that much. So what we at Wcker um
did is we took all the durable
advantages of our product which has been
you know tried and tested in AI training
in HBC environments and augmented memory
grid is basically a uh supported um
optimized connector between the
inference systems and our um existing
product. And because we're backed by
NVMe we we're we're much denser. where
like thousand times depending on how you
look at it denser it's quite significant
and then I show another example of a
storage at the top there where you know
not not something sluggish something
that can still get 50 60 GB a second but
uh and it has the capacity but still
relative to what we're talking about is
is still quite slow.
Okay. So then moving on to how do we
test this? So again, um we we talked
about how we're we've open sourced this.
Um basically, um Val already covered the
the main part of it and that the fact
that it it it acts like it's an
inference provider. It's trying to keep
the load within two SLOs's if you enable
them. You actually don't have to enable
them and it'll just go as hard as it can
regardless of of an SLO being time to
first token or output tokens per
request. But the main thing that it can
do is you can either set a static number
of coding agent users or you can um
increase the number of those users over
time so that you can slowly utilize more
of the memory tiers and be able to
compare different configurations.
So there's two ways that it works. Um
I'll just be quick through these
sections because you can read about
this. I have a blog that explains how I
do the testing that goes through all of
this in detail. And there's obviously
the GitHub as well, but basically it can
do the initial working set and then
sequentially go through those prompts.
So this will be very very very
deterministic because as soon as you
over overflow the memory tier even the
slightest bit, you'll see a massive drop
off in performance. But the other way
that it can be done and realistically
the more fair way that it can be done is
you can ex increase the size over time.
So the amount of concurrent users that
you're accessing out of a pool and you
can randomly sample where in that sample
set you'll get that uh prompt from. So
sometimes you might be hitting HPM,
sometimes you might be hitting your your
memory tier 2. Let's say that let's say
that's DAM and you get a really nice
blended number.
So with that, let's go in and tell show
you some results and just sort of
explain and and show why we're so
excited about what we're talking about
today.
So this showing three comparisons.
Comparison number one is HBM with weter.
That's the purple. Uh there's orange
which is HBM and DRAM. And there's the
you know orangey pinky color with uh HBM
plus DRAM plus that uh other uh posics
system that I talked about earlier. The
dotted line is showing uh concurrent
users. So the amount the amount of users
that are in a pool and that's increasing
over time. So in the initial shaded area
you can see that all three of them get
an advantage of HBM. The primary uh hit
out of uh C hit rate is coming out of
HBM. But then over time as we increase
the users more and more and more you're
overflowing what the DRAM system what
the DRAM memory tier can do and both
orange and the pinky color start to drop
off quite dramatically. Um we also from
a wcker perspective also drop off
because we get less and less advantage
from HBM. So we have to uh pull back our
concurrency a little bit. The system
does automatically the uh the
benchmarking tool. But then once we've
sort of got down to the steady state,
all three start to like um level out a
little bit. But the main difference is
is that once you get down to that steady
state, we can maintain that at a much
higher amount of users at a much higher
amount of output tokens.
The other way that you look at this is
um that was a decode focused role. Um if
you look at a pre-fill focus ro if
you're doing disag prefill um then the
prefill is actually even better result
for us because the systems the GPUs are
so much more efficient when you're doing
large um batches of pre-fill tokens with
a single decode. Um then we we can
basically saturate things more fairly
and um and it continues. Now the main
difference between pink and orange is
that we uh sorry purple and orange is
that we have a lot more cash. So we can
hit a lot more. The interesting thing
about the orangey pinky color is that it
also has the ability to hit every single
thing that it's possible but it's not
fast enough to get it into the GPU for
it to make a difference. And that's why
we're sort of showing the difference
between these three because with purple
you're getting the advantage of capacity
but at DM speeds so you can maintain
that benefit longer periods of time
and then maybe Val I'll hand back to
you.
>> Absolutely. That was a great walk
through Ken of all of your research and
benchmark results in WA labs. So once
again we're thrilled to be announcing
the open sourcing of this context
platform engineering toolkit today.
Please do download it, use it, give us
your feedback. Again, feel free to fork
it and improve it yourself. And we look
forward together just contributing to
less token anxiety overall, less prompt
cash arbitrage and more context and
context platform engineering in the
future. A nice QR code for you to find
out even more information. And at the
end of this video in um in the actual
transcript section and so forth,
there'll be links to all the blogs we
referenced here. So, thank you for
joining us today and we look forward to
pairing on the context platform
engineering conversation with you in the
future.