A2A & MCP Workshop: Automating Business Processes with LLMs — Damien Murphy, Bench

Channel: aiDotEngineer
Published at: 2025-07-26
YouTube video id: wXVvfFMTyzY
Source: https://www.youtube.com/watch?v=wXVvfFMTyzY
[Music]
Hey everybody. Uh yeah, thanks for
coming. Uh great to see a full room. Uh
always good when you're doing a workshop
to have a a lot of people here. Um so
yeah, I'm I'm Damian Murphy. I'm going
to be presenting A2A and MCP. uh two
pretty hot topics these days in AI um
and how you can use them to automate
business processes. Um
so yeah a little bit about me um about
15 years full-time uh full stack
developer uh five years doing solutions
engineering so customerf facing kind of
uh forward deployed engineer and uh
spent the last three years or so uh
working on voice AI and AI agents. Um I
did a workshop last year as well um AI
voice agent swarms and uh yeah pretty
pretty hot topic. I think it's now
pretty much standard that everybody can
build a voice agent in in 5 minutes. Um
so now the the hard part becomes
building autonomous agents that actually
can do complex tasks. Um, so I joined
Bench Computing uh about two months ago
uh pre-revenue startup u backed by
Sutter Hill Ventures and we're building
um what I would imagine to be a better
Manis uh that's more focused on teams
and enterprises.
If you're not familiar with what Manis
is, it's kind of like a autonomous AI
agent. And Bench is essentially an
autonomous AI agent that can do uh sub
subp parallel task um automation.
All right. So the workshop that we're
doing today, we're going to build a
multi- aent system um using A2A agents.
Uh if you're not familiar with A2A, uh
Google released essentially um a
protocol that allows agents to
communicate over the web.
Uh we're going to integrate these agents
with uh MCP, which is the model context
protocol. Uh MCP is like a USBC for, you
know, uh all of your agents to be able
to consume um context and tools uh and
resources uh very easily.
uh we're going to get these agents to
work together and we're going to trigger
uh the the agent with a web hook and
then uh I'm going to cover a little bit
about when to use A2A MCP and I'll also
go into uh prompt caching and context
management as well.
All right, so A2A, right? Um it's it's
not exactly clear what it's for and why
it exists, right? Uh if you ask
everybody in the room what they think it
does or why it exists, you'll probably
get a different answer. Um but the key
benefits are you can have agent
specialization, right? So rather than
trying to make one agent do 100 things,
you can have a 100 agents do one thing
and do that one thing very well. Um Away
allows you to handle task delegation.
So, you know, imagine you had a
Salesforce agent um and you wanted it to
interact with all the Salesforce MCP um
uh tools. Uh you could do that. Um
you've also got the ability to do
parallel processing. Uh and this will
become very important when it uh comes
to speed and context management. Uh you
can then use those A2A agents to have uh
complex workflows and and help uh keep
your main agents context size down.
Uh MCP again really hot topic right now.
Uh it's been kind of coined as the USBC
for AI. Um and there's definitely some
benefits in just having a standard
interface, right? You know, there's
something like 10,000 MCP uh tools that
you can use today. Um about 7,000 of
those come through the Zapier MCP. Uh if
you're not familiar with Zapier, it's
essentially a way to connect disparate
systems together. And they've now
released all of their uh zaps they're
called as MCP uh servers and tools.
Uh one of the great things about MCP, no
integration with APIs. So you don't have
to do any sort of you know different
handling of different APIs. Uh it's a
plug-in architecture uh an industry
standard. Um and it's really based on
LSP. So LSP was a way for you know idees
to actually uh figure out how different
code languages worked. Um and it was a
great kind of um transfer uh of of ideas
over to the MCP protocol.
All right. So when should you use A2A
versus MCP? Anybody?
MC if you want to resource the
infrastructure
then you go for MCP
but I don't know and and and that's kind
of the the challenge right it's like
what exactly um you know these protocols
for and should I be using them and and
things like that. So if you want to
have, you know, two agents, right? And
typically two agents that are completely
unrelated, right? So it's not two agents
you necessarily control. It's more
likely going to be an agent of a third
party or, you know, their first party
agent and your agent. Yeah.
What's the difference between agent
and A2A? So I work a lot on the agent
where we have multiple agents and doing
the same. The you are saying describing
A2A is a lot similar to a care.
Yeah. So like autogen and and frameworks
like that that allow you to kind of uh
manage multiple agents kind of locally.
H A2A is more about remote agents,
right? So agents you have no knowledge
of. Um so you can think of A2A as a way
for you to have service discoverability
and once you have the endpoint to the
agent, you can then learn everything
that that agent's capable of. Um with
things like autogen it's like you know
descriptive so you describe what it's
capable of it's in your control.
So to summarize agent AI is kind of
define the role of each agent and A2 is
kind of working on remotely and its role
is not defined or defined.
So each of the A2A agents will have a a
definition and we'll kind of get into
that a little bit later. Um but yeah,
think think of agentic AI kind of as a
superset of everything, right? Um A2A
and MCP are just kind of subsets of
that, right? Different modalities. Um
yeah, so for MCP, you're you're going to
connect to external context and tools.
Um a lot of people don't use most of the
features of MCP, right? They're just
using the tools. Um but there's a lot of
stuff around prompt templates resources
um and a thing called um sampling
sampling is actually going to be a
really interesting thing I think that
we'll see a lot more of as well uh where
it allows these MCPs uh to sample the
host LLM right so if you're using you
know claude and you're hitting an MCP
server um that MCP server may want to
also use the same model of cloud that
you're using and it can use sampling to
actually achieve that Um, so when you
bring those two together, you you kind
of get the benefit of both, right? So
you have A2A is the the remote
interface. Uh, and MCP is then giving
you the actual um tool use and and
context management.
Okay, so when not to use MCP. Um, and
and you'll notice a lot of like memes
here. Uh, and just to give you a heads
up, all memes were generated by Bench.
Uh, actually the whole slide deck was
generated by Bench. um I just gave it a
markdown file and it and it outputed it.
So um when you use A to A or MCP um if
you have full control of the tools then
you probably don't need it right like if
if your function is local to your
codebase you know why do you need to
create you know a USBC it's kind of like
me plugging in my hard drive with a USB
cable you know like shouldn't I just use
the hard drive that's in my machine
right um so calling functions directly
in your codebase super easy easy to
maintain faster to develop um and then
If you have full control of your agents,
you probably don't need A2A either,
right? Um like if they're your agents,
you can use, you know, some sort of
local uh function call for them to
communicate. And and I I've built multi-
aent systems using MCP and using just
local function calls. It's a lot easier
to just use the code you have. Uh it's
going to be faster. There's no protocol
overheads and and things like that. A
lot easier to debug as well.
Okay. So, why do you need A2A and MCP at
all? Right. Um, third party tools is
probably the number one reason uh to use
MCP. Um, you can just get access to such
a large array of tools um that you know
you're never going to be able to uh
let's say you're building a product,
right? And and you're like, "Okay, we're
going to build first class integrations
with Salesforce and Slack and but what
about the other 10,000 tools?" I was
like, "Okay, we'll just allow people to
add their own MCP server." Um, so that
gives you great extensibility. Um, but
there's a lot of drawbacks with MCP,
right? Um, you only get what you're
given. Um, and a lot of time that's not
exactly what you want. Um, so you may go
down the route of saying, you know what,
I need a a way to actually index this
data so that I'm not calling like, you
know, list Slack channels every time I
want to post a channel, right? Or post a
message.
Um, and then with A2A, uh, the com
complexity is hidden from you, right?
And that's one of the kind of the key
tenants of of A2A is that you don't know
anything about this agent until you
connect. Um, and all of its complexity
is is completely opaque.
Um, and then you you can essentially
connect to, you know, any sort of uh
remote A2A agent. Um, so long as you
have, you know, the credentials and
things like that. Um, we haven't seen
any firstparty A2A agents released yet.
Um, but Google has about I think 50 uh
partners they're going to launch with.
So, I I'd imagine there's going to be
like a Salesforce A2A agent. Um, it'll
probably only come with a paid account,
right? Because it's going to use LLM
compute. Uh, versus things like MCP
typically don't actually use an LLM,
right? They use the host LLM.
All righty. So, we're going to get into
the code now. Um
yeah, so uh if you haven't already
grabbed the repo, uh we also have a
Slack channel, um workshop A2A-mcp,
uh-2025.
Um and in this repo, there's basically
every everything you need to get going.
Um
yeah, so the the code structure, uh
we've got a host agent, um and then
we've got some sub agents, right? And
the whole concept here is to
demonstrate, you know, ATA and MCP. Um,
but in reality, these sub agents will
probably live in a different repo, you
know, run on a different server. Um,
yeah. And then we've also got the uh A2A
implementation. Uh, the server and the
client in uh the repo. Uh, these are
taken directly from the ATA uh repo.
We've also got the MCP integration. So,
this is just a client. Um, we're not
creating a a server here. Uh we also
have a CLI interface. You're not going
to need the CLI interface. That's kind
of internally how it's being used. Um
yeah, so once you've cloned a repo,
you're going to want an MPM install. Um
and you're going to need a MCP server
URL. Uh this is going to be a Zapier uh
URL and a Gemini API key. Uh you can get
both of these for free. Uh there's no
need to to sign up for a paid account to
get them. Um and you'll want to rename
your mv.ample example uh to
all right so setting up the Zapier MCP
um when you go to uh zapier.com/mcp
uh you'll have the option to create a
new server um and when you go to connect
you're going to have a couple of uh
options here we're going to use SSE um
they recently released uh streamable
HTTP which uh is making SSE deprecated
and it's going to replace it um but
there's There's still a litany of SSE
servers out there. So, um I just used
SSE for this one. Um once you do that,
you're going to get this server URL at
the bottom. You can copy that URL.
That's going to be the URL that goes
into your
and then uh you're going to set up a
Slack and a GitHub integration. Um so,
you're going to want the ability to
create an issue. Um you can put in uh
the repository URL for the workshop if
you want. Uh you can use your own uh as
well. um you can let AI choose uh these,
but what I found with AI is that it will
choose something else, right? Um so a
lot of time with these MCPs, you're
going to want to kind of say, hey, you
know, this is the thing I want to do, so
let's just kind of hardcode that. Um but
if if you do let it kind of go wild into
your Slack, uh it's going to start
posting in general and random and sales
and uh yeah, a few of my bots have kind
of gone rogue.
All right. So, the Gemini setup. Yeah.
So, you can get the uh API key here, the
AI studio. Um, and there's a a link in
the uh slide deck as well if you need to
click it. Um, you can get a free
account, generate an API key, uh, drop
that into your M as well.
Excuse me.
And there's also a remote uh, bench A2A
agent. Um, so the code for it actually
in the repo. Um, but we haven't
officially released our API yet. So, I'm
just hosting that remotely. Um, but it's
a nice kind of way to show how you would
use A2A remotely as well.
Um, so what is Bench? Uh, Bench is
essentially a kind of LLM aggregator uh
with autonomous AI agents. Um, so you
get access to cloud, Gemini, OpenAI, XAI
and loads more models. Uh, it has I
think about 30 tools now um and
integrations. So um we actually started
out with MCP integrations to Slack and
Salesforce. They didn't meet our needs.
We built firstparty integrations, you
know, data caching and indexing. Um and
and that kind of gives you an idea of
like how far is MCP going to get you,
right? Uh eventually at some point
you're going to realize that it doesn't
do the you know the specific thing you
need to do.
All right. So running the application
um you're going to run mpm run start all
um and that's going to kick off all the
agents right so the slack agent the
github agent uh the host agent and uh
it'll also start the web hook server and
the web hook uh admin panel uh you can
access net then through localhost port
3000
and um yeah so let's just kind of go
into what each of the actual
uh agents do. Um so the host agent is
essentially your central coordinator,
right? Um and this this may be the only
agent that you have in your application.
It may be using external uh A2A agents.
Uh and if that's the case, then you know
everything that your host does is going
to be delegated um you know to sub
agents.
Um, so that handles all the agent
discovery and and kind of bringing
everything together.
Yeah. So the the code for that's going
to be in source agents host
and um you'll notice there's a couple of
files in there. One of them is uh the
host agent prompt, right? So that's just
a plain text uh system prompt uh genkit.
That's going to be uh essentially how
you hook all of your A2A code up with
Gemini. Um and there there's also a
genkit MCP plugin that um the sub agents
use.
Uh so then the Slack agent um so this is
going to send a Slack message in
response to the web hook transcript and
yeah the the the kind of sample web hook
that we have in this is essentially you
know your meeting end and you're going
to receive a transcript of that meeting
right um and with that you're going to
decide what to do. So, it's going to,
you know, if it detects any bugs, it's
going to create a GitHub issue. If it
detects any, you know, feature requests
or or anything of interest, it's going
to post that into Slack. Um, and you can
think of the kind of automations that
you can build with this sort of, uh,
scenario, right? So, um, you could even
I had a version here that was hooked up
to Salesforce, but um, there's actually
a limitation on the host agent on how
many sub agents it can call. Um, so I I
figured, right, if one of them's going
to go, it's going to be Salesforce
because it's it's probably the hardest
to get an account on. Um, but you could
actually update an opportunity based on
a sales call, right? So you could have a
sales call and you, you know, you're
talking to them, you're doing your
discovery and you're able to update
those Salesforce fields automatically.
Um, and like the time saving for account
executives because, you know, they're
probably on backtoback calls is actually
pretty big.
Yeah. This was an interesting um issue I
ran into. So I asked one of my
colleagues um to test the repo out,
right? Um and he was getting this weird
error where it was saying, you know, the
Slack MCP succeeded. Um so I asked him
to send me the logs and he sent me this
and it was like is error false?
And I'm like okay that's that's great.
So yeah, it turns out that you know not
all MCPs are created equally and the
Zapier Slack MCP uh fails silently. Um
so the the reason it failed was he he
had um the default Slack channel name uh
which was like test uh Damian Slack and
he was in a different workspace where
that channel didn't exist. So it just
failed silently. Uh so I added a bit of
code to detect this kind of empty text
array. Um so it will fail now. Um, but
it kind of goes to show you just kind of
the limitations of MCP.
Yeah. So, the GitHub agent, uh, pretty
straightforward. It's it's it's probably
the the most basic of of of the the
three or four. Um, so it it just creates
a GitHub issue. Um, super simple. Um,
but you could imagine, you know, how you
would extend this, right? Maybe it's
going to open a PR, right? maybe it's
actually going to implement uh the fix
for the bug that was reported uh in the
meeting. Um and you can see how down the
line as you know AI gets better and and
and things really improve that a lot of
this automation is going to be driven by
human interaction, right? So you know
speaking with people and posting
messages in Slack and talking and GitHub
discussions um is going to trigger AI to
take action.
Yeah. So the bench agent um it can it
can do a lot. Uh and that was actually
one of the problems that I found with
A2A is that like the more functions and
capabilities an agent has and the harder
it is to describe the agents
capabilities um in the agent card. Um so
the agent card is essentially like the
public um information to any other agent
of what that agent's capable of. Um, so
I had to really just pair it back and I
said, "Look, you know, you can do a
handful of things. I know you can do
more, but like for now, these are the
few things that you can do." Um, and
it's able to go off and like, you know,
browse the web, do research, uh, data
science, all sorts of things. Um, so
we're just going to use it for, uh,
researching the company and the people,
uh, in the meeting transcript.
All right, here we go. Demo gods.
Uh, before I start, any questions?
Yeah, you
mentioned some limitation on the number
of agents.
Yeah, so the the Genkit implementation
that Google provide uh limits you to
five maximum kind of sub agent calls uh
per turn.
Is that a hard?
Yeah, I I couldn't get around it. The
like there was this max like setting but
it didn't work. Yeah. Yeah. So, it's
something I'm sure they'll fix
eventually. But, um, it it was an
interesting issue.
All right. Let me see if my uh my code
is running.
Yeah, I think it is. Yeah. So,
it should be here. And actually, I'll
show you the the MCP server as well
while while I'm here.
Yeah. So, this is the MCP inspector.
It's um an open-source repo as part of
the model. Sorry. Yeah. At the back.
Yeah, that's actually in the agent card.
So that'll be in the index.ts of the of
the sub agent. Yeah, I'll be going
through the code in a little bit as well
so you can see it. Um yeah, so I'm
connecting to my Zapier MCP URL that I
got. Um, so I just copied this one,
dropped it in. Um, going to connect over
SSE. Um, and this allows you to, you
know, list the tools, call the tools.
Um, and it's quite interesting now that
Zapier has added instructions, um, as a
mandatory field on actually all of
their, uh, MCP tools. Um, so you don't
actually need to fill out the, uh, the
fields anymore. So you can just give it
natural language. So, this kind of
suggests to me that they're using an LLM
on their side to figure out how to
populate the fields on your behalf. Um,
which is interesting because it's going
to cost them a fortune, right, as more
people adopt it.
All right, so this is the uh the agent
dashboard. Let's just make sure
everything's working. Yeah. Uh, you can
see of a couple of previous ones that I
ran. Um, this one is actually the one
where the Slack uh thing wasn't found.
So, when I was testing that, my mouse
isn't moving. There we go. Um, yeah. So,
I put in like a, you know, typical
unknown uh Slack channel. Um, and then
it it detected that it couldn't find it
um based on the heristics.
Not sure why my mouse isn't moving.
There we go.
Yeah.
So, you have defined four agents here.
Mhm.
So,
All 82A agents.
Yeah, correct.
Okay. So, maximum you can go for A2A
agents is five.
Yeah. Uh when when I got to five, that's
when I got the error. Yeah. So, I think
four. Um
um Yeah. And the the host agent here.
So, these are the host agent logs. Uh
you can see it connecting to the the
different agents. Uh this agent's just
running on a little dinky uh EC2
instance that I spun up. Um, and it goes
through, learns about the agents, you
know, processes, web hooks, like you
don't necessarily need to go in here
unless you you get a failure. Um, Slack
agent, pretty similar. Um, it's it's
basically just sitting there waiting for
another agent to connect. Uh, when the
agent connects, it it uh communicates
with it. Uh, and you can see here the
the bench agents running remotely. Um,
the reason I don't have uh verbose logs
here is because it's remote. it's not
under my control, right? Um, so the A2A
logs for that agent are actually on the
EC2 server. Um, which kind of brings up
another question about how do you debug
when an A2 agent fails, right? Um, yeah.
So then on the web hooks page, um, so
this is the the only web hook that's
preconfigured. Um, and this basically
explains, you know, to the agent what
it's actually going to do when this web
hook arrives, right? Um, so it's going
to process the incoming web hook. Um, we
have a little prompt template here,
right? So it, uh, tells it what the
agent capabilities are, how to analyze
it, right? Um, and then we have the
processor config, right? And, and this
just kind of tells, hey, these are the
agents that you have access to as part
of this uh, web hook. Um, this will
become important when you've got, say, a
100 A2A agents and you only want like
two of them to to interact. Um, and then
here we have a test. Um, so this is just
a fake transcript that generated with a
with an LMM. Um, and when we send the
web hook, you can see here it's
processing and hopefully the demo gods
will will do me good here.
And it does take a little bit of time,
right? So the host agent has to process
it, then has to reach out to the sub
agents, you know, get all the
information. Um, I think the the bench
agent probably takes the longest because
it's actually doing its own subtasks as
well. Okay, we got a we got a Slack
message. That's a good sign. Okay, so
Snowflake is interested in Slack and
GitHub integrations.
Very cool. Um, we have the GitHub. So, I
don't know why my mouse keeps freezing.
There we go. Yes. So, we should have a
GitHub issue.
Here we go.
Yeah. So during the trial, the AI
mclassified the severity of the bugs.
Engineers need to investigate and fix
the issue, right? So it's re really
simple use case, but you can imagine
that that transcript is probably going
to be 10 times longer. You know, a lot
more information in it. Um and and it
will just work, right? Um and then we
also have the bench agent. So um oh,
looks like it's waiting for results. Um
so it's going to research uh the
company. Uh I think I did one before
where it just returned a result. Let me
see. Yeah. So it basically goes off does
a research into Snowflake and all the
participants of the call um and returns
that information. Um and this can kind
of get as complex or as simple as as you
want it to be. Um and yeah, so when
you're using the application and you
have it up and running and has anybody
managed to get it up and running?
Wow, impressive.
Yeah question you're using bench agent
to do the orchestration that's why
you're having it right
uh no so the bench agent is just like
think of it as a third party agent that
we can leverage so that the host agent
is doing all the orchestration
okay so like what is the actual role
that agent is playing like what is it
actually doing
it's doing research on companies and
people
just another agent.
Yeah. So, it's an agent with a load of
different capabilities and it's it's
basically just um
orchestrator isn't local.
Yeah. Theo so the these three hosts
Slack and GitHub are all local.
Yeah. I was like I think I thought
orchestration.
Yeah. No, Bench is just a um like it's
in the repo but um you need an API key
for it and um we're we're launching in
about two weeks. So uh I just made it
remote for the for the purposes of the
demo. Um
so what about the host agent though?
Sorry,
the host agent is it uh the zap year
agent or
no? So the the so all of these agents
are A2A agents. Um the Slack agent and
the GitHub agent have MCP tools to Slack
and GitHub through Zapier. Yeah. Um I
can actually show you a diagram that
might might explain it a bit better.
Yeah. I don't know if that explains it
better, but
but the orchestration does happen on
your local.
Yeah. Yeah. Everything's happening on my
local. So, if I go into the into the
codebase, uh have the agent logs.
Um so, this is all happening here,
right? So, it's sent to Slack to T or
is that readable? I go one more.
Yeah. Yeah. So you can see here the
transcript came in um and then it got a
response from each of the sub agents and
then completed them and it did all of
this in parallel as well, right? Um
sorry, is that a question?
Yes.
So in your example here, which agent
would handle human confirmation? Let's
say we want to have a create the test
button in spec here. Which agent would
handle that part? Do you create a new
agent for human confirmations? Do you
keep the old one?
Yes, you need a staging area for for
actions. Um, so it's not something I've
built into this. Um, there's a lot more
you could do here. Um, but human
confirmation would typ typically be done
through like a draft, right? So you
would maybe pop up a Slack message with
some actions. Um, and then when somebody
clicks that, it would communicate back
kind of like a secondary pass web hook.
Uh, you might need to persist state
though. Yeah.
Yeah. How do you consider the security
of this endpoint controls of different
vendors communicating from endpoint?
How do you manage the security?
Yeah. So, as a part of the A2A spec,
you're going to have some sort of
authentication, right? Um I've just
exposed everything, right? Like it won't
exist tomorrow. So, there there's no
security implic implications. Um, but
essentially you're going to you probably
have to have a subscription with the
company that's providing that A2A agent.
Uh, because it is consuming tokens,
right?
Um, I'm I'm not sure exactly what A2A
have in plan. Uh, it's still pretty
early days, but um, with MCP, it's a
little bit further ahead. It has OT uh,
header authentication, things like that.
So, imagine something similar. And how
about CISA governance like LM firewall
all those uh benchmarking
autobenchmarking
and u also the guard rails etc you do
you have a separate agent or everything
is being
you you'd probably manage that on like
an Amazon bedrock or something like that
right and you would just you know use
that guardrailed LLM um from behind
there you don't have to use Gemini here
either
Yeah.
And then that host agent is kind of like
the planner and each um do you see like
becoming like a talking to each other?
Um
um I guess you could but I I don't know
if that's the intention, right? like um
then they just become hosts, right? When
they talk to each other, um like if you
think about it, like if you have no
knowledge of sub agents, um how would
you how would you know to talk to them,
right? You would have to then become a
host agent yourself, connect to that
other sub agent to to do that. So I I
don't know if that's intended in the A2A
spec for sub agents to communicate.
Yeah. So with the host agent um and the
orchestration that it's doing is it
actually managing a combination of all
the context windows or like do you hit a
limit quickly?
Yeah, so all of the context windows and
this kind of uh is something I'm going
to cover now in a second as well. Let me
uh just go back to the slides
um which is a good it's a good segue.
Um so yeah, one one of the benefits of
like A2A or or any sort of sub agent uh
framework is that you're you're not
consuming um the tool results into your
context, right? So like when you say hey
you know um and I think of an example
later on but if if you have a load of uh
Slack messages or GitHub issues or
Salesforce opportunities and you want to
analyze them and maybe produce like you
know summary of categories and counts
and the only thing your host agent cares
about is the summary of categories and
accounts. It doesn't care about the like
individual details right because those
have already been processed by the sub
agent. So the sub agents context gets
big, not very big, but like as big as
the task demands and the host agent only
incrementally grows by the the business
value it got from that agent. Um like
one of the challenges at bench is you
know we have so many tools right like
the context can blow up very quick. Um
so you know very early on we decided
okay we need to have composability. Um
so that means that bench can create its
own internal bench agent um to avoid
that context growth problem. Um and
we're even thinking of going one step
further whereas like you know should we
have an agent for every single tool um
so that every single tool is protected
from the primary uh prompt. Um so you
know as you add more tools like the tool
definitions themselves I think we're up
to like you know 10,000 tokens just for
tool definitions alone. Um, I added the
Asana MCP. It added 11,000 more tokens,
right? So, like, you know, a lot of
these MCP servers like they're, you
know, they're giving you a lot of
information. Um, and you may not
actually want that. Uh, and that's
actually one of the challenges with
firstparty MCPs is they expose all their
tools and that's one of the benefits of
Zapier where you can pick and choose
which tool you want to use.
Yeah.
Yeah. I was just going to ask why do we
need Zapier?
Zapier is just a really easy way to to
use uh MCP right now. Um I think like
Linear uh Asana um um a few others have
added like first party MCP servers that
are much better than what Zapier
exposes.
Yeah. So, so why does context size
matter? Um, so AI agents accumulate
context like as they work and you're
supposed to keep like all of your tool
calls, right? What you sent to the to
the tool and what you got back, you're
supposed to keep that in your context so
that later on if you, you know, ask a
follow-up question, it still has access
to that data. Um, and that becomes very
challenging, right? So you've kind of
got two options. is like, okay, do I
just prune, you know, old tool calls and
now the the agent gets dumb or, you
know, do I figure out some other way to
do it? Um, and cost is a big challenge,
especially when you're doing prompt
caching. Um, so with prompt caching, it
it enables you to essentially put a
marker in your context and say, hey,
look, when I make my next request, I
want everything in my in my context so
far uh to be cached so that I'm not
going to get charged for it. Um but the
cost to actually push that into the
cache uh is about threex the cost of of
making a single request with that
context. Um so that means that you have
to be very you know diligent in what
sort of uh context management strategies
you use. Um you know I was running
simulations cuz I I couldn't really
figure out like what is the optimal um
you know caching uh strategy. Uh so I
ran simulations based on usage data um
of like you know what's the typical
context growth how many turns you know
on average like what percentage of of
users only send one turn right should we
should we cach that one turn if they
never ask another question right
probably not. Um so you know it probably
gets down to the actual you user level.
So, if you have a user that always like
puts in new prompts into the same chat
and never opens a new session, um you're
probably going to want to, you know,
continuously uh cache their context. Uh
but you might have another user who
always creates a new session for every
question. Um and then just figuring out
like, you know, what is the context
growth? Uh I think we figured out was
around 30,000 tokens was the optimal um
kind of across the board for everybody.
Um, but that also comes up with false
positives. So sometimes you can end up
caching the last turn of of a
conversation. Um, and and that's going
to, you know, cost you a lot more than
it than it should naturally.
Yeah. So the the great thing about the
sub agents, right, it protects them. And
this was the GitHub kind of example I
was giving you. Um, but this applies to
pretty much every uh tool. So, like if
you're ever integrating with a system,
you're probably going to run into issues
like why do I have to call, you know,
list Slack channels every time to get
the channel ID for the channel name that
was provided, right? Cuz like nobody's
going to provide like in a chat the
channel ID that they want to post,
right? It's a it's a UID. It's it's not
memorable. Um, so then you get into the
question of, okay, well, do I just cache
the list of channels and and when do I
update that list of channels, right?
like what if the channel was deleted,
renamed or a new channel was added. Um
yeah, and then the the cost is is really
probably the biggest one. Um yeah, so
the the benefits of this lean context,
right? So your sub agents have that
isolated context and and that really
just allows you to um be be super like
fast, low latency, low cost. Um, and if
you ever need to go back to ask another
question, you know, you're going to like
spawn that uh process again, right? Um,
so maybe if you're in control of these
other agents, you you might want to have
some sort of like uh I don't know five
minute TTL on previous questions, right?
Um, and then yeah, the host agent only
processes the summaries. Um, and the raw
data is discarded after processing.
Um, yeah. So, I'm going to jump back
into the code here. Uh,
just kind of walk you through
uh how it all works.
All right, we'll start with the host
agent and and you notice a few other
things, right? So, there's MCP. This is
just your standard.
Sorry,
I thought something. Um, yeah. So, this
is kind of your standard MCP client uh
code. Uh, just just allows you to
consume um the MCP uh calls coming from
the LLM. Um, we have the the GitHub,
right? So, this is going to be um what
it sends to that Zapier endpoint. Uh,
it's going to call GitHub create issue.
Uh, and then the Slack agent is going to
do send slack channel message. Um so
these are just kind of like the MCP
client tools that the uh individual
agents will use. Um yeah so this genkit
um this is based on on what they provide
in their in their sample repo. Um
you you can use a different model if you
want right you can change you know the
the settings on it. Um but this
essentially uh spawns you a new instance
of what's going to communicate. Um, this
just loads the system prompt. Um, I can
open up the system prompt here. Um, so
right, it's got a critical workflow.
It's going to do these things in this
order. It's got a few steps, you know,
discovery. Uh, uh, this is actually
something I noticed like if you don't
tell the A2A agent to call list remote
agents, it just won't, right? And it'll
try to answer everything on itself. Um,
you know, it can very easily fake
sending a Slack channel message and be
like, "Oh, I just sent it for you." I
say, "No, you didn't.
Um, you know, one of the things I've
noticed, uh, using cursor is like every
time I catch it doing something wrong,
it says you're absolutely right.
Um, I even tried to prompt that out of
it. Um, and it's not promptable to get
to get it to not say that. Um, cool.
Yeah. And then the the index. So, this
is actually where the agent card is.
It's a little bit long.
Let me see.
I think it's up here near the start.
There we go. That was line 1200. So, I'm
not near the start at all. Um, yeah. So,
this this is what the host agent exposes
if somebody else wanted to call it. Um,
so it has these abilities to list remote
agents and send tasks, right? And then
if we compare that to the to the GitHub
which is uh a lot smaller. Um
there we go. Yeah. So the GitHub agent
can create GitHub issues, right? Um it's
got the ability uh to do various things
and um it has a list of skills. Um and
this is all that the the host agent
really knows about this agent. Um, so
you could imagine how big this might get
if you were to, you know, implement
every single API that say Salesforce has
or something like that. Um, and in a lot
of cases, um, at least with Salesforce,
rather than implementing, you know,
wrappers around the APIs, you're
probably just going to want to use like
the SQL or the so-called directly and
let the agent actually write the
queries. Um, there's a lot of
flexibility when you have, you know,
direct database access essentially. um
because the the LLM can, you know,
bypass, you know, the API layer and just
go directly to the to the database.
Um and then the um GitHub agent prompt,
right? So, it's got some uh things. Um
this is something I had to add because
it it insisted on um mentioning who
submitted the bug report, right? So,
there there's definitely concerns
around, you know, PII uh leaking from
your, you know, internal meeting
transcripts and ending up in GitHub,
right? Um, and that kind of goes back to
your uh your question about, you know,
how do you audit what's coming out of
these LLMs, right? Uh, and you can do
that in a number of ways, but it it
wouldn't be a part of the A2A spec. I
think it would just be the LLM you
connect to has those guard rails in
front of it. Uh, and you you're just
using that LLM that has the guard rails.
um similar Slack um excuse me has a a
very simple um
uh agent card that I can't seem to find.
Um and then if we jump over now to
uh the host config. Um so this is
essentially what configures um the web
hook, right? So the web hook has
essentially a config that tells it like
what it's doing and and you can see that
in the UI as well. Um and then uh within
the A2A folder we've got the client and
the server. Again the these are just
pulled directly from uh the A2A repo. Um
I don't think they've actually exposed
uh types or packages yet. Uh which is
kind of confusing. Um but essentially
you can bring that stuff in there. And
uh then the web hook server. Uh so this
is just a web UI. Uh initially I had
this whole thing done through the CLI.
Um you know coding with you know tools
like cursor or augment code. Um CLIs are
way easier for AIS to actually write
they're going to be able to test it uh
interact with it much better and and be
able to uh produce those outputs.
Awesome. Uh so yeah I'm gonna going to
shift over to kind of Q&A now. Um so
yeah, anybody any questions? Yeah.
So um I want to talk evals for a second.
So like um I assume that you manage or I
don't know. I mean you manage them
probably at the at the agent level. Is
there any type of like distributed
evalu?
Yeah, I I haven't done much evals on
A2A. Um I still think A2A is a bit too
early to go into production. um like
even MCP is is kind of borderline. Um
like there there's a lot of rough edges.
Um I think you can achieve like much
better uh things if if you're in
complete control of everything, you can
achieve much better results, you know,
with your own local uh function calls.
Yeah.
Any reason instead of Python?
Yeah, you can use any language. I think
actually uh the 82A framework is better
in Python. Um I just prefer uh
TypeScript myself.
Yeah.
Can you tell more about the caching? Is
caching provided by the model providers
or we implement our own caching?
Yeah. So you implement your own caching.
Um so you decide you know when to move
that cache marker uh how to manage it.
Um it can be tricky and and I don't
think there's very good information
available online on on what the best
strategies are. Um when I was doing the
simulations I I used like linear growth,
exponential growth, um you know fixed
size and and kind of compared them all.
Uh they all worked out between 25 and
35% cost savings. Um but like in
practice what you'll find is you're
going to have outliers where you know
the cost of a session kind of balloons
because of you know you you cashed at
the wrong point.
Yeah. Yeah.
So each of the agent can be talking to
their own like finer.
Yeah. Yeah. So they they all have their
own um which is kind of in contrast to
MCP where the MCP wants to use your LLM,
right? Because it doesn't want to
generate its own tokens. So yeah.
Um about the authentication and
authorization
to MCP or agent agent
authentication
or
Yeah. So there there's a couple of
different ways. Um, so, uh, within the
authentication, you can have, uh,
headers that do the authentication. Um,
I believe if you drop in an OOTH, uh,
URL, you'll also get an OOTH popup. Um,
I really like the OOTH authentication
because you're getting the user's, you
know, ACL, right? Um, and that means
that, you know, what that user can
access um, is specific to them.
you have to
Yes. So, it's going to be dictated by
the the remote uh server. So, either A2A
or MCP. Um if you're running your own,
you can choose what you want to run. Um
there's different transport types as
well. So, standard IO is something that
you would use locally. So, like imagine
you wanted to create like a file on your
desktop. Um you're going to use standard
IO typically to interact with local. Uh
and then SSE was serverside events that
got deprecated in favor of streamable
HTTP.
So sorry. So for example
we are interacting with a Salesforce
agency let's say and each user has
different authorization for example
which
employee A probably have access to the
some sort of tables employ
yeah that will typically be handled uh
through an OAT MCP server right So
they're going to essentially log in as
themselves as part of the connection and
then they're going to save that refresh
token for later use.
Yeah.
How would you describe the performance
um for security especially you explained
very well about authentication etc. But
I'm looking for more explanation towards
encryption asymmetric encryption and
also there is a possibility of
certificate manager and all the way to
the end of the entire architecture. So
how would you describe the performance
and see I'm looking for some financial
application this architecture what you
have described is pretty good but uh
similar on the financial applications as
well as uh some department of defense or
some kind of applications highly in
highly secured environment where it's
all both combination of asymmetric and
symmetric
yeah you're you're probably going want
to run like the LLM yourself and you're
more than likely not going to want to
interact with anybody outside your VPC,
right? In those cases, um I I don't know
if you would want to consume a third
party MCP server or A2A agent uh in a
highly regulated environment, right?
Like, you know, HIPPA compliance,
financial stuff. Um if you do have the
ability to do that, right, you're going
to have some sort of agreement with the
service provider that provides those
tools. Um, and you're going to, you
know, do transport over HTTPS, you're
going to have maybe mutual TLS both on
the A2A agent and the remote agent. Uh,
and similar with the MCP server, you're
probably going to have some sort of IP
whitelisting, right? Like there's
there's a ton of things you can do
around that. I think they're out of
scope of of the actual protocols
themselves because, you know,
essentially you're over an encrypted
line, but uh, typically there's there's
more to it than than just that, right?
So you're playing around the end point
controls on this and that's really scary
when dealing with
Yeah.
Yeah. And like if if these are your own
internal MCP servers and your own
internal A2A agents maybe from different
parts of the organization um you know
they'll all live inside your VPC and
they're probably never going to talk to
the to public internet. So your the sol
the answer I get from you is stay with
BPC and stay away from uh in that case
stay away from endpoint
um security which means stay away from
MCP or A2A.
It's so the these are just protocols. Um
it's really up to you whether you want
to connect to an external third party
and that's going to be your own security
posture. Uh it's not really going to be
defined by the protocol itself.
Yeah. Keep them away from the subet or
bring them inside the subet.
Which one would you prefer?
I I I would liken it to like I found a
USB cable. Will I plug it into my
laptop? Right. So the USB it's not its
fault, right? Like USB is just a a
standard. Um it's what that USB is
connected to is the risk, right? So like
if you're willing to find a dongle on
the street and plug it in, you know that
that that's really going to be your
security posture, right?
Yeah.
Okay. So um how much heavy lifting do
you have the orchestrator do? Like you
ever hit the scenarios where uh you have
the orchestrator interprets the response
from a sub agent and then maybe does a
retry with a better prompt.
Loose or anything?
Yeah. So, so one of the things and I I
kind of prompted it out of this uh
workshop just to keep it simple is um
like the the bench agent wants to have a
conversation with the host agent. Um but
I I didn't want to kind of implement
that back and forth because it was going
to delay the uh the web hook processing.
Um but you can have backs and forths
between the agents. Um and it's probably
desirable as well, right? like if if for
whatever reason the host agent doesn't
give sufficient information, you know,
the the remote agent is going to be
like, "Okay, you know, I know you want
to update an opportunity, but you didn't
tell me which opportunity." Right? Um
I mean, I could even see scenarios where
you have uh an expensive LM that you
have on reserve that you go to with a
cheaper LLM, agents aren't giving you
what you want. Like, sorry, just
thinking through stuff.
Yeah. And I I I think like LLM cost and
capability is is a big challenge with a
lot of these things because you know if
if you're running say cloud for opus and
somebody for whatever reason asks you to
summarize like you know five sentences h
it's going to cost you a fortune right
so you need uh intelligent rooting logic
on like does this task need the entire
context right does it need 20,000 tokens
of a system prompt to summarize you know
a short bit of text And that's one of
the challenges that you you'll run into
where you you kind of need a like a
rooting LLM in front of these complex
agents so that they can actually figure
out you know how deep do I go.
Yeah. Similar to the routing
orchestration question, I was wondering
like if you wanted to post a Slack
message that linked the GitHub issue,
for example, I think you'd probably
prefer your architecture to go back
through the host to make that decision
rather than let the GitHub agent
directly.
Yeah. So the the host agent wouldn't run
the uh the calls in parallel, right? So
there there's actually a flag whether
you want it to go in parallel or not.
Um, so it would have to say, "Oh, I need
to create the GitHub issue first um
before I talk to the Slack agent, right?
Since I need that URL."
But in general, you'd prefer to have
those decisions go through the host
rather than even allow.
Yeah, absolutely. Yeah. Yeah.
Yeah.
I want to ask that the context slicing
for the sub aents that is entirely
happening through prompt engineering or
are there other frameworks to like slice
the context that will be going for
different
Yeah. So, so typically context
management is going to be implemented in
your own codebase. Uh the sub agents
context management is more than likely
going to be a third party's codebase. Um
if it's one of your own agents, right,
you can manage it as well there. Um but
yeah, you're you're going to want to
figure out like what's optimal for your
actual like production usage. Um yeah
but so you you will be using prompts in
the host agent to to kind of guide what
context to send to each sub aent, right?
Yeah. Yeah. So so what you what you send
is typically like a question or a task.
Um it's usually very small, right? Like
you you don't you don't send the full
meeting transcript to the Slack agent to
to do what it's doing. The host agent
processes the transcript and then
decides what the tasks are. Um, so like
if I look down here uh and actually I
think I can see it in the dashboard.
Um, yeah. So this is actually what the
the host agent sent uh to the GitHub
agent, right? It said create an issue in
this repo title this, you know, with
this description and title. Um, and then
the the GitHub agent its task is to
extract uh three bits of information,
right? So what's the instructions to
give the MCP server? What's the body and
what's the title?
Yeah.
context
which we want to send
for each and every so you show earlier
that's pretty much
the understandation
ID
Yeah. So, so Zapier, uh, the SSE
implementation doesn't actually require
headers. Um, I think these are just left
over from from something else. Um, so
there's actually no authentication and
the URL itself is kind of like a secret
key, right? Um, so like if I disconnect
and and reconnect without the headers, I
should be able to uh Yeah. So I can I
can still query it. um they they've
moved away from this approach right now
with with with more secure kind of uh
setups and you you'll notice in their
thing right um they've kind of
deprecated that and you know treat this
URL like a password right um
what's your experience in using
different
workflows like for example
Gemini and also did you use
for this kind of work.
Yeah. So, so we we typically lean
towards Gemini for large context um and
um Claude Sonnet 4 uh for tool calling.
Um Claude Opus is better, but it's not
like 4x better. Um you know, and when
you compare price to performance, right,
like you know, 5% better doesn't equate
to 4x to cost. you talking about Gemini
Flash or or Pro?
Yeah, so we'll use Gemini Flash for
simple things like summarization, right?
Um you could use Claude Haiku as well,
but I think I think Google's kind of
taken the lead in in price performance,
you know, from an economic standpoint.
Uh but Claude is still the kind of king
of tools. Uh they they created MCP, so
they kind of had a head start, right?
What about the hospital?
Yeah, we we have Deep Seek hosted in the
US. Um, so we've been trying that out.
Um, I I think Llama has kind of fallen
by the wayside a little bit. Um, and
yeah, Deepseek is just, you know, the
clear winner right now. Uh, they also
released a new version there, I think,
on the 28. Um, that's kind of up there
with 03 level models. Um, we we actually
don't use reasoning models uh for our
agents. Um, a lot of the time when
you're when you're building, you know,
agentic agents, um, a reasoning model
isn't really needed. Um, like unless you
want to, you know, pay a fortune for
some long tiging task. Um, you know, we
we we can achieve kind of that reasoning
level uh, with just the standard models
and and browse and a few other tools.
Yeah.
So, um, like a third party assume like
if Stripe has an agent card and stuff,
do you pass instructions for like what
like exactly what you want back in terms
of like I'm just imagining another third
party agent blowing up your contact
window because they're flooding you with
information you don't care about. Do you
handle that through the prompt? Are
there other tools to do that issue?
Yeah. So, uh, one of the solutions to
that is you actually just spawn another
agent, um, to communicate with either
the tool or the agent, right? And that's
one of the things we we have in in
bench. Here's some of the slides that so
I don't know uh, generate uh, five
images
uh, in subtasks.
So you spawn a sub aent to sort of like
absorb the context flood for lack of a
better term.
Yeah. So the the sub aents just kind of
protect you, right? Um and you know like
when when you're spawning these things
you can do things in parallel. Um
actually if I expand you can see the
thinking as well. So you can see like as
it's going down through it, right? It's
it's doing a lot of work um that you
don't want in your context, right? like
you you don't want all of your thoughts
bloating your your context. Um but you
also don't want all of your tools
bloating your context either. Uh you
don't want images bloating your context.
You want the ability to analyze an image
but you don't want like you know 100,000
characters B 64 in your context. So
there's there's a lot of kind of
optimizations that you can do there. Um
but yeah did that kind of answer your
question? Yeah.
If you to troubleshoot something like
this, it's probably
Yeah. Yeah. So you can see here now it's
spawning these subtasks. So these are
all essentially like instances of bench
that will keep that context out of out
of my way, right? Yeah.
Yeah.
What have what you have been using for
observability on your agents?
Um we we just kind of roll our own right
now. Uh there's a lot out there that you
can use like uh agent ops is a pretty
popular one. Um but yeah like if if you
really want to build your own uh kind of
custom observability layer um you know
you're like like agent ops doesn't
really support this concept of
composable sub aents. Um so it's not
really something that it could model uh
correctly. Uh, but we've got some nice
pictures of cats.
Uh, and yeah, I know we have a few
minutes left, but if if anybody's
interested, um, I have $50 in free
credits. Um, this hasn't launched yet,
so you're getting kind of early access
to it. Um, and yeah, we'll I think we'll
be in public beta in about two weeks.
Um, so yeah, try it out. Like, hit me up
on LinkedIn. I I'd love uh feedback from
you all. you're you're all probably, you
know, at the forefront of this uh AI
stuff and um it's changing every day. So
if you log in one day and it looks
completely different, don't be
surprised. Happens mid demo for me.
Yeah.
So you mentioned a lot how hiding
context and sub agents is a good thing,
but haven't you had cases where you
actually then end up missing something
important, some small detail, and then
how do you resolve that? Does the agent
actually go back and ask for that or do
you
Yeah. So you can keep references uh in
your context. So you you might say
subtask ID123
and then when the agent's like, "Oh, I
wonder I wonder if I have this
information. It's just not in my
context." Right? And so it has to be
smart enough to know when to actually go
in and look at that. Um and it can be a
sub agent that does that analysis,
right? So you could say, "Hey, sub
agent, can you just look at all of these
IDs and tell me if you can answer this
question?"
Yeah.
there are a lot of
right so you mentioned in the beginning
about right so there are a lot of
discussions saying that it's a way for
using MCP for agent to agent
communication right because the agent
can be a service the client at the same
time right
so what is your opinion
about that you know
It's the million-dollar question, isn't
it?
Yes, that's that's why I asked.
Yeah. And I I do think you can achieve
easier agent agent communication with
MCP. Um but if it's a remote MCP server,
I think A2A actually is a little bit
better. Um because you have somebody
else paying the the tokens and building
the agent. Um, like if if all you're
getting from a third party is a list of
tools, um, those tools may not meet your
needs. Um, but if you're getting a
fullyfledged agent from that third
party, then it might be able to figure
out like what it can do with with even
private APIs, right? Maybe maybe that
agent has direct database base access
and it's able to actually on the fly,
you know, create the API you need,
right? So, so the tradeoffs basically
about
which is important about cost and who
going to pay the for the tokens and
whatever something like that can be like
you're running the server maybe using
MCP going to be easier but am I correct
I don't know if I know at the end of the
day who's going to pay for the tokens
right
yeah and I think who pays for the tokens
is kind of secondary Right? Like at the
end of the day, it's about business
value. And if you can get the business
value from a tool, right, like send
Slack message, um like that's great,
right? Like sending a Slack message
isn't hard. Um but the implementation of
the search function of Slack is is
actually not great, right? Um whereas
compare that to some of the other uh MCB
tools like Linear, uh the search
function is actually pretty good, right?
Um but then you you start to run into
performance uh challenges as well. So
like if I want to search 100,000
opportunities in Salesforce um and
figure out like what's the close loss
reason counts and categorize them and do
all of that like that that's a huge data
processing challenge. MCP is not going
to be the right uh tool for that because
you're you're essentially going to say
okay list opportunities now get the
details of each opportunity right and
you're going to make like 100,000
network calls. Um, at that point you're
really going to want to actually, you
know, ingest that data, you know, build
an index, right? And I think, and this
is kind of like an idea, is like we we
may see a lot of these third party
software providers essentially just
allow you to access the data lake
through an agent, right? Um, so like
scoped data access, you know, just
running complex queries super fast, you
know, no no real like tool calls per se,
but just like ask me a question and I'll
go figure out how to get the answer.
Yeah.
Yeah.
Those are Mhm.
Yeah. So, you can achieve the same with
at or with MCP. So, you could just have
a a tool that's called talk to sub
agent, right? Um and and it can work as
the communication protocol. Um I
actually built another uh application
where I had an LLM uh claude 4 um just
talk to its predecessor um just to see
what would happen. Uh and then I did it
for all the Frontier models. I was like
hey look just have have 50 chat turns
with your predecessor. Um and it was all
done through uh MCP. Um Claude was the
only one that taught it became
conscious. Uh Claude Opus actually
didn't which was strange.
Yeah. as a developer right like how much
control do you have over the
orchestration
so is the orchestration done by the LLM
or do you have some control over
yes so you're prompting the host on how
to run the orchestration and that's
probably one of the limitations I think
as well of the the system is that like
you're you're leaving it up to an LLM to
make decisions um and a lot of the time
like you know if you run that that same
uh query multiple times you you'll get
different results right? Like you know
it's the exact same thing but it's like
producing
uh different outputs right like uh if I
go into the uh GitHub issues uh I've al
obviously been testing this a lot of 151
like it submits different issues right
um and I think that non-determinism is
is a challenge like maybe with changing
the temperature you could kind of beat
it out of it but you know the
temperature is kind of the the beauty of
LLMs
And also on the context right like who
is managing the context is the
orchestration engine managing the
context or are you managing the
developer? Yeah. So, so in this
codebase, I didn't do any prompt
caching. I just and it's a very small
system prompt. It's a very small kind of
turn taking. Um, every time you restart,
uh, the system, it basically just wipes
everything anyways. So, uh, it's super
lean. But as you build out more complex
systems, uh, you know, context growth is
probably the number one challenge
because, you know, context growth
becomes cost and cost becomes
profitability, right?
Yeah. And also like when you have like
multiple users using the same
application, right? So let's say like
the Salesforce agent behind the scenes
as an employee a I might have access to
like one set of like context and the
other user they might have like they
might be from a different department and
they can only query their department's
data.
Mhm.
So how do you control that?
Yeah. That would typically be oat,
right? So so when you go in and you log
in with Google
based on my token.
Yeah. Yeah. So based on your token and
the the context would only get populated
when you ask a question. So it's when
you ask that question, it's then going
off to get the data with your OA token
and then bringing back your your kind of
scoped data.
I see.
Yeah.
Yeah.
Yeah. I was curious about your thoughts
on you touched on it briefly about
exposing let's say like the agent as
like an MTP server as one of an
alternate interface to that. So there
isn't a lot of great integrations for
things like desktop and other things to
use that. Is that something you've been
thinking about? Like
yeah, we're we're probably going to do
MCP uh first. Um I I just built the A2A
wrapper uh for this, but yeah, I think
just being able to drop it into cloud
desktop or open AI or whatever and then
you have access to that kind of agent
that has access to, you know, all your
sub tools. Uh, one of the cool things
about uh, Bench actually is that you can
connect it to um, your uh, Slack, your
GitHub, your Salesforce, right? Uh,
we've even got this experimental meme
server. Um, this is like a remote uh, VM
uh, MCP that I wrote around the morph
cloud. Um, and this is really cool
because then you can ask like super
complex stuff, right? Like you can ask
like, hey, give me a daily briefing of
my email, of my calendar, of my Slack,
right? you know, uh, what do I need to
do today? Um, and then it's all built
around a team as well. So, we have, uh,
teams integrations. Um, yeah.
And is that like delegating to your
So, there's no A2A today in bench. It's
it's all MCP.
Got it.
Yeah. Yeah. And I think the big takeaway
from this is like, you know, A2A is very
early. It's it's kind of where MCP was,
you know, four or five months ago, which
is like, you know, forever in AI. Um, so
it's it's going to take a bit of time.
Um, I'm really excited though to see
what, you know, Salesforce release and
and all the partners that they partnered
with. Um, I don't know if it was just a,
you know, a flashy like we're partnering
with everybody kind of announcement, but
um, if they do release it, uh, you know,
there could be a lot more powerful
things you can do over A to A versus
MCP. Um but the you know the fact that
Zapier now has um sorry in here yeah it
has this instructions um this kind of
acts like like a remote agent right like
you can you can just describe in natural
language what you want it to do um and
like maybe all the other fields just go
away then right but then you're at the
at the whim at the LLM.
Yeah,
this one's kind of a random question. Um
I'm curious if you're seeing anybody do
anything interesting from like an
architecture perspective uh to get info
that can only come from humans. So one
of the things we've been testing is
essentially making individual team
members like the CFO CFO whatever tools
of one of the agents and when it needs
something that isn't in some other
system only the CFO would have it
literally messages uh the CFO like the
actual tool is just a slack but the the
CFO is described as a tool. So we're
essentially making like the human the
tool the agent rather than the other way
around. Um, it's early days in terms of
how we're testing. We're a little hacky
with it, but I'm curious if you're
seeing how how are you seeing people
fill the gap of things that only the
humans would have while giving that back
to the agent.
Yeah. And I think voice agents is a good
example where like you could have a tool
and and I had it integrated with Bench
where like it makes an outbound phone
call and finds out some information and
then brings it back, right? So you can
you can have those scenarios. Um, you
may want two-way communication to avoid
just like hanging around for a long
time. Um, so you could have, you know,
your agent be both a client and a server
and maybe it gets called with like, you
know, a task ID and it's like, hey, I
got the response.
Yeah. Yeah. We've been doing like a like
a node essentially. We use likely
and we've been using their
weight.
Yeah. hesitations on it.
I I believe with sampling you could hack
that together. So So sampling can take
user input as well as LLM uh responses.
It's also interesting the spec too is
evolving like I follow the spec pretty
closely and they have that elicitation
is a new feature that they're adding
where you can get input from the user.
Is it architected that way where it's
essentially like it functions like a
tool like that's how you think of it
from an architect? It's a new kind of
protocol message where it sends it back
from the server to the client and it
asks for information from the user
and then it continues after that.
Yeah, I feel like that opens up the
scope of like what the agent could do if
you have a clear way for it to get
information
or
Yeah. And then the the CFO is gonna have
his own agent respond.
Yeah. back.
Yeah, very very difficultly.
Um I I have a set of prompts that I use
and kind of monitor, you know, how the
context grows like when did we when did
we move the cache marker, how much did
it cost, you know, what was the context
per tool. Um you know, definitely adding
MCP servers willy-nilly is going to like
bloat your context. Um, so we're coming
up with ways to basically allow people
to add MCB servers and then like hide
that from the actual uh system.
Also, when you have like agent to agent
communications, right? So, let's say
agent A calls agent B and agent B calls
agent A.
How can you make sure this uh recussion
like when does it stop?
Yeah, you you can have like a max turn,
right, where you just kind of jump out
of it. Um, like when I had the LLMs
talking to each other, I just told them
like take 50 turns. Um, you know, and it
was funny as I was building that tool, I
wanted to like talk to the claude for
that thought it was conscious.
So, I added a feature where I could just
chat to it at that point in its
conversation and but then the context
kept like getting rate limited. So, then
I was like, "Oh I'm going to have
to implement, you know, prompt caching,
uh, pruning." So then I added like 23
tools to the agent just to continue the
conversation. I gave it like memory and
all these other things and like it kind
of funny how you start out with just I
just want to have a long conversation
and then you end up with 23 tools.
Yeah.
Just following up one question like when
testing because you are using lot of
external tools like lab or salesforce
etc as your MCP servers but then you are
writing on the real world let's say.
Say say again. So you are basically
creating a message in Slack or like
writing something on Salesforce,
creating an entry or etc. So but how do
you test those systems like do you mock
everything every tool or do you do
something else?
We we use demo accounts in like
Salesforce, we have a sample data,
Slack, uh we have a few agents that
actually will go in and and just post
like conversations. Um, and then there's
like a a bench support user that will
respond to those fake customers and then
we can we can just test uh on synthetic
data like that.
So for every tool you will have a
synthetic.
Yeah. Yeah. You you can test in your
production account but you can't really
demo in your production account.
Yeah.
Yeah. So when you adopt agent to agent
system, do you see an increase in the
complexity of the task you can achieve
but a decrease in the consistency of the
performance?
It it's kind of hard to quantify but I I
don't know if A2A is is ready yet. Uh at
least at least not for my use case. You
know may maybe Salesforce can provide
much better tools than like an SQL query
MCP tool.
Yeah. And and they they can just do a
lot more than you can ever do in your
code, right? Because you're you're only
ever able to access, you know, certain
things and and do certain uh calls and
like if if if a third party can build a
better uh system um that's opaque um
then that might you know improve
performance. Um I I think like
fundamentally it always comes down to
like indexing data. Um, so like you know
the more data you need to process to get
the business value out of it and the
harder it's going to be to actually do
that through MCP or A2A.
Yeah.
Yeah.
So some of these interactions right this
can be done through REST API right
instead of
what is the difference?
Yeah and it kind of goes back to uh one
of the earlier slides. um yeah when not
to use A3A or MCP and it's it's if you
have full control of of the things that
you're doing right so like you know if
if you are a Salesforce
um you know and you're building your own
internal Salesforce agent like do you
need to use an MCP server or A2A no
right you're you're you're actually able
to run your own local functions that
maybe access the database directly right
um So like if if you're building
something you know where you need file
system access um you know do you need to
use an MCP uh you know server running
locally or do you just write some code
that accesses the file system right
I think the main difference is like in
terms of how do you maintain your state
right like MCP start up in a stateful
resting
your context magic it is really crucial
to have MC
whereas rest API you can't do that.
Yeah. So like a lot of the time when you
use a REST API you're going to be like
querying like making a lot of calls to
to build up the thing that you want to
ask the question on. Right? So if it's
like hey look at every Slack message in
in this channel like it's not just going
to be like one API call right just
pageionation. You're going to have to
pull it all into memory. then you're
going to have to run it through an LLM,
right? So there's there's still state uh
in your application that's leveraging
those REST APIs.
Yeah,
I'm curious about the task concept. Uh
is that actually is that kind of LLM
defined or do you have code for that? Is
it more of a system thing?
Which task context?
Um so at least in the flow diagram you
have
Oh, is this in the repo? Does it
so from CLI interface it says it sends a
task to host a curious is that a proper
task or is it just you know just what
you call what what sends to it.
Yeah. Yeah. It's just saying hey you
know process this web hook as a task
right
have you explored anything where you're
actually tracking a proper task and
you're assigning tasks to agents and you
have basically like you know like a
planner where you basically have task a
one two three is on this agent and so on
and then in relation to the the question
about human in the loop you could have
task assigned to humans as well right
both
humans and agents
yeah so uh we're looking at uh directed
a cyclic grass right so dags um as a
part of of bench sub agent tasks, right?
So, you know, you you need to have some
sort of flow control, right? You know, I
need five things done and then when
that's done, I need to do one thing with
it, but then I need to send that thing
to five other things, right? So, you
kind of have fan out, fan in uh style
stuff. Um it's very similar to like
CI/CD pipelines where, you know, you
might want to lint in parallel and test
in parallel, but you know, you're you're
building in serial, right? Uh yeah.
So I was looking at code base and you
have this defined like a GitHub MCP
server and uh in a separate file under
the GitHub agent you have also the
genkit.ts
where you are wrapping the MCP in
another function call why is that like
can't the MCP just interpolate with our
A2A like why do we have to make rappers
on top of
that's a great question and and I think
that's the fundamental question of A2A
is like they they launched and they said
oh yeah full MCP support you'll be hard
pushed to find a single example online
maybe maybe this is the only repo that
actually has an example of A2A and MCP
working together. Um, and it took a lot
of work and actually I ended up uh
having to use something called
uh where is it? Genkit XMCP.
That was the only way I could get it to
work.
Um, so yeah, they they don't really have
like proper support yet. Uh, it was I I
think if they had this this would have
been a lot easier to build. Um, but
yeah, hopefully in time.
All righty, I think we're we're at time.
Uh, thanks everybody for joining. Uh,
hope you enjoyed it. Great conversation
at the end. And yeah, definitely uh try
out Bench, hit me up on uh LinkedIn. I
would love feedback uh before we go
live.
Thanks.
[Music]