Bending a Public MCP Server Without Breaking It — Nimrod Hauser, Baz

Channel: aiDotEngineer

Published at: 2026-04-08

YouTube video id: U00AOI1eJUE

Source: https://www.youtube.com/watch?v=U00AOI1eJUE

Hi everyone, and welcome to our talk
today about bending a public MCP server
without breaking it.
But today we may have just broken it
because our MCP server seems to have
caught on fire. We're glad you're here.
We need all the help we can get.
Let's go through our talk and see how we
can improve whatever is going on right
here. I'm Hauser. I work at Buzz. We've
been building AI powered code reviewers
for the past few years now, as well as a
bunch of other features, anything that
can can make the lives of people at the
R&D easier and better, whether they're
devs, PMs, anything else. If it's
Agentech, we're probably tinkering with
it.
But let's jump right in and start
looking at what's going on with our MCP
server. I suspect it's the tools.
Uh we're going to talk about third-party
tools and why they might blow our
applications.
First of all, it's me.
As I said, Nimrod Hauser, a founding
engineer at Buzz. I've been with the
company since it was founded in 2023.
I've been at back end data for the past
20 years or so. Had a In my career, had
a brief stint in Salesforce, and ever
since, mostly startups,
um
cyber, crypto, and now developer tools.
Nowadays, I mostly want to talk to you
guys about Agentech tools.
All right.
Agentech tools, and specifically
third-party tools.
They can be a great force, a great
addition to our application, but they
don't always work out of the box. We
expect them to make our application
better. Sometimes, we'll see
degradation. And we'll try to understand
why that happens.
After that, we'll explore a framework of
five best practices that we can follow
in order to turn this around and make
our application kick ass. Along the way,
hopefully, we'll fix the busted MCP
server that we just saw, put out that
fire, make it work, and make our agents
behave the way we want them to.
Yep, this is looking kind of bad.
Think we should dive in.
So,
we're going to talk about Agentech
tools.
When we use MCP servers, we get tools
from coming from the MCP server. So, as
long as we're talking about third-party
tools, I don't care if they're coming
from an MCP server, from a library,
maybe we copy and pasted them from
somewhere else. If they're Agentech
tools, they were written by a different
team, they're relevant for this
discussion.
So, what are these tools?
Essentially, tools are just callable
functions wrapped with a nice
description. The description is
important because it lets the agents
know when to use the code and how to use
the code. And we'll dive deep into these
aspects of the description. But again,
it's kind of like glorified integration
code written by a third party. In
today's talk, we're going to take
Playwright's MCP server as an example.
So, essentially, we're looking at
integration code written by the good
people at the team of Playwright wrapped
with their descriptions.
And we'll see how we can make these
tools
work better, kind of tailor tailor them
for our use case.
Yeah, so
third-party tools have their challenges.
Uh first and foremost, they might cause
our agents to behave unexpectedly. You
know, agents are already
non-deterministic, unpredictable things.
You give them tools, and you get
unpredictability at scale.
But also, they can just degrade
performance. You might want the agent to
do a certain thing, and you get subpar
results, wrong results, or maybe it just
does it, but in a way that's not
optimal.
And through today's best practices,
hopefully, we can see how we can make
the implementation of third-party tools
in our Agentech workflows much, much
better. Last and foremost, these bad
performances, that unexpected behavior,
that can also mean full-blown security
issues. I mean, just imagine a scenario,
pretty classic scenario, uh like a
multi-tenant architecture, and your
agent might not know all there is to
know about your architecture and the
division into folders or databases and
schemas. Uh it just does doesn't have
the proper guardrails, and it might leak
client data to another client to another
client, things of that nature. You
really want to guardrail your agents,
and and this is becomes even more
important when dealing with third-party
tools who are not aware of your
architecture. So, we'll cover that as
well.
All right. I think
I think we're going to we're about ready
to look at a use case.
To look at some code, actually. But
we'll need a use case, and with your
permission, I'll we'll choose one of
ours.
So, today our use case will be Buzz's
spec reviewer. So, what is a spec
reviewer? It's one of our products,
which is essentially an Agentech
reviewer that knows how to compare
requirements with implementation. So, as
a first step, it needs to kind of
collect requirements. It will go to your
ticketing systems like your Jiras or
Linear or anything of that nature, and
read a ticket.
And it can also go to Figma and look at
visual designs in kind of a multimodal
way of operation. It will actually see
the design that is intended, and that
part is the requirements. What's Once it
understands what a developer was tasked
with, that's when it will spin up
Playwright's MCP server to actually open
up a browser, go into your system, check
the branch, see the implementation, and
it will need to assess whether the
implementation meets the requirement. It
will give us kind of a verdict. It will
take a snapshot as evidence whether this
was fulfilled or wasn't fulfilled, and
it does all this automatically and can
save uh people, mostly PMs, a lot and
lot of time doing menial validation
work. So, we've built a toy example of
our spec reviewer, and we're going to
see how we handle the tools to get the
most of it.
I hope this makes sense.
At a high level, I think it's time to
look at some code, and hopefully,
everything will be much, much clearer.
All right. So,
we have a toy example of our spec
reviewer. We'll go through it kind of
quickly.
We don't need to dive into every aspect
of it. It's It's a pretty small project.
And we'll see what's going on and
focus on the parts that we care about.
So, we start here with our main
function.
And um
we have a some a directory where we want
to save snapshots. We'll get to that
later. But right off the bat, we have
our MCP server configuration. We have
just the one. We're using only
Playwright's MCP server. This is
pretty standard.
So, we have the one MCP.
As we go into our main function,
you can see that we're
uh defining our MCP client.
And we're going to use it in just a
little bit. We'll put it in our agent,
but I want to focus on this. This is
where the magic of this talk happens. We
have built a base class
for getting the tools. And all it does,
it has one function called get tools. As
we will go through the talk, we will
go in increasing complexity and improve
the way we handle the tools that are
coming from our third-party MCP server.
So, here it is.
We're starting with a baseline. We'll
look at it in just a second. And as we
start our session, this is
this is where this inheritance is going
to take place. Every time we run this,
we will uh the get tools will do
something a little bit more advanced.
So,
we start we we want to start the our
flow. We have this function called login
to Buzz, because for this talk, our
example is going to be logging into our
system, and we will talk about why we
need this towards the end. There's
actually an interesting point here.
We will define an LLM. We'll create an
agent.
Uh we will uh give it a system message
and a human message to start to kick to
kick it off. These are the messages it's
going to get. And we will
invoke it.
Um I'm you're probably kind of wondering
maybe you want to see a little bit more
under the hood, maybe look at the
prompts.
So, this should be
uh relatively um straightforward, you
know.
System prompt, this is mostly AI
generated, saying things like you are a
meticulous QA agent. You need to review
requirements from the ticket, as well as
visual verification. Everything we
talked about at high level is right
here. Some guidelines, first read the
ticket, understand it, navigate through
the system, uh and then at the end, like
we said, it needs to give us a pass or
fail verdict, um
specific observations, and reference
everything with a screenshot for
evidence. Uh human prompt is very
similar. It does have a multimodal
aspect to it, where we take images and
we embed them in the human prompt. But
these days, it's very straightforward,
and any coding agent can just whip that
out for you if you need it. Speaking of
images, we have two images here.
We have a ticket that we took a snapshot
of. Our real product doesn't take
tickets as snapshots. We were just lazy.
But um the agent can definitely read
this, understand the requirement. There
is a an accompanying design,
which is this one. So,
the ticket states that we want to have a
configuration drawer for our spec
reviewer in our system in Buzz. It
explains how it should look,
and a design is given. So, the agent
should understand that it's looking for
a drawer
inside our agents tab for spec reviewer,
and it should look roughly like this.
Amazing.
I think we it's about time we just fire
this up, and hopefully it will make
everything so much clearer.
We have a breakpoint here right after we
get the tools. Almost forgot. Our first
run is going to be with this V0, the
benchmark. What is our benchmark? If we
go to our get tools, we see that what we
do for V0 is classic out of the box. We
just use LangChain's load MCP tools
uh method. That is it. For the first
round, we're not tinkering with tools at
all. Let's see how it behaves vanilla.
All right.
Okay. So, this is starting up, and we
have our tools.
Let's see what we have here. So, right
off the bat, the good people at the
Playwright have given us 21 tools, and
everything that has to do with
manipulating the browser, browser close,
browser resize, console messages, handle
dialogue, file upload, fill form,
install all the press key. And then we
can look at the descriptions. What is
the description for a tool called press
key? Press a key on the keyboard. What
is the description for something like
resize? Resize the browser window.
Browser close? Close the page. These
seem very shallow and very generic, but
we don't blame them. The people at
Playwright don't know what our specific
use case is. This MCP server will need
to cater to
I don't know how many different use
cases. It has to be generic. But, for
us, using this, we and we'll see this
going forward, we might want to put in
our own descriptions that really are
tailored to our use case, but we're not
there yet. We're still at the baseline.
So, let's just continue. And we will see
that this is running.
Okay.
So,
Playwright is running. It's spinning up
a browser.
And now it's going to log in.
And once it's logged in, the agent is
going to
take over and start
running according to the prompt.
And there it's off to the races. It's
opening uh it's it's logged in. This is
our home page, which is the changes
screen. And now it's going to need to
find the relevant um page, which is the
agents tab.
So, it's going to need to explore the
system a little bit.
And it might work, it might not work.
Remember, the tools are not optimized at
this point. And it's done. Let's see how
it did. So, looking at the results, it
tells me that the requirement is not
implemented, the status is
it's a failed verdict. It gives me an
observation, and it tells me that the
require requirement is not met because
it couldn't navigate to a seemingly
made-up page called buzz.co
/spec-reviewer. This might be a
hallucination, a lapse in judgment on
the agent's part, a bunch of other
things. And it gives an evidence of a
404 screenshot, which probably took and
we can probably check out in our
screenshots folder. It didn't even
manage to take the screenshot properly.
So, a lot of things went wrong, and this
is actually a great outcome for the
beginning of a talk whose whole concept
is optimizing our use of agentic tools.
So, let's see what we can do to improve
our tools, and we'll run this again and
see if we can turn this upside down.
All right. Cool.
Our MCP server is already starting to
look a little bit better. The fire is
put out. It's just this spark now. And
this is probably because we've gone
through some code. We're starting to
understand the problem, but we still
need to start to actually implementing
our improvements and see what can be
done to really make the system
better.
So, time to introduce our five concepts
that we're going to go over. We're going
to look at how we can curate third-party
tools,
wrap third-party tools with our own
descriptions and perhaps some additional
things,
adding deterministic guardrails whenever
we feel it's necessary, and we'll give
an example, creating new tools out of
the existing tools, actually using the
existing tools as building blocks. And
lastly, there's always the option to
treat tools as simple functions, just
calling them, using them as that
integration code we spoke about written
to us by the good people at the team of
Playwright.
Um you know, taking some parts of the
workflow outside of the agentic flow
whenever we feel it's necessary. We'll
talk about this towards the end.
So, it's also a tool in our arsenal.
I did kind of split these into two
buckets. One is more in the realm of
context engineering, the other
deterministic guardrails. It doesn't
really matter. At the end of the day,
whatever gets our application um to work
as we want it, that's what we need to
use.
So, now we'll go over them one by one,
looking at code, see how we can improve
our toy example that we just saw.
Starting with our first point, curating
third-party tools.
All right. Let's see how this one looks.
All right, we're back here at our
familiar project. And through the magic
of video editing, we have now imported
V1. It used to be V0 original. It's now
V1 curated.
The only difference, like we we've seen,
is that now we have this as V1, and
this, as we said, it's that class that
inherits from the base class. It used to
have just get tools vanilla using
LangChain's function. Now, we can see
what we have implemented here.
So, we go in, and we used to return
this, right? But now, we have this big
list of all the tool names that we get
from our Playwright MCP.
And this small list, this is pretty, you
know, standard stuff in Python.
Um
list comprehension. So, we just created
this list of tools that we want to
exclude.
We just went over them, and we we know
the tools. We've been using the this MCP
server for a while, and we decided that
for our use case, we might not need
resizing the browser. We don't want our
agent to drag things. We don't want to
run code inside the browser on its own.
These are just not things that our spec
reviewer needs to do as part of its
operations. Maybe for your use case,
this is needed, but for ours, not so
much. So, all we do is we get all the
tools, and instead of just returning
them, we simply exclude the ones that we
don't want. So, there are a bunch here,
six here, that we're going to simply not
use. We fire this up.
We have our breakpoint, and instead of
21 tools, which we used to have, I
expect to see less. And so, we have 16.
Amazing. Um so,
this means our context window already
has less tools in it. Our agent has less
to choose from. So, everything might
become simpler.
We'll see that not all the guidelines
that we're going to go through will
necessarily um reduce stuff from context
window. Some will actually add to it,
but this is all part of this trade-off,
this juggling act that we're going to
talk about.
Moving on to our next point,
the practice of wrapping third-party
tools. This is amazing. We talked about
how the descriptions, specifically
coming from the Playwright MCP, are
super shallow and very very generic, and
that it's totally understandable because
they need to cater to every possible use
case in the world that might want to use
the browser. But, if you really want to
optimize, you might want to start
tailoring stuff for your own use case.
Let's see how this happens.
Okay, this is becoming familiar
territory by now. And as always, through
the magic of video editing, we have V2
imported, wrapped. So, we're wrapping
tools this time. Going down, we see that
we're calling the V2
class, which will implement get tools,
and we'll see what's going on here. If I
go to V2 wrapped, I see that we, as
before, we get all the tools, but now we
have this new class called tool wrapper,
which has a method that we're calling
wrap Playwright tools. Let's see what's
going on here.
As before, we still have this list of
all the tool names. We'll do the
filtering a little bit further, but
instead of just the tool names, we also
have all these descriptions. And so, for
every tool, we we want to specify what
needs to happen. And from experience, we
have our own kind of little emphasis
that we want to give our agent. We might
tell it, you know, before calling the
browser tool, first call this other
tool. This tool we found to be
especially helpful. It's uh it has kind
of a misleading name. It's called the
snapshot tool. It's actually not a
visual snapshot. It's the accessibility
snapshot
that [snorts]
kind of shows you all the different
buttons and all the different menu items
in text. And we feel that the agent the
agent really gets a good understanding
of what is in a page when it calls that
tool. So, we tell it for a bunch of
tools, you know, instead before calling
hover, before calling click, please use
this tool before. So, we can kind of
really affect its behavior. We can make
it more eager to choose one tool over
the other. We can do a bunch of things.
Um
for instance, this is the tool I just
talked about, the accessibility
snapshot. We will tell it always prefer
this over taking an actual snapshot,
which is this tool. So, you can really
give a lot of gu- guidance from your own
experience for your own particular use
case. And this is very very powerful.
In here, we have this dictionary, which
just maps tool names with their new
enhanced descriptions. Still, we have
our tools to filter. At the end, we have
the function that we called called wrap
Playwright tools, and it just goes
through all the tools that we get from
Playwright out of the box.
We filter what needs to be filtered. And
for for other tools, we
get our enhanced description based on
the tool name, and we create this tool
and we append it to the list of wrapped
tools. So, we get enhanced tools.
What is this method that creates an
enhanced tool? Well,
it's a method that gets the original
tool and the enhanced description,
creates a new tool,
and returns it.
And so, what does this amazing new tool
what what does it do?
Exactly what the old tool did. It just
invokes the original tool.
It just has an enhanced description.
So, if we run this,
going back to main,
and we run this,
and we still have our breakpoint,
we can see that we still have less tools
like we wanted to from before, even
less, we filtered a bunch more. But when
we look at the descriptions,
you see that they're much longer, and
they are they are the ones that we
wanted. For example, here is the tool we
spoke about, browser snapshot, capture
an accessibility snapshot of the current
page, yada yada yada, all the things we
said. If we look at another one, browser
click, here's our guideline for first
call the other tool, and then call this
one. Now, our agent knows how we want it
to behave.
All right, on to the next one.
First of all, our MCP server, I don't
know if you can notice, but things are
looking even better. Some of the
interfaces seem to work, lights
blinking, things firing, but we're still
far from the home stretch.
We'll move on to point number three and
keep making this better.
Now, we're moving into the realm of
deterministic guardrails. And this is
putting in deterministic guardrails,
taking control of sensitive or
mission-critical aspects of our tasks
with deterministic logic that is not up
to agentic decision-making.
Sometimes, there are
aspects of your tasks that are just too
sensitive to leave at the hands of the
agents. We talked before about scenarios
like multi-tenant architecture and may-
and scenarios where the agent might not
be fully aware of your architecture,
things of that nature. And of course,
you need to specify everything you can
in the tool descriptions and the
prompts, but sometimes you really want
to enforce that it is not doing anything
funky. You know, agents are
non-deterministic things, and sometimes
they will ignore you. We know of all
these phenomenons such as needle in the
haystack and lost in the middle and a
lot of instances where agents will just
not work as you intend them intended
them to.
This is where you want to put some
deterministic enforcement. We did this
around the tool that takes actual visual
snapshots. Not the accessibility
snapshot we talked about before, but the
actual visual snapshots.
We had a folder that we defined and we
said, "This is the output folder. This
is where we want you to put images."
But there is a possibility that the
agents will go rogue and just store
images in other places. So, that's where
we want to draw the line and make sure
this never happens.
Okay.
So, as always, we have V3 now, which is
the one we want to look at. So, going
back to main, we see we have this here,
V3 guardrails. We dive in and we see
uh that we have again our wrapped
Playwright tools. Obviously, this time
it's going to do something a little bit
different as we increment every time.
So, we still have the names, we still
have the descriptions, and uh going down
down down down. By the way, we can
already see that apart from the V3 that
we al- always have, which looks like
this, and the tool wrapper, which we had
before, we now have another class called
path validation. Let's see where we use
it. So, we're going down down down.
We're going past the dictionary, past
the tools to filter. We're in the um the
method that method that we're always
importing, wrap Playwright tools. And
wrap Playwright tools, as before, it
goes over all the original tools we got
from our Playwright MCP,
filters what needs to be filtered, uh
gives the enhanced description for each
of the tools if we find it, and then as
before, we have the same helper function
create Playwright tool wrapper, that
function that takes a tool, gives it the
enhanced description, and creates a new
tool out of it. Same functionality, new
description. Let's see what's changed
now.
So, when we want to create the new tool,
right? So, this is the tool we're
creating,
as we said, a tool is just a a callable
function with some description. We give
the new description,
and before we had this part because we
said, "What does the new tool do?
Exactly what the old one did." But we
added this part. We're saying, "If the
tool that is now
being activated, if it is the take
screenshot tool, and we've kind of
researched the tool and we know that it
uses under the hood it will use either
the path or file name as keywords, at
least the relevant keywords for us.
So, if if you're trying to invoke the
tool and it is this tool, then
find the- these keywords
and validate them. And we have some
helper functions. This path validation,
these are just helper functions. We
don't need to go too much too much into
them, but they're just deterministic
logic where we take the path that the
tool chose, and we take our path where
we want to enforce things being stored.
We call it the screenshots root.
And we just use this method.
We want to know if that path is relative
to the scree- the chosen path is
relative to the screenshot path. So, it
is a deterministic way once we
understand where the tool intends of
sto- to store the image, we know if it
is
inside the right folder.
So, going back to our helper helper
function, every time the tool is
invoked, before it actually gets
invoked, we we kind of extract the path
it intends to save to to invoke it with.
>> [snorts]
>> We try to validate, and if it is not a
valid path, it will not reach
invocation. It will raise an error.
So, this is a deterministic way where we
can stop the tool right in its tracks
with a deterministic guardrail. Now,
another nice thing to note is that we
might raise an excep- exception, we
don't return an exception. We don't want
the whole agentic process to fail.
Instead, we handle this nicely by
creating a very nice agent-facing
explanation. And and what the error what
the agent will get back is this nice
match message saying, "Listen, access is
denied. You can't save it there. You
need to save it here. Please provide a
proper file name and proper path." An
agent that gets this message is most
likely to just try again,
but aligned. This is our way of aligning
it, and it will try again, give a
correct path, uh manage to traverse
this, get here,
and save the image. So, that's exactly
what we want to happen in these very
mission-sensitive
um
security-related scenarios. So, that is
amazing.
Okay.
Personally, I love this point. I think
it is so common and a a tool everybody
must have in their arsenal. But it is
time to go forward. We have two more.
This might be a little bit more niche,
more in the advanced side, but it is
worth kind of checking out, knowing that
it is a possibility. This talks about
composing new tools from existing tools.
And the place where we did that
was interesting. It was actually it also
had to do with the same take snapshot
tool.
We felt that sometimes we want to take
snapshots in general. It's kind of a
generic thing.
But when we take screenshots
uh in the context of evidence at the end
of the flow,
we felt that maybe having a separate
tool for that was
uh
was in order. Why?
We can create a new tool
whose functionality is essentially a
pretty much the same as take regular
snapshot, but because it will have
separate descriptions,
the agent can choose either this or
that. And we can also tell it to behave
slightly differently in both cases. And
we can also give some additional
additional actions to do, maybe even
deterministic actions just before
invoking the original tool.
So, we created a new agent called the
evidence tool. And let's take a look
what we did there.
All right. This time around, we have V4
imported as we would expect by this
point. So, we can take a look at V4, see
how it implements get tools. We go in,
this opens up. This looks very similar
to before. Actually, everything looks
similar. We have tool wrapper, we have
path validation, very very similar.
What's the difference? Well, we inject a
new tool.
So, we have our, you know, inside tool
wrapper, what we would have come to
expect by now. We have the the tool
names, the tool descriptions, the
dictionary.
We have our tools to filter.
Everything's the same. And then we have
the same old, same old wrap play rights
tool, which iterates over all the tools,
filters what needs to be filtered, and
creates any tools that we need to create
with enhanced descriptions. But, so this
is what we used to have. We finished the
loop, and we have this. We have this
part injecting a brand new tool. We
chose, and you don't have to go the same
route, but we said, because this is
building on the screenshot tool, then
only create this new tool if the
screenshot tool has not been filtered
out. Totally optional, just what we
chose to do in this particular use case.
And so, we have our new
description for this new tool.
And we create a new tool. Let's take a
look at the at the screenshot.
And sorry, the description.
So,
it's a description as you might expect
from a description of this kind of tool.
Take a screenshot specifically for the
purposes of evidence. And because the
prompt, when it describes the flow,
tells it that this ends with taking
snapshots for evidence, the agent will
probably know to choose this tool over
the regular screenshot taking tool.
Uh we tell it to use this only when
capturing things for evidence. And we
also specify how to go about doing
evidence. We say, for example, that when
it creates an image and it wants to
store it, we want it to identify the
relevant ticket and put include the
ticket number in the file name being
saved. And so, you can see example of
how we're going to have two tools. The
agent will know to differentiate when to
use this one or whether to use the other
one, specifically in the context of
evidence taking. And when it chooses
that tool in the context of evidence
taking, it will This will cause it to
have different considerations when
choosing the image name, for example.
And you can do a bunch of other things.
You can add guardrails or deterministic
actions that are specific to this tool,
for example. So, really, the sky's the
limit, the world is your oyster, and
knock yourselves out.
So, this was this.
Moving on to our final fifth and final
point.
Look at this. Our server all the fires
out. It The system is booting. Looks
And still, we have one final point,
which we actually touched touched on in
the beginning of the talk. This is
treating tools as deterministic
functions or callable functions.
Sometimes, we get all these wonderful
functions all this code from the people
at these third-party teams. In our case,
the team of Playwright, they gave us all
this really nice code, and we can just
call this code
outside of the agentic flow. Just just
call it. Just
Just disregard the description and just
use the function they gave us. Where do
we use that? I don't know if you
remember, but when we first looked at
the code, we had this function called
login to Buzz, and we said we'll get
back to that. This is us
uh kind of closing that loop.
Let's take a look.
Okay.
Back to the top of our main pie file. We
go down down down down down down down
down down. Here is the imported V4, but
today we're interested in something But
now we're interested in something else,
which is this part. When we just When we
just start out, we kind of define the
MCP client, we define this
a class that implements get tools, we
get our tools, and the first thing we do
is we have this deterministic function
called login to Buzz. Only after we log
in to Buzz, do we actually create an
agent, and then and then we give it all
the messages it needs, and we invoke it.
So, why is the logging deterministic
here? Well,
it seems that logging is kind of tricky,
because this is a toy example again, but
in a real product, we need to log in to
our system, we need to log in to client
systems.
So, and each client might have a
different logging mechanism. And they
can be tricky, and they can have
secrets, and they can can A lot of
things can make this kind of
complicated. So, on the one hand, it's
complicated, and on the other hand, it
is a an action that we will always want
to take. There is no agentic flow for
spec reviewer that does not begin with a
login.
And so, we And so, we do use tools we
got from MCP server
in order to achieve this. We're just not
letting the agent try to do it, because
it's We saw it gave us subpar results.
So, for these very specific niche
use cases, sometimes you might want to
take matters into your own hand.
Here, if we're going to the function,
not much going on. I
Again, toy example, the real product
behaves very differently, but here I
just hid some JWT tokens as environment
variables, and I, you know, this
function accepts all the tools from the
MCP server. We kind of pluck the the
ones we want.
And we
And the gist of it is that we are going
to inject these JWT tokens into the
browser's local storage,
and then when we click click the login
button, we will just log in like magic.
So, we just do this deterministically.
Once we have logged in, we take the
reins and give it to the agent. We say,
"You're logged in, you're off to the
races." And that's how we just unburden
the agent from this
somewhat clunky action it needs to take,
which we can just take off its hands,
not bother it it, and its context with
it.
Um
Cool.
So, that simplifies things, at least for
us.
Oh, I almost forgot. I still owe you one
last thing.
I think we should fire this thing up now
with all the improvements,
and see where we land. We failed on the
previous run, didn't we?
Okay. Let's Let's see how it works.
So, going back to the familiar code
base,
our main,
Nope. And we run this.
We no longer need the the breakpoint. I
think I took it out. Yep. And this is
just running.
Okay. So, this is going to start by
doing all that JWT magic in the back,
which we now know how it works.
And let's see how it goes.
I'll take the time to remind you that we
gave it a ticket
and a
a design, and it needs to see this
drawer. That's what it's actually doing
now. Okay. So, it kind of reloaded. I
think maybe the JWT tokens have kicked
in. It go It's going into our home page
called changes, and it's probably going
to do a little bit of reasoning, and
it's going to need to find the way to
navigate itself through the system to
the agents tab.
It doesn't always show it. It has a lot
of late latency and lag, at least from
our experience, but it does take a
screenshot as evidence. It finished.
Let's see what it did. So, it says that
this passed. It said It claims that the
configuration drawer is present and
includes the sections that are need to
be there. It has a screenshot. The
screenshot has a file name that includes
the ticket, like we said. It has
reasoning. Let's take a look at the
screenshot. So, you see the the image we
gave it was a dark mode, and it It was
navigating in in light mode, and it took
it, and it looks It looks great. It
looks like it's it's done, and it's
correct. Now, I will say that
it did see a configuration drawer. It
did see everything that's supposed to be
in it. It might not be pixel perfect,
but pixel perfect verification is
something that we worked very hard on in
our real product. So,
outside of this toy example, pixel
perfect does work, because it is super
important for front-end validations,
making sure the padding and the margins
and all the
style is just as it needs to be.
Amazing. So, with that done, I think
we're
It's said it It's almost goodbye time.
So, I think it's We're all We're
starting to get ready to wrap up. Look
at our server. Look at it. It's amazing.
It's beautiful. It's working. Green
lights are blinking. Engines roaring.
GPUs churning.
Millions of spec reviewer being being
executed and a great acceptance rate.
Thank you for all your help.
Let's just do this one final summary
before we each go our separate ways. So,
as we said, we looked at agentic tools
today, whether they comes whether they
come from libraries, MCP servers, or any
other place. We saw that sometimes they
will fail out of the box. They might be
very generic. They might be not tailored
enough for our use case. And tailoring
them is what's going to make our agents
pop. Now, sometimes you wanted to curate
the tools and kind of ease the load on
the on the context window. Sometimes we
wanted long descriptions and more
verbose tools. So, there's no
one-size-fits-all. It's mainly a
question of how do I kind of mold the
tools to best fit my use case. Sometimes
they'll be deterministic. Sometimes
they'll be flexible. It depends. And
you're going to need to tinker with it
and make it your own and and and and
strive for the best possible setup that
you can do to achieve your goals. I hope
this has given you a few pointers, uh
things that you can try out. This has
been amazing and great fun. Uh I'm I'm
going to move up here now because I have
a shameless plug
after I thank you for listening. And so,
always feel free to reach out. I will
see you guys in the next one. Cheers.