Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase

Channel: aiDotEngineer

Published at: 2026-05-04

YouTube video id: GmAQKINjv1E

Source: https://www.youtube.com/watch?v=GmAQKINjv1E

Hello everyone. Uh is everyone excited
for the the conference?
>> Awesome. We got a full house here. Uh
it's very I'm very glad to be here. Very
honored uh to be giving uh the opening
or one of the opening workshops uh
today. If you've noticed already uh the
title is slightly different uh from what
we have in the schedule. I've basically
done a rebrand but the theme for the
workshop will remain the same. Uh we
went from skill issue to level up your
skills. I've decided to move the skill
issue title to to the keynote that I'm
giving uh tomorrow. If you'll have time
to to know more about what's the keynote
is going to be about. Uh but mainly uh
it's um this workshop. Basically, this
workshop is what I've been doing in the
last two months at Superbase, writing
our own skills. And tomorrow, I'm going
to present how we put this actually into
production and the lessons we've
learned. Um, so for everyone who's been
paying um closer attention, you probably
noticed that I'm I'm running this slide
deck on local host. Some of you have
already noticed. Uh this is no
coincidence at all. I bas uh I
essentially vip coded the presentation.
So if you see something off uh it was
not my fault was clo uh but for for you
to if you don't believe me you can see
that uh or you cannot do or you you'll
have to be a very uh Google slides guru
to have dark mode enabled. So honestly I
like this layout better. So I think
we're going with dark mode here. Um if
there's any uh light mode fans out there
or the majority of the room it's light
mode. I'm happy to switch back. uh but
for now let's go with with this one. So
to to do a little presentation of myself
before starting the workshop uh my name
is Pedro. I'm uh uh I'm from Portugal uh
Lisbon and I work at Superbase as an AI
tooling engineer. Essentially my day
today is to uh think of how we can make
the the Superbase the most agentic
friendly as possible uh and um improve
the agent experience. So we you've
probably heard about develop uh
development experience the DX. Uh we're
more focused on DAX which is the same
thing but for agents. Uh in this
workshop we're going to um talk a bit
about skills. Uh because essentially um
that's how we've been improving the the
performance of agents around uh a
product like Superbase uh or a company
like Superbase who has multiple
products. Um the secret sauce is has
been basically skills. So we're going to
dive into how to write one um how to uh
test it man first manually and then how
to automate the testing with
evaluations. Uh so to start with how
many of you have heard about skills?
All right so almost anyone everyone. Uh
so what I'm going to say it's probably
no news to you. Uh skills are basically
folders uh with uh instructions
um and uh files for you to to run
workflows uh repeated workflows or give
custom um essentially custom information
to your agents uh or provides a new set
of uh of tools let's say in form of
scripts. So there's a a bit of a
misconception about skills. usually the
skill.md the main file takes the the the
spotlight. Uh but skills can actually be
more than just the the main file, right?
Uh so the the main file is basically a
markdown file uh named skill.md where
the essentially the the main information
about the skill lives. Uh it it is
composed by this front matter at the
top. Uh which essentially has can have
multiple fields but the two um required
ones are the name which basically
identifies the skill uh and then the
description which tells the agent what
the skill does. the main um
what exactly the the the the skills
basically bring that uh tools like uh
MCP didn't uh was this concept of uh
progressive disclosure. Uh progressive
disclosure is basically when the agent
uh or the all the information about a
subject is not loaded uh straight to
context. uh instead you just load the
exact amounts of information that allows
the agent to to choose to load the rest
of the information once it actually
needs it. So in this case um the the
skilld file is designed like this. So
the front matter uh will be loaded at
first to the to the context of the agent
not the content of the the file. Uh this
works as an envelope. So the agent has
knows from the description what the
skill does uh and when it should loads
the the rest of so when should we look
for the information inside of the file
uh inside this file you can also
reference another files. Usually these
other files are either markdown files or
script bash python whatever you would
like uh to to reference. Um, starting on
the reference files, you usually put
them inside a reference folder. Um, and
they provide more information. You can
think about a skill in this format as a
book. The skill.md you can think of it
as the index on steroids because besides
of having these links to the other
files, you can think of them as uh the
pages of the book or the other chapters.
uh you you'll have um custom you can
have custom informations and then also
reference it uh to to the other files.
The reference files they have nothing
special about them. They're basically
like the a normal regular markdown file.
Uh you can think of similar to skill.md
file but uh instead of being the main
one it's the one that uh that got
referenced. Uh you can also fun um funny
enough you can also reference files
inside of reference files. So you can
make basically a graph out of a out of a
skill. And for uh for scripts, I've
actually talked about uh how the how MCP
and skills differ uh from each other.
And we're basically comparing apples to
uh apples to oranges when it comes to
MCP and skills. Uh one of the
misconceptions
currently probably was already debunked.
The debate now is more about MCP versus
CLI. But when the the skills were
released uh back in I think it was
November or October last year um they
basically started this debate about um
well it's uh should we use them instead
of MCP because if if I can run if I can
provide more information more context to
the to the agent without actually
loading every tool to uh to the context
like the like the MCP u and I can also
have uh scripts so I can have actions
just like uh I have on MCP tools. Should
we use them? And the answer is uh you
should use both to be honest. Uh if
you're building anything that it's uh uh
an integration, you should use uh MCP,
right? Um anything that if your agent
doesn't have access to bash, you should
you should use MCP to integrate to your
service. Uh skills actually just provide
more context to your agent, right? And
you can define workflows everything that
you would not that that you don't have
space to define on on the MCP tools uh
descriptions you can define them on on
skills. Um also regarding the comparison
the the debate between skills scripts
and the skill the MCP tools the main
difference is that tools don't need um
an environment to to run. uh the agent
can just call a tool knows how to call a
tool especially if the uh the MCP server
is remote and the tool will run on on
server side while the scripts uh well
they basically are loaded into your
machine they run on your local
environment uh and they're tied to the
whatever environment that you have so if
you're running on Linux they have to be
Linux compatible if you're running on
Mac OS the same Windows I'm not going
I'm going to even studed about it uh but
essentially that those are the main
differences between the MCP tools and
the the scripts. Uh hope is everything
clear. If you have any doubts, feel free
to I'm going to have a a little
demonstration. This workshop is going to
be more more of a walk through than
actually code along, but feel free to to
tag in. Um I have a a GitHub repo uh
prepared, so you'll be able to visit it
and to explore it. Um but if you have
any doubts in in any moment of the of
the workshop uh feel free to to
interrupt me or to raise your your
hands. So moving to see this exactly
this is so I tested this on a smaller
screen was working you can see it was
voded. So uh how do you test your skills
right? So if this is just a markdown how
you test your markdown files basically
um so to test a an applica u a piece of
code it's uh it's already
straightforward right we already know
you have all sort of uh um tests types
you have unit tests integration test or
you can test the whole flow or that we
call end to end testing um well
essentially when you're testing a
markdown file you can basically do
exactly the same. You can be as granular
if you uh as if you want. Um but usually
since we have an LLM in in the loop,
you'll have something called
evaluations. So, uh, for those of you
who haven't heard about evaluations or
evolves for short, um, they essentially,
um, are a more, um, a nondeterministic
way of testing the output or the
behavior of an agent or a model. You can
test both an LLM or an agent, uh, with
with Evals. Uh, essentially, you being
the the most common structure. I'm going
to be to to to present to you at the end
um a framework for you to test your your
evals like a very simple one where you
can start and I'm going to uh dive
deeper on evaluations there but
essentially um they usually are made of
an input an expected output just like a
regular test and in between you can um
evaluate the steps that the agent took
the reasoning the tools that it that it
fault uh which is uh normally more
interesting and easy to to evaluate the
than just like a reax on the exact
output since this is nondeterministic.
So there's essentially a framework uh to
that you can follow to test your skills.
Uh this one was proposed by OpenAI on
their system on their blog post called
systematically evaluate uh agent skills.
I think they released this back in
January or February. So not that long
ago, but all this is fairly new. So this
is basically prehistory. Um so you start
uh by defining your metrics. So what you
want to evaluate uh on your skills. Uh
if you're building a skill for your
product, for example, what exactly do
you want the skill to um to highlight to
your to your agent? It's going to be to
uh forward it to the documentation. Are
you putting some specific instruction,
specific workflow? So depending on what
you want to to evaluate you start this
uh evalriven development uh so this
testdriven development you start by
defining the metrics what you what
exactly good uh means uh when it comes
to the skill then you create the skill
itself right so you write the skill.mmd
file
so you write the the skill.md file uh
any scripts uh alongside it uh the
reference files uh if you want to
they're all optionals. The one the only
only required is the skill.mmd file. Uh
and then you you went um you move to the
testing part. So you run the evaluations
or or you run it um manually. Um I've
recently heard the the the CEO of of
Brain Trust uh during the podcast a
podcast. I don't know how many of you
know Brain Trust.
Okay. not not as much the not as many as
the um uh not as popular as skills. Uh
but uh so brain for those of you who
don't know brain trust is a platform
that allows you to systematically run
evals and provide you like u the full
picture uh of um of the of the agent
behavior during the the evaluation uh
scenario, right? Um
trying to think about another platform
to compare it with but this is fairly
new to be honest. Um so you can think of
it as like an observability tool uh to
to check your the behavior of your um of
your agents during a specific control
scenario uh which are the evaluations.
Um so you move to to the testing part.
Basically you run a set of evaluations
uh scenarios uh how you um these are
defined by the input and expected output
tools that should be called. So
basically how the you expect your agent
to behave um and then uh you move to the
grading part. So how did the the agent
do? Well, essentially it's this is very
similar to a testing cycle, right? Uh
but now we instead of having a
deterministic output, you can have um
it's nondeterministic. It's an LLM in
between. Uh but you can still have
deterministic parts to evaluate on and
then you iterate basically and repeat.
This is the that's why this it's a it's
a cycle pretty similar to uh any of the
the the test development cycles that we
that we have at the moment. All right.
So, uh jumping straight to what we're
going to do uh during this workshop. So,
we're going to write a a skill. I've
prepared um a little demonstration app,
a demo app. Um it's going to be um a
performance review application uh with
uh four I believe uh four employees uh
one employee, two managers and one HR um
representative.
Uh and essentially we're going to that
there's some errors uh on the database
site that we're going to find and fix.
uh we're going to build a skill to help
to guide the agents to to to fix them.
All right. And then at the end uh I have
as I said a framework to test um
automatically the the same scenario that
we're going to test manually um using
using evol. Um before moving to the
demonstration uh how many of you have
have heard about or used superbase?
All right. So almost almost anyone knows
or used superbase. Uh I've I've seen
some some hands down. So uh still I'm
going to give you a little brief. So uh
Superbase it's essentially a back end as
a service. Uh you you can think of it as
the open source version of uh of um
I just can
>> thank you Firebase. I only fire was
coming to to my mind. Sorry. uh to
Firebase. Uh and if you don't know
Firebase, you're probably living under a
rock. Uh no, but essentially it's the
it's a it's a back end as a service. You
can use it to build any back end as you
as you would like. Uh coming straight
out of the of the box. Uh we provide a
database um for for you to just plug
into your application and run on
Postgres, one of the most if not the the
most popular open source solution out
there for databases. uh you can easily
integrate including authentication on
your application uh running storage to
save files um
and many other things uh edge functions
which are a lambda functions uh for
those of you who come from the AWS
environment and so forth. So the demo
application that I've built was built on
top of superbase of course. Uh and so
you can follow along. Uh here is the QR
codes. Uh here at the back can anyone
everyone uh scare uh scan the the QR
code or should I make this bigger?
Bigger. Right.
This
Okay, just so everyone can see, I'm
basically editing the the presentation
at the moment as we speak. Let's see
what Clots
has to to offer us.
Bigger
This is the cool thing of of web coding
your presentations. I really recommend
uh I probably spend the same or or more
time than than if I just uh use
something like Google slides, but at
least it's more fun. Uh and anthropic
should be thrilled about it for sure.
All right. Uh let me know once everyone
is uh in the is that the can see the
repo. If you cannot see or scan the QR
codes,
uh
I should probably make the the link
bigger as well.
So you asked for a demo. Here's the demo
on my slides.
Uh so mainly everyone you here at this
room used skills. So I probably won't
have to to sell you uh the the the power
of skills. Uh but if you if you're still
a bit skeptical about skills, this whole
presentation was without skills, this
whole presentation will be a lot um not
pleasant, let's say. Uh be much uglier
in a sense. Okay, you should probably
see it now.
So basically navigate to uh to GitHub
Hudripppn which is my um nickname and
improve skills workshop AIE Europe. It's
a very long name. All right.
So is everyone at the at the at the
GitHub repo at the moment? Okay.
Everyone was had no trouble. All right.
So,
not this one,
right? So, this is the repo that you
should be looking at. Uh, essentially,
it's I know it's it's a big repo, but
we'll go we're going to break it down.
Um,
actually going to
move.
Okay, going to
to move to VS Code.
All right,
so
here we have uh we have an two Nex.js
apps. Actually the the slides are also
embedded here. Uh the the NexJS app that
um that matters it's inside demo, right?
And to give you an insight of what that
looks like, it's basically this.
So it's a very simple application. Uh
you can see that it's a VIP coded
application to be honest. The the layout
it has nothing special on it. Uh you
have as I've described earlier several
um employees of this uh fictional
company. Um and you can think of it like
an internet or a performance review
application where uh you have all the
information as an HR um employee you
have all the information all the
information about the um about the the
other uh employees of the the company uh
and you can change for the sake of the
presentation you can change between
uh users right so what we're going to do
uh is first without a skill we're going
to try to implement a new um a new a new
view. Uh here we're going to implement
the reports view. Essentially this uh
reports uh part of the um of the
application is going to be uh is going
to show uh both the salary uh and the
the average rating for the performance
review of each department. So uh so HR
can be um can know what's uh uh can have
like an overview o of the of the whole
company. Um so before we start to vip
codes because um during these pres these
workshops no one actually writes code
anymore. Uh so of course I'm going to
vype code it. We're going to just break
this um
um this application down. So if we
navigate to the dashboard,
nothing special to to see. You have it's
basically the the page, the first page,
the main page that you that you've seen.
And then you have here the the reports
where we should have
Yeah. We should have this
set view exists. that is going to uh so
I prepared the back end. We're just
going to to to create on the back end
the view as a SQL view on the the back
end on the on the database and then we
should be able to to see to see it on on
the application. So I've prepared
where is it? Yeah. So,
I've prepared the prompt
and
we're going to
live test it.
Fingers crossed that this work. Right.
Uh first let me navigate to
it's app.
Okay.
Here we have more control. All right. So
essentially for the ones in the back
uh I'm just ask I'm going to ask Claud
to create a department stats view uh
that shows the ad count and the average
salary broken down by department. All
right. So for uh HR to have a full
overview of what's going on in the the
company. So we're going to hit the
prompt and wait it wait to see what it
it should come up with. Uh
right. Um forgot about this part. I have
um this MCP server configured.
If you have
uh you actually have
I totally jumped the
the read me. Uh if you follow along,
sorry about this. If you follow along,
if you're following along, you can
follow the setup guides uh to to get to
get your application started locally. Uh
this essentially is going to install the
dependency, clone the repo, install the
dependencies, start locally your your
superbase project. You don't have to
have the CLA installed that we were
using npx to start to run it as a um as
a binary. Um just resets the the
database state. So you start from from
scratch with the seeded data uh and then
just run uh the the app uh as
as the running npm rundev uh should be
available on localhost 3,000/
dashboard. Um
you also have you'll have this MCP.json
um file prepared. This essentially is
pointing to the MCP server that we we
Superbase enable uh for local projects.
No authentication required. So your
agent should be able to just load it uh
on demand. This um uh this MCP server
expose um a set of tools. is um I don't
know many of you have used the the
superbase MCP server but I think we
currently have something along 20 20
something 29 tools I believe for the
production one uh this one is a smaller
version uh has 20 tools but you can
basically perform essentially almost any
anything that the the the one uh to
connect to your remote project does uh
basically list the the tables that you
have executes SQL the straight on your
database apply migration uh and run the
database uh advisor and so forth. So
essentially what he what he started to
do was to list my tables. So I've asked
for a view it's going to review the the
schema that I already have implemented.
So I'll let you
and now it's going to run the apply
migration
uh tool to create the view. So it's
basically doing a schema change on my
database and it's going to create uh the
the view. If we inspect the view, we're
basically creating
um create a replace a view a department
stat the name that we gave uh and we're
uh by um fetching all the information uh
from
I think department exact uh
no from profiles exactly and group by
department. Okay,
made a mistake. It's going to try again.
Okay, it's going to test it. It's
actually something that I really like
about. All right, then here's the here's
our uh view on the database. So, uh we
currently have it on the database. Let's
see if that's also enabled on the app.
Okay, it's not then. Let's quickly uh
I've
created SQL.
What's the name I gave?
This is the this essentially the the
problem with live views
is it usually doesn't go well at first
try. Uh,
I want to repeat the
and let's see if it if it implements. If
not, we can just run the the SQL query
for you to to see as different users uh
to for you to see if everything is
working um accordingly. So for now
he's going to need to implement on the
nextJS application so we can have a nice
interface to to check the results. Um
I need to enable everything. Wait, let
me just put on auto mode so I can
continue to talk. So essentially the the
agent created the the view
tested said everything is working
accordingly. Uh we should not the the
the app was um the the feature was
implemented all good. Um but we're
actually going to see if he's actually
if everything it's it's good or not. Um
so let's give it a
let's give him some space. uh not to
pressure not to pressure it to create
the
the the feature.
Let's uh just wait a bit more time.
In the meantime, if you're following
along, you can also play with it. Uh
change the the layout
actually using the using cloud code. I
don't know. just doing the a brief um
survey here in the during the workshop.
How many of you are using cloud code as
well?
Oh, fairly almost anyone. Okay.
Um
how many of you are using cursor
with cloud codes or with the plug-in or
Okay. Yeah, at least one person.
Uh we're going to have uh some um cursor
folks here. I think from a tropic as
well. uh OpenAI uh is going to be here
as um Gemini of course Google Deep Mind
is sponsoring event. So um we're
basically going to have the whole gang
here. Uh okay. So we should be
I'm trusting his word. All right. So it
says that we should now be uh have
correctly displays the the department
stats view. So let's see if that's
actually true.
It looks like it. Yeah. So we now have
this uh cards uh with the whole view of
the company. So I'm logging in as Julia
from HR. Uh we can see that we have five
people on the engineering team with an
average salary of uh uh 107K.
Uh HR as well will only one person will
would be Julia and product has four
people and that average salary. So so
far so good. Looks looks okay. Let's
see. Uh so this is like this is sensible
information. Uh the the reports, right?
So uh we're expecting that the other um
the other um employees will not have
access to it and even the managers only
have for their departments. Let's see if
that's the case. So let's navigate to
Bob. Bob uh is the head of engineering.
Oh,
okay. So, so Bob also can see the the
performance reviews of both u the
information of both HR and product.
Well, it's not it's not that bad, right?
Um it's not ideal, but uh at least he's
a manager, right? So, uh it should be
access to privileged information anyway.
And who doesn't like a transparent
company? Let's see if our hus is okay.
Um okay, this is this is problematic. So
we basically created a view. Uh Claude
uh said everything is working because as
you can see the information is here. It
was created but um he missed something
uh that is training data basically
missed something which was for pro
postgress specifically um when you
create a a new view the
and your um table has role level
security enabled. So for those of you
that don't know a role level security
allows to for you to define who can see
the information on a spec uh on a
specific row on a database level. So
without trusting the the application you
can filter it directly on the on the
database. So in this case we should be
limiting the the the view of of the rows
uh by user ID right and the user role.
So if a user has an employee um and has
an employee role, he should not be
should not have access to the rows of
the uh that don't belong to to them,
right? Uh we have role level security
enabled. If you navigate to our
superbase
um migrations,
you can see
uh
level that we have role level security
enabled
both uh on profiles and on performance
reviews, right?
And on a performance review
should be should be about right. Uh
right. So we have
reviewer ID equal current setting. So it
should work. Why is not working? Well,
when you create a view on Postgress, um
by default the the permission um it
creates with the permissions or or the
the credentials of the the user that
created the view and not uh with the the
the credentials of uh of the of the
table let's say with the the role level
security. So basically by default it
bypasses the role level security uh that
you have in place on your that you might
have in place already on your um
uh on your uh on your table. So for for
this scenario to happen, we have to have
a security invoker
uh we have to use a security invoker
flag to
transfer the role level security
policies or to enable the the the RLS
policies on the view itself. So this is
the why currently everyone can see
everyone's because the the role level
security policies were basically
bypassed on the on the view. So
uh for the sake of the of the
demonstration of this uh workshop, I've
already created
um a skill
uh uh I prepared a skill uh for for the
presentation.
Uh and essentially the skill is three
main po main security points about
Postgress uh for that the agent should
be aware of during the the presentation.
For this one specifically, I actually
overfeed it to the exact view that we're
creating. But models right now are smart
enough to generalize this. And if I
wanted to to create a new view, you will
be able to uh essentially it has to
create with this flag. uh since
Postgress uh version 15 this flag was
enabled and every time it's enabled the
the role level the RLS policies
are also enabled on the on the view um
as you can see it's um it's actually
quite human readable documents um mo
most of you have already written skills
so I'm not going to uh dive deep into
into this Um
but as you can see we have both the
title let me just move this we have the
title I called it superbase security uh
and the description uh uses the the verb
use uh this is an insight that I that I
got from uh some experiments that I've
did using verbs mainly the the ver um
the verb use uh increases the chances of
the skill being loaded um at least on
clots. I don't know if this is default
behavior for for cloth if it was trained
to recognize to more easily recognize
verbs essentially use. Uh but I found it
more efficient to if you write use and
then the whole purpose of the the skill
uh in front of it and then a regular
markdown uh list. we have the the view
case there but also another uh checklist
to the points uh for security um on RLS.
So public schemas should should have RLS
enabled by default. Uh public schemas or
exposed schemas are the the the database
schemas that are going to provide
information for the the application that
the user can see. So for example the the
users table the profiles the performance
reviews all this information is going to
be fetched by the front end. Uh it's
completely secured because uh superbase
makes it secure uh by by allowing you to
to fetch information on from the front
end. Uh but the the key part here is
that if you don't enable role level
security, you will not have this filter
on the on the table and you will have to
rely on the application logic to to make
the the filter. So enabling role level
security at least makes it safer uh for
for you as the backend engineer uh that
the you only expose the information that
you actually want from the start.
And then a couple more things that I'm
not going into.
So
if we uh we can install this skill on
this project um by running
where where do I have the command
npdx? Yeah.
So, I'll be
I'll be using Where's the skill? Okay,
I'll be using Versel's uh npm package
called skills. Um curious to know how
you guys have been packaging your
skills. Have been Have you ever used
this uh this package? Are you using
plugins? Just
>> this one.
>> This one?
>> Yeah, this one mainly. Yeah, it became
very popular uh few months ago.
>> I think the only problem is it doesn't
really
adhere to your project. So you get like
global
only for your local project.
>> Yeah, you can install it both uh
globally and and on your project and
also for multiple as support for
multiple agents. Uh while plugins for
now are still tied to the agent that is
going to load them. So, Cursor has
plugins, Cloud Code has plugins. Um, I
think other vendors have as well, but
they're specifically distributed and
made for those specific models. Uh, so
we're using this one to install. You can
install any skill uh from a repo online
that has a skill.md file or or um you
can use it to install the one locally.
It will auto detect the
uh the location that you're trying to
fetch from. uh based on the format. So
in this case we don't have any GitHub,
we don't have any HTTP protocol there.
So we have a dot uh slash so it will
recognize that it's um
uh that it's a local one. And for this
I'm going to uh
move to the main. Yeah. Okay.
And on a good oldfashioned way going to
run on the bash.
So, it's going to pop pop this up. Um,
ask me which agent do I want to install
on? I'm using cloud code, so I'm going
to install it on cloud code. If you're
using any other uh any other um agent
arness, you can also install it as long
as it's um uh it supports it. Uh I'm
going to install it on on a project
level. So it's going to in this case
going to create a agent folder with the
skill and uh link it to mycloud
uh slash skills uh folder as well. So
this so claude knows how where to find
them. I'm going to sim link and we're
ready to install. So we if we let's not
expose my key. Uh I'll delete this is
just for the workshop so I'll delete it
afterwards. feel free to use my my free
credits uh for the time being, but
essentially created the the
agent
uh where is it? Yeah. So, I also have
some more things that we're going to see
afterwards. Uh but the essential part
has the skill that I've showed
previously. Yeah, there it is. It's the
the skill and then also created a s a
sim link a symbolic link to the cloud.
Uh this is how the package works. Uh and
this this way allows to to cloud to
either search on agents which is
becoming the standard or on the cloud uh
folder that it has.
So let's see let's run the same prompt
again on a new session. Um, let me go
back to the apps demo. Yeah. And start a
new session. We should have this one
enabled. Yeah, there it is. So, Claude
is aware of uh the superbase security
skill. Now for you to run skills, uh you
can either um just run your prompt and
uh pray that uh cloth imports your skill
uh based on the description that you
gave. Uh you can uh include the keywords
use and then the name of the the skill
that you have on the prompt and this
will uh almost 100% of the times load
your skill. or if you're using cloud
code, you can just slash and write the
name of of your skill. And this 100%
guarantees that cloud we're going to
import the skill. So for for our use
case or for the presentation, I'm going
to because I cannot afford that it
doesn't load the
the the skill. Let's
wait. I need to
uh I need to reset the the database to
create the
the view again. Um
workshop uh and it's npxb
reset.
Yes,
I'm just resetting the the database
applying the migrations from the from
the start. Uh I didn't it didn't create
any um any migration file. He he applied
direct the the migration directly to to
the database. Uh so we now should should
be good to to go. So it's going to u
bring down the the database uh and
create a new one uh based on the schema
that we defined on the migration files
and the seated data.
>> Yes.
Have you found ways to build
skills?
Yeah, that's a that's a fair point. So
the the your whole question or
observation it's that the initial
promise of skills they were presented by
anthropic were
>> yeah uh
so since this is uh this is like on the
the agent side right the agent decides
when to load this uh the best thing that
you can do without explicitly
either with the slash command or the use
and then the name of the skill feel on
your prompt. Uh is for you to play and
uh play around with the description uh
and run a bunch of tests either manually
or um or automatically uh to to check
what actually works and not for the the
ones that that you're expecting the the
agent to behave, right? So you define a
bunch of scenarios where you think that
the uh skill should be loaded and when
the skill shouldn't be loaded. Um you
test it out. You can test it on your
machine like on this scenario I don't
want the skill to to be loaded. Prompt
the the prompt uh on on cloud code let's
say and check if the skill was loaded or
not through the CLI. Um,
and then play around with the
description to see what actually works
or not. Like this without actually
explicitly call the skill. This is the
best thing that you can do to to to
test if the the skill is being loaded
correctly or not.
Yeah, we're still at the at the very
beginning of uh like a a very early
stage uh of of skills even for MCP like
all all this um agent stuff it's fairly
new. So we're still we're still
standardizing things. Uh we're still
figuring out what works and what
doesn't. Progressive disclosure was
something that no one was talking about
uh six months ago and now it's fairly
it's fair to say that's one of the north
stars of uh uh agent development. Uh so
in six months from now probably it could
be another thing. So or skills could be
the standard or maybe anthropic or openi
or someone else found a more efficient
way to manage the context or provide
more context to the to the agent. Uh so
we'll see basically.
All right. So the database was was
reset.
Okay. So at least now we have the view
but we don't have the information on
your database. So now we should be able
to run the same prompt again but we but
with the
with the skill. So if if we hit the
prompt
you saying it was quite fast I don't
think
uh
yeah but it didn't create one.
Okay, let me try another thing. Instead
of
instead of this, let's
use
to create.
Let's see if it works now. Yeah. Okay.
So, it loaded the skill. So now at least
should have the context uh to
create uh that the the RLS or the
security invoker flag should be included
uh when creating the view uh and the
steps should the the rest of the
workflow should remain the same. So we
it will list my tables. Right.
Exactly. Identify the tables. And now if
we look if we look closely we can see
that we we now have
this the the flag here uh is going to be
on the on the migration. So let's see if
with the flag uh this is the expected
result.
This is what what happens when you v
code a CLI. You now have the the UI
duplicated. Right? So, it created the
view.
We should be able to see it,
but Alice shouldn't.
So, what's happening?
Wait. Okay. So,
uh do I have to reset now?
H
interesting.
should have another
uh probably. Let me just see if I have
it uh here.
Uh where did I put it?
So count
I'm going to cheat here. going to say
the both
and the employee should
be able to see information.
Okay, basically live troubleshooting.
What is not happening? Probably from a
different uh a different um policy that
I've defined here. Uh
but now it's going to troubleshoot.
Let's see if the the skill actually
improves the the efforts here. If not, I
have something on my sleeve. Uh because
uh if you're not aware of Superbase
basically has um
database advisors that you can use uh to
try to identify early early on identify
um some potential vulnerabilities or uh
schemas that are exposed, information
that might be exposed um before you're
running it into production. Um
so if if it can figure out by itself I'm
going to include on the skill to also
run the advisor uh to to check. So this
is the the main part of skills is that
you can oh that's you can uh uh see well
it's the it's a very poorly written
application let me say uh it's
essentially the the main part of skills.
uh it's not if if this specific um demo
works or not, it's that you the the
behavior changed uh once it loaded the
skill, right? It created with the
security invoker part. Uh and with with
that that just shows how powerful it is
that that you can create um you can
change the the behavior or or guide the
the agent on demand bas uh based on the
on information that you that you put.
can think of the skill.mds as a prompt
template that you can give to to your
agent. So, let's just quickly
troubleshoot. Oh, is even offering to
apply a migration.
Let's see if it doesn't break my my app.
All right. So it seems too complic
>> I have a fairly amount of skills. if you
as you can see I've been playing around
with them. I also have the some of the
pre-installed MCP servers for that um
that Superbase enables. Uh but
essentially
SK uh it would be more interesting if
you if I've just um if I've compared the
the context uh from before and after
loading the skill. So right now skills
take 1.3
uh,000 tokens on my context, right? uh
as you as you saw I have more than than
just this one uh skill but the skill was
loaded so the whole information inside
skills.mmd was loaded to to context if
we clear and run the context again
the
the skill amount so this skill is not
it's not enough to for you to see but as
you can see the the skill stick quite
um less space uh that that the MCP uh
would uh from
all right okay I have a newer um version
of the cloud code so for those of you
who are not aware of this uh entropic
recently released the tool the the tool
search tool uh which is a mechanism for
for cloud code to load tools on demand
so it doesn't loads basically
progressive disclosure but for MCP
tools, right? Um the the main difference
between M this progressive disclosure or
the the tool search tool um on uh on
cloud codes and skills is that the
progressive disclosure is built by
design uh for skills. So it's like
already baked into the structure the
instance of the skill while on MCP is
still not a standard for all tools. So
it works for cloud code but for many
other clients it won't. It will just
load all tools straight to your context.
So um this is a for now a thing uh for
just um for for just cloud code. If
you're interested about it uh we are
going to have the the founder or one of
the co-founders of the MC of MCP
speaking on the 10th. So on Friday, um
he's going to to give a brief overview
of the the MCP road map. Um which if
it's something if anything if nothing
changed since last week uh when he
presented this in New York uh on the MCP
dev summit, you should bring this uh
this progressive disclosure part uh to
tools to to bring it to the protocol
itself.
So
>> yes,
>> let's say that we have very large
database and we have to to load in the
context the schema of this database
because we we have to query database
using agents
>> in your opinion is it better to use a
skill or an MCP
for examp
to to load this schema but progressively
Okay. Uh
>> possible to use the schema to
progressively disclo um uh load the
schema of this big database
>> in your experience.
>> Yeah. Oh yes. Okay. So is your question
more about uh how should we access it or
the whole architecture of this uh
pipeline to import the the data? I I
just want to to ask to a an agent uh to
to query the database and uh obviously
uh uh the agent uh uh must know the the
schema of the database
before
or not.
How can you teach the the agent to to
query the database
>> using the skills using the uh an MCP
server or something like that? And if
you use the skills,
if you decide to use the skills to uh to
load the context of the agents with the
schema of the database, is it possible
to progressively load the schema within
the context?
>> Okay, gotcha. Um so let me break let me
break the the situation. Uh let me break
down the situation for you here. you
you'll have um essentially two parts.
One is uh what's going to be on the the
context. So what's going to be loaded
and the the specific information uh that
you want to to have um on your on your
scenario. Uh and the second part is the
actual mechanism the the extraction
mechanism that you're going to use to
load the information from the database.
So for the second part to to load the
the information from the the database
you can either use uh a script so a
skill that invokes a script or an MCP
tool. Um I would advise to use an MCP
tool because you can uh use it if if
you're using on production or on remote
project. You don't rely on your local
environment. You don't have to manage
the keys. Um and the tool it's already
standardized and uh you already have the
the authentication baked into the
protocol. So the agent never manage the
the authentication uh token. It's on the
it just runs the tool um and and it
works for the for it to progressive
disclosure the the information on the
database. It will you'll have to um you
can include it on on a skill. Yeah. Uh
you'll be using the MCP tool. on the
skill you'll probably state that use
this tool to load and in the tool
implementation you have to enable it to
not load to progressively loaded right
so to load into chunks um it might be
just enough from the the the tool
parameters um the agent should figure it
out by itself that if you put a
parameter called buffer for example
should be able to load it in chunks
right uh instead of the whole table. But
if you want to have 100% sure that it's
going to load into chunks and use it
properly, I would also package with with
a skill and describe it how I intend to
to use this uh this tool. So this is
actually how both skills and MCP play
along together. It's the the tool to
enable this connection, this integration
and the skill to describe how to use it.
Yeah, this is how I I would implement
this this type of system. Uh thank you
for for the for the question. Uh and it
got me the opportunity to to basically
talk about the how how to use both
skills and MCP and not put it uh uh one
against each other. Um so now as I
promised we should be moving on. Um I'll
have to give it more time uh to to
figure out because I've I basically
during the the workshop when I was
preparing the workshop I've I've gave it
a bunch of vulnerabilities. So if I just
kept it simple and that one the demo
should would probably work. Um since I
have more vulnerabilities exposed that I
if I had time I would um try to solve
it. Uh it didn't for for the moment but
uh but you you saw on both uh scenarios
that the first one didn't have the
security uh flag security invoker flag
and the second one had. So at least we
can um
we we can imply that the the the skill
was doing something. The the the agent
saw the information on the skill. It it
merged it with the system prompt or
stored it near near the system prompt
and um change the behavior accordingly.
uh to test this. So if you want to move
this um this part the the skill into
production, right? So it works on your
machine. It's a it's a tail older than
time that it's working on my machine,
but I don't know if it's going to work
on your agent, on your machine, uh on
your um environment.
So to have this uh to to test this or to
automate this testing and with this we
can unlock having a pipeline for example
if you change one thing on your skill uh
how can you reliably tell that it's uh
it keeps doing what you're expecting
didn't break the previous flow so if I
uh change one of the checklists how can
I ensure that the the other ones were
still working right so for the this is
where evals um could step in. So, uh,
evaluations, it's a very broad term. You
can basically evaluate anything since
this is a markdown file. It's a free
text file. You can evaluate basically
anything. Uh, so it's um fairly
difficult for you to the most difficult
part to create evolves, I would say, is
actually coming up with the scenarios
because you would first have to to know
what's the expected behavior uh of your
of your agents. Um
so coming up with representative
actually good scenarios that represent a
fairly amount that cover a fairly amount
of uh use cases that you want to to
build are the most difficult. Um and
there's still not a standardized
structure to create evaluations. you can
use um you can test it um or by
importing a bunch of prompts and
expected outputs uh from a CSV file from
a JSON file. You can use tools uh like
Brain Trust or Lenfuse to test it um and
to to have an analytics and an obser
observability layer on top of it. Uh for
this presentation I followed
the
um I followed the what agent skills uh
open standard defines as to to design
the test cases. So if you're not aware
of this website, this is the landing
page of the agent skills open standards
uh to try to standardize what a a skill
is and how should behave and they
basically propose a very simple
structure local way to test the the
skills organized by you'll have an
eval.json that essentially has
a set of evolves. an array of of evolve
scenarios. Um, you'll put the prompt
that you're going to give the agent, the
expected output from the agent. Uh, this
is only if you have an LLM as a judge.
Uh, this is a technique used for
nondeterministic evaluation. You you
would have instead of a human, you can
give the outputs of a of a an evaluation
run uh to another LLM. say it define a a
success criteria and let the the LLM
who's who's doing the uh whose role is
is to judge in this case that's why it's
called LLM as a judge uh to give it a
grade basically. So this is one part
that you can automate on your
evaluations for nondeterministic
workflows. you can either assert if a
tool was called or you can give the the
results to an LLM and
nondeterministically try to uh get the
the agent to to grade the the
performance of the other agents. We
basically have agents evaluating agents.
Um so I followed this this structure. I
gave the same um the same input here,
right? Uh so the the agent that is going
to run this evaluation is going to get
the same input that we that we had. The
expected output it's that the security
invoker uh it's true. So it's it's
present on the um on the app uh sorry on
the view and I have and then I have a
bunch of uh assertions that in this case
um
I'm going to check it uh um
deterministically right I prepared a
Python script that essentially just
resets the the state of the database so
we we ensure that uh since we're running
this locally and not on isolated
container like a docker container for
example uh we have to make sure that the
systems always starts uh from the same
ground so I'm going to reset the the app
uh if you want to to run the the
evaluations as well you have to pick
your own entropic key uh create copy
this you can follow the the read me
inside the the superbase security uh
here you'll have how to set this up um
but then I I will run the uh the cloud
code on it. Uh I think it's on print
mode or can remember what what they
called but essentially like I will run
it as a binary headless. Um so the agent
will receive the um the the prompt that
that's on the evaluation uh as the task
to perform. And I'm also going to to
give the condition uh we're going to
test two conditions. one with the skill
and another without it.
And essentially,
so for you to see run the condition
run, this is where the cloud code will
run. And if the condition is with skill,
we're going to load the skill.md into
the the system prompt, right? Um if you
if you would actually uh would like to
mimic the behavior, you would run this
on the Docker container. you will put
the agent skills uh on the cloud/skills
um directory inside of a docker
container and let organically let the uh
cloud codes find them and use them. Uh
for for this presentations this is a
very simple setup. I've just basically
appended to to the system prompt. So,
so we're going to run the evaluations.
Uh, do I have the other? Yes, I do.
Okay. I think we run it on the
base.
How is it not finding the the skill
supervisor?
No.
Okay.
Oh, I have Wait, I know what's going on.
I have the
the wrong name.
I change it.
All right. So, we started by running
with the skill. So, the first result
that we should get is the the with the
skill. It stopped. Now, it's running
without it. And then we're going to
compare it. This will output a workspace
iteration one um folder and we can
compare it both the output of uh with
the skill and without it uh while the
without skill is loading. Let's just
quickly inspect what what the uh with
skill output gave. Um and essentially
you can see that it created this the
view with the security invoker and then
we have this grading.json JSON file with
a bunch of information like the the
assertions that we we've put on the
eval.json.
Uh we have them here and we can see
that for this one created as the as
failing even though that created where
is it not found
the view as security
okay I'm actually evaluating something
wrong so the problem here now it's with
the
is it the scale uh view.
So since I I was expecting this to to
create an PG class
uh RL options instead of just inspecting
the the view, it's giving me that uh it
failed. But the key part has it
finished.
It's not finished.
Still running. Take a long time.
Could be. Okay.
And now we can inspect.
Okay. So, uh this is actually good good
insight. So, um with these results, this
is the the tricky part of of writing
evolves. Um
so, as the uh as like normal tests, uh
the results will depend on how you
implement them, right? it's it's just
code. Uh so if you're evaluating
something wrong or some or not the the
expected behavior, you're going to have
wrong results. It might not be because
the the system is is not working. So uh
we've tested manually and see that with
the skill it created with the security
um that the security flag we can
actually just inspect it here with the
skill created. Let's see if if on this
one surprisingly this time it did. It's
a nondeterministic
um the nondeterministic behavior of of
clots. Um
but since I was evaluating something
wrong, right? Uh I was expecting it to
create the or or inspecting uh the um a
meta schema to check if the the the view
the the security invoker was there or
not instead of just inspecting the view
directly.
um the results came a bit off. So it
said that with the skill it failed and
with
the um without the skill it passed. So
and if if we inspect the both outputs
they're basically the same. So with this
ju just to show you how tricky it is to
to write evolves because it's although
this can happen on uh with the with the
regular uh tests um it's easier to catch
because the the output is deterministic
right it's just code um here if you're
handing to to an LLM to evaluate it can
sometimes elucinate
so to finish uh because we're also
almost running out of time
to sum up the the structure. This is the
one that they recommend. Uh I find it
very easy to implement to to getting
start with. Uh later on you can move on
to more um complex uh evaluation
scenarios like running on a docker or in
the sandbox uh to guarantee that uh you
get a fresh environment with just one
skill that you're testing on your set.
Um but essentially you would just put
two conditions with and without the
skill compare the results and see uh run
them on the harness the agent harness
that you would like and compare the
results uh out there. This is basically
your very first uh evaluation pipeline
to to test the skill automatically.
From my end that's all. I hope you find
you found this workshop useful to to get
your skills leveled up and ready to
productions. I'm going as I said in the
beginning I'm going to give uh a keynote
tomorrow a keynote no a talk tomorrow
about how we've implemented uh and
created the the superbase skill for the
product itself. How we're keeping it m
maintainable while ensuring that
provides value and now we're uh testing
it uh into production. Thank you. ANYONE
has uh any doubts, questions?
I'll I'll also be Yeah.
Uh so I have a question about uh like
the number of skills that you typically
install on your environment because with
this progressive disclosure it seems
like we can basically keep adding
different skills and the agent will
automatically
um basically the agent will
automatically
find them. Uh do you have any
recommendation on how many skills to
have
>> or is there any limit or we should just
basically keep adding uh and it will
magically work?
>> Yeah. Uh I'm probably not the best
person to talk about this cuz uh it's
easy for you to um get into this rabbit
hole or of just like especially when
you're experimenting getting a bunch of
skills as as you saw. I had a plenty of
them um installed globally and um I
think it's fair to say that and use them
all uh on a daily basis. Um but it
depends if if you're using them on your
local machine. I think it's pretty
um be pretty easy for you to um get this
messy environment where you'll have all
of them installed or most of them
installed. um for local for your local
environment I wouldn't for now since
it's very experiment or my personal
opinion I would not uh constrain myself
on like
space management or context management
about this the progressive disclosure
it's a very powerful thing that you can
explore in this case you sure if you
have skills that you don't use uh you're
going to have them uh fill your context
window but the descriptions are so small
that you can afford to not delete them
if you don't want to. Uh into
production, treat them as um any
artifact that you would have on your CI.
Uh so keep it clean. Uh into production
into your CI um I would keep them only
the the exact skills that you that
you're using in that specific case.
Yeah. Uh, another piece of information
that that I could give you on the
production part is that it's now more
and more common for you to uh, also
export skills or make skills available
on your repos um, as like another piece
of documentation. So treat it treat
skills that you put into production as
actual document as you would read
documentation. So it's important for you
to keep them updated uh include it on
your include the the updates workflow on
your cloud.mmd or on your agents.mmd. So
you make sure that if anything changes
um you will change this the the skill as
well like you would do on on the
documentation if a feature or workflow
changes. Um
he from time to time you can also create
a um a job to to check if the skill uh
is still uh running a fair workflow. I
somehow you could check if the the skill
have been loaded by your users um
in a if it haven't been loaded by your
users for a long time does it still make
sense to have it there? Um, so yeah,
this is basically the the piece of
advice that I could give you for skills
into productions based on my experience.
Uh, for the rest of it, you'll have to
come to the to the talk tomorrow to
learn. Uh, we're putting it into
production on Superbase.
Any more questions? I'm going to be
around throughout the whole event. So,
if you catch me um if you if you cross
paths, feel free to to ask me anything.
Tell me about what you're building.
Would love to see if it's with
Superbase. Even more thrilled to hear
about it. Um, and from my hand once
again, thank you very much. You've been
lovely today for uh 9:00 a.m. Pretty
cool. Good energy. Uh, so just from my
end, enjoy the the rest of the the
conference and we'll see you around.
Thank you.
Heat. Heat.