Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase
Channel: aiDotEngineer
Published at: 2026-05-04
YouTube video id: GmAQKINjv1E
Source: https://www.youtube.com/watch?v=GmAQKINjv1E
Hello everyone. Uh is everyone excited for the the conference? >> Awesome. We got a full house here. Uh it's very I'm very glad to be here. Very honored uh to be giving uh the opening or one of the opening workshops uh today. If you've noticed already uh the title is slightly different uh from what we have in the schedule. I've basically done a rebrand but the theme for the workshop will remain the same. Uh we went from skill issue to level up your skills. I've decided to move the skill issue title to to the keynote that I'm giving uh tomorrow. If you'll have time to to know more about what's the keynote is going to be about. Uh but mainly uh it's um this workshop. Basically, this workshop is what I've been doing in the last two months at Superbase, writing our own skills. And tomorrow, I'm going to present how we put this actually into production and the lessons we've learned. Um, so for everyone who's been paying um closer attention, you probably noticed that I'm I'm running this slide deck on local host. Some of you have already noticed. Uh this is no coincidence at all. I bas uh I essentially vip coded the presentation. So if you see something off uh it was not my fault was clo uh but for for you to if you don't believe me you can see that uh or you cannot do or you you'll have to be a very uh Google slides guru to have dark mode enabled. So honestly I like this layout better. So I think we're going with dark mode here. Um if there's any uh light mode fans out there or the majority of the room it's light mode. I'm happy to switch back. uh but for now let's go with with this one. So to to do a little presentation of myself before starting the workshop uh my name is Pedro. I'm uh uh I'm from Portugal uh Lisbon and I work at Superbase as an AI tooling engineer. Essentially my day today is to uh think of how we can make the the Superbase the most agentic friendly as possible uh and um improve the agent experience. So we you've probably heard about develop uh development experience the DX. Uh we're more focused on DAX which is the same thing but for agents. Uh in this workshop we're going to um talk a bit about skills. Uh because essentially um that's how we've been improving the the performance of agents around uh a product like Superbase uh or a company like Superbase who has multiple products. Um the secret sauce is has been basically skills. So we're going to dive into how to write one um how to uh test it man first manually and then how to automate the testing with evaluations. Uh so to start with how many of you have heard about skills? All right so almost anyone everyone. Uh so what I'm going to say it's probably no news to you. Uh skills are basically folders uh with uh instructions um and uh files for you to to run workflows uh repeated workflows or give custom um essentially custom information to your agents uh or provides a new set of uh of tools let's say in form of scripts. So there's a a bit of a misconception about skills. usually the skill.md the main file takes the the the spotlight. Uh but skills can actually be more than just the the main file, right? Uh so the the main file is basically a markdown file uh named skill.md where the essentially the the main information about the skill lives. Uh it it is composed by this front matter at the top. Uh which essentially has can have multiple fields but the two um required ones are the name which basically identifies the skill uh and then the description which tells the agent what the skill does. the main um what exactly the the the the skills basically bring that uh tools like uh MCP didn't uh was this concept of uh progressive disclosure. Uh progressive disclosure is basically when the agent uh or the all the information about a subject is not loaded uh straight to context. uh instead you just load the exact amounts of information that allows the agent to to choose to load the rest of the information once it actually needs it. So in this case um the the skilld file is designed like this. So the front matter uh will be loaded at first to the to the context of the agent not the content of the the file. Uh this works as an envelope. So the agent has knows from the description what the skill does uh and when it should loads the the rest of so when should we look for the information inside of the file uh inside this file you can also reference another files. Usually these other files are either markdown files or script bash python whatever you would like uh to to reference. Um, starting on the reference files, you usually put them inside a reference folder. Um, and they provide more information. You can think about a skill in this format as a book. The skill.md you can think of it as the index on steroids because besides of having these links to the other files, you can think of them as uh the pages of the book or the other chapters. uh you you'll have um custom you can have custom informations and then also reference it uh to to the other files. The reference files they have nothing special about them. They're basically like the a normal regular markdown file. Uh you can think of similar to skill.md file but uh instead of being the main one it's the one that uh that got referenced. Uh you can also fun um funny enough you can also reference files inside of reference files. So you can make basically a graph out of a out of a skill. And for uh for scripts, I've actually talked about uh how the how MCP and skills differ uh from each other. And we're basically comparing apples to uh apples to oranges when it comes to MCP and skills. Uh one of the misconceptions currently probably was already debunked. The debate now is more about MCP versus CLI. But when the the skills were released uh back in I think it was November or October last year um they basically started this debate about um well it's uh should we use them instead of MCP because if if I can run if I can provide more information more context to the to the agent without actually loading every tool to uh to the context like the like the MCP u and I can also have uh scripts so I can have actions just like uh I have on MCP tools. Should we use them? And the answer is uh you should use both to be honest. Uh if you're building anything that it's uh uh an integration, you should use uh MCP, right? Um anything that if your agent doesn't have access to bash, you should you should use MCP to integrate to your service. Uh skills actually just provide more context to your agent, right? And you can define workflows everything that you would not that that you don't have space to define on on the MCP tools uh descriptions you can define them on on skills. Um also regarding the comparison the the debate between skills scripts and the skill the MCP tools the main difference is that tools don't need um an environment to to run. uh the agent can just call a tool knows how to call a tool especially if the uh the MCP server is remote and the tool will run on on server side while the scripts uh well they basically are loaded into your machine they run on your local environment uh and they're tied to the whatever environment that you have so if you're running on Linux they have to be Linux compatible if you're running on Mac OS the same Windows I'm not going I'm going to even studed about it uh but essentially that those are the main differences between the MCP tools and the the scripts. Uh hope is everything clear. If you have any doubts, feel free to I'm going to have a a little demonstration. This workshop is going to be more more of a walk through than actually code along, but feel free to to tag in. Um I have a a GitHub repo uh prepared, so you'll be able to visit it and to explore it. Um but if you have any doubts in in any moment of the of the workshop uh feel free to to interrupt me or to raise your your hands. So moving to see this exactly this is so I tested this on a smaller screen was working you can see it was voded. So uh how do you test your skills right? So if this is just a markdown how you test your markdown files basically um so to test a an applica u a piece of code it's uh it's already straightforward right we already know you have all sort of uh um tests types you have unit tests integration test or you can test the whole flow or that we call end to end testing um well essentially when you're testing a markdown file you can basically do exactly the same. You can be as granular if you uh as if you want. Um but usually since we have an LLM in in the loop, you'll have something called evaluations. So, uh, for those of you who haven't heard about evaluations or evolves for short, um, they essentially, um, are a more, um, a nondeterministic way of testing the output or the behavior of an agent or a model. You can test both an LLM or an agent, uh, with with Evals. Uh, essentially, you being the the most common structure. I'm going to be to to to present to you at the end um a framework for you to test your your evals like a very simple one where you can start and I'm going to uh dive deeper on evaluations there but essentially um they usually are made of an input an expected output just like a regular test and in between you can um evaluate the steps that the agent took the reasoning the tools that it that it fault uh which is uh normally more interesting and easy to to evaluate the than just like a reax on the exact output since this is nondeterministic. So there's essentially a framework uh to that you can follow to test your skills. Uh this one was proposed by OpenAI on their system on their blog post called systematically evaluate uh agent skills. I think they released this back in January or February. So not that long ago, but all this is fairly new. So this is basically prehistory. Um so you start uh by defining your metrics. So what you want to evaluate uh on your skills. Uh if you're building a skill for your product, for example, what exactly do you want the skill to um to highlight to your to your agent? It's going to be to uh forward it to the documentation. Are you putting some specific instruction, specific workflow? So depending on what you want to to evaluate you start this uh evalriven development uh so this testdriven development you start by defining the metrics what you what exactly good uh means uh when it comes to the skill then you create the skill itself right so you write the skill.mmd file so you write the the skill.md file uh any scripts uh alongside it uh the reference files uh if you want to they're all optionals. The one the only only required is the skill.mmd file. Uh and then you you went um you move to the testing part. So you run the evaluations or or you run it um manually. Um I've recently heard the the the CEO of of Brain Trust uh during the podcast a podcast. I don't know how many of you know Brain Trust. Okay. not not as much the not as many as the um uh not as popular as skills. Uh but uh so brain for those of you who don't know brain trust is a platform that allows you to systematically run evals and provide you like u the full picture uh of um of the of the agent behavior during the the evaluation uh scenario, right? Um trying to think about another platform to compare it with but this is fairly new to be honest. Um so you can think of it as like an observability tool uh to to check your the behavior of your um of your agents during a specific control scenario uh which are the evaluations. Um so you move to to the testing part. Basically you run a set of evaluations uh scenarios uh how you um these are defined by the input and expected output tools that should be called. So basically how the you expect your agent to behave um and then uh you move to the grading part. So how did the the agent do? Well, essentially it's this is very similar to a testing cycle, right? Uh but now we instead of having a deterministic output, you can have um it's nondeterministic. It's an LLM in between. Uh but you can still have deterministic parts to evaluate on and then you iterate basically and repeat. This is the that's why this it's a it's a cycle pretty similar to uh any of the the the test development cycles that we that we have at the moment. All right. So, uh jumping straight to what we're going to do uh during this workshop. So, we're going to write a a skill. I've prepared um a little demonstration app, a demo app. Um it's going to be um a performance review application uh with uh four I believe uh four employees uh one employee, two managers and one HR um representative. Uh and essentially we're going to that there's some errors uh on the database site that we're going to find and fix. uh we're going to build a skill to help to guide the agents to to to fix them. All right. And then at the end uh I have as I said a framework to test um automatically the the same scenario that we're going to test manually um using using evol. Um before moving to the demonstration uh how many of you have have heard about or used superbase? All right. So almost almost anyone knows or used superbase. Uh I've I've seen some some hands down. So uh still I'm going to give you a little brief. So uh Superbase it's essentially a back end as a service. Uh you you can think of it as the open source version of uh of um I just can >> thank you Firebase. I only fire was coming to to my mind. Sorry. uh to Firebase. Uh and if you don't know Firebase, you're probably living under a rock. Uh no, but essentially it's the it's a it's a back end as a service. You can use it to build any back end as you as you would like. Uh coming straight out of the of the box. Uh we provide a database um for for you to just plug into your application and run on Postgres, one of the most if not the the most popular open source solution out there for databases. uh you can easily integrate including authentication on your application uh running storage to save files um and many other things uh edge functions which are a lambda functions uh for those of you who come from the AWS environment and so forth. So the demo application that I've built was built on top of superbase of course. Uh and so you can follow along. Uh here is the QR codes. Uh here at the back can anyone everyone uh scare uh scan the the QR code or should I make this bigger? Bigger. Right. This Okay, just so everyone can see, I'm basically editing the the presentation at the moment as we speak. Let's see what Clots has to to offer us. Bigger This is the cool thing of of web coding your presentations. I really recommend uh I probably spend the same or or more time than than if I just uh use something like Google slides, but at least it's more fun. Uh and anthropic should be thrilled about it for sure. All right. Uh let me know once everyone is uh in the is that the can see the repo. If you cannot see or scan the QR codes, uh I should probably make the the link bigger as well. So you asked for a demo. Here's the demo on my slides. Uh so mainly everyone you here at this room used skills. So I probably won't have to to sell you uh the the the power of skills. Uh but if you if you're still a bit skeptical about skills, this whole presentation was without skills, this whole presentation will be a lot um not pleasant, let's say. Uh be much uglier in a sense. Okay, you should probably see it now. So basically navigate to uh to GitHub Hudripppn which is my um nickname and improve skills workshop AIE Europe. It's a very long name. All right. So is everyone at the at the at the GitHub repo at the moment? Okay. Everyone was had no trouble. All right. So, not this one, right? So, this is the repo that you should be looking at. Uh, essentially, it's I know it's it's a big repo, but we'll go we're going to break it down. Um, actually going to move. Okay, going to to move to VS Code. All right, so here we have uh we have an two Nex.js apps. Actually the the slides are also embedded here. Uh the the NexJS app that um that matters it's inside demo, right? And to give you an insight of what that looks like, it's basically this. So it's a very simple application. Uh you can see that it's a VIP coded application to be honest. The the layout it has nothing special on it. Uh you have as I've described earlier several um employees of this uh fictional company. Um and you can think of it like an internet or a performance review application where uh you have all the information as an HR um employee you have all the information all the information about the um about the the other uh employees of the the company uh and you can change for the sake of the presentation you can change between uh users right so what we're going to do uh is first without a skill we're going to try to implement a new um a new a new view. Uh here we're going to implement the reports view. Essentially this uh reports uh part of the um of the application is going to be uh is going to show uh both the salary uh and the the average rating for the performance review of each department. So uh so HR can be um can know what's uh uh can have like an overview o of the of the whole company. Um so before we start to vip codes because um during these pres these workshops no one actually writes code anymore. Uh so of course I'm going to vype code it. We're going to just break this um um this application down. So if we navigate to the dashboard, nothing special to to see. You have it's basically the the page, the first page, the main page that you that you've seen. And then you have here the the reports where we should have Yeah. We should have this set view exists. that is going to uh so I prepared the back end. We're just going to to to create on the back end the view as a SQL view on the the back end on the on the database and then we should be able to to see to see it on on the application. So I've prepared where is it? Yeah. So, I've prepared the prompt and we're going to live test it. Fingers crossed that this work. Right. Uh first let me navigate to it's app. Okay. Here we have more control. All right. So essentially for the ones in the back uh I'm just ask I'm going to ask Claud to create a department stats view uh that shows the ad count and the average salary broken down by department. All right. So for uh HR to have a full overview of what's going on in the the company. So we're going to hit the prompt and wait it wait to see what it it should come up with. Uh right. Um forgot about this part. I have um this MCP server configured. If you have uh you actually have I totally jumped the the read me. Uh if you follow along, sorry about this. If you follow along, if you're following along, you can follow the setup guides uh to to get to get your application started locally. Uh this essentially is going to install the dependency, clone the repo, install the dependencies, start locally your your superbase project. You don't have to have the CLA installed that we were using npx to start to run it as a um as a binary. Um just resets the the database state. So you start from from scratch with the seeded data uh and then just run uh the the app uh as as the running npm rundev uh should be available on localhost 3,000/ dashboard. Um you also have you'll have this MCP.json um file prepared. This essentially is pointing to the MCP server that we we Superbase enable uh for local projects. No authentication required. So your agent should be able to just load it uh on demand. This um uh this MCP server expose um a set of tools. is um I don't know many of you have used the the superbase MCP server but I think we currently have something along 20 20 something 29 tools I believe for the production one uh this one is a smaller version uh has 20 tools but you can basically perform essentially almost any anything that the the the one uh to connect to your remote project does uh basically list the the tables that you have executes SQL the straight on your database apply migration uh and run the database uh advisor and so forth. So essentially what he what he started to do was to list my tables. So I've asked for a view it's going to review the the schema that I already have implemented. So I'll let you and now it's going to run the apply migration uh tool to create the view. So it's basically doing a schema change on my database and it's going to create uh the the view. If we inspect the view, we're basically creating um create a replace a view a department stat the name that we gave uh and we're uh by um fetching all the information uh from I think department exact uh no from profiles exactly and group by department. Okay, made a mistake. It's going to try again. Okay, it's going to test it. It's actually something that I really like about. All right, then here's the here's our uh view on the database. So, uh we currently have it on the database. Let's see if that's also enabled on the app. Okay, it's not then. Let's quickly uh I've created SQL. What's the name I gave? This is the this essentially the the problem with live views is it usually doesn't go well at first try. Uh, I want to repeat the and let's see if it if it implements. If not, we can just run the the SQL query for you to to see as different users uh to for you to see if everything is working um accordingly. So for now he's going to need to implement on the nextJS application so we can have a nice interface to to check the results. Um I need to enable everything. Wait, let me just put on auto mode so I can continue to talk. So essentially the the agent created the the view tested said everything is working accordingly. Uh we should not the the the app was um the the feature was implemented all good. Um but we're actually going to see if he's actually if everything it's it's good or not. Um so let's give it a let's give him some space. uh not to pressure not to pressure it to create the the the feature. Let's uh just wait a bit more time. In the meantime, if you're following along, you can also play with it. Uh change the the layout actually using the using cloud code. I don't know. just doing the a brief um survey here in the during the workshop. How many of you are using cloud code as well? Oh, fairly almost anyone. Okay. Um how many of you are using cursor with cloud codes or with the plug-in or Okay. Yeah, at least one person. Uh we're going to have uh some um cursor folks here. I think from a tropic as well. uh OpenAI uh is going to be here as um Gemini of course Google Deep Mind is sponsoring event. So um we're basically going to have the whole gang here. Uh okay. So we should be I'm trusting his word. All right. So it says that we should now be uh have correctly displays the the department stats view. So let's see if that's actually true. It looks like it. Yeah. So we now have this uh cards uh with the whole view of the company. So I'm logging in as Julia from HR. Uh we can see that we have five people on the engineering team with an average salary of uh uh 107K. Uh HR as well will only one person will would be Julia and product has four people and that average salary. So so far so good. Looks looks okay. Let's see. Uh so this is like this is sensible information. Uh the the reports, right? So uh we're expecting that the other um the other um employees will not have access to it and even the managers only have for their departments. Let's see if that's the case. So let's navigate to Bob. Bob uh is the head of engineering. Oh, okay. So, so Bob also can see the the performance reviews of both u the information of both HR and product. Well, it's not it's not that bad, right? Um it's not ideal, but uh at least he's a manager, right? So, uh it should be access to privileged information anyway. And who doesn't like a transparent company? Let's see if our hus is okay. Um okay, this is this is problematic. So we basically created a view. Uh Claude uh said everything is working because as you can see the information is here. It was created but um he missed something uh that is training data basically missed something which was for pro postgress specifically um when you create a a new view the and your um table has role level security enabled. So for those of you that don't know a role level security allows to for you to define who can see the information on a spec uh on a specific row on a database level. So without trusting the the application you can filter it directly on the on the database. So in this case we should be limiting the the the view of of the rows uh by user ID right and the user role. So if a user has an employee um and has an employee role, he should not be should not have access to the rows of the uh that don't belong to to them, right? Uh we have role level security enabled. If you navigate to our superbase um migrations, you can see uh level that we have role level security enabled both uh on profiles and on performance reviews, right? And on a performance review should be should be about right. Uh right. So we have reviewer ID equal current setting. So it should work. Why is not working? Well, when you create a view on Postgress, um by default the the permission um it creates with the permissions or or the the credentials of the the user that created the view and not uh with the the the credentials of uh of the of the table let's say with the the role level security. So basically by default it bypasses the role level security uh that you have in place on your that you might have in place already on your um uh on your uh on your table. So for for this scenario to happen, we have to have a security invoker uh we have to use a security invoker flag to transfer the role level security policies or to enable the the the RLS policies on the view itself. So this is the why currently everyone can see everyone's because the the role level security policies were basically bypassed on the on the view. So uh for the sake of the of the demonstration of this uh workshop, I've already created um a skill uh uh I prepared a skill uh for for the presentation. Uh and essentially the skill is three main po main security points about Postgress uh for that the agent should be aware of during the the presentation. For this one specifically, I actually overfeed it to the exact view that we're creating. But models right now are smart enough to generalize this. And if I wanted to to create a new view, you will be able to uh essentially it has to create with this flag. uh since Postgress uh version 15 this flag was enabled and every time it's enabled the the role level the RLS policies are also enabled on the on the view um as you can see it's um it's actually quite human readable documents um mo most of you have already written skills so I'm not going to uh dive deep into into this Um but as you can see we have both the title let me just move this we have the title I called it superbase security uh and the description uh uses the the verb use uh this is an insight that I that I got from uh some experiments that I've did using verbs mainly the the ver um the verb use uh increases the chances of the skill being loaded um at least on clots. I don't know if this is default behavior for for cloth if it was trained to recognize to more easily recognize verbs essentially use. Uh but I found it more efficient to if you write use and then the whole purpose of the the skill uh in front of it and then a regular markdown uh list. we have the the view case there but also another uh checklist to the points uh for security um on RLS. So public schemas should should have RLS enabled by default. Uh public schemas or exposed schemas are the the the database schemas that are going to provide information for the the application that the user can see. So for example the the users table the profiles the performance reviews all this information is going to be fetched by the front end. Uh it's completely secured because uh superbase makes it secure uh by by allowing you to to fetch information on from the front end. Uh but the the key part here is that if you don't enable role level security, you will not have this filter on the on the table and you will have to rely on the application logic to to make the the filter. So enabling role level security at least makes it safer uh for for you as the backend engineer uh that the you only expose the information that you actually want from the start. And then a couple more things that I'm not going into. So if we uh we can install this skill on this project um by running where where do I have the command npdx? Yeah. So, I'll be I'll be using Where's the skill? Okay, I'll be using Versel's uh npm package called skills. Um curious to know how you guys have been packaging your skills. Have been Have you ever used this uh this package? Are you using plugins? Just >> this one. >> This one? >> Yeah, this one mainly. Yeah, it became very popular uh few months ago. >> I think the only problem is it doesn't really adhere to your project. So you get like global only for your local project. >> Yeah, you can install it both uh globally and and on your project and also for multiple as support for multiple agents. Uh while plugins for now are still tied to the agent that is going to load them. So, Cursor has plugins, Cloud Code has plugins. Um, I think other vendors have as well, but they're specifically distributed and made for those specific models. Uh, so we're using this one to install. You can install any skill uh from a repo online that has a skill.md file or or um you can use it to install the one locally. It will auto detect the uh the location that you're trying to fetch from. uh based on the format. So in this case we don't have any GitHub, we don't have any HTTP protocol there. So we have a dot uh slash so it will recognize that it's um uh that it's a local one. And for this I'm going to uh move to the main. Yeah. Okay. And on a good oldfashioned way going to run on the bash. So, it's going to pop pop this up. Um, ask me which agent do I want to install on? I'm using cloud code, so I'm going to install it on cloud code. If you're using any other uh any other um agent arness, you can also install it as long as it's um uh it supports it. Uh I'm going to install it on on a project level. So it's going to in this case going to create a agent folder with the skill and uh link it to mycloud uh slash skills uh folder as well. So this so claude knows how where to find them. I'm going to sim link and we're ready to install. So we if we let's not expose my key. Uh I'll delete this is just for the workshop so I'll delete it afterwards. feel free to use my my free credits uh for the time being, but essentially created the the agent uh where is it? Yeah. So, I also have some more things that we're going to see afterwards. Uh but the essential part has the skill that I've showed previously. Yeah, there it is. It's the the skill and then also created a s a sim link a symbolic link to the cloud. Uh this is how the package works. Uh and this this way allows to to cloud to either search on agents which is becoming the standard or on the cloud uh folder that it has. So let's see let's run the same prompt again on a new session. Um, let me go back to the apps demo. Yeah. And start a new session. We should have this one enabled. Yeah, there it is. So, Claude is aware of uh the superbase security skill. Now for you to run skills, uh you can either um just run your prompt and uh pray that uh cloth imports your skill uh based on the description that you gave. Uh you can uh include the keywords use and then the name of the the skill that you have on the prompt and this will uh almost 100% of the times load your skill. or if you're using cloud code, you can just slash and write the name of of your skill. And this 100% guarantees that cloud we're going to import the skill. So for for our use case or for the presentation, I'm going to because I cannot afford that it doesn't load the the the skill. Let's wait. I need to uh I need to reset the the database to create the the view again. Um workshop uh and it's npxb reset. Yes, I'm just resetting the the database applying the migrations from the from the start. Uh I didn't it didn't create any um any migration file. He he applied direct the the migration directly to to the database. Uh so we now should should be good to to go. So it's going to u bring down the the database uh and create a new one uh based on the schema that we defined on the migration files and the seated data. >> Yes. Have you found ways to build skills? Yeah, that's a that's a fair point. So the the your whole question or observation it's that the initial promise of skills they were presented by anthropic were >> yeah uh so since this is uh this is like on the the agent side right the agent decides when to load this uh the best thing that you can do without explicitly either with the slash command or the use and then the name of the skill feel on your prompt. Uh is for you to play and uh play around with the description uh and run a bunch of tests either manually or um or automatically uh to to check what actually works and not for the the ones that that you're expecting the the agent to behave, right? So you define a bunch of scenarios where you think that the uh skill should be loaded and when the skill shouldn't be loaded. Um you test it out. You can test it on your machine like on this scenario I don't want the skill to to be loaded. Prompt the the prompt uh on on cloud code let's say and check if the skill was loaded or not through the CLI. Um, and then play around with the description to see what actually works or not. Like this without actually explicitly call the skill. This is the best thing that you can do to to to test if the the skill is being loaded correctly or not. Yeah, we're still at the at the very beginning of uh like a a very early stage uh of of skills even for MCP like all all this um agent stuff it's fairly new. So we're still we're still standardizing things. Uh we're still figuring out what works and what doesn't. Progressive disclosure was something that no one was talking about uh six months ago and now it's fairly it's fair to say that's one of the north stars of uh uh agent development. Uh so in six months from now probably it could be another thing. So or skills could be the standard or maybe anthropic or openi or someone else found a more efficient way to manage the context or provide more context to the to the agent. Uh so we'll see basically. All right. So the database was was reset. Okay. So at least now we have the view but we don't have the information on your database. So now we should be able to run the same prompt again but we but with the with the skill. So if if we hit the prompt you saying it was quite fast I don't think uh yeah but it didn't create one. Okay, let me try another thing. Instead of instead of this, let's use to create. Let's see if it works now. Yeah. Okay. So, it loaded the skill. So now at least should have the context uh to create uh that the the RLS or the security invoker flag should be included uh when creating the view uh and the steps should the the rest of the workflow should remain the same. So we it will list my tables. Right. Exactly. Identify the tables. And now if we look if we look closely we can see that we we now have this the the flag here uh is going to be on the on the migration. So let's see if with the flag uh this is the expected result. This is what what happens when you v code a CLI. You now have the the UI duplicated. Right? So, it created the view. We should be able to see it, but Alice shouldn't. So, what's happening? Wait. Okay. So, uh do I have to reset now? H interesting. should have another uh probably. Let me just see if I have it uh here. Uh where did I put it? So count I'm going to cheat here. going to say the both and the employee should be able to see information. Okay, basically live troubleshooting. What is not happening? Probably from a different uh a different um policy that I've defined here. Uh but now it's going to troubleshoot. Let's see if the the skill actually improves the the efforts here. If not, I have something on my sleeve. Uh because uh if you're not aware of Superbase basically has um database advisors that you can use uh to try to identify early early on identify um some potential vulnerabilities or uh schemas that are exposed, information that might be exposed um before you're running it into production. Um so if if it can figure out by itself I'm going to include on the skill to also run the advisor uh to to check. So this is the the main part of skills is that you can oh that's you can uh uh see well it's the it's a very poorly written application let me say uh it's essentially the the main part of skills. uh it's not if if this specific um demo works or not, it's that you the the behavior changed uh once it loaded the skill, right? It created with the security invoker part. Uh and with with that that just shows how powerful it is that that you can create um you can change the the behavior or or guide the the agent on demand bas uh based on the on information that you that you put. can think of the skill.mds as a prompt template that you can give to to your agent. So, let's just quickly troubleshoot. Oh, is even offering to apply a migration. Let's see if it doesn't break my my app. All right. So it seems too complic >> I have a fairly amount of skills. if you as you can see I've been playing around with them. I also have the some of the pre-installed MCP servers for that um that Superbase enables. Uh but essentially SK uh it would be more interesting if you if I've just um if I've compared the the context uh from before and after loading the skill. So right now skills take 1.3 uh,000 tokens on my context, right? uh as you as you saw I have more than than just this one uh skill but the skill was loaded so the whole information inside skills.mmd was loaded to to context if we clear and run the context again the the skill amount so this skill is not it's not enough to for you to see but as you can see the the skill stick quite um less space uh that that the MCP uh would uh from all right okay I have a newer um version of the cloud code so for those of you who are not aware of this uh entropic recently released the tool the the tool search tool uh which is a mechanism for for cloud code to load tools on demand so it doesn't loads basically progressive disclosure but for MCP tools, right? Um the the main difference between M this progressive disclosure or the the tool search tool um on uh on cloud codes and skills is that the progressive disclosure is built by design uh for skills. So it's like already baked into the structure the instance of the skill while on MCP is still not a standard for all tools. So it works for cloud code but for many other clients it won't. It will just load all tools straight to your context. So um this is a for now a thing uh for just um for for just cloud code. If you're interested about it uh we are going to have the the founder or one of the co-founders of the MC of MCP speaking on the 10th. So on Friday, um he's going to to give a brief overview of the the MCP road map. Um which if it's something if anything if nothing changed since last week uh when he presented this in New York uh on the MCP dev summit, you should bring this uh this progressive disclosure part uh to tools to to bring it to the protocol itself. So >> yes, >> let's say that we have very large database and we have to to load in the context the schema of this database because we we have to query database using agents >> in your opinion is it better to use a skill or an MCP for examp to to load this schema but progressively Okay. Uh >> possible to use the schema to progressively disclo um uh load the schema of this big database >> in your experience. >> Yeah. Oh yes. Okay. So is your question more about uh how should we access it or the whole architecture of this uh pipeline to import the the data? I I just want to to ask to a an agent uh to to query the database and uh obviously uh uh the agent uh uh must know the the schema of the database before or not. How can you teach the the agent to to query the database >> using the skills using the uh an MCP server or something like that? And if you use the skills, if you decide to use the skills to uh to load the context of the agents with the schema of the database, is it possible to progressively load the schema within the context? >> Okay, gotcha. Um so let me break let me break the the situation. Uh let me break down the situation for you here. you you'll have um essentially two parts. One is uh what's going to be on the the context. So what's going to be loaded and the the specific information uh that you want to to have um on your on your scenario. Uh and the second part is the actual mechanism the the extraction mechanism that you're going to use to load the information from the database. So for the second part to to load the the information from the the database you can either use uh a script so a skill that invokes a script or an MCP tool. Um I would advise to use an MCP tool because you can uh use it if if you're using on production or on remote project. You don't rely on your local environment. You don't have to manage the keys. Um and the tool it's already standardized and uh you already have the the authentication baked into the protocol. So the agent never manage the the authentication uh token. It's on the it just runs the tool um and and it works for the for it to progressive disclosure the the information on the database. It will you'll have to um you can include it on on a skill. Yeah. Uh you'll be using the MCP tool. on the skill you'll probably state that use this tool to load and in the tool implementation you have to enable it to not load to progressively loaded right so to load into chunks um it might be just enough from the the the tool parameters um the agent should figure it out by itself that if you put a parameter called buffer for example should be able to load it in chunks right uh instead of the whole table. But if you want to have 100% sure that it's going to load into chunks and use it properly, I would also package with with a skill and describe it how I intend to to use this uh this tool. So this is actually how both skills and MCP play along together. It's the the tool to enable this connection, this integration and the skill to describe how to use it. Yeah, this is how I I would implement this this type of system. Uh thank you for for the for the question. Uh and it got me the opportunity to to basically talk about the how how to use both skills and MCP and not put it uh uh one against each other. Um so now as I promised we should be moving on. Um I'll have to give it more time uh to to figure out because I've I basically during the the workshop when I was preparing the workshop I've I've gave it a bunch of vulnerabilities. So if I just kept it simple and that one the demo should would probably work. Um since I have more vulnerabilities exposed that I if I had time I would um try to solve it. Uh it didn't for for the moment but uh but you you saw on both uh scenarios that the first one didn't have the security uh flag security invoker flag and the second one had. So at least we can um we we can imply that the the the skill was doing something. The the the agent saw the information on the skill. It it merged it with the system prompt or stored it near near the system prompt and um change the behavior accordingly. uh to test this. So if you want to move this um this part the the skill into production, right? So it works on your machine. It's a it's a tail older than time that it's working on my machine, but I don't know if it's going to work on your agent, on your machine, uh on your um environment. So to have this uh to to test this or to automate this testing and with this we can unlock having a pipeline for example if you change one thing on your skill uh how can you reliably tell that it's uh it keeps doing what you're expecting didn't break the previous flow so if I uh change one of the checklists how can I ensure that the the other ones were still working right so for the this is where evals um could step in. So, uh, evaluations, it's a very broad term. You can basically evaluate anything since this is a markdown file. It's a free text file. You can evaluate basically anything. Uh, so it's um fairly difficult for you to the most difficult part to create evolves, I would say, is actually coming up with the scenarios because you would first have to to know what's the expected behavior uh of your of your agents. Um so coming up with representative actually good scenarios that represent a fairly amount that cover a fairly amount of uh use cases that you want to to build are the most difficult. Um and there's still not a standardized structure to create evaluations. you can use um you can test it um or by importing a bunch of prompts and expected outputs uh from a CSV file from a JSON file. You can use tools uh like Brain Trust or Lenfuse to test it um and to to have an analytics and an obser observability layer on top of it. Uh for this presentation I followed the um I followed the what agent skills uh open standard defines as to to design the test cases. So if you're not aware of this website, this is the landing page of the agent skills open standards uh to try to standardize what a a skill is and how should behave and they basically propose a very simple structure local way to test the the skills organized by you'll have an eval.json that essentially has a set of evolves. an array of of evolve scenarios. Um, you'll put the prompt that you're going to give the agent, the expected output from the agent. Uh, this is only if you have an LLM as a judge. Uh, this is a technique used for nondeterministic evaluation. You you would have instead of a human, you can give the outputs of a of a an evaluation run uh to another LLM. say it define a a success criteria and let the the LLM who's who's doing the uh whose role is is to judge in this case that's why it's called LLM as a judge uh to give it a grade basically. So this is one part that you can automate on your evaluations for nondeterministic workflows. you can either assert if a tool was called or you can give the the results to an LLM and nondeterministically try to uh get the the agent to to grade the the performance of the other agents. We basically have agents evaluating agents. Um so I followed this this structure. I gave the same um the same input here, right? Uh so the the agent that is going to run this evaluation is going to get the same input that we that we had. The expected output it's that the security invoker uh it's true. So it's it's present on the um on the app uh sorry on the view and I have and then I have a bunch of uh assertions that in this case um I'm going to check it uh um deterministically right I prepared a Python script that essentially just resets the the state of the database so we we ensure that uh since we're running this locally and not on isolated container like a docker container for example uh we have to make sure that the systems always starts uh from the same ground so I'm going to reset the the app uh if you want to to run the the evaluations as well you have to pick your own entropic key uh create copy this you can follow the the read me inside the the superbase security uh here you'll have how to set this up um but then I I will run the uh the cloud code on it. Uh I think it's on print mode or can remember what what they called but essentially like I will run it as a binary headless. Um so the agent will receive the um the the prompt that that's on the evaluation uh as the task to perform. And I'm also going to to give the condition uh we're going to test two conditions. one with the skill and another without it. And essentially, so for you to see run the condition run, this is where the cloud code will run. And if the condition is with skill, we're going to load the skill.md into the the system prompt, right? Um if you if you would actually uh would like to mimic the behavior, you would run this on the Docker container. you will put the agent skills uh on the cloud/skills um directory inside of a docker container and let organically let the uh cloud codes find them and use them. Uh for for this presentations this is a very simple setup. I've just basically appended to to the system prompt. So, so we're going to run the evaluations. Uh, do I have the other? Yes, I do. Okay. I think we run it on the base. How is it not finding the the skill supervisor? No. Okay. Oh, I have Wait, I know what's going on. I have the the wrong name. I change it. All right. So, we started by running with the skill. So, the first result that we should get is the the with the skill. It stopped. Now, it's running without it. And then we're going to compare it. This will output a workspace iteration one um folder and we can compare it both the output of uh with the skill and without it uh while the without skill is loading. Let's just quickly inspect what what the uh with skill output gave. Um and essentially you can see that it created this the view with the security invoker and then we have this grading.json JSON file with a bunch of information like the the assertions that we we've put on the eval.json. Uh we have them here and we can see that for this one created as the as failing even though that created where is it not found the view as security okay I'm actually evaluating something wrong so the problem here now it's with the is it the scale uh view. So since I I was expecting this to to create an PG class uh RL options instead of just inspecting the the view, it's giving me that uh it failed. But the key part has it finished. It's not finished. Still running. Take a long time. Could be. Okay. And now we can inspect. Okay. So, uh this is actually good good insight. So, um with these results, this is the the tricky part of of writing evolves. Um so, as the uh as like normal tests, uh the results will depend on how you implement them, right? it's it's just code. Uh so if you're evaluating something wrong or some or not the the expected behavior, you're going to have wrong results. It might not be because the the system is is not working. So uh we've tested manually and see that with the skill it created with the security um that the security flag we can actually just inspect it here with the skill created. Let's see if if on this one surprisingly this time it did. It's a nondeterministic um the nondeterministic behavior of of clots. Um but since I was evaluating something wrong, right? Uh I was expecting it to create the or or inspecting uh the um a meta schema to check if the the the view the the security invoker was there or not instead of just inspecting the view directly. um the results came a bit off. So it said that with the skill it failed and with the um without the skill it passed. So and if if we inspect the both outputs they're basically the same. So with this ju just to show you how tricky it is to to write evolves because it's although this can happen on uh with the with the regular uh tests um it's easier to catch because the the output is deterministic right it's just code um here if you're handing to to an LLM to evaluate it can sometimes elucinate so to finish uh because we're also almost running out of time to sum up the the structure. This is the one that they recommend. Uh I find it very easy to implement to to getting start with. Uh later on you can move on to more um complex uh evaluation scenarios like running on a docker or in the sandbox uh to guarantee that uh you get a fresh environment with just one skill that you're testing on your set. Um but essentially you would just put two conditions with and without the skill compare the results and see uh run them on the harness the agent harness that you would like and compare the results uh out there. This is basically your very first uh evaluation pipeline to to test the skill automatically. From my end that's all. I hope you find you found this workshop useful to to get your skills leveled up and ready to productions. I'm going as I said in the beginning I'm going to give uh a keynote tomorrow a keynote no a talk tomorrow about how we've implemented uh and created the the superbase skill for the product itself. How we're keeping it m maintainable while ensuring that provides value and now we're uh testing it uh into production. Thank you. ANYONE has uh any doubts, questions? I'll I'll also be Yeah. Uh so I have a question about uh like the number of skills that you typically install on your environment because with this progressive disclosure it seems like we can basically keep adding different skills and the agent will automatically um basically the agent will automatically find them. Uh do you have any recommendation on how many skills to have >> or is there any limit or we should just basically keep adding uh and it will magically work? >> Yeah. Uh I'm probably not the best person to talk about this cuz uh it's easy for you to um get into this rabbit hole or of just like especially when you're experimenting getting a bunch of skills as as you saw. I had a plenty of them um installed globally and um I think it's fair to say that and use them all uh on a daily basis. Um but it depends if if you're using them on your local machine. I think it's pretty um be pretty easy for you to um get this messy environment where you'll have all of them installed or most of them installed. um for local for your local environment I wouldn't for now since it's very experiment or my personal opinion I would not uh constrain myself on like space management or context management about this the progressive disclosure it's a very powerful thing that you can explore in this case you sure if you have skills that you don't use uh you're going to have them uh fill your context window but the descriptions are so small that you can afford to not delete them if you don't want to. Uh into production, treat them as um any artifact that you would have on your CI. Uh so keep it clean. Uh into production into your CI um I would keep them only the the exact skills that you that you're using in that specific case. Yeah. Uh, another piece of information that that I could give you on the production part is that it's now more and more common for you to uh, also export skills or make skills available on your repos um, as like another piece of documentation. So treat it treat skills that you put into production as actual document as you would read documentation. So it's important for you to keep them updated uh include it on your include the the updates workflow on your cloud.mmd or on your agents.mmd. So you make sure that if anything changes um you will change this the the skill as well like you would do on on the documentation if a feature or workflow changes. Um he from time to time you can also create a um a job to to check if the skill uh is still uh running a fair workflow. I somehow you could check if the the skill have been loaded by your users um in a if it haven't been loaded by your users for a long time does it still make sense to have it there? Um, so yeah, this is basically the the piece of advice that I could give you for skills into productions based on my experience. Uh, for the rest of it, you'll have to come to the to the talk tomorrow to learn. Uh, we're putting it into production on Superbase. Any more questions? I'm going to be around throughout the whole event. So, if you catch me um if you if you cross paths, feel free to to ask me anything. Tell me about what you're building. Would love to see if it's with Superbase. Even more thrilled to hear about it. Um, and from my hand once again, thank you very much. You've been lovely today for uh 9:00 a.m. Pretty cool. Good energy. Uh, so just from my end, enjoy the the rest of the the conference and we'll see you around. Thank you. Heat. Heat.