Replacing 12K LoC with a 200 LoC Skill — David Gomes, Cursor
Channel: aiDotEngineer
Published at: 2026-04-30
YouTube video id: WE_Gnowy3uw
Source: https://www.youtube.com/watch?v=WE_Gnowy3uw
Hi everyone. How you all doing? Thank you for uh coming today. Um I'm going to be talking about how markdown is basically the new code. Uh as TJA has already sort of previewed um we recently replaced a lot of code in the cursor application with just markdown just a skill and in today's talk I'm going to share a bit of the journey of going from a fullblown feature with a lot of code a lot of dependencies a lot of complexity and tests into a much more lightweight streamdown version of the same feature effectively but just with a single skill. Um, before I start though, I have to give you guys a little recap of git word trees and how they work in cursor. Now, if you haven't heard of word trees in git, they're effectively like um separate checkouts. And I'm sorry for the wide screen um but they're effectively like like separate checkouts of your repos that allow you to work in parallel. So different agents can be working on the same task at the same or on different tasks at the same time without um interfering with each other. If you've never used this feature before in cursor, the way it works is that you can spin up an agent on an individual work tree. Um, and you will see, for example, the same file in two different work trees. And you can see that they look different because the agent is doing some work on on the work tree, but not on your primary checkout. And anytime the agent runs commands or lints or anything it does will be isolated and scoped to that git word tree. Um, with this feature, you can also um work even in parallel at the same time on the screen. you can have like these grids of agents working for you. Um, and if you say, "Hey, open a PR," the agent will open a pull request from that work tree with the changes that it produced inside that work tree. And one of the coolest things about this feature is that it allows you to give the same task even to different models at the same time and then compare what different models do on the same prompt. So if you haven't heard of this, we call it best event and it's effectively a way for you to compete on on diff have have different models compete on the same task and then you can even preview the changes if it's a front-end um project you're working on. Uh you can um compare all the different visual implementations and then choose the one you prefer. Now, if you have never heard about this all everything I'm talking about today, um I will also just say that it all came out in around October of last year alongside cursor 2.0. Um and when we initially shipped that, it came with a lot of complexity. Um we had to write all the code for creating word trees, managing these word trees, feeding them into the agent as context. We also had to make sure that the agents were scoped and isolated and they could not escape the work tree they were working on. Uh we also have something called setup scripts which users can configure and run uh and and have cursor run them anytime an Asian starts operating on a given word stream. We also have the judging. So I didn't show you this before, but uh there's a little thumbs up icon on one of the models. That's just a judge that we run um that tells you which implementation looks the best based on um different criteria. Uh and then we also had to make some changes to the harness uh and introduce some system reminders to help the agent stay on track in these word trees. And then finally, there's some cleanup complexity as well because people like to spin up hundreds of these word trees and then their disk sizes blow up and we have to help them by cleaning up the um the the word trees that stay behind. Now, in our new implementation, the one that I'm going to be talking about today, we were able to get rid of most of these things. And in fact, I recently opened a PR uh removing this entire feature from cursor and it was a massive like deletion of of of code like I think it was around 15,000 lines of code deleted. The new implementation of the feature is almost as good as the previous one. um and it is much much more lightweight in terms of us to maintain it. Um and it even has some benefits compared to the previous implementation that I'll be talking about today. So how were we able to replace an entire feature with a skill? We decided that there are two primitives that we could use to effectively allow cursor users to use word trees by simply leveraging two primitives. one is Asian skills and the other are sub Asians. So both of these are existing cursor features. You can learn more about them in our docs. Uh we have a page for skills and we have a page for sub Asians. And we realized that if we took these two things together, we could basically reimplement both the cursor work feature as well as the cursor best event feature with just markdown. And this is a little video of how it works. So I can now as a user say slashwork tree and then I'll give it some task. I'll say fix a typo in the footer of the website and this agent will run in an isolated word tree and do its work there. So the way the skill is written is actually really simple. I can show you most of it. Uh it doesn't fit on the screen but it's basically a set of instructions telling the model um how to create word trees and um to run the setup scripts that the user might have configured. and then to stay on that checkout, right? We want to make sure that when the agent is operating on a word tree, it is staying in that checkout. Um the best event skill is very similar. It's actually even smaller. The entire skill fits on the screen here with with a small uh font. Um, and what we're doing here is we're instructing the parent agent to go and create sub agents for each model and then spin up a word tree for each uh, so have each sub agent create its own work tree and work inside that work tree. Um, and then we also tell it to wait for all the subs and when they're done, please provide some commentary. Please let the user know um what um the different implementations by the different sub agents look like. Maybe you can grade them. Maybe you can make some uh criticism of them and maybe you can help the user choose which one is the best. Um and and please give that to the user in some nice table format or something. But again, it's only around 40 lines of code and it's all marked down. Like it's not even code. And the previous version of this was maybe 4,000 lines of code. Some of the considerations we have to have in this in the skill is that the skill must be crossplatform compatible like we have Windows specific instructions and we have Linux and Mac OS instructions as well. We also instruct the parent model to run the setup scripts for each word tree that the user might have configured. And then this is the hardest part. We'll spend a bit of time on this on the talk today. We have to instruct the model to stay on that word tree, right? we have to really say, hey, do not ever work outside this and do not ever um escape, right? Um and we we do that with some aggressive prompting effectively. So the new commands are slashword tree and then slashpass event to do the basic basically like um the to start agents in isolated work trees and to start multiple agents on the same task. And then we also have apply word tree and delete word tree to bring over changes from the side word tree into your primary checkout. And delete work tree just does uh what you would expect. Uh a little note is that these are not actually skills in cursor. They're actually commands but the way these commands work in cursor is extremely similar to how skills work in that there the prompts only get loaded into the context if the user chooses to load them. Um, and the only reason we did it as commands and not as skills is so that the prompts for them can be controlled in our servers in our back end. This means I can iterate on these prompts um without you having to update your cursor version. Um, if I do some improvements to these prompts, the next time you use them, you're going to have you're going to get the latest version of the prompts. But effectively, they work like skills. Um this is a demo of the best event um skill or command where I'm giving the same task to Kimmy Grock composer GPT and opus and what you will see is that the parent agent starts by spinning up five sub agents on the five different models that I specified and each one is going to have its own work tree. Each each one has its own context and then opus takes a little longer as expected and then at the end the parent model as instructed will do that comparison ac across all the different subsations. It'll say um these two models did basically the same thing. This one did something that none of the others did. And you can even talk to the parent agent and you can say oh I like this part that Opus did and I like this part that GPT did. Can you can you match them together? and the the parent agent will do that for you. Um, so let's talk about some of the pros of the new implementation and then I'll talk about some of the some of the the the cons, some of the things we lost um with this refactor. So the main pro of reimplementing this entire feature as a skill is that I have a lot less code to maintain. Uh selfishly, um I'm going to be spending a lot less time maintaining this feature. And this is an an advanced feature, right? We're not talking about a feature that is used by 90% of cursors users. Far from it. Work trees are kind of an advanced thing. Um and so only the cursor power users that love paralyzing and having these grids of agents are using work trees. So it's not the kind of feature where we want to be spending a lot of time with maintenance. Another advantage is that our users can now switch into a word tree halfway through a chat. It was not possible before. Um, we didn't want to pollute the prompt UI too much with all these like drop downs and settings. And so now that it's just a slash command, it's much easier for for users to switch to a word tree halfway through a chat. They can start talking about something and then if they decide they want to work on the site, they can do that with slashword tree. Another big advantage is that the previous implementation did not work if you were working on multiple repos at the same time. So it's very common to have a multi-reo setup where maybe your front end and your back end are separate repos. In the past you could not do word trees in this kind of setup. It was just disabled. With the new slashword tree command everything works fine. the agent will make sure to create a word tree on each repo and then if you open a PR, it'll open two PRs, one for each repo. It works quite well. Another advantage of the new skill implementation is that the judging experience at the end of knowing what model did which for best event is far superior. The parent now has a lot more context over what each of the sub aents did. And the user can even ask the agent to stitch together a little different piece pieces and bits from the different implementations which was not possible before. In the previous implementation, you had to choose one sub agent or one model and just stick with that. Now let's talk about some of the cons. And if you're curious, um, we have a forums link here where we're actually getting some mixed feedback on the new implementation. Like some people were really accustomed to the old way of how the feature used to work. Um, and if you're curious, you can go and see that not everyone is happy with the change, at least for now. But we're we're tracking. What are the problems? Number one, it's very hard for the agent to stay on track. With our previous approach, um, the agent had to stay on track. Like it, we didn't let the model ever touch any files outside its work. It was physically impossible for it to do so. Now we're trusting the model. So it's you could say it's a bit vibes based because we're basically saying hey operate on this directory and and and then like you know knock on wood please don't forget about this and especially over long sessions it's quite possible that the model will forget where it should be operating and sometimes these models especially the worst models will kind of hallucinate or they'll go a bit haywire and they'll start doing things they shouldn't but we're we're working on this. Um, another con is that it feels slower because you're you're seeing the agent create the work tree and you're seeing that in your chat. It's not actually slower, but it does feel like the agent is kind of like wasting time doing something that should be done for it in advance. Um, we're also looking at some improvements here. And then finally, this is much harder to find the feature now, right? Like before whenever you opened cursor you had this dropdown that would show you do you want to run this task locally or do you want to run it in cloud or do you want to run it in a work tree now that entire dropdown is gone and so if you want to use work trees you have to know the feature exists so you can actually type /wart tree so the discoverability is a bit worse but as I mentioned before this is an advanced power user feature um which we're personally okay we're we're okay with being less discoverable in general So, how can we make this skill better? Um, as I mentioned, the biggest problem right now is that the agent is not really always staying on track. Uh, there's two ways that we're going to improve this. One is with evals and then using those evals to improve the prompts and then the other one is through RL and training. So, at cursor, we train our own model called composer. And for composer 2, our the latest version of this model, we didn't have any RL tasks with these prompts. We we didn't have any tasks in all of the many many thousands of tasks that we um use for RL actually operating in this type of environment. So we're working on adding a bunch of these tasks into our RL pipeline so that by the time we launch composer three or four or five u at least our own model will be much better at this. Obviously we cannot improve the models that the other companies develop but we've been sharing feedback with all the other labs and model providers on this kind of thing. And for evals, uh I've been working on some evals for this feature and it was actually my first time or not my first time but one I'm I'm fairly um early in my u writing evals uh journey and I was actually very surprised if you use something like brain trust and shout out to brain trust they've been super helpful. Uh writing these kinds of eval is is actually super super easy. You don't have to know almost anything about evals and you can just prompt the agent. It'll do everything for you. Um, effectively what I'm doing is I spin up the cursor CLI. It's headless, so it's great for evals. Um, and then I have two scorers. One that checks to see if the model did any work in its work tree as expected and then another one which is the reverse of that which is did the model do any work in the primary checkout where it shouldn't be doing any work. Uh, and so far the evals I've got are pretty simple. So I actually haven't been um able to simulate extremely long sessions, which is when the models start performing worse. But even so far, I've already understood that not all models are equally good at this. So for example, haiku, which is a smaller, less intelligent model, will very often deviate and start working in the primary checkup. But the other models that I've been testing such as composer and grock um are doing much better. So I still have to improve these evals a lot more to make them more complicated. But the hope is that as soon as I can start to find patterns here, I can actually go and improve the prompts. And then another thing we can do is have better system reminders to the models uh instructing them to stay on track and to not deviate from the word tree that they are supposed to be working in. Okay. So, what's next? Um, the first thing is we're actually going to take a a small step back here and we're actually going to have a much more complete and native work trees implementation in the new cursor agent window. If you're uh if you've been following, we recently announced cursor 3.0. Part of 3.0 0 is a more agentic interface for coding where you can still edit code and you can still see code but the UI and the UX are much more optimized around the agent and the chat interface. We believe this kind of interface is the right place for a proper word trees implementation. The kind of person who is more likely to be uh doing a bunch of local paralization is usually the same type of person that is more likely to use this type of UI. So we're taking a small step back there and building a proper word trees uh implementation that is more native not so much agentic in the new UI. Also we're improving the skills um as I mentioned through this continued work on evals and then RL and other training work. And then finally we are actually looking into other parallelization primitives that are not git work trees. So if you've used git work trees, you might know that uh they can be a bit slow to create. Um and also to uh they also use up a lot of disk space on your computer. Um and then finally uh they only work in git repos. So if you're using something other than git, there's really no local paralization primitive in cursor. Um in the near future we hope to uh share more about this but we're looking into some other solutions for local paralization that don't involve git and don't involve git work trees. Um so yeah stay tuned for that. Um thank you all for coming to the talk today. Um I'm sure many of you have questions and I'm going to be around all day. Uh feel free to grab me anytime and uh um I'm happy to chat with anyone. Thank you.