Building Cursor Composer – Lee Robinson, Cursor
Channel: aiDotEngineer
Published at: 2025-12-02
YouTube video id: fL1iJHtl51Q
Source: https://www.youtube.com/watch?v=fL1iJHtl51Q
[Music] [Music] It's great to be back in New York and I'm very excited to be here and talk on behalf of all of our engineering and research teams at Curser about building Cursor Composer, our first agent model and my colleague Sasha actually gave a version of this talk recently. So I'm excited to give my own uh my own take on it. So cursor composer is a model designed for real world real world software engineering and it tries to be both fast and smart. So as we've measured it against our own benchmarks. It's better than the best open source models. It's like up against recent Frontier models but kind of slightly below the latest frontier with Sonnet 45 GPT 5.1 codecs. But where it really shines is it's about four times more efficient at token generation than models at a similar level of intelligence. So we're trying to mesh speed as well as intelligence. So why did we build this model? I mean obviously cursor has an IDE. Why are we getting into the model space? Why do we care about this? Well, our research and product teams have been building a model called tab which you can use for autocomplete. Maybe some of you use that inside of cursor. and we wanted to take that same approach for a very low latency model and apply it to coding with agents. But honestly, we weren't really sure if it would work. So, we started prototyping some early versions of what this model could look like, started to put it out and get some feedback from users. And we were pretty surprised that this cheetah slug we released for this model, people actually really liked it. Uh they really like the speed, but the feedback we got was it's not really smart enough yet to be a daily driver for a lot of their coding. So we needed it to be smart and fast. Definitely needed to be smart. So we really worked on making this internal benchmark that represented our usage on our own repos and how we actually built software. Like if we had a model that was both fast and smart and a checkpoint that our developers would use every single day to build the product and to build all of our software, then we knew that we would be on to something. And for example, one big change here that helped actually push this towards a level where we had a checkpoint where people would use it was being able to call tools in parallel and being able to very effectively use our semantic search tool. And we'll talk about that a little bit more here later. So if you haven't seen it, uh here's cursor in cursor 2.0 in our new view and we're going to use the composer one model and you'll notice that it is doing a lot of things very quickly. It's calling a bunch of tools in parallel like gp so reading a lot of files. It's making shell commands. Uh it's making file edits. It's writing and managing uh a list of to-dos. And you can kind of very quickly work through tasks in the foreground here. Uh in this case, I'm investigating an issue in an open source repo. And I don't know about y'all, but this has been a quite different programming experience for me. Uh having working with coding agents for a little bit of time now versus kind of firing off an agent and waiting, let's call it 20 minutes for it to complete where you can kind of context switch away. This really does help keep you in the flow and is a kind of a different style of programming I think. So I want to talk about how we did this in a way that's hopefully accessible for you all. I'm not a machine learning researcher but I do really enjoy this stuff. Uh what we learned some of the infrastructure challenges and then a little bit on where we're going uh moving forward. So in cursor a user kind of submits a query to our backend. The agent reads that query and then decides to make a series of tool calls. And our agent has about 10 tools give or take, but we're going to focus on five here. So reading files, editing files, searching your codebase, looking at lints, and then also running terminal or shell commands. And the agent then is able to autonomously decide, do we call these serially or do we run these in parallel? And our goal with reinforcement learning here is to try to mirror the cursor production environment as close as we possibly can. So this data that we have in training, we want to kind of pretend like we're actually calling real cursor queries. Uh so to do that, we are running a series of rollouts. Um for example, in this roll out, we're calling a series of tools like reading files and editing files. And when we run more rollouts, we can start from that same initial starting point, but we might call a completely different set of tools. So in this one, we're also doing codebase search. So we score the output, we decide which one is better and then we update the parameters of our model based on that change. So conceptually a pretty simple idea. The challenges come from when you take the simple idea and then you try to scale it up to a very large amount. So there's kind of three challenges. The first one is trying to match the training and inference environment. So when the model is actually being used in the product. Um, in this case with composer, we're training a large mixture of experts model and it's being parallelized across thousands of GPUs and if we don't speed that up, it's going to take forever to train the thing. So, we want to make it really fast and match the training and kind of sampling version to be as close as possible. The second challenge is that the rollouts can get pretty complex when you start to look at real world data here. So, models are going to use hundreds of thousands to millions of tokens. They're going to make hundreds of different tool calls. And each of these rollouts could take a, you know, a pretty different amount of time. One might make a lot of tool calls, one might make not as many, and they'll complete a different time. So, we have to figure out how to deal with that challenge. And finally, there's this challenge of consistency. If we want to mimic the production cursor environment as close as possible, we need to use exactly the same tool format and the tool response. But in training, we have this really bursty amount of compute. Basically, we're like doing all of this training all at once, which is different than at production. So, it is really an infrastructure challenge. We have these three machine learning challenges and all of the solutions coincidentally are actually infrastructure problems. So, let's talk through a few of these problems and how we solved it at the infrastructure layer. So, our architecture is probably familiar for some of you who have been involved in this space a little bit, but I still think it's really interesting to talk about at kind of a high level. Uh, we have three different servers. We have an inference server. We have kind of the standard ML stack with PyTorch. We have an inference server. So the rollouts that I just talked about, that's where we use Ray. And then we have environment servers. And these are the ones where we're kind of simulating that cursor environment that I talked about. And all these servers talk to each other. So for example, the inference server can basically send these advantages back to the trainer, which is like nudging it up or down uh based on the roll out and then updating the model and getting new parameters. So this this one is a bit more on the ML side, but we're we're trying to train a model that's very very large and to do it as fast as possible. And one way that our team was able to do this on the research side was to develop a library of custom kernels that allowed for very low precision training. And basically this allows us to just speed up the training process in a big way and also make it much easier to ship to our inference server. So, if you're the type of person who loves this, we wrote a blog post going way in depth on all of this that talks about our custom kernels. Uh, if you're interested, the TLDDR here is we found for the mixture of experts layer was about three and a half times faster uh a speed up on Nvidia Blackwell chips. So, it made a pretty significant uh impact on our training runs. So, once we update the weights, we need to send them back over to the inference server uh during this training process. and the inference server is the one that's doing all the rollouts that I talked about calling the tools and kind of managing um what we sent. The challenge here uh is that they all complete at different times. So kind of a naive version of this there will be a lot of wasted time. So what we were able to do is do load balancing across the different threads and processes to basically shift the work around and and not have a bunch of idle time. So if one roll out for example makes a ton of tool calls, maybe it installs some packages, installs some library, we're not just sitting there waiting for all of the other ones to finish. The inference server is spending all this time going back and forth making the tool calls to the environment uh and getting the tool results back. So again, communicating between these servers and we want that environment to be as close as possible to the cursor product. One thing that's nice about having both the coding agent, the IDE, as well as what we're doing with the model research and training our own models is we can kind of co-design these things together. So, as we were building out a lot of our RL work for this model, we were also building our cloud agents product. Um, this is how you can run a cursor agent kind of offline. You can run it from your phone or on the web or kick it off from Slack, for example. And to do this, we spin up virtual machines in the cloud. So each one of these VMs loads up the user's code. It allows the agent to kind of like make file changes, run tools, and edit code in a secure sandbox. And coincidentally, this is the perfect Impra for RL and our use in training. So we have this like fleet of cloud VMs and we have an environment that very closely matches the production cursor environment and we can then use that for training. This does still have some challenges though. I kind of talked about how the training workload is very spiky and it's different than the kind of standard inference when you're running the cloud agents product. So we needed to build infrastructure to support all of these VMs and orchestrating between them. So you know we have many different clusters, hundreds of thousands of VMs here and you can see behind me one of the internal dashboards we built uh with composer actually to visualize uh all of the different VMs in the fleet. So why spend all this time trying to match the environment to be as close as possible to cursor production. I've kind of mentioned that a few times. We could mock it, we could simulate it out. Um, but one of the really nice benefits is we get to give the model uh specific tools that we think are very valuable inside of the agent. So, one of those is that we've trained our own embedding model that allows you to do semantic search. So when you use cursor, we go and index your codebase and then it allows the agent to make natural langu natural language queries to find files that it might want to edit. And we did some research on this recently. We found that semantic search not only helped basically every single model inside of the cursor agent harness, but it was particularly helpful with composer, which kind of makes sense when you think about it. Like we trained composer in the exact same environment that we're using at inference time. And so the model kind of becomes a power user of this tool which is really effective. So let's talk about uh how the release has been going and kind of where we're going next. Um as we were doing the training process we kind of knew that RL was working when we were able to continuously improve the model and start to see more and more improvements after more and more rollouts. So we started about kind of the same performance as the best open model and then as we trained and kind of threw more compute at it the performance continued to increase and to a point today where we're close to the frontier in terms of kind of the best coding agents that are available and personally I think this is a great sign just for being able to take and scale RL and apply it to these very hard specialized tasks like in our example coding but it could be applied to other domains as well. uh RL also allowed us to kind of change properties of the model in a way that was very useful for the cursor product. We wanted the model to be both kind of fast at generating tokens but also the end toend experience of getting a result that's helpful. So for example, instead of reading a file one by one, you can read 10 files in parallel with tool calling. And as you saw in the demo earlier, it makes composer feel much faster when you have that. And we think this is kind of just the start. there's a lot more we can do in this area to speed up the model. Uh, and the second one is the model learned how to behave better as an agent. So, in the beginning, the model was was kind of making too many edits. Sometimes the edits were made unnecessarily, but as we trained more and more, the model actually got surprisingly better at learning to search and read files more. So, it would go and find the right thing before it tried to make edits. Overall, just being, you know, a bit more effective. So, we released composer last month in comp uh cursor 2.0 and so far seems like people seem to like it. Has anyone here tried the model by chance? Okay, that's pretty great. That's more than I expected. So, that's great to hear. I think from my perspective using this model and using coding agents for some time. I kind of describe this problem as like airplane Wi-Fi. So, when you're on airplane Wi-Fi, uh it works, but it's kind of frustrating. you really want to do whatever you're trying to do, but it's just it's a little slow almost to where sometimes you wish that you just didn't have Wi-Fi at all. And I think for some of us who adopted coding agents very early, it kind of feels like airplane Wi-Fi sometimes cuz if it's taking 10 or 20 minutes, you're in this weird I think Swiss called it semi async valley of death where you either want something that's really fast or you want the most powerful most intelligent model that can run for you know a significantly long amount of time maybe in the background maybe you know 30 minutes days and I think when you're stuck in the middle that's that's very very painful. So for me composer and I think other people it's brought a lot of joy back to coding with agents that felt more like when you were writing code by hand where you're very in the loop very synchronous. So I'm excited to see more people exploring this space as well. For me daily uh I'm writing a lot of plans with kind of the latest uh model like the the highest frontier. So GPT 5.1 codec is is really great for plans. uh and then I'm using composer to actually take that plan kind of like what Dex talked about like take the context engineering work and then actually go and build the thing with it. So uh a few reflections from our research and products team on building composer. The first is that RL can work surprisingly well for training very specific models and you know giving it this high quality data and a decent amount of compute. You know at cursor we're not trying to build general intelligence. We're not trying to build AGI. We're trying to build very good coding models and RL RL has worked surprisingly well for that. The second one is uh how much tools AI tools like cursor it doesn't have to be cursor but like cursor really helps speed up research and development. You know of course our entire team uses cursor to help them write code and debug code more efficiently but that speed up that increase really compounds across all of our engineering efforts. So we're able to try more ideas, ship product faster, try new research. Um, so it's been really really helpful there. And the last one that's, you know, personally pretty interesting for me is that it was interesting to see how much of the ML work and the training process was actually also an infrastructure problem. They were very correlated. And going back to my time at Verscell, we saw a very similar thing where a lot of the magic moments that you can have in working in frameworks in the JavaScript or Python space, you also need to think a little bit about the infrastructure of where they're actually deployed. So these things are are more related than people might think. So those are some of our reflections. Uh sounds like some of you have tried it out. If this is something that you're interested in and working on, we're hiring pretty much across the board at Cursor right now. We just opened up an office in New York if you're here based in New York. and we'd love to talk to you about building the best coding models in the world. Thank you. [Music]