Building Cursor Composer – Lee Robinson, Cursor

Channel: aiDotEngineer

Published at: 2025-12-02

YouTube video id: fL1iJHtl51Q

Source: https://www.youtube.com/watch?v=fL1iJHtl51Q

[Music]
[Music]
It's great to be back in New York and
I'm very excited to be here and talk on
behalf of all of our engineering and
research teams at Curser about building
Cursor Composer, our first agent model
and my colleague Sasha actually gave a
version of this talk recently. So I'm
excited to give my own uh my own take on
it. So cursor composer is a model
designed for real world real world
software engineering and it tries to be
both fast and smart. So as we've
measured it against our own benchmarks.
It's better than the best open source
models. It's like up against recent
Frontier models but kind of slightly
below the latest frontier with Sonnet 45
GPT 5.1 codecs. But where it really
shines is it's about four times more
efficient at token generation than
models at a similar level of
intelligence. So we're trying to mesh
speed as well as intelligence. So why
did we build this model? I mean
obviously cursor has an IDE. Why are we
getting into the model space? Why do we
care about this? Well, our research and
product teams have been building a model
called tab which you can use for
autocomplete. Maybe some of you use that
inside of cursor. and we wanted to take
that same approach for a very low
latency model and apply it to coding
with agents. But honestly, we weren't
really sure if it would work. So, we
started prototyping some early versions
of what this model could look like,
started to put it out and get some
feedback from users. And we were pretty
surprised that this cheetah slug we
released for this model, people actually
really liked it. Uh they really like the
speed, but the feedback we got was it's
not really smart enough yet to be a
daily driver for a lot of their coding.
So we needed it to be smart and fast.
Definitely needed to be smart. So we
really worked on making this internal
benchmark that represented our usage on
our own repos and how we actually built
software. Like if we had a model that
was both fast and smart and a checkpoint
that our developers would use every
single day to build the product and to
build all of our software, then we knew
that we would be on to something. And
for example, one big change here that
helped actually push this towards a
level where we had a checkpoint where
people would use it was being able to
call tools in parallel and being able to
very effectively use our semantic search
tool. And we'll talk about that a little
bit more here later. So if you haven't
seen it, uh here's cursor in cursor 2.0
in our new view and we're going to use
the composer one model and you'll notice
that it is doing a lot of things very
quickly. It's calling a bunch of tools
in parallel like gp so reading a lot of
files. It's making shell commands. Uh
it's making file edits. It's writing and
managing uh a list of to-dos. And you
can kind of very quickly work through
tasks in the foreground here. Uh in this
case, I'm investigating an issue in an
open source repo. And I don't know about
y'all, but this has been a quite
different programming experience for me.
Uh having working with coding agents for
a little bit of time now versus kind of
firing off an agent and waiting, let's
call it 20 minutes for it to complete
where you can kind of context switch
away. This really does help keep you in
the flow and is a kind of a different
style of programming I think. So I want
to talk about how we did this in a way
that's hopefully accessible for you all.
I'm not a machine learning researcher
but I do really enjoy this stuff. Uh
what we learned some of the
infrastructure challenges and then a
little bit on where we're going uh
moving forward. So in cursor a user kind
of submits a query to our backend. The
agent reads that query and then decides
to make a series of tool calls. And our
agent has about 10 tools give or take,
but we're going to focus on five here.
So reading files, editing files,
searching your codebase, looking at
lints, and then also running terminal or
shell commands. And the agent then is
able to autonomously decide, do we call
these serially or do we run these in
parallel? And our goal with
reinforcement learning here is to try to
mirror the cursor production environment
as close as we possibly can. So this
data that we have in training, we want
to kind of pretend like we're actually
calling real cursor queries. Uh so to do
that, we are running a series of
rollouts. Um for example, in this roll
out, we're calling a series of tools
like reading files and editing files.
And when we run more rollouts, we can
start from that same initial starting
point, but we might call a completely
different set of tools. So in this one,
we're also doing codebase search. So we
score the output, we decide which one is
better and then we update the parameters
of our model based on that change. So
conceptually a pretty simple idea. The
challenges come from when you take the
simple idea and then you try to scale it
up to a very large amount. So there's
kind of three challenges. The first one
is trying to match the training and
inference environment. So when the model
is actually being used in the product.
Um, in this case with composer, we're
training a large mixture of experts
model and it's being parallelized across
thousands of GPUs and if we don't speed
that up, it's going to take forever to
train the thing. So, we want to make it
really fast and match the training and
kind of sampling version to be as close
as possible. The second challenge is
that the rollouts can get pretty complex
when you start to look at real world
data here. So, models are going to use
hundreds of thousands to millions of
tokens. They're going to make hundreds
of different tool calls. And each of
these rollouts could take a, you know, a
pretty different amount of time. One
might make a lot of tool calls, one
might make not as many, and they'll
complete a different time. So, we have
to figure out how to deal with that
challenge. And finally, there's this
challenge of consistency. If we want to
mimic the production cursor environment
as close as possible, we need to use
exactly the same tool format and the
tool response. But in training, we have
this really bursty amount of compute.
Basically, we're like doing all of this
training all at once, which is different
than at production. So, it is really an
infrastructure challenge. We have these
three machine learning challenges and
all of the solutions coincidentally are
actually infrastructure problems. So,
let's talk through a few of these
problems and how we solved it at the
infrastructure layer. So, our
architecture is probably familiar for
some of you who have been involved in
this space a little bit, but I still
think it's really interesting to talk
about at kind of a high level. Uh, we
have three different servers. We have an
inference server. We have kind of the
standard ML stack with PyTorch. We have
an inference server. So the rollouts
that I just talked about, that's where
we use Ray. And then we have environment
servers. And these are the ones where
we're kind of simulating that cursor
environment that I talked about. And all
these servers talk to each other. So for
example, the inference server can
basically send these advantages back to
the trainer, which is like nudging it up
or down uh based on the roll out and
then updating the model and getting new
parameters.
So this this one is a bit more on the ML
side, but we're we're trying to train a
model that's very very large and to do
it as fast as possible. And one way that
our team was able to do this on the
research side was to develop a library
of custom kernels that allowed for very
low precision training. And basically
this allows us to just speed up the
training process in a big way and also
make it much easier to ship to our
inference server. So, if you're the type
of person who loves this, we wrote a
blog post going way in depth on all of
this that talks about our custom
kernels. Uh, if you're interested, the
TLDDR here is we found for the mixture
of experts layer was about three and a
half times faster uh a speed up on
Nvidia Blackwell chips. So, it made a
pretty significant uh impact on our
training runs. So, once we update the
weights, we need to send them back over
to the inference server uh during this
training process. and the inference
server is the one that's doing all the
rollouts that I talked about calling the
tools and kind of managing um what we
sent. The challenge here uh is that they
all complete at different times. So kind
of a naive version of this there will be
a lot of wasted time. So what we were
able to do is do load balancing across
the different threads and processes to
basically shift the work around and and
not have a bunch of idle time. So if one
roll out for example makes a ton of tool
calls, maybe it installs some packages,
installs some library, we're not just
sitting there waiting for all of the
other ones to finish. The inference
server is spending all this time going
back and forth making the tool calls to
the environment uh and getting the tool
results back. So again, communicating
between these servers and we want that
environment to be as close as possible
to the cursor product. One thing that's
nice about having both the coding agent,
the IDE, as well as what we're doing
with the model research and training our
own models is we can kind of co-design
these things together. So, as we were
building out a lot of our RL work for
this model, we were also building our
cloud agents product. Um, this is how
you can run a cursor agent kind of
offline. You can run it from your phone
or on the web or kick it off from Slack,
for example. And to do this, we spin up
virtual machines in the cloud. So each
one of these VMs loads up the user's
code. It allows the agent to kind of
like make file changes, run tools, and
edit code in a secure sandbox. And
coincidentally, this is the perfect
Impra for RL and our use in training. So
we have this like fleet of cloud VMs and
we have an environment that very closely
matches the production cursor
environment and we can then use that for
training. This does still have some
challenges though. I kind of talked
about how the training workload is very
spiky and it's different than the kind
of standard inference when you're
running the cloud agents product. So we
needed to build infrastructure to
support all of these VMs and
orchestrating between them. So you know
we have many different clusters,
hundreds of thousands of VMs here and
you can see behind me one of the
internal dashboards we built uh with
composer actually to visualize uh all of
the different VMs in the fleet.
So why spend all this time trying to
match the environment to be as close as
possible to cursor production. I've kind
of mentioned that a few times. We could
mock it, we could simulate it out. Um,
but one of the really nice benefits is
we get to give the model uh specific
tools that we think are very valuable
inside of the agent. So, one of those is
that we've trained our own embedding
model that allows you to do semantic
search. So when you use cursor, we go
and index your codebase and then it
allows the agent to make natural langu
natural language queries to find files
that it might want to edit. And we did
some research on this recently. We found
that semantic search not only helped
basically every single model inside of
the cursor agent harness, but it was
particularly helpful with composer,
which kind of makes sense when you think
about it. Like we trained composer in
the exact same environment that we're
using at inference time. And so the
model kind of becomes a power user of
this tool which is really effective.
So let's talk about uh how the release
has been going and kind of where we're
going next. Um as we were doing the
training process we kind of knew that RL
was working when we were able to
continuously improve the model and start
to see more and more improvements after
more and more rollouts. So we started
about kind of the same performance as
the best open model and then as we
trained and kind of threw more compute
at it the performance continued to
increase and to a point today where
we're close to the frontier in terms of
kind of the best coding agents that are
available and personally I think this is
a great sign just for being able to take
and scale RL and apply it to these very
hard specialized tasks like in our
example coding but it could be applied
to other domains as well. uh RL also
allowed us to kind of change properties
of the model in a way that was very
useful for the cursor product. We wanted
the model to be both kind of fast at
generating tokens but also the end toend
experience of getting a result that's
helpful. So for example, instead of
reading a file one by one, you can read
10 files in parallel with tool calling.
And as you saw in the demo earlier, it
makes composer feel much faster when you
have that. And we think this is kind of
just the start. there's a lot more we
can do in this area to speed up the
model. Uh, and the second one is the
model learned how to behave better as an
agent. So, in the beginning, the model
was was kind of making too many edits.
Sometimes the edits were made
unnecessarily, but as we trained more
and more, the model actually got
surprisingly better at learning to
search and read files more. So, it would
go and find the right thing before it
tried to make edits. Overall, just
being, you know, a bit more effective.
So, we released composer last month in
comp uh cursor 2.0 and so far seems like
people seem to like it. Has anyone here
tried the model by chance? Okay, that's
pretty great. That's more than I
expected. So, that's great to hear. I
think from my perspective using this
model and using coding agents for some
time. I kind of describe this problem as
like airplane Wi-Fi. So, when you're on
airplane Wi-Fi, uh it works, but it's
kind of frustrating. you really want to
do whatever you're trying to do, but
it's just it's a little slow almost to
where sometimes you wish that you just
didn't have Wi-Fi at all. And I think
for some of us who adopted coding agents
very early, it kind of feels like
airplane Wi-Fi sometimes cuz if it's
taking 10 or 20 minutes, you're in this
weird I think Swiss called it semi async
valley of death where you either want
something that's really fast or you want
the most powerful most intelligent model
that can run for you know a
significantly long amount of time maybe
in the background maybe you know 30
minutes days and I think when you're
stuck in the middle that's that's very
very painful. So for me composer and I
think other people it's brought a lot of
joy back to coding with agents that felt
more like when you were writing code by
hand where you're very in the loop very
synchronous. So I'm excited to see more
people exploring this space as well. For
me daily uh I'm writing a lot of plans
with kind of the latest uh model like
the the highest frontier. So GPT 5.1
codec is is really great for plans. uh
and then I'm using composer to actually
take that plan kind of like what Dex
talked about like take the context
engineering work and then actually go
and build the thing with it. So uh a few
reflections from our research and
products team on building composer. The
first is that RL can work surprisingly
well for training very specific models
and you know giving it this high quality
data and a decent amount of compute. You
know at cursor we're not trying to build
general intelligence. We're not trying
to build AGI. We're trying to build very
good coding models and RL RL has worked
surprisingly well for that. The second
one is uh how much tools AI tools like
cursor it doesn't have to be cursor but
like cursor really helps speed up
research and development. You know of
course our entire team uses cursor to
help them write code and debug code more
efficiently but that speed up that
increase really compounds across all of
our engineering efforts. So we're able
to try more ideas, ship product faster,
try new research. Um, so it's been
really really helpful there. And the
last one that's, you know, personally
pretty interesting for me is that
it was interesting to see how much of
the ML work and the training process was
actually also an infrastructure problem.
They were very correlated. And going
back to my time at Verscell, we saw a
very similar thing where a lot of the
magic moments that you can have in
working in frameworks in the JavaScript
or Python space, you also need to think
a little bit about the infrastructure of
where they're actually deployed. So
these things are are more related than
people might think. So those are some of
our reflections. Uh sounds like some of
you have tried it out. If this is
something that you're interested in and
working on, we're hiring pretty much
across the board at Cursor right now. We
just opened up an office in New York if
you're here based in New York. and we'd
love to talk to you about building the
best coding models in the world. Thank
you.
[Music]