Coding Evals: From Code Snippets to Codebases – Naman Jain, Cursor

Channel: aiDotEngineer

Published at: 2025-12-15

YouTube video id: tHN44yJoeS8

Source: https://www.youtube.com/watch?v=tHN44yJoeS8

[music]
Hi everyone. So I'll be talking about uh
like some work on evaluations
particularly evaluations across like I
guess I've done in the last four years.
So let's get started.
So uh I'll be talking about coding
evaluations across varying time
horizons. So I've been uh working on
like in the code space for about four
years now like it was right before like
early copilot came out and my first
project was actually working on
generating like single line panda
snippets and my last project was
generating an entire codebase. So the
field has like really progressed very
quickly. So I'll be talking about like
uh different stages of evaluations we
have considered and some learnings
across each of the projects and how I
see evaluations going forward. So the
first work I did was on uh like uh
evaluating uh coding models in like
second uh work doing in seconds of time
like generating single line steps your
co-pilot code completions. Then I work
did some work on like uh evaluating on
like interview style competition
programming problems uh which where
models can work up to minutes. Uh then
we worked on some work on like uh
repository question answering uh which
required like maybe uh more uh multiple
minutes tens of minutes. Uh and finally
like uh pushing the frontier forward we
are uh thinking about uh evaluating
models on very complex tasks which can
take hours or like multiple hours of
work like code optimization and like
even further. So let's get started.
Uh so first work I'll be talking about
is like codebench uh which is uh like uh
uh validation work on models for like
competition coding. So here uh like this
is what a problem would look like. This
is like very standard lead code problem
and don't worry you don't need to solve
something like this. So uh like uh here
as you can see there's a problem uh
statement and the nice thing about these
interview style problems is that these
problems are very well uh defined. you
have like good natural language
specifications some example input output
examples so you can very uh reliably
evaluate the models are doing a good job
or not. So what was the motivation
behind this and how we improved the
frontier here. So the first challenge in
uh evaluating uh language models these
days is like data contamination. These
models are trained on like the entire
internet and uh like on stack overflow
you'll find uh like very uh similar
programming problems puzzles. Uh
similarly uh like you'll find uh like uh
very similar programming problem sources
on GitHub or on the internet. So uh like
contamination is a big uh deal. Uh
another very uh challenging factor which
has struggled with the field is like
insufficient text suites. So you'll see
that uh like in this program uh like the
goal was to return a sorted unique
common elements between the two lists.
But uh like even a solution which does
not do the sorting and just returns the
set actually works because the tests
were brittle and were not catching this
mistake. So uh like test suites is
another uh like very challenging factor
and how do we generate good and diverse
tests and finally uh difficulty
distributions which is something which
people do not do not really uh reliably
uh like calibrate uh like when I first
was working uh in uh this space uh like
there were two benchmarks available on
one benchmark the performance was 80% or
90% and on the other one it was 1% and
there was nothing in between and uh like
as like benchmark users what you care
about is having some signal from the
benchmark to like basically hill climb
to make progress to measure progress and
in uh either of these regimes when if
the problems are too easy or too hard
you don't get a lot of signal. So it is
very important [clears throat] when
you're designing benchmarks to think
about like the kinds of problems you are
taking and will it provide enough signal
for the users of your benchmark.
So uh like in light codebench we
pioneered like dynamic evaluations uh
particularly uh like we can periodically
update uh the evaluation sets uh and
this gives you two uh very nice factors.
First is you can combat contamination.
So you can evaluate the models on
problems that were released after the
model was trained. So it has likely not
seen the problem something like that. Uh
and uh then you can also modify the
problem difficulty distributions over
time. So as we have talked about models
are incre like improving very rapidly.
uh so what was difficult uh for the
model 6 months back might not be now. So
you can uh if you're updating your
evaluation sets constantly you can
actually uh keep calibrate uh the
difficulty distributions calibrated so
you still get more signal out of your
benchmarks.
So how we did that here like we had like
an automated approach for curation of
these problems and uh similarly we could
automatically constru these test cases
in an automated manner and uh this
allows a very nice thing when since we
are like collecting problems over time
we have time as a control knob. So like
we have these problem release months uh
on lead code and if you evaluate the
model performances like the pass at the
rate one metric uh like on problems
released over different months you will
see that after uh like uh these model
release dates you would see stark drop
in model performance. So like after
deepsek was uh released in like
September 2023 uh the performance
starkly drops from like maybe 50%
average to like over like 20% or 15%
average. So like uh based on these
sliding windows you can uh evaluate
performance, measure contamination and
even combat contamination.
Um uh we have the running leaderboard
which is like very well maintained and
uh on this leaderboard you can actually
uh like uh like view performances by uh
scrolling this uh horizontal time bar
and you'll see that as you're scrolling
uh the contaminated models which are the
red bars actually go down which does
highlight that uh like problem does uh
like model performance does change on uh
these newer kind of problems.
Um finally for uh test generation we uh
maintain uh like these uh test
generation test generators. So if you
worked on fuzzing you would have like
input generators where you generate
diverse inputs and each of the problems
are supported by like 30s or 50 inputs.
So you can uh reliably find mistakes and
bugs in uh incorrect code and these are
all automatically generated uh using an
LLM driven approaches
and these problems uh have been like
continuously being released and updated.
So we have released like six different
versions of uh life codebench and these
uh new one of the nice things or one of
the worrying things for me at the start
was that uh like if you're constantly
updating the eval sets will uh like
people be able to keep track of them
will people be using them or will they
just restrict to a single version? Uh it
turned out that these newer eval sets
were constantly uh like adopted by
different foundation model labs and uh
like since we updated the problem
difficulty over time uh the evaluation
sets continue to provide strong signal
to compare uh different models.
Um so this was like live codebench.
Let's talk about uh like something which
is more on coding agents like more real
world programs and this is our work on
like uh software optimization. So this
is a problem we're very excited about
and I'll talk about a few factors why
you should maybe be excited about this.
So uh here we are trying to uh measure
model capabilities in generating high
performance software and uh I feel that
this uh like problem domain uh like
mixes two uh factors like the
algorithmic coding uh uh field I talked
about which is like live codebin setting
but also like glob global software
editing like uh sweet bench and other
like software uh uh general software
engineering benchmarks. uh uh in high
performance uh software you will have to
do algorithmic work you have to do deep
analysis and find uh uh generate
software with like right uh runtime.
So uh one of the key principles when we
are trying to build this benchmark was
like ensuring construct validity because
when you see a lot of benchmarks today
uh we get very high benchmark scores but
at a lot of the times they don't really
translate to real world performance
gains. So construct validity refers to
how close uh a measurement reflects the
underlying uh concept it's meant to
measure. So like here we are measuring
code optimization and we want something
which is uh like uh reliably evaluates
real world uh takes. So this usually
requires like two aspects. First is like
the task distribution. Your task should
be natural and sourced from the real
world and then you should be able to
reliably grade them. So let me talk
about like what steps we take to uh make
this happen and how we construct this
benchmark. So let's say we take a
codebase like llama cvp uh we take uh uh
we crawl over all the commits of the
codebase and we find the commits which
are op like doing something uh related
to performance optimization. So here
there was this commit which is
optimizing the quantized performance of
uh like uh certain kinds of models. Uh
for all of these uh comm performance
optimizing commits we would uh like
generate performance test cases. Um and
uh these performance SK would look like
some workloads and uh once we have these
workloads uh we have a very uh nice and
precise way to specify the problem
statement that uh given this workload of
let's say uh running uh Quinn uh 7B
model uh can uh we give this uh problem
to uh su agent ask the model to optimize
the code glamour CPB repository so this
code runs faster so as you can imagine
this uh task is like fairly challenging
you need to understand like low-level uh
implementation details uh and like how
quantized models behave, how we can uh
improve the runtime and so models can
generate a patch and the evaluation is
done on whether the patch is correct. So
does it pass the equivalence check with
the human patch and uh is there a valid
optimization over the reference human
patch uh that is uh whether you can uh
generate a better runtime than what a
human could do.
So uh like uh this is a very challenging
task. we have like 100 plus optimization
task source in this manner and this is
like fairly uh like important in like uh
like high performance settings. So think
about like data science uh like ML
visualization scenarios uh benchmark uh
like comprises of like various uh
low-level uh code like C, C++, Rust and
the very nice thing is like these are
precise problem statements. you can uh
easily specify to the model what is the
goal in the form of a performance test
which the model has access to and it can
continuously iterate over it for a long
time. So here we can scale the test time
compute and pick the best solution based
on uh the test cases that we have and
this can happen like synchronously or
asynchronously.
So uh like we generate these performance
test cases and uh that work uh
reasonably well but uh we found that
there were uh like cases of reward
hacking here. So what do I mean by
reward hacking? Like frontier models
would write non-inneatic code to like
actively exploit the evaluation
infrastructure or overfitit the test
distributions. So one funny example we
saw was like models would add like l
cache to p like arbitrary pandas methods
when we were uh trying to optimize
pandas and the official solution should
have required changing something in the
internals. Uh so we tried to pass this
by changing our evaluation
infrastructure so it's like more robust
to this kind of hacking uh approaches
but then we saw something like even more
drastic models would sometimes
completely hijack the infra where uh
they would add a like site customized.py
Pi file where which runs at the start of
Python runtime and it would basically
change the numpy library uh like which
was installed in the codebase to
something it crawled from uh source and
there is like I think you can do some
ways to uh like take some measures to
make your evaluation infra which is
robust to these kind of uh like
adversarial uh like attacks. But uh here
uh like there could be myriad ways in
which models can hack these kind of
scenarios. And here uh we propose like
hack detector where which is a detection
system that leverages GBD5's like code
analysis capabilities and test compute
to like basically identify these kind of
hacking behaviors at runtime. So you
don't have to imagine all the possible
failure scenarios at the start. So what
it would take is like a model patch, the
expert patch and test cases and we'll
ask GBD5 to give like verdicts on like
whether it's reward hacking with some
kind of explanation. Uh we'll do this a
few times and take the consensus and
based on this consensus we'll determine
if this is uh doing some like nonomatic
coding patterns or not
and uh we did some failure analysis
based on this. So now you can detect
mistakes using test cases whether the
code is correct or not whether it is
optimizing or not but you can also
detect reward hacks using this like lm
as a judge uh factor and uh what you see
is kind of surprising uh like models
make a lot of like correctness mistakes
that you can catch by tests but even if
the code passes the test cases like 03
attempted reward hacking patterns in
like 30% of the problems it tried and
this fraction is like going down uh for
the newer models to some degree but it
is still existing and as we go to more
and more real world tasks. Uh this is
going to get more challenging and we
need to figure uh like ways to combat
these kind of reward hacking patterns by
using LLM judge and other uh ways to
make just evaluation infra more
reliable.
So next I'll talk about like uh uh like
sizz some of our new work on like uh
like pushing the boundary of code eval
even further and uh taking a look at
more challenging tasks. So here we were
thinking about like can uh like these
language models translate uh like a
entire code base uh specifically given a
specification as a C program can you
generate a safe implementation for the
same and we took a fairly complex code
base. So Zofle is a like highly
efficient compression library from
Google like it has about like 4,000
lines of code hundreds of functions and
complex data structures. uh and uh we
want like u like very precise and
correct code. So we uh generated like a
million compression inputs and your test
case was to generate a rest
implementation that uh maintains
correctness over those million test
cases. And when I did this work back in
uh like uh last year it took us 12 hours
to actually do this translation. Now
perhaps with better models this can be
done in 2 hours but still I think uh
this is pushing the frontier of like
what the models can do currently. Um so
what was one of the key findings when we
are trying to make progress in uh
something like this like end to end
correctness is important but it only
gives you like one bit of feedback but
for these very long horizon tasks one
thing which will become more important
going forward is like having some
measures of intermediate correctness. So
like for our case we could measure like
fraction of code translated, fraction of
code refactored and based on these kind
of settings you can uh understand like
if you're making progress or not and how
you can uh scale systems better.
Um so like uh as we're closing I'll talk
about like quickly talk about some of
the work I did on like in the wilds. So
this work was done in collaboration with
LM arena folks and uh like I'll talk
about two settings here. First is
co-pilot arena. So this is like
evaluating in ID uh code completion
assistance. So what we will do here is
we'll have an ID plug-in where uh like
uh similar to GitHub copilot setting uh
we'll generate a completion for you but
instead of just a single completion
you'll have two completions appearing
like top and uh down and you can pick
either one of them via shortcuts like
tab or shift tab and based on the uh
like acceptance rates we can pair wise
compare what the code completion
assistants are doing.
uh uh we also did some work on repo chat
where uh like uh to evaluate uh like
code question answering capabilities of
models uh we uh built a system where you
can provide a github url uh and you can
ask a natural language query about the
codebase which could be something what
explain the codebase to as complex as
let's try to solve this issue let's give
me give me a model patch that could
solve this issue and uh we integrated a
very basic and simple uh like su agent
system that fetches the codebase
resolves user queries and like
multi-turn uh code assistant uh
conversations.
So uh one thing that stood out to me in
these kind of things uh is like like how
humanentric experiment design uh needs
to be. So uh like for code like copilot
you know in particular we realized that
like latency is a big concern for
acceptance rates. So if you look at
accept like latency below and the
acceptance rates like if it is like
anything more than 1 second uh like the
acceptance rates drop very starkly. So
people care a lot about latency. So you
have to so we had to design our
experiment so that it's robust to these
kind of like latency differences between
models balance latency across different
models. So like if you're doing like
anything in the wild having this human
centering component understanding human
behaviors is very important to do
anything meaningful.
So uh at the end I think uh just to
recap like I think I talked about a
bunch of works like what are some uh big
takeaways. So I think uh dynamic uh
dynamically updating evaluation sets to
like prevent contamination like modify
the problem distributions like in terms
of difficulty in terms of distribution
of tasks we care about as we like uh
improve uh as the language model
capabilities will improve over time the
types of tasks we'll start to do with
model change. You can even uh think of
this like uh we were doing like code
completion where you were generating
like few tokens, few lines and now we
are generating like uh tens of lines,
hundreds of lines and to some degree
this uh will continuously change and we
have to update our evaluation sets uh so
that it reflects the real world usage
and kinds of things people need. Um the
second very uh important thing is like
ensuring reliable grading in this domain
and like tests are very good for
ensuring correctness and uh provide a
lot of reliable feedback but uh once we
go to real world settings like models
can start doing like lot of non-edomatic
coding patterns they would add try
catches everywhere to just prevent any
kind of bug from occurring. So having
these kind of lm judges to detect
nonmatic coding patterns code quality
and just any like arbitrary hacks will
be very important. And finally like as I
talked about in the last work like
intermediate grading signals so that you
can measure like incremental progress uh
is uh like another key factor here. So I
think that's uh the end of my talk.
Thank you.
[applause]
[music]
>> [music]