OpenAI on Securing Code-Executing AI Agents — Fouad Matin (Codex, Agent Robustness)

Channel: aiDotEngineer
Published at: 2025-07-30
YouTube video id: w7IMuYsBNr8
Source: https://www.youtube.com/watch?v=w7IMuYsBNr8
[Music]
Hi everyone, I'm Fouad and I'm here to
talk about safety and security for code
executing agents. And a little intro
about myself, I actually started on the
OpenAI security team um after running a
startup for about six years, a security
company. Um and now I work on agent
robustness and control as part of post
training.
Uh one of the things I did in the last
couple of months is work on codecs and
codeex CLI which is our open source
library for actually running codecs
directly on your computer and there's a
lot of things we learned in building
codecs that I'm excited to share with
you all but um there's definitely a lot
more work for us to do and excited to
hear what you think um afterwards.
Um, one high level point I want to start
with is that every frontier research lab
is focusing on how to push the
benchmarks around coding and not just
the benchmarks but also usability and
actually deployability of these agents.
So they're making them really good at
writing and executing code and as a
result every agent will become a
codeexecuting agent. It's not just
actually about writing code but it's
about achieving the objective most
efficiently.
And if you look at where the models were
just even a year ago or under a year ago
um with with 01 it showed us a very
early preview of what these reasoning
models can do. But with more recent
models like 03 04 mini and other models
in the space um you can see higher
reliability and more capabilities. And
now the new constraint isn't just can
these models do things but actually what
should they be able to do and what
should the guardrails be when you allow
them um to work um in your environments.
And as I mentioned, code isn't just for
SWE tasks, which is candidly what I
thought initially when I when I started
at OpenAI, but it actually helps across
the stack. Here's an example um from our
O3 release around multimodal reasoning
where previously 01 would, you know,
look at the image and try to just reason
about it based on the image as it's
given. But what we've noticed with
codeex executing agents, even outside of
a suite scenario, they're able to
actually run code to decipher the text
that's on the page using OCR or to crop
images. There's some really exciting um
behaviors that we've seen from models
when you just give them the ability to
run code. We didn't tell it in this
prompt that it should run code. It just
knew that with that tool as an option,
it's able to do it more efficiently.
And what we'll I think observe when it
comes to building AI agents is this
shift from the kind of complex like
inner loop where you have a model that
might determine what type of task the
user is asking for given a prompt.
You'll then load a more task specific
prompt and tool set. You'll then chain a
bunch of these loops together in order
to achieve some sort of goal. Maybe just
ask the model, hey, are you done yet? Or
to keep going. Um and then finally use
another model to respond back to the
user. We don't generally we don't need
these anymore. You can actually just
have the model decide when it should use
which tools and when it should write or
run code and it can just write write and
run that code um on its own.
Now that's what we in security would
call a rce or remote code exploitation.
So, um, when we're looking at these new
behaviors, it's important to consider
not just the capabilities, but also how
do we ensure that those capabilities are
not going to backfire on us when we
allow it to be able to perform those
operations.
And there's a couple different ways that
we've observed how models can go wrong.
Um, the most common one, something we
think about consistently is prompt
injection and data exfiltration. There's
a lot of different examples that we
we'll be documenting in the coming
months. Um, but, uh, that's probably
number one in our priority queue. But
then you also have things like the agent
just makes a mistake. It just does
something wrong. Um maybe it installs a
malicious package unintentionally or it
writes vulnerable code again
unintentionally or you have privilege
escalation or sandbox escape.
And when we think about our
responsibility of deploying these agents
both internally and externally. We have
this preparedness framework where we
document some of the recommendations and
also some of the um kind of standards
that we hold ourselves to. But one one
of the ones that I want to emphasize is
requiring safeguards to ensure um or to
avoid misalignment at large-scale
deployment. And this is something that
we think about ourselves when we are
building codecs um but also something
that organizations as you deploy coding
agents into your workplace that you
should also be considering. And one of
the first safeguards that we put in
place is to sandbox the agent especially
if you're running it locally. Um
generally the best the best method is
just to give it its own computer. That's
what we did with codeex and chatgbt. It
spins up a container fully isolated. It
then produces a PR at the end, that's
practically as safe as you can get. Um,
but if you are going to run it locally,
which of course with Codeex CLI, we we
we also encourage um making sure that
you're actually providing the correct
level of sandboxing, whether it's uh
containerization or it's using app level
sandboxing, which we'll talk about in a
moment, or OS level sandboxing. Um,
making sure that you're providing the
right guardrails for the model, even if
it does attempt to do something wrong.
Related to that is disabling or limiting
internet access. And this is probably
the kind of highest uh uh kind of
probability vector of prompt injection
or data xfill. You know, the model goes
to read some sort of docs or reads a
GitHub issue and then in a comment of
that GitHub issue, maybe there's a
prompt injection and that kind of
untrusted content can leak into um the
kind of core interloop that you would
trust an agent to run code in and if it
has access to your codebase or other
sensitive materials um that could be um
pretty bad. Um and then finally um
reviewing um all of these operations or
the actual final diffs that the agents
perform um whether it's uh code review
in a in a GitHub PR or it's um approvals
and confirmations. Um those guardrails
are actually really important ensuring
that humans stay in control of these
systems um is one of the strongest
mitigations that we have. But of course
no one wants to sit there and click keep
clicking approved. So um avoiding the
kind of yolo mode on one end to uh you
know having to approve every single um
you know ls command is is not practical
either. So let's talk a little bit about
how do we actually achieve this. So um I
mentioned um our recommendation is to
give the agent its own computer. You see
this in codeex and chatgbt. Um there's a
lot of different constraints that you
need to apply when you think about that.
Making sure that the agent has all of
the dependencies installed all of the
different access it needs to perform its
actions. Um, and uh, if you want to run
it locally, being able to use something
like Codeex CLI, which we fully open
sourced for you to be able to um, build
out these agents yourself. You can use
this as a reference point. That's part
of why we wanted to open source it is
really showcase not only here's uh, the
agent that we built for you, but also
here's how you can build your own. And
um, as I mentioned, fully open source,
you can actually use these um, in this
case Mac OS or Linux sandboxing
techniques. And um as an example, here
is the um just a portion of the Mac OS
sandboxing policy. This uses a language
called seat belt um that uh Apple
bundles into operating systems since uh
Leopard um it's canly somewhat uh hard
to find documentation for. So this was
definitely an area where um using both
our models using deep research to
actually understand what are the bounds
of different examples that people have
created. Um this we were heavily
inspired by Chromium which also uses
sandbox excuse me seat belt as a
sandboxing mechanism on Mac OS and then
um separately um you'll actually notice
this is now in Rust um where we actually
um tapped into our own security teams to
um build out our Linux sandboxing and
run it in this case using both SEC comp
and landlock um in order to be able to
um I think we'll do maybe questions
afterwards um but um to to in order to
uh have unprivileged sandbox um uh and
prevent escalation ition.
Um and then next we have disabling
internet access. This is really
important when it comes to prompt
injection which again is a primary um XL
risk. And u we have two methods. Well
actually let me before I get into that
um we have two methods both in codeex
and chargebt but also within CLI we
actually have this full auto mode where
effectively what we did is define a
sandbox where um it can only read and
write files within the directory that
it's run in. It can only make network
calls um based on commands that you auto
approve it for. Um but otherwise it just
runs in this kind of fully sandbox and
lockdown um environment that allows the
agent to be able to go and test you know
run piest run npm test um but not
actually have some second order
consequences um and then when it comes
to codeex and chatbt we actually just
launched this uh yesterday um or two
days ago maybe um but you can now turn
on internet access but it comes with a
set of configurable allow list. This is
really important when you consider
either using or building agents
yourself. Um ensuring that you have both
the kind of maximum security option and
also this more flexible option. So
people can define um whatever use case
that um or whatever policy that makes
sense for their use case. And in here we
even define um which HTTP methods are
allowed including a warning letting you
know about the risks. Just to give you
an example and we actually linked to
this from those docs. Um let's say my
prompt is to fix this issue and I just
linked to a GitHub issue. um seems
pretty innocuous but in that GitHub
issue which could be you know user
generated content um go ahead and grab
the last commit and go ahead and um post
that to this random URL and because
codeex is um really trained with um
instruction following and it tries to do
exactly what you ask it'll go ahead and
do that. Um, now a way that we can
control that is both at the model level
and flagging things that seem like they
could be suspicious and that's
definitely an area when it comes to
model training that we're actually
focusing and but ultimately your most um
kind of deterministic and authoritative
control is going to be a system level
control. It shouldn't even be able to
make a call to HTTP bin in this case. So
combining those model level controls
along with your kind of system level
configurations is really key to solving
this problem.
And finally, there's requiring human
review. Um, now this is something that I
see a lot of attention when it comes to
folks who are um kind of using LMS and
coding agents. Um, is that it's uh you
have this new problem when you're
prompting these agents is there's just
so much code that you end up having to
review. Using tools like other PR review
or um kind of code review tools and
using LMS as part of that loop, while
useful, is not a substitute for a human
actually going in and reviewing the
operations that the model's about to
perform. ensuring that you're not having
a model that might have installed a
package that maybe is not as well known
or it's maybe off by one character. Um
and uh ensuring that that doesn't land
in your codebase that then later gets
run um in an unprivileged environment or
excuse me in a privileged environment.
Then we also have again since this
doesn't just apply to coding agents um
we also have operator as an example
where um there's different techniques
you can use in this case um we have both
a domain list and also a monitor that is
in the loop uh identifying any kind of
potential sensitive operations that a
model might go out and do on your behalf
and we have this monitoring task and
watch mode as we call it where we ensure
that a human is actually reviewing any
kind of actions that it can take. So
again, balancing the maximum security
with the maximum flexibility is really
important here.
And so as an example of how to think
about actually building these agents um
effectively it where previously you
might have had a loop that is doing a
bunch of different um elements of
softwarebased logic now you can actually
just defer most that logic to the
reasoning model and give it the right
tools to accomplish the task. Um, we
released this exact tool, um, local
shell as it's called in the API. Um,
where it actually is exactly the way
that we train our models to be able to
execute to write and execute code. Um,
we also released tools like apply patch,
which models aren't particularly good at
getting line numbers correct for like a
a git diff. So, we provided this new um
format for actually applying diffs to
files. Um, but then of course your more
standard tools, things like MCP, web
search. Um, I'm actually going to give
an example of how you can use these in
combination. Um so let's say um socket
which is a dependency um check
dependency vulnerability checking
service um now has an MCP server. You
can expose that to the agent to then go
in and verify whether or not a given
dependency it's about to install could
be vulnerable or suspicious and ensure
that the model either as part of its own
operations or you can apply a system
level check after the rollout has
completed um to make sure that any
dependencies it's going to install um
are actually safe to do so. But again,
one thing we'd emphasize is to use a
remote container. Um, we are releasing a
container service as part of um our
agents SDK and um as part of responses
API and so you can either run it locally
or run in your own environment or you
can uh let OpenAI kind of host it for
you.
And so as a recap um would strongly
recommend sandboxing these agents
whether it's through containerization or
it's through OS level sandboxing. um
disabling and limiting internet. I think
that balance between uh capability where
you want to be able to let it just run
and do its own thing for as long as
possible um which you can do when it's
fully network disabled um versus I
wanted to go out and read docs. I want
to go install packages. Um we give you
that flexibility but being really
thoughtful about when you employ each um
and then finally requiring human review.
This is definitely an area where we
expect there to be a lot more research.
um employing monitors um LM based
monitors in the loop while valuable is
just not quite there yet in terms of um
the kind of certainty that you get from
again a deterministic control
and so in in that in that vein um there
is more tooling that we plan to release
here so stay tuned in the codeex repo um
on the openi or um there's also more
documentation that we plan to publish
around um both the MLbased interventions
and the systems controls and if you're
interested in working on problems like
We are hiring for this new team agent
robustness and control and um so if you
also write rust we are also hiring for
the codec cli to build out more of those
integrations and making sure that
everyone can benefit from them. So um if
you're interested or you know someone
who'd be interested definitely let us
know. But with that, thank you so much
[Music]