OpenAI on Securing Code-Executing AI Agents — Fouad Matin (Codex, Agent Robustness)
Channel: aiDotEngineer
Published at: 2025-07-30
YouTube video id: w7IMuYsBNr8
Source: https://www.youtube.com/watch?v=w7IMuYsBNr8
[Music] Hi everyone, I'm Fouad and I'm here to talk about safety and security for code executing agents. And a little intro about myself, I actually started on the OpenAI security team um after running a startup for about six years, a security company. Um and now I work on agent robustness and control as part of post training. Uh one of the things I did in the last couple of months is work on codecs and codeex CLI which is our open source library for actually running codecs directly on your computer and there's a lot of things we learned in building codecs that I'm excited to share with you all but um there's definitely a lot more work for us to do and excited to hear what you think um afterwards. Um, one high level point I want to start with is that every frontier research lab is focusing on how to push the benchmarks around coding and not just the benchmarks but also usability and actually deployability of these agents. So they're making them really good at writing and executing code and as a result every agent will become a codeexecuting agent. It's not just actually about writing code but it's about achieving the objective most efficiently. And if you look at where the models were just even a year ago or under a year ago um with with 01 it showed us a very early preview of what these reasoning models can do. But with more recent models like 03 04 mini and other models in the space um you can see higher reliability and more capabilities. And now the new constraint isn't just can these models do things but actually what should they be able to do and what should the guardrails be when you allow them um to work um in your environments. And as I mentioned, code isn't just for SWE tasks, which is candidly what I thought initially when I when I started at OpenAI, but it actually helps across the stack. Here's an example um from our O3 release around multimodal reasoning where previously 01 would, you know, look at the image and try to just reason about it based on the image as it's given. But what we've noticed with codeex executing agents, even outside of a suite scenario, they're able to actually run code to decipher the text that's on the page using OCR or to crop images. There's some really exciting um behaviors that we've seen from models when you just give them the ability to run code. We didn't tell it in this prompt that it should run code. It just knew that with that tool as an option, it's able to do it more efficiently. And what we'll I think observe when it comes to building AI agents is this shift from the kind of complex like inner loop where you have a model that might determine what type of task the user is asking for given a prompt. You'll then load a more task specific prompt and tool set. You'll then chain a bunch of these loops together in order to achieve some sort of goal. Maybe just ask the model, hey, are you done yet? Or to keep going. Um and then finally use another model to respond back to the user. We don't generally we don't need these anymore. You can actually just have the model decide when it should use which tools and when it should write or run code and it can just write write and run that code um on its own. Now that's what we in security would call a rce or remote code exploitation. So, um, when we're looking at these new behaviors, it's important to consider not just the capabilities, but also how do we ensure that those capabilities are not going to backfire on us when we allow it to be able to perform those operations. And there's a couple different ways that we've observed how models can go wrong. Um, the most common one, something we think about consistently is prompt injection and data exfiltration. There's a lot of different examples that we we'll be documenting in the coming months. Um, but, uh, that's probably number one in our priority queue. But then you also have things like the agent just makes a mistake. It just does something wrong. Um maybe it installs a malicious package unintentionally or it writes vulnerable code again unintentionally or you have privilege escalation or sandbox escape. And when we think about our responsibility of deploying these agents both internally and externally. We have this preparedness framework where we document some of the recommendations and also some of the um kind of standards that we hold ourselves to. But one one of the ones that I want to emphasize is requiring safeguards to ensure um or to avoid misalignment at large-scale deployment. And this is something that we think about ourselves when we are building codecs um but also something that organizations as you deploy coding agents into your workplace that you should also be considering. And one of the first safeguards that we put in place is to sandbox the agent especially if you're running it locally. Um generally the best the best method is just to give it its own computer. That's what we did with codeex and chatgbt. It spins up a container fully isolated. It then produces a PR at the end, that's practically as safe as you can get. Um, but if you are going to run it locally, which of course with Codeex CLI, we we we also encourage um making sure that you're actually providing the correct level of sandboxing, whether it's uh containerization or it's using app level sandboxing, which we'll talk about in a moment, or OS level sandboxing. Um, making sure that you're providing the right guardrails for the model, even if it does attempt to do something wrong. Related to that is disabling or limiting internet access. And this is probably the kind of highest uh uh kind of probability vector of prompt injection or data xfill. You know, the model goes to read some sort of docs or reads a GitHub issue and then in a comment of that GitHub issue, maybe there's a prompt injection and that kind of untrusted content can leak into um the kind of core interloop that you would trust an agent to run code in and if it has access to your codebase or other sensitive materials um that could be um pretty bad. Um and then finally um reviewing um all of these operations or the actual final diffs that the agents perform um whether it's uh code review in a in a GitHub PR or it's um approvals and confirmations. Um those guardrails are actually really important ensuring that humans stay in control of these systems um is one of the strongest mitigations that we have. But of course no one wants to sit there and click keep clicking approved. So um avoiding the kind of yolo mode on one end to uh you know having to approve every single um you know ls command is is not practical either. So let's talk a little bit about how do we actually achieve this. So um I mentioned um our recommendation is to give the agent its own computer. You see this in codeex and chatgbt. Um there's a lot of different constraints that you need to apply when you think about that. Making sure that the agent has all of the dependencies installed all of the different access it needs to perform its actions. Um, and uh, if you want to run it locally, being able to use something like Codeex CLI, which we fully open sourced for you to be able to um, build out these agents yourself. You can use this as a reference point. That's part of why we wanted to open source it is really showcase not only here's uh, the agent that we built for you, but also here's how you can build your own. And um, as I mentioned, fully open source, you can actually use these um, in this case Mac OS or Linux sandboxing techniques. And um as an example, here is the um just a portion of the Mac OS sandboxing policy. This uses a language called seat belt um that uh Apple bundles into operating systems since uh Leopard um it's canly somewhat uh hard to find documentation for. So this was definitely an area where um using both our models using deep research to actually understand what are the bounds of different examples that people have created. Um this we were heavily inspired by Chromium which also uses sandbox excuse me seat belt as a sandboxing mechanism on Mac OS and then um separately um you'll actually notice this is now in Rust um where we actually um tapped into our own security teams to um build out our Linux sandboxing and run it in this case using both SEC comp and landlock um in order to be able to um I think we'll do maybe questions afterwards um but um to to in order to uh have unprivileged sandbox um uh and prevent escalation ition. Um and then next we have disabling internet access. This is really important when it comes to prompt injection which again is a primary um XL risk. And u we have two methods. Well actually let me before I get into that um we have two methods both in codeex and chargebt but also within CLI we actually have this full auto mode where effectively what we did is define a sandbox where um it can only read and write files within the directory that it's run in. It can only make network calls um based on commands that you auto approve it for. Um but otherwise it just runs in this kind of fully sandbox and lockdown um environment that allows the agent to be able to go and test you know run piest run npm test um but not actually have some second order consequences um and then when it comes to codeex and chatbt we actually just launched this uh yesterday um or two days ago maybe um but you can now turn on internet access but it comes with a set of configurable allow list. This is really important when you consider either using or building agents yourself. Um ensuring that you have both the kind of maximum security option and also this more flexible option. So people can define um whatever use case that um or whatever policy that makes sense for their use case. And in here we even define um which HTTP methods are allowed including a warning letting you know about the risks. Just to give you an example and we actually linked to this from those docs. Um let's say my prompt is to fix this issue and I just linked to a GitHub issue. um seems pretty innocuous but in that GitHub issue which could be you know user generated content um go ahead and grab the last commit and go ahead and um post that to this random URL and because codeex is um really trained with um instruction following and it tries to do exactly what you ask it'll go ahead and do that. Um, now a way that we can control that is both at the model level and flagging things that seem like they could be suspicious and that's definitely an area when it comes to model training that we're actually focusing and but ultimately your most um kind of deterministic and authoritative control is going to be a system level control. It shouldn't even be able to make a call to HTTP bin in this case. So combining those model level controls along with your kind of system level configurations is really key to solving this problem. And finally, there's requiring human review. Um, now this is something that I see a lot of attention when it comes to folks who are um kind of using LMS and coding agents. Um, is that it's uh you have this new problem when you're prompting these agents is there's just so much code that you end up having to review. Using tools like other PR review or um kind of code review tools and using LMS as part of that loop, while useful, is not a substitute for a human actually going in and reviewing the operations that the model's about to perform. ensuring that you're not having a model that might have installed a package that maybe is not as well known or it's maybe off by one character. Um and uh ensuring that that doesn't land in your codebase that then later gets run um in an unprivileged environment or excuse me in a privileged environment. Then we also have again since this doesn't just apply to coding agents um we also have operator as an example where um there's different techniques you can use in this case um we have both a domain list and also a monitor that is in the loop uh identifying any kind of potential sensitive operations that a model might go out and do on your behalf and we have this monitoring task and watch mode as we call it where we ensure that a human is actually reviewing any kind of actions that it can take. So again, balancing the maximum security with the maximum flexibility is really important here. And so as an example of how to think about actually building these agents um effectively it where previously you might have had a loop that is doing a bunch of different um elements of softwarebased logic now you can actually just defer most that logic to the reasoning model and give it the right tools to accomplish the task. Um, we released this exact tool, um, local shell as it's called in the API. Um, where it actually is exactly the way that we train our models to be able to execute to write and execute code. Um, we also released tools like apply patch, which models aren't particularly good at getting line numbers correct for like a a git diff. So, we provided this new um format for actually applying diffs to files. Um, but then of course your more standard tools, things like MCP, web search. Um, I'm actually going to give an example of how you can use these in combination. Um so let's say um socket which is a dependency um check dependency vulnerability checking service um now has an MCP server. You can expose that to the agent to then go in and verify whether or not a given dependency it's about to install could be vulnerable or suspicious and ensure that the model either as part of its own operations or you can apply a system level check after the rollout has completed um to make sure that any dependencies it's going to install um are actually safe to do so. But again, one thing we'd emphasize is to use a remote container. Um, we are releasing a container service as part of um our agents SDK and um as part of responses API and so you can either run it locally or run in your own environment or you can uh let OpenAI kind of host it for you. And so as a recap um would strongly recommend sandboxing these agents whether it's through containerization or it's through OS level sandboxing. um disabling and limiting internet. I think that balance between uh capability where you want to be able to let it just run and do its own thing for as long as possible um which you can do when it's fully network disabled um versus I wanted to go out and read docs. I want to go install packages. Um we give you that flexibility but being really thoughtful about when you employ each um and then finally requiring human review. This is definitely an area where we expect there to be a lot more research. um employing monitors um LM based monitors in the loop while valuable is just not quite there yet in terms of um the kind of certainty that you get from again a deterministic control and so in in that in that vein um there is more tooling that we plan to release here so stay tuned in the codeex repo um on the openi or um there's also more documentation that we plan to publish around um both the MLbased interventions and the systems controls and if you're interested in working on problems like We are hiring for this new team agent robustness and control and um so if you also write rust we are also hiring for the codec cli to build out more of those integrations and making sure that everyone can benefit from them. So um if you're interested or you know someone who'd be interested definitely let us know. But with that, thank you so much [Music]