Infra that fixes itself, thanks to coding agents — Mahmoud Abdelwahab, Railway
Channel: aiDotEngineer
Published at: 2025-11-24
YouTube video id: Q5IVm_CxN2w
Source: https://www.youtube.com/watch?v=Q5IVm_CxN2w
your app's infrastructure should fix itself. Let me show you. So, right now I'm on the Rayway dashboard and I have a bunch of services that are deployed and all of these services have one thing in common. They all have bugs and problems. So, for example, this service has a memory leak. If I click on it, go to metrics, we can just see memoryization keeps growing high and very quickly. This is just a sign of a memory leak and pretty sure the service would eventually crash. If I look over at the amount of requests, we have a high number of 500s. So, the server is failing to respond. We have a high request error rate of 94%. And we also have an extremely high response time of like multiple seconds uh for like the service to respond, which is not ideal. Like if this was a service running in production, everything would be on fire. You'd be getting paged and you just try to bring back the service up back quickly. But the thing is not all problems are this obvious. For example, this service all it does just queries a postcrist database. And if we go to metrics, we'll just see that well CPU utilization seems fine. Memory usage is also fine. Sure, it's a bit spiky, but okay, whatever. We have some fails, but okay, nothing too alarming. request error rate is somewhat high. So that also should make us kind of like want to investigate, but the response time is extremely high. The thing is this is because the service makes queries that are super slow. And the thing is if you're an end user that's trying to use this experience, you would just suffer. You would need like 30 seconds for like a page to load, which would be a nightmare. So the thing is when you deploy your app to production maybe some you know bugs or issues make their way to production things happen and kind of like the typical way of dealing with these things is maybe you set up a bunch of thresholds and when these thresholds are met for let's say CPU or memoryization maybe uh you want to have a threshold for the request error rate it shouldn't exceed a certain amount well what will happen is You're going to get alerted and you'll be aware that there is an issue, but you still have to do the investigation yourself. You have to dig through logs, metrics, and traces to try to paint a picture in your head and try to piece things together so that you can ship a fix. Now, what I'm proposing is you should have a coding agent that monitors the state of your project and your application's infrastructure. And if any issue is detected, so you know any of the thresholds we define are met, we should just have a fix shipped, right? So like instead of, you know, getting the alert and investigating, you just review a pull request and you're like, uh, looks good to me. You ship it and then everything is good and crisis averted. So today I'm going to show you what I have in terms of demo that kind of paints a picture of how this could be achieved. So at a high level I want to have a series of workflows that will kick in that will help me go from issue detected in on railway my deployment provider to a pull request being open in my GitHub repo. And this is what I have in mind. Uh the first workflow that I want to have is a workflow that runs on a schedule. So let's say it runs every 10 minutes, 15 minutes, 30 minutes. And what this workflow will do is one, it should fetch the application's architecture. We should have an understanding of what services are deployed, which you know like frontends, backends, crons, cues are live in my project. And I then want to fetch each services resource metrics, so CPU and memory utilization. And I also want to fetch each services HTTP metrics. I want to see the request error rate, the number of failed requests for, you know, 500 400 errors. And once that's done, I will want to then see which services have exceeded which thresholds. And then I just want to return a list of the affected services. So this would be essentially the goal. Now you might be wondering, well, why not make this an alertbased system? So maybe we configure something like web hooks for alerts and then that would kick off uh essentially this workflow instead. I would argue that it's probably better to be able to analyze a slice of time rather than just having a threshold being met because it can get pretty noisy. Like imagine you have a spiky workload uh and you know you reach the 80% resource utilization for like your CPU but things are still fine and that's not like in my mind this is enough to be investigate but it might not like it might mean that there just aren't issues when we try to look at like the bigger picture and all the details. Now once we have this list of impact that is impact services we essentially want to pull in even more context for them. So like at a high level we want to see project health all the services is everything operating as expected. Oh we have this thing that we're suspicious about. Let's actually pull all of you know additional context for the service because imagine again you have like high resource utilization. Maybe you're just successful. You have high usage. Uh but then when you pull the logs it's like oh everything seems fine. there aren't any errors. Well, you're good. And you can imagine that we can even pull even more context. Like imagine maybe we scan the code in the repo and based on that we infer the upstream providers that the repo relies on and then we can automatically check the status pages of these services. Imagine like a payment processor goes down. Well, that's kind of how you can know and then the coding agent will be able to maybe tell you like, hey, you should just like wait out this issue. And once we have all this information, we can just write a detail plan. So like we can look at, oh, we have a high number of 500 requests. We see that we have very high resource utilization for memory. and we see that we have you know um just errors specifying that a specific endpoint is failing. Well, this is enough information that we can write a detailed plan of hey this is my application's architecture. These are the affected services. We just then give this plan to an agent and then the agent will just follow the process of hey let me clone this repo. I'll just create a to-do list based on the plan you gave me. I'll implement all the fixes and I'll just create a pull request. And this is kind of how we go from issue detected to an open pull request. So let's actually see this in practice. So because we have the idea of workflows, what I want to do is actually use what is known as durable execution. So the idea of durable workflows has been around for a while and it's really one of my favorite abstractions because it can help you simplify complex logic while making it more reliable. So for example here we have this workflow. So this actually is ingest but there are lots of solutions out there that pretty much do the same thing and we have this function that you know called process video upload. It listens on an event of video uploaded and we essentially want to do three things. We first want to generate a transcript and we do this by making an API call to a third party API. Once we get that transcript, we want to generate a summary by also making a request to an LLM provider. And once we have the transcript and the summary, we want to store them in the database. The thing is all of these steps, they are not 100% guaranteed to work. Uh they are prone to failure. And what's neat about this pattern is by default, these steps will be automatically retried. You don't even have to think about it. But if you let's say want to modify this behavior, maybe you want the retry to happen uh like on a certain schedule like you know exponential back off uh maybe you want to define another thing that should happen in the case of failure you'll be able to do it. But what's neat is each step when it succeeds uh the result is cached. So if for example we are able to transcribe the video correctly, we summarize the transcript correctly, but we failed to write to the database. If we were to retry this workflow, we just continue where we left off. Uh we don't we won't really repeat any work, which is one awesome because it's faster, but also it's more cost effective. So at a high level, this is the thing that I'll be relying on in my code because I'll be making API calls to the railway API to be able to fetch the project architecture, all the resource metrics um as well as you know the HTTP metrics and whatnot. So yeah, uh this is kind of like the first thing that um we need to talk about. The second thing is the coding agent. And for the coding agent, I'll be using Open Code. Open code is an AI agent that's built for the terminal. You can think of it as an alternative to something like cloud code, but the main difference is open code is fully open source and you can choose any LLM provider or uh you know model that you like, which is pretty nice. Uh you have this nice terminal UI, but honestly what's so cool about the project is how it's architected. So if you go to their docs, they actually have a server implementation. you can have a a a headless server that runs that exposes an API for you to essentially interact with an agent. So the way it works is when you run the command open code, which is what starts up the agent in your terminal, it doesn't just run a single app. It actually starts a terminal UI and a server. And because the terminal UI here is the client, we can essentially bring our own client and talk to the server, which is awesome. uh because now we can run open code on a server in this case would be on railway and we can just have this server have all the tools that the agent would need. So we'd install all of the necessary you know tools we can configure git and then the agent will be able to open pull requests and you know go through the file system and do everything. Let me show you what how easy it is to essentially have this deployed on right away. So if you go to the code uh here right now this is my project it's called railway autofix I know great name uh I have essentially two directories one is for my API the other one is for open code and open code really we just have a single server running using bun and all we're doing is we're just calling a function uh that is called create open code server so if I actually stop this here you can see it runs on port 4000 9 496 and this is pretty much all we need and I have a docker file and in this docker file we're essentially defining that environment. So, we're installing a bunch of tools. You can see we're installing curl, jq, bash, all the other tools, even git. Uh, we're installing the GitHub CLI, which is what will allow us to open pull requests against a given repo. We're then installing open code in the environment. We're configuring git and at the end, we're just exposing the port and we're just authenticating the GitHub CLI, which is pretty neat. Uh, by the way, the code will be linked somewhere down below. But that's really it for open code. And when it comes to the actual API, let me actually run it. So now the this is the open code server that's running. And if I go here, I have my actual API running on localhost 3000. And I have a UI that is provided by ingest, which is very useful for debugging. So if I go here and I go to functions, essentially each function here is a workflow and it has a bunch of steps. So let's actually try to run it to see what happens. Um now in production when this is live this monitor project health workflow should run on a schedule and if an issue is detected we will call the pool service context and then pull service context will call the workflow for generating a fix. So if we actually just kick things off this is how the flow of things will happen. So if I actually have now I have this function run. We called moderate project health. Then we called pull service context and now we're actually calling generate fix because we detected an issue and we're just setting um like the railway specific variables as environment variables. And all of these are actually available uh on railway. They're just set automatically which is pretty neat. So if I actually go to monitor project health you'll see we have a bunch of steps. Uh the first one is getting the project architecture and this step right here this is we can actually see its output. So we can see all of the databases that I have in my project. I just have one. Uh we can see also a list of all the services as well as their configuration. We can see which like where's the repo for them and we just now have a highle overview of our applications infrastructure. Uh we also see that we have any kind of like volumes that are there which is cool. And then we have a series of steps that are actually running in parallel. So like you know things are efficient. So we're getting the database resources. We can see on average well what's the max CPU? Uh and it's like 0.9 CPU. Okay. Same thing for memory. And we actually have a summary. And this summary essentially is us formatting these results so that we can then pass it to the coding agent. So you can see CPU usage average 0.93 vcpu and you know this is the max and memory usage as well. Now this is actually high uh and we'll be able to kind of understand that because it's like oh memory usage here is 31.96 GB out of a max which is 32 gigs. Uh and then we just pull even more um like resources. So like because we have multiple services we will call each step for it. Right? So like we will pull the HTTP metrics for each of the three services that we have deployed for example. But also for this one for the HTTP metrics we can see the error rate percentage for 400s for 500s. We see like the latency um and we just have like a status count. So we can also have a summary and then we can say hey these this is the rate of um like request error rates. This these are the latencies and this way when we actually at the end of like this workflow so I go to runs go here again towards the end we will actually give this uh pull service context function just all of this information in a nicely formatted way. So if I actually go now to this function run, we will see here that we're fetching the HTTP logs, the build logs, the deployment logs for like all the services that are affected. And we can see here like this is the function payload. Uh so this is the stuff that we passed from the other function. And we can see we just have all this info. We also have an architecture summary. So this actually we can expand this. Uh the architecture summary is just a nicely formatted uh text saying like this is the project architecture. We have three services we are running in the production environment. We have one database. We have all these volumes and we just have all of this information. It's just harder to read cuz like in one line but for the um coding agent we'll just give it to it as like markdown. So now that we have that go to runs again. Now that we have that, we are just going to make a call to another workflow which is generate fix. And for this one, what it does is one, it will analyze with AI. So this is the actual output in terms of like the input. It's a bit large to render here. Uh but we analyze it with AI. So like you can imagine we give a large language model saying like hey this is my project architecture. This is the data. This is how things are performing. And then we take all of this information and now we actually come up with a plan. So you can see here debugging steps. We want to see reproduce locally with the same load. Maybe we want to run it. We want to see what will happen if we see that the agent is like oh I ran into an error. Then it's going to fix it. And then we have like recommendations. So like this is the plan that we'll then just pass to our coding agent. And then we have a step to create a session. So on the coding agent you can imagine each session being its own chat. So this will run like imagine you have multiple repos each repo will have its own session. The coding agent will work and then at the end it should you know if as expected it should open a pull request. So yeah that's pretty much it. This is how it works. Now if everything works as expected we should see a pull request on the project. And here we go. We have a pull request that is open with all of our changes. If we go to the conversation, we actually be able to see that we have a summary of all the changes, uh, an analysis summary, the root causes, what was fixed. So, we should be able to just review this. If everything looks good, we merge and we're good to go. And that's it. I hope you enjoyed this talk as much as I enjoyed making it. If you have any questions, feel free to reach out to me on X or Twitter. This is where I mostly hang out. Also, the repo for this project will be available somewhere down below. So, make sure to check it out. And with that, thank you so much for watching and I'll see you in the next