Infra that fixes itself, thanks to coding agents — Mahmoud Abdelwahab, Railway

Channel: aiDotEngineer

Published at: 2025-11-24

YouTube video id: Q5IVm_CxN2w

Source: https://www.youtube.com/watch?v=Q5IVm_CxN2w

your app's infrastructure should fix
itself. Let me show you. So, right now
I'm on the Rayway dashboard and I have a
bunch of services that are deployed and
all of these services have one thing in
common. They all have bugs and problems.
So, for example, this service has a
memory leak. If I click on it, go to
metrics, we can just see memoryization
keeps growing high and very quickly.
This is just a sign of a memory leak and
pretty sure the service would eventually
crash. If I look over at the amount of
requests, we have a high number of 500s.
So, the server is failing to respond. We
have a high request error rate of 94%.
And we also have an extremely high
response time of like multiple seconds
uh for like the service to respond,
which is not ideal. Like if this was a
service running in production,
everything would be on fire. You'd be
getting paged and you just try to bring
back the service up back quickly. But
the thing is not all problems are this
obvious. For example, this service all
it does just queries a postcrist
database. And if we go to metrics, we'll
just see that well CPU utilization seems
fine. Memory usage is also fine. Sure,
it's a bit spiky, but okay, whatever. We
have some fails, but okay, nothing too
alarming. request error rate is somewhat
high. So that also should make us kind
of like want to investigate, but the
response time is extremely high. The
thing is this is because the service
makes queries that are super slow. And
the thing is if you're an end user
that's trying to use this experience,
you would just suffer. You would need
like 30 seconds for like a page to load,
which would be a nightmare. So the thing
is when you deploy your app to
production maybe some you know bugs or
issues make their way to production
things happen and kind of like the
typical way of dealing with these things
is maybe you set up a bunch of
thresholds and when these thresholds are
met for let's say CPU or memoryization
maybe uh you want to have a threshold
for the request error rate it shouldn't
exceed a certain amount well what will
happen is You're going to get alerted
and you'll be aware that there is an
issue, but you still have to do the
investigation yourself. You have to dig
through logs, metrics, and traces to try
to paint a picture in your head and try
to piece things together so that you can
ship a fix. Now, what I'm proposing is
you should have a coding agent that
monitors the state of your project and
your application's infrastructure. And
if any issue is detected, so you know
any of the thresholds we define are met,
we should just have a fix shipped,
right? So like instead of, you know,
getting the alert and investigating, you
just review a pull request and you're
like, uh, looks good to me. You ship it
and then everything is good and crisis
averted. So today I'm going to show you
what I have in terms of demo that kind
of paints a picture of how this could be
achieved. So at a high level I want to
have a series of workflows that will
kick in that will help me go from issue
detected in on railway my deployment
provider to a pull request being open in
my GitHub repo. And this is what I have
in mind. Uh the first workflow that I
want to have is a workflow that runs on
a schedule. So let's say it runs every
10 minutes, 15 minutes, 30 minutes. And
what this workflow will do is one, it
should fetch the application's
architecture. We should have an
understanding of what services are
deployed, which you know like frontends,
backends, crons, cues are live in my
project. And I then want to fetch each
services resource metrics, so CPU and
memory utilization. And I also want to
fetch each services HTTP metrics. I want
to see the request error rate, the
number of failed requests for, you know,
500 400 errors. And once that's done, I
will want to then see which services
have exceeded which thresholds. And then
I just want to return a list of the
affected services. So this would be
essentially the goal. Now you might be
wondering, well, why not make this an
alertbased system? So maybe we configure
something like web hooks for alerts and
then that would kick off uh essentially
this workflow instead. I would argue
that it's probably better to be able to
analyze a slice of time rather than just
having a threshold being met because it
can get pretty noisy. Like imagine you
have a spiky workload uh and you know
you reach the 80% resource utilization
for like your CPU but things are still
fine and that's not like in my mind this
is enough to be investigate but it might
not like it might mean that there just
aren't issues when we try to look at
like the bigger picture and all the
details.
Now once we have this list of impact
that is impact services we essentially
want to pull in even more context for
them. So like at a high level we want to
see project health all the services is
everything operating as expected. Oh we
have this thing that we're suspicious
about. Let's actually pull all of you
know additional context for the service
because imagine again you have like high
resource utilization. Maybe you're just
successful. You have high usage. Uh but
then when you pull the logs it's like oh
everything seems fine. there aren't any
errors. Well, you're good. And you can
imagine that we can even pull even more
context. Like imagine maybe we scan the
code in the repo and based on that we
infer the upstream providers that the
repo relies on and then we can
automatically check the status pages of
these services. Imagine like a payment
processor goes down. Well, that's kind
of how you can know and then the coding
agent will be able to maybe tell you
like, hey, you should just like wait out
this issue.
And once we have all this information,
we can just write a detail plan. So like
we can look at, oh, we have a high
number of 500 requests. We see that we
have very high resource utilization for
memory. and we see that we have you know
um just errors specifying that a
specific endpoint is failing. Well, this
is enough information that we can write
a detailed plan of hey this is my
application's architecture. These are
the affected services. We just then give
this plan to an agent and then the agent
will just follow the process of hey let
me clone this repo. I'll just create a
to-do list based on the plan you gave
me. I'll implement all the fixes and
I'll just create a pull request. And
this is kind of how we go from issue
detected to an open pull request. So
let's actually see this in practice. So
because we have the idea of workflows,
what I want to do is actually use what
is known as durable execution. So the
idea of durable workflows has been
around for a while and it's really one
of my favorite abstractions because it
can help you simplify complex logic
while making it more reliable. So for
example here we have this workflow. So
this actually is ingest but there are
lots of solutions out there that pretty
much do the same thing and we have this
function that you know called process
video upload. It listens on an event of
video uploaded and we essentially want
to do three things. We first want to
generate a transcript and we do this by
making an API call to a third party API.
Once we get that transcript, we want to
generate a summary by also making a
request to an LLM provider. And once we
have the transcript and the summary, we
want to store them in the database. The
thing is all of these steps, they are
not 100% guaranteed to work. Uh they are
prone to failure. And what's neat about
this pattern is by default, these steps
will be automatically retried. You don't
even have to think about it. But if you
let's say want to modify this behavior,
maybe you want the retry to happen uh
like on a certain schedule like you know
exponential back off uh maybe you want
to define another thing that should
happen in the case of failure you'll be
able to do it. But what's neat is each
step when it succeeds uh the result is
cached. So if for example we are able to
transcribe the video correctly, we
summarize the transcript correctly, but
we failed to write to the database. If
we were to retry this workflow, we just
continue where we left off. Uh we don't
we won't really repeat any work, which
is one awesome because it's faster, but
also it's more cost effective. So at a
high level, this is the thing that I'll
be relying on in my code because I'll be
making API calls to the railway API to
be able to fetch the project
architecture, all the resource metrics
um as well as you know the HTTP metrics
and whatnot. So yeah, uh this is kind of
like the first thing that um we need to
talk about. The second thing is the
coding agent. And for the coding agent,
I'll be using Open Code. Open code is an
AI agent that's built for the terminal.
You can think of it as an alternative to
something like cloud code, but the main
difference is open code is fully open
source and you can choose any LLM
provider or uh you know model that you
like, which is pretty nice. Uh you have
this nice terminal UI, but honestly
what's so cool about the project is how
it's architected. So if you go to their
docs, they actually have a server
implementation. you can have a a a
headless server that runs that exposes
an API for you to essentially interact
with an agent. So the way it works is
when you run the command open code,
which is what starts up the agent in
your terminal, it doesn't just run a
single app. It actually starts a
terminal UI and a server. And because
the terminal UI here is the client, we
can essentially bring our own client and
talk to the server, which is awesome. uh
because now we can run open code on a
server in this case would be on railway
and we can just have this server have
all the tools that the agent would need.
So we'd install all of the necessary you
know tools we can configure git and then
the agent will be able to open pull
requests and you know go through the
file system and do everything. Let me
show you what how easy it is to
essentially have this deployed on right
away. So if you go to the code uh here
right now this is my project it's called
railway autofix I know great name uh I
have essentially two directories one is
for my API the other one is for open
code and open code really we just have a
single server running using bun and all
we're doing is we're just calling a
function uh that is called create open
code server so if I actually stop this
here you can see it runs on port 4000 9
496 and this is pretty much all we need
and I have a docker file and in this
docker file we're essentially defining
that environment. So, we're installing a
bunch of tools. You can see we're
installing curl, jq, bash, all the other
tools, even git. Uh, we're installing
the GitHub CLI, which is what will allow
us to open pull requests against a given
repo. We're then installing open code in
the environment. We're configuring git
and at the end, we're just exposing the
port and we're just authenticating the
GitHub CLI, which is pretty neat. Uh, by
the way, the code will be linked
somewhere down below. But that's really
it for open code. And when it comes to
the actual API, let me actually run it.
So now the this is the open code server
that's running. And if I go here, I have
my actual API running on localhost 3000.
And I have a UI that is provided by
ingest, which is very useful for
debugging. So if I go here and I go to
functions, essentially each function
here is a workflow and it has a bunch of
steps. So let's actually try to run it
to see what happens. Um now in
production when this is live this
monitor project health workflow should
run on a schedule and if an issue is
detected we will call the pool service
context and then pull service context
will call the workflow for generating a
fix. So if we actually just kick things
off this is how the flow of things will
happen. So if I actually have now I have
this function run. We called moderate
project health. Then we called pull
service context and now we're actually
calling generate fix because we detected
an issue and we're just setting um like
the railway specific variables as
environment variables. And all of these
are actually available uh on railway.
They're just set automatically which is
pretty neat. So if I actually go to
monitor project health you'll see we
have a bunch of steps. Uh the first one
is getting the project architecture and
this step right here this is we can
actually see its output. So we can see
all of the databases that I have in my
project. I just have one. Uh we can see
also a list of all the services as well
as their configuration. We can see which
like where's the repo for them and we
just now have a highle overview of our
applications infrastructure. Uh we also
see that we have any kind of like
volumes that are there which is cool.
And then we have a series of steps that
are actually running in parallel. So
like you know things are efficient. So
we're getting the database resources. We
can see on average well what's the max
CPU? Uh and it's like 0.9 CPU. Okay.
Same thing for memory. And we actually
have a summary. And this summary
essentially is us formatting these
results so that we can then pass it to
the coding agent. So you can see CPU
usage average 0.93 vcpu
and you know this is the max and memory
usage as well. Now this is actually high
uh and we'll be able to kind of
understand that because it's like oh
memory usage here is 31.96 GB out of a
max which is 32 gigs. Uh and then we
just pull even more um like resources.
So like because we have multiple
services we will call each step for it.
Right? So like we will pull the HTTP
metrics for each of the three services
that we have deployed for example. But
also for this one for the HTTP metrics
we can see the error rate percentage for
400s for 500s. We see like the latency
um and we just have like a status count.
So we can also have a summary and then
we can say hey these this is the rate of
um like request error rates. This these
are the latencies and this way when we
actually at the end of like this
workflow so I go to runs go here again
towards the end we will actually give
this uh pull service context function
just all of this information in a nicely
formatted way. So if I actually go now
to this function run, we will see here
that we're fetching the HTTP logs, the
build logs, the deployment logs for like
all the services that are affected. And
we can see here like this is the
function payload. Uh so this is the
stuff that we passed from the other
function. And we can see we just have
all this info. We also have an
architecture summary. So this actually
we can expand this. Uh the architecture
summary is just a nicely formatted uh
text saying like this is the project
architecture. We have three services we
are running in the production
environment. We have one database. We
have all these volumes and we just have
all of this information. It's just
harder to read cuz like in one line but
for the um coding agent we'll just give
it to it as like markdown. So now that
we have that go to runs again. Now that
we have that, we are just going to make
a call to another workflow which is
generate fix. And for this one, what it
does is one, it will analyze with AI. So
this is the actual output in terms of
like the input. It's a bit large to
render here. Uh but we analyze it with
AI. So like you can imagine we give a
large language model saying like hey
this is my project architecture. This is
the data. This is how things are
performing. And then we take all of this
information and now we actually come up
with a plan. So you can see here
debugging steps. We want to see
reproduce locally with the same load.
Maybe we want to run it. We want to see
what will happen if we see that the
agent is like oh I ran into an error.
Then it's going to fix it. And then we
have like recommendations. So like this
is the plan that we'll then just pass to
our coding agent. And then we have a
step to create a session. So on the
coding agent you can imagine each
session being its own chat. So this will
run like imagine you have multiple repos
each repo will have its own session. The
coding agent will work and then at the
end it should you know if as expected it
should open a pull request. So yeah
that's pretty much it. This is how it
works. Now if everything works as
expected we should see a pull request on
the project. And here we go. We have a
pull request that is open with all of
our changes. If we go to the
conversation, we actually be able to see
that we have a summary of all the
changes, uh, an analysis summary, the
root causes, what was fixed. So, we
should be able to just review this. If
everything looks good, we merge and
we're good to go. And that's it. I hope
you enjoyed this talk as much as I
enjoyed making it. If you have any
questions, feel free to reach out to me
on X or Twitter. This is where I mostly
hang out. Also, the repo for this
project will be available somewhere down
below. So, make sure to check it out.
And with that, thank you so much for
watching and I'll see you in the next