From Chaos to Choreography: Multi-Agent Orchestration Patterns That Actually Work — Sandipan Bhaumik
Channel: aiDotEngineer
Published at: 2026-04-08
YouTube video id: 2czYyrTzILg
Source: https://www.youtube.com/watch?v=2czYyrTzILg
Hi everyone, I'm Sandy. I have spent 18 years building data systems, a major part of it focusing on building and scaling distributed data systems in the cloud. I've done it for multi-tenant systems for software and SaaS companies, and then for scaling data and AI platforms in regulated industries like financial services and healthcare. I've learned a great deal about production grade distributed systems while I have been working at AWS and now in Databricks. For the last 2 years, I've been deploying multi-agent AI systems in production. And I have watched brilliant engineers make the same mistakes over and over. They think adding more agents is just like adding more features. It's not. It's building a distributed system. And today, I'm going to show you the patterns that actually work when you make the transition. These are lessons that I have learned working in the trenches, and today I'm here to share it with you. Here's what we're covering today. First, the problem. I'll share you a very basic production war story about race conditions and why complexity explodes when you go from one agent to five agents. Um I'll I'll talk about the patterns, choreography and orchestration patterns for coordination of agents. I'll talk about state management, uh talk about failure recovery and how we can um design for failure in production systems. And then I'll I'll share how a production grade architecture will look like uh in as simple way possible. And I'll also show you an example on how we build this on Databricks. So, let's dive into it. You see, one agent works beautifully. You have got your LLM, some prompts, maybe a retrieval augmented generation pipeline, maybe some tool calls. It demos great. Leadership loves it. You feel happy and your team is happy. And then, product comes back with a request that changes everything. They want five more agents. And here's what happens. You think, "Okay, I know how to build agents, and I will add five more." Except, now you have coordination problems. Agent A produces data that Agent B needs. Agent C is waiting on both Agent A and Agent B. Agent D just updated the shared state that Agent B was reading, and Agent E just crashed and took the down this entire workflow. This is no longer an AI problem. This is a distributed system problem. And most of you didn't sign up to be distributed systems engineer. Let me tell you about a production deployment where this went very wrong. We built a credit decisioning system for a financial services company. The first agent, credit score calculation, worked perfectly. It worked great in demos, 2 weeks in production, zero issues. Then we added four more agents, income verification, risk assessment, fraud detection, and final approval. Uh we deployed all five. In 3 days' time, we started seeing weird approvals. Uh 20% of the decisions had incorrect risk ratings. Customers who should have been flagged were getting approved. The business team was panicking. It took us 2 days to find out what was happening. Credit score agent calculated a score of 750 and wrote to the database. The risk assessment agent, on the other hand, read from the database 500 milliseconds later and got a score of 680 for the same customer. Why did it happen? Because we had a caching layer for customer records. The write to PostgreSQL SQL succeeded, but the cache was not invalidated. The risk agent read from the cache, and it got stale data. Use It used the wrong score and made the wrong decision. This is a classic distributed systems problem. We had caching layer between the agents and the database. Cache invalidation failed, and the agent was reading stale values. The race condition wasn't in the database, it was in the architecture. Multiple agents, shared cache, no coordination on cache invalidation. This took us quite a while to find the pattern. It created delays in delivery and led to wrong decisions. And here's the lesson we learned. The problem was, of course, not with the model. The problem wasn't with the prompts. The problem was we built a distributed system without distributed system thinking. And that's what kills multi-agent projects, not bad AI, but bad architecture. Now, I will show you the architecture that works. We will also look into a production grade architecture. But first, let's understand why this complexity explodes so quickly. Now, when you move from a one agent system to a multi-agent, let's say five agent systems, it doesn't get just five times harder. It gets 25 times more complex. Coordination complexity grows exponentially. One agent has got zero coordination problems. Two agents have got at least one connection. Five agents have got at least 10 potential connections and coordination. Each connection is a failure point, a race condition, a state synchronization problem. You are not just building five agents, you are building a coordination problem across multiple relationships and across and and possibility to have multiple failure modes. And that's why the complexity increases very, very quickly. Now, I'm going to show you two critical patterns. First pattern is about how to coordinate multiple agents. Then we will talk about how you can manage state. And then we'll talk about how you can recover and design for failure. Now, these patterns come from multiple years of distributed systems work, and I can directly apply them on multi-agent AI system. Once you get the basics, it's really hard to miss these patterns when you build multi-agent AI architecture. The first decision you need to make is about choreography or orchestration. These are the two fundamental patterns for distributed coordination. Choreography means agents coordinate through events. They are decentralized, they are autonomous. Orchestration means a central coordinator manages the workflow. This is centralized and controlled. Most teams pick one instinctively and regret it. Let me show you when to use each. Let's start with choreography. Choreography is event-driven. Um the research agent finishes uh research and publishes a research completed event to a message bus. Agent B subscribe to that message bus and listens for the event type it is interested in. The analysis agent subscribes to that event type, picks it up, does analysis, and publishes analysis ready. Then the report agent picks that analysis ready event, generates the report. There is no central coordinator here. Each agent is autonomous, listening for events it cares about, publishing when it is done. This is the beauty of choreography. Agents are loosely coupled. It's easy to add add new agents and make them subscribe to the events that they're interested in. This drives high autonomy and scales really well. However, the nightmare of choreography is debugging. When something fails, you're playing detective with no real clue. Which agent failed to publish? Did the event get consumed? Did the event get consumed twice? You need bulletproof observability to make choreography work. Even with the event propagation, you need strong uh guarantees across delivery of these events. Without it, debugging is really hard. So, when should you use choreography? You use choreography when your workflow is naturally event-driven, when agents need to operate independently, when you are adding agents frequently and don't want to update a central coordinator. But it is important to understand it is possible only if you have strong observability. If you can't trace events through your system, choreography will destroy you. I have seen teams choose choreography because it feels more agentic, more autonomous. Then they spend months firefighting because they can't debug distributed event flows. Don't make that mistake. Now, let's look at the alternative, orchestration. Orchestration is centralized. You have a workflow orchestrator that calls each agent directly. Agent A runs first. The orchestration calls Agent A, waits for the result, gets the result back. Then the orchestrator calls Agent B and C in parallel if they are agents that need to run in parallel. The orchestrator manages the parallelism, not the agents. B and C return their results to the orchestrator. Then the orchestrator calls Agent D with the combined results from B and C. Every call goes through the orchestrator. Agents never call each other. The orchestrator is the single source of truth. It knows the entire execution graph. It manages state. It handles retries. It logs every step. Agents are dumb. They just take the input, they do the work, they return the output. The orchestrator does all the smart coordination. In Databricks, one way to implement this pattern would be with LangGraph wired into AI agent framework as the orchestrator. But any workflow that gives you DAGs, directed acyclic graphs, and proper retry mechanisms would fit in this kind of orchestrator patterns. You use orchestration when you have complex dependencies that need central management, when you need to roll back, compensate for failures, when you want one dashboard showing the entire system state, when your workflow is relatively stable. In financial services, for example, we use orchestration almost exclusively. Why? Because it provides easy debugging and the ability to roll back, and that matters more than autonomy in these kind of industries. When something goes wrong with a credit decision, for example, we need to know exactly which agent made that call, in what order, and with what data. Orchestration gives us that. Choreography doesn't. So, how do you choose? Here's your decision framework. Two axis. Workflow complexity, simple to complex. Autonomy requirements, low to high. Simple workflow, high autonomy, you go with choreography. You need complex workflow with low autonomy tolerance, you go with orchestration. The interesting quadrant is the top right, where you need complex workflow, but agents need autonomy. This is where you use hybrid patterns. Choreography with saga patterns for compensation. I'll talk about this pattern later in this uh session as well. Uh tools like Agent Bricks on Databricks are starting to package these orchestration patterns for common multi-agent use cases. So, you don't need to rebuild them every time. It makes building these patterns really easy in production environments. Now, I use the decision metrics uh every time to make decisions with customers based on their use cases. Uh it's worth you take a screenshot. I'm sure you'll reference it. Let me show you what a production orchestration actually looks like at the tail end of the session. All right. Now, we have chosen a call coordination uh pattern. Now, let's talk about the thing that actually when you scale. State. How do agents share data without race conditions? Without stale reads? Without mystery bugs? Here's what most people do first, and it's wrong. Shared mutable state. Multiple agents writing at the same database records at the same time. Agent A reads credit score, calculates the value, writes it back. Agent B does the same thing at the same time. Both read 680. Agent A writes 750. Agent B writes 720. Last write wins. Agent A's update disappears. Lost update. Uh I understand, yes, modern databases have protections in place, row locks, isolation levels, etc. But, you have to use them correctly. Explicit transactions um you have to build uh serializable isolation. Uh you have to make sure that you select for update. Uh and and many teams don't. Uh they use default isolation. They don't use explicit locks, and they ship race condition to production. We did it. We did that mistake, and that resulted in delayed value to the business. We just assumed that the database would handle these conditions, but they don't. When it gets really complex, you have to handle them explicitly in the code. Now, here's what works. Immutable state snapshots with versioning. Agent A produces a state version, let's say version one. It's sealed. It's immutable. Nobody can modify it. State is stored in the orchestrator database as an append-only log. These are insert operations, not not any update. Agent A hands state version one to agent B. Agent B validates the schema, checks that the data contract matches with its expectations. It processes it, produces state version two. Also immutable. Agent B inserts version two as the new row. It doesn't update version one. And then hands it to agent C. Same thing. Schema validation version tracking, immutability guarantee at each handoff. Agent C fails. Now, if agent C fails, you roll back to version two. If you need to debug, you replace state evolution uh from version one through version N. You can see exactly what each agent received and produced. This eliminates race conditions. No concurrent modification to the same record. Each agent appends a new version instead of updating the shared state. Now, of course, if you want to uh save these state snapshots, they can be logged uh in any sort of append-only storage for audit replay, but they are never shared for read or write. Now, here's how it looks like in code. Agent state class, the frozen means immutable in Python. It has a version number, the data payload, and who created it. The handoff function does three things. First, it validates the schema. Uh this is the contract enforcement. We are checking that agent A's output matches agent B's input contract. This is critical, and we will come back to this. Second, increment version. Create a new immutable state object with version N plus one. Third, execute the next agent with that immutable state. The agent can't modify the input state. It can only produce a new state. This prevents an entire class of bugs. It prevents race conditions on shared state. No stale reads. It provides a clear lineage. Every state has a version, and you know who has created it. When something goes wrong, you can trace back through state evolution. Version seven produced bad output, look into version six that went into the agent. Look at version five before that. You can binary search through your state history to find where things went wrong. And this becomes really, really powerful. Now, state management is half the battle. Data contracts are the other half. Agent A can just throw um arbitrary data at agent B and hope it works. This doesn't work that way. They need a contract in place. In this example, research agent promises to output findings, confident score, sources, timestamp, etc. Analysis agent declares it requires research agent output with type and first. Uh and it validates. If confidence is below 0.7, it will reject the handoff. This is the contract. If the research hand if the research agent tries to handoff low-quality data, the contract catches it at the boundary. You find out immediately, not three agents downstream when it produces a report in garbage. When we work with our customers um using Databricks, one way of doing it is uh registering these input-output schemas in Unity Catalog. Uh so, every agent's contract is versioned and governed in one place. All right. We talked about coordination patterns. We talked about state management. Now, talk about Now, now let's talk about another thing that you need to keep in mind, and that's failure and recovery. And and the reason this is important is because agents will fail. That's inevitable. The LLM will time out. The API will rate limit you. The agent will crash mid-workflow. What happens then? What happens then is what you need to plan for and design in the system. Let's talk about a few patterns. Let's talk about the first pat- pattern, which is a circuit breaker pattern, and this comes straight from distributed system. When agent A calls agent B, it wraps that call in a circuit breaker. If agent B fails repeatedly, say five times in a row, the circuit breaker opens. Now, instead of waiting for a timeout every single time, you basically fail fast. Circuit open, agent B is down, you just try again later. You are not bombarding agent B with requests. You're protecting your system. After a time- timeout period, let's say 60 seconds, it the circuit goes half open. Then you test agent B again with one request. If it succeeds, the start circuit closes, and normal operation resumes. If it fails, the circuit opens again, and it resets the timer. This prevents you from cascading failures into the system. One agent going down doesn't bring your entire workflow down. You gracefully degrade. Maybe you skip that agent and continue with a reduced functionality. Uh maybe you use cached results. Maybe you alert a human. But, you don't crash the entire workflow. Circuit breakers are the single most important failure recovery pattern for multi-agent systems. Every agent call should be wrapped with a We enforce these circuit breaker policies at the serving layer on Databricks through model serving or through AI Gateway. Here's how it looks like in code. You track the failure count, and you track the state. When you call an agent, you check the state first. If it is open, you fail fast. You don't even try. If it is closed, you make the call. If the call succeeds, you reset the failure count and stay closed. If it fails, you increment the failure count. If you hit the threshold, you open the circuit. After the timeout period, you transition to half open. You test one request. If it succeeds, you close the circuit. If it fails, you open it again. This is a simple pattern, but it has got a massive impact. And in Databricks, you can log every open-closed transition in MLflow, so you can see when an agent started flaking out. Now, let's talk about another pattern. We call it the compensation pattern. Also called saga pattern. Every agent has two methods, execute and compensate. Execute does the work. Compensate rolls it back, undoes it. The orchestrator agents have executed. If the execution agent fails, the orchestrator walk walks backward through the executed agents. And it calls compensate for each one. Analysis agent compensates, it deletes the draft recommendation from the system that it has written originally. And then the research agent compensates by clearing the cached research data that it gathered previously. So, you're back to the initial state. No partial transactions. No stuck workflows. This is a simple rollback pattern that you can implement in multi-agent system. Compensation gives distributed agents. It is not sexy, but it's how production systems handle partial failures. Every orchestrated workflow needs this kind of compensation pattern, and you need to plan for it depending on what you're doing with your workflows. Here's how compensation looks in code. Every agent, as I mentioned earlier, has got two methods, the execution method and the compensate method. The execution does the work, the compensate undoes it. Uh that's the contract. Every operation must be reversible. The orchestration tracks which uh the orchestrator tracks which agents have run successfully, and then it keeps a list. Agent A executes, gets added. Agent B executes, gets added. Agent C fails, now we walk backward through the list in reverse order. Agent B compensates first, it undoes the work that it has done. Agent A compensates next, it undoes the work that Agent A has done, and it goes back to the initial state. This is saga pattern from distributed databases. Financial services requires this. Now that we have covered these different patterns, I wanted to show you what a production architecture would look like when you bring these things together. You've got the orchestrator at the left-hand side. Um it's the brain of the workflow. It contains the workflow engine, it contains the state store uh holding versions through zero to n, and it has uh it it it can look into the observability layer. It handles the observability data. Every call goes through the orchestrator. Orchestrator calls Agent A, Agent B, Agent A returns state version one to the orchestrator. Orchestrator then calls Agent B and C in parallel if they need to run in parallel. Uh both receives state version one from the orchestrator. They return results. Orchestrator stores at version three two and three. Finally, orchestrator calls D with these combined results. Agents never call each other. All coordination happens through the orchestrator. And this is what gives us control, observability, capability to roll back. This runs 24/7 across billions of transactions because the orchestrator is the single source of truth. All right, here's a production architecture that you could implement with the Databricks Data Intelligence Platform. In the orchestration layer, you can have LangGraph wired into Mosaic AI Agent Framework. It handles multi-agent orchestration. It manages the workflow graph and knows which agents to call in what order. Each agent is implemented as a Unity Catalog function. It could be written in SQL or Python, or it could be a model registered in a Unity Catalog. Um they are When you register these assets in Unity Catalog, they are set discoverable centrally within the organization. Uh they can be governed in one place, and they can be versioned, which is really critical uh in terms of operating these uh workflows in production. We expose these agents through a Databricks Model Serving or Function Serving, and that's where we enforce these circuit breaker style policies like retries or timeouts or rate limits uh at the serving layer, typically via AI Gateway configuration. Now when we talk about the data layer, Delta Lake stores everything. It not only stores the state versions from the agent, it also stores customer data and, you know, all all all the data that you need for your workflows to work. Um Talking about the snake state snapshots, Delta table uh is immutable and versioned. For us, those state versions are just rows in a Delta table. Uh we never update them in place. Each agent run is tied to a state version via MLflow Traces, so we can step through the evolution when something breaks. Now, uh I just wanted to touch upon uh Unity Catalog. It It governs everything access control, lineage, audit trail for both data and agents. MLflow gives us per agent tracing evaluation capabilities with out-of-the-box LLM as judges and and metrics on every call. And as I mentioned earlier, um tools like Agent Bricks is the higher level way of Databricks packaging these orchestration patterns for common multi-agent use cases, so you don't need to rebuild them every time. So just to wrap up this workflow, I see the LangGraph orchestrator calls Agent A, a Unity Catalog function or model. It gets a result, writes version one state to Delta. It then calls Agent B with state version one, writes version two, and so on. MLflow traces every call, latency, inputs, outputs, token usage. A circuit breaker at the serving layer guards each call. If Agent C fails, LangGraph triggers compensation logic and walks backward, calling the compensate functions for previous successful steps. These kind of patterns run in production day in and day out. So thank you for hearing me out. You can reach out to me over LinkedIn. You can scan this keyword that will take you directly to my LinkedIn profile. Uh I I would like to like to leave you with three final thoughts. First of all, agent chaos is inevitable. When you scale past one agent, you will you will hit coordination problems, race conditions, cascading failures. That's guaranteed. The complexity curve doesn't lie. Your agent choreography is a choice. You can build systems with proper patterns, orchestration, choreography, immutable state, circuit breakers, compensation patterns, data contracts. Make sure you understand these patterns and bring them to your production architecture. Doing so will help you build systems, not demos. Demos are easy. You use an LLM to show something cool. Everyone can do it. These things don't work in production. In production, you have to build systems, and systems are hard. Systems are what create value for businesses. Everything I showed you today, choreography versus orchestration, immutable state, circuit breakers, these are all unsexy infrastructure work. You won't get applause for implementing a circuit breaker, but you make your systems more reliable. They don't fail at 2:00 a.m. in the night. That is what people notice over time. Be a systems engineer. The patterns here, they work. Apply these patterns in your production architecture. Thank you very much for watching. Bye.