From Chaos to Choreography: Multi-Agent Orchestration Patterns That Actually Work — Sandipan Bhaumik

Channel: aiDotEngineer

Published at: 2026-04-08

YouTube video id: 2czYyrTzILg

Source: https://www.youtube.com/watch?v=2czYyrTzILg

Hi everyone, I'm Sandy. I have spent 18
years building data systems, a major
part of it focusing on building and
scaling distributed data systems in the
cloud. I've done it for multi-tenant
systems for software and SaaS companies,
and then for scaling data and AI
platforms in regulated industries like
financial services and healthcare. I've
learned a great deal about production
grade distributed systems while I have
been working at AWS and now in
Databricks. For the last 2 years, I've
been deploying multi-agent AI systems in
production. And I have watched brilliant
engineers make the same mistakes over
and over. They think adding more agents
is just like adding more features. It's
not. It's building a distributed system.
And today, I'm going to show you the
patterns that actually work when you
make the transition. These are lessons
that I have learned working in the
trenches, and today I'm here to share it
with you. Here's what we're covering
today. First, the problem. I'll share
you a very basic production war story
about race conditions and why complexity
explodes when you go from one agent to
five agents.
Um I'll I'll talk about the patterns,
choreography and orchestration patterns
for coordination of agents. I'll talk
about state management, uh talk about
failure recovery and how we can um
design for failure in production
systems. And then I'll I'll share how a
production grade architecture will look
like uh in as simple way possible. And
I'll also show you an example on how we
build this on Databricks. So, let's dive
into it. You see, one agent works
beautifully. You have got your LLM, some
prompts, maybe a retrieval augmented
generation pipeline, maybe some tool
calls. It demos great. Leadership loves
it. You feel happy and your team is
happy. And then, product comes back with
a request that changes everything. They
want five more agents. And here's what
happens. You think, "Okay, I know how to
build agents, and I will add five more."
Except, now you have coordination
problems. Agent A produces data that
Agent B needs. Agent C is waiting on
both Agent A and Agent B. Agent D just
updated the shared state that Agent B
was reading, and Agent E just crashed
and took the down this entire workflow.
This is no longer an AI problem. This is
a distributed system problem. And most
of you didn't sign up to be distributed
systems engineer. Let me tell you about
a production deployment where this went
very wrong. We built a credit
decisioning system for a financial
services company. The first agent,
credit score calculation, worked
perfectly. It worked great in demos, 2
weeks in production, zero issues. Then
we added four more agents, income
verification, risk assessment, fraud
detection, and final approval.
Uh we deployed all five. In 3 days'
time, we started seeing weird approvals.
Uh 20% of the decisions had incorrect
risk ratings. Customers who should have
been flagged were getting approved. The
business team was panicking. It took us
2 days to find out what was happening.
Credit score agent calculated a score of
750 and wrote to the database. The risk
assessment agent, on the other hand,
read from the database 500 milliseconds
later and got a score of 680 for the
same customer. Why did it happen?
Because we had a caching layer for
customer records. The write to
PostgreSQL SQL succeeded, but the cache
was not invalidated. The risk agent read
from the cache, and it got stale data.
Use It used the wrong score and made the
wrong decision. This is a classic
distributed systems problem. We had
caching layer between the agents and the
database. Cache invalidation failed, and
the agent was reading stale values. The
race condition wasn't in the database,
it was in the architecture. Multiple
agents, shared cache, no coordination on
cache invalidation. This took us quite a
while to find the pattern. It created
delays in delivery and led to wrong
decisions. And here's the lesson we
learned. The problem was, of course, not
with the model. The problem wasn't with
the prompts. The problem was we built a
distributed system without distributed
system thinking. And that's what kills
multi-agent projects, not bad AI, but
bad architecture. Now, I will show you
the architecture that works. We will
also look into a production grade
architecture. But first, let's
understand why this complexity explodes
so quickly. Now, when you move from a
one agent system to a multi-agent, let's
say five agent systems, it doesn't get
just five times harder. It gets 25 times
more complex. Coordination complexity
grows exponentially. One agent has got
zero coordination problems. Two agents
have got at least one connection. Five
agents have got at least 10 potential
connections and coordination. Each
connection is a failure point, a race
condition, a state synchronization
problem. You are not just building five
agents, you are building a coordination
problem across multiple relationships
and across and and possibility to have
multiple failure modes. And that's why
the complexity increases very, very
quickly. Now, I'm going to show you two
critical patterns. First pattern is
about how to coordinate multiple agents.
Then we will talk about how you can
manage state. And then we'll talk about
how you can recover and design for
failure. Now, these patterns come from
multiple years of distributed systems
work, and I can directly apply them on
multi-agent AI system. Once you get the
basics, it's really hard to miss these
patterns when you build multi-agent AI
architecture. The first decision you
need to make is about choreography or
orchestration. These are the two
fundamental patterns for distributed
coordination. Choreography means agents
coordinate through events. They are
decentralized, they are autonomous.
Orchestration means a central
coordinator manages the workflow. This
is centralized and controlled. Most
teams pick one instinctively and regret
it. Let me show you when to use each.
Let's start with choreography.
Choreography is event-driven.
Um the research agent finishes uh
research and publishes a research
completed event to a message bus. Agent
B subscribe to that message bus and
listens for the event type it is
interested in. The analysis agent
subscribes to that event type, picks it
up, does analysis, and publishes
analysis ready. Then the report agent
picks that analysis ready event,
generates the report. There is no
central coordinator here. Each agent is
autonomous, listening for events it
cares about, publishing when it is done.
This is the beauty of choreography.
Agents are loosely coupled.
It's easy to add add new agents and make
them subscribe to the events that
they're interested in. This drives high
autonomy and scales really well.
However, the nightmare of choreography
is debugging. When something fails,
you're playing detective with no real
clue. Which agent failed to publish? Did
the event get consumed? Did the event
get consumed twice? You need bulletproof
observability to make choreography work.
Even with the event propagation, you
need strong uh guarantees across
delivery of these events. Without it,
debugging is really hard. So, when
should you use choreography? You use
choreography when your workflow is
naturally event-driven, when agents need
to operate independently, when you are
adding agents frequently and don't want
to update a central coordinator. But it
is important to understand
it is possible only if you have strong
observability. If you can't trace events
through your system, choreography will
destroy you. I have seen teams choose
choreography because it feels more
agentic, more autonomous. Then they
spend months firefighting because they
can't debug distributed event flows.
Don't make that mistake. Now, let's look
at the alternative, orchestration.
Orchestration is centralized. You have a
workflow orchestrator that calls each
agent directly. Agent A runs first. The
orchestration calls Agent A, waits for
the result, gets the result back. Then
the orchestrator calls Agent B and C in
parallel if they are agents that need to
run in parallel. The orchestrator
manages the parallelism, not the agents.
B and C return their results to the
orchestrator. Then the orchestrator
calls Agent D with the combined results
from B and C. Every call goes through
the orchestrator. Agents never call each
other. The orchestrator is the single
source of truth. It knows the entire
execution graph. It manages state. It
handles retries. It logs every step.
Agents are dumb. They just take the
input, they do the work, they return the
output. The orchestrator does all the
smart coordination. In Databricks, one
way to implement this pattern would be
with LangGraph wired into AI agent
framework as the orchestrator. But any
workflow that gives you
DAGs, directed acyclic graphs, and
proper retry mechanisms would fit in
this kind of orchestrator patterns. You
use orchestration when you have complex
dependencies that need central
management, when you need to roll back,
compensate for failures, when you want
one dashboard showing the entire system
state, when your workflow is relatively
stable. In financial services, for
example, we use orchestration almost
exclusively. Why? Because it provides
easy debugging and the ability to roll
back, and that matters more than
autonomy in these kind of industries.
When something goes wrong with a credit
decision, for example, we need to know
exactly which agent made that call, in
what order, and with what data.
Orchestration gives us that.
Choreography doesn't. So, how do you
choose? Here's your decision framework.
Two axis.
Workflow complexity, simple to complex.
Autonomy requirements, low to high.
Simple workflow, high autonomy, you go
with choreography. You need complex
workflow with low autonomy tolerance,
you go with orchestration. The
interesting quadrant is the top right,
where you need complex workflow, but
agents need autonomy. This is where you
use hybrid patterns. Choreography with
saga patterns for compensation. I'll
talk about this pattern later in this uh
session as well.
Uh tools like Agent Bricks on Databricks
are starting to package these
orchestration patterns for common
multi-agent use cases. So, you don't
need to rebuild them every time. It
makes
building these patterns really easy in
production environments. Now, I use the
decision metrics uh every time to make
decisions with customers based on their
use cases. Uh
it's worth you take a screenshot. I'm
sure you'll reference it. Let me show
you what a production orchestration
actually looks like at the tail end of
the session. All right. Now, we have
chosen a call coordination uh pattern.
Now, let's talk about the thing that
actually when you scale. State. How do
agents share data without race
conditions? Without stale reads? Without
mystery bugs? Here's what most people do
first, and it's wrong. Shared mutable
state. Multiple agents writing at the
same database records at the same time.
Agent A reads credit score, calculates
the value, writes it back. Agent B does
the same thing at the same time. Both
read 680. Agent A writes
750. Agent B writes 720. Last write
wins. Agent A's update disappears. Lost
update. Uh I understand, yes, modern
databases have protections in place, row
locks, isolation levels, etc. But, you
have to use them correctly. Explicit
transactions um you have to build uh
serializable isolation. Uh you have to
make sure that you select for update. Uh
and and many teams don't.
Uh they use default isolation. They
don't use explicit locks, and they ship
race condition to production. We did it.
We did that mistake, and that resulted
in delayed value to the business. We
just assumed that the database would
handle these conditions, but they don't.
When it gets really complex, you have to
handle them explicitly in the code. Now,
here's what works. Immutable state
snapshots with versioning. Agent A
produces a state version, let's say
version one. It's sealed. It's
immutable. Nobody can modify it. State
is stored in the orchestrator database
as an append-only log. These are insert
operations, not not any update. Agent A
hands state version one to agent B.
Agent B validates the schema, checks
that the data contract matches with its
expectations. It processes it, produces
state version two. Also immutable. Agent
B inserts version two as the new row. It
doesn't update version one. And then
hands it to agent C. Same thing. Schema
validation version tracking,
immutability guarantee at each handoff.
Agent C fails. Now, if agent C fails,
you roll back to version two. If you
need to debug, you replace state
evolution
uh from version one through version N.
You can see exactly what each agent
received and produced. This eliminates
race conditions. No concurrent
modification to the same record. Each
agent appends a new version instead of
updating the shared state. Now, of
course, if you want to
uh save these state snapshots, they can
be logged
uh in any sort of append-only storage
for audit replay, but they are never
shared for read or write. Now, here's
how it looks like in code. Agent state
class, the frozen means immutable in
Python. It has a version number, the
data payload, and who created it. The
handoff function does three things.
First, it validates the schema.
Uh this is the contract enforcement. We
are checking that agent A's output
matches agent B's input contract. This
is critical, and we will come back to
this. Second, increment version. Create
a new immutable state object with
version N plus one. Third, execute the
next agent with that immutable state.
The agent can't modify the input state.
It can only produce a new state. This
prevents an entire class of bugs. It
prevents race conditions on shared
state. No stale reads. It provides a
clear lineage. Every state has a
version, and you know who has created
it. When something goes wrong, you can
trace back through state evolution.
Version seven produced bad output, look
into version six that went into the
agent. Look at version five before that.
You can binary search through your state
history to find where things went wrong.
And this becomes really, really
powerful. Now, state management is half
the battle. Data contracts are the other
half. Agent A can just throw um
arbitrary data at agent B and hope it
works. This doesn't work that way. They
need a contract in place. In this
example, research agent promises to
output findings, confident score,
sources, timestamp, etc. Analysis agent
declares it requires research agent
output with type and first.
Uh and it validates. If confidence is
below 0.7, it will reject the handoff.
This is the contract. If the research
hand if the research agent tries to
handoff low-quality data, the contract
catches it at the boundary. You find out
immediately, not three agents downstream
when it produces a report in garbage.
When we work with our customers um
using Databricks, one way of doing it is
uh registering these input-output
schemas in Unity Catalog. Uh so, every
agent's contract is versioned and
governed in one place. All right. We
talked about coordination patterns. We
talked about state management. Now, talk
about Now, now let's talk about another
thing that you need to keep in mind, and
that's failure and recovery. And and the
reason this is important is because
agents will fail. That's inevitable. The
LLM will time out. The API will rate
limit you. The agent will crash
mid-workflow. What happens then? What
happens then is what you need to plan
for and design in the system. Let's talk
about a few patterns. Let's talk about
the first pat- pattern, which is a
circuit breaker pattern, and this comes
straight from distributed system. When
agent A calls agent B, it wraps that
call in a circuit breaker. If agent B
fails repeatedly, say five times in a
row, the circuit breaker opens. Now,
instead of waiting for a timeout every
single time, you basically fail fast.
Circuit open, agent B is down, you just
try again later. You are not bombarding
agent B with requests. You're protecting
your system. After a time- timeout
period, let's say 60 seconds, it the
circuit goes half open. Then you test
agent B again with one request. If it
succeeds, the start circuit closes, and
normal operation resumes. If it fails,
the circuit opens again, and it resets
the timer. This prevents you from
cascading failures into the system.
One agent going down doesn't bring your
entire workflow down. You gracefully
degrade. Maybe you skip that agent and
continue with a reduced functionality.
Uh maybe you use cached results. Maybe
you alert a human. But, you don't crash
the entire workflow. Circuit breakers
are the single most important failure
recovery pattern for multi-agent
systems. Every agent call should be
wrapped with a
We enforce these circuit breaker
policies at the serving layer on
Databricks through model serving or
through AI Gateway. Here's how it looks
like in code. You track the failure
count, and you track the state. When you
call an agent, you check the state
first. If it is open, you fail fast. You
don't even try. If it is closed, you
make the call. If the call succeeds, you
reset the failure count and stay closed.
If it fails, you increment the failure
count. If you hit the threshold, you
open the circuit. After the timeout
period, you transition to half open. You
test one request. If it succeeds, you
close the circuit. If it fails, you open
it again. This is a simple pattern, but
it has got a massive impact. And in
Databricks, you can log every
open-closed transition in MLflow, so you
can see when an agent started flaking
out. Now, let's talk about another
pattern. We call it the compensation
pattern. Also called saga pattern. Every
agent has two methods, execute and
compensate. Execute does the work.
Compensate rolls it back, undoes it. The
orchestrator
agents have executed. If the execution
agent fails, the orchestrator walk walks
backward through the executed
agents.
And it calls compensate for each one.
Analysis agent compensates, it deletes
the draft recommendation from the system
that it has written originally. And then
the research agent compensates by
clearing the cached research data that
it gathered previously. So, you're back
to the initial state. No partial
transactions. No stuck workflows. This
is a simple rollback pattern that you
can implement in multi-agent system.
Compensation gives distributed agents.
It is not sexy, but it's how production
systems handle partial failures. Every
orchestrated workflow needs this kind of
compensation pattern, and you need to
plan for it depending on what you're
doing with your workflows. Here's how
compensation looks in code. Every agent,
as I mentioned earlier, has got two
methods, the execution method and the
compensate method. The execution does
the work, the compensate undoes it. Uh
that's the contract. Every operation
must be reversible. The orchestration
tracks which
uh the orchestrator tracks which agents
have run successfully, and then it keeps
a list. Agent A executes, gets added.
Agent B executes, gets added. Agent C
fails, now we walk backward through the
list in reverse order. Agent B
compensates first, it undoes the work
that it has done. Agent A compensates
next, it undoes the work that Agent A
has done, and it goes back to the
initial state. This is saga pattern from
distributed databases. Financial
services requires this. Now that we have
covered these different patterns, I
wanted to show you what a production
architecture would look like when you
bring these things together. You've got
the orchestrator at the left-hand side.
Um
it's the brain of the workflow. It
contains the workflow engine, it
contains the state store uh holding
versions through zero to n, and it has
uh it it it can look into the
observability layer. It handles the
observability data. Every call goes
through the orchestrator. Orchestrator
calls Agent A, Agent B, Agent A returns
state version one to the orchestrator.
Orchestrator then calls Agent B and C in
parallel if they need to run in
parallel.
Uh both receives state version one from
the orchestrator. They return results.
Orchestrator stores at version three two
and three. Finally, orchestrator calls D
with these combined results. Agents
never call each other. All coordination
happens through the orchestrator.
And this is what gives us control,
observability, capability to roll back.
This runs 24/7 across billions of
transactions because the orchestrator is
the single source of truth. All right,
here's a production architecture that
you could implement with the Databricks
Data Intelligence Platform.
In the orchestration layer, you can have
LangGraph wired into Mosaic AI Agent
Framework. It handles multi-agent
orchestration. It manages the workflow
graph and knows which agents to call in
what order. Each agent is implemented as
a Unity Catalog function. It could be
written in SQL or Python, or it could be
a model registered in a Unity Catalog.
Um they are When you register these
assets in Unity Catalog, they are set
discoverable centrally within the
organization. Uh they can be governed in
one place, and they can be versioned,
which is really critical uh in terms of
operating these uh workflows in
production. We expose these agents
through a Databricks Model Serving or
Function Serving, and that's where we
enforce these circuit breaker style
policies like retries or timeouts or
rate limits uh at the serving layer,
typically via AI Gateway configuration.
Now when we talk about the data layer,
Delta Lake stores everything. It not
only stores the state versions from the
agent, it also stores customer data and,
you know, all all all the data that you
need for your workflows to work.
Um
Talking about the snake state snapshots,
Delta table
uh is immutable and versioned. For us,
those state versions are just rows in a
Delta table. Uh we never update them in
place. Each agent run is tied to a state
version via MLflow Traces, so we can
step through the evolution when
something breaks. Now, uh I just wanted
to touch upon uh Unity Catalog. It It
governs everything access control,
lineage, audit trail for both data and
agents. MLflow gives us per agent
tracing evaluation capabilities with
out-of-the-box LLM as judges and
and metrics on every call. And as I
mentioned earlier, um tools like Agent
Bricks
is the higher level way of Databricks
packaging these orchestration patterns
for common multi-agent use cases, so you
don't need to rebuild them every time.
So just to wrap up this workflow, I see
the LangGraph orchestrator calls Agent
A, a Unity Catalog function or model. It
gets a result, writes version one state
to Delta. It then calls Agent B with
state version one, writes version two,
and so on.
MLflow traces every call, latency,
inputs, outputs, token usage. A circuit
breaker at the serving layer guards each
call. If Agent C fails, LangGraph
triggers compensation logic and walks
backward, calling the compensate
functions for previous successful steps.
These kind of patterns run in production
day in and day out. So thank you for
hearing me out. You can reach out to me
over LinkedIn. You can scan this keyword
that will take you directly to my
LinkedIn profile.
Uh
I I would like to like to leave you with
three final thoughts. First of all,
agent chaos is inevitable. When you
scale past one agent, you will you will
hit coordination problems, race
conditions, cascading failures. That's
guaranteed. The complexity curve doesn't
lie. Your agent choreography is a
choice. You can build systems with
proper patterns, orchestration,
choreography, immutable state, circuit
breakers, compensation patterns, data
contracts. Make sure you understand
these patterns and bring them to your
production architecture. Doing so will
help you build systems, not demos. Demos
are easy. You use an LLM to show
something cool. Everyone can do it.
These things don't work in production.
In production, you have to build
systems, and systems are hard. Systems
are what create value for businesses.
Everything I showed you today,
choreography versus orchestration,
immutable state, circuit breakers, these
are all unsexy infrastructure work. You
won't get applause for implementing a
circuit breaker, but you make your
systems more reliable. They don't fail
at 2:00 a.m. in the night. That is what
people notice over time. Be a systems
engineer. The patterns here, they work.
Apply these patterns in your production
architecture. Thank you very much for
watching. Bye.