The Multi-Agent Architecture That Actually Ships — Luke Alvoeiro, Factory
Channel: aiDotEngineer
Published at: 2026-05-06
YouTube video id: ow1we5PzK-o
Source: https://www.youtube.com/watch?v=ow1we5PzK-o
[music] >> Hi everyone. My name is Luke and my goal is that 20 minutes from now you'll be able to assemble agent teams that can complete tasks orders of magnitude harder than what you can complete with a single agent today. A little bit about me. So I come from a background in dev tools. About 2 and 1/2 years ago I started a project at Block which is where I was working at the time. And that project evolved into Goose. Goose is now one of the leading coding agents is open source and it's recently was was donated to the AI agentic AI Foundation. So it's been really cool to see. Now nowadays I work at Factory where I lead our core agent harness and Factory's mission is to bring autonomy to the entire software development life cycle. So I want to start off with a claim. The bottleneck in software engineering nowadays is not intelligence. It's now limited by human attention. Even the best engineers can only complete a couple of tasks at a time. They may have a backlog of 50 features but they can only drive a few forward per day because every task requires their attention. Every commit needs their review. Today's models are smart enough to figure out all 50 of these tasks but there's not enough uh just bandwidth to supervise their implementation. So we kept asking ourselves what if a human decides what to build and then a system figures out how to do so. Right? An agent could just work for hours for days and you come back to finish work. So that's what I'm here to talk about. When you start researching multi-agent frameworks and systems you quickly realize that the field's a bit of a mess. Everyone has their own framework, their own terminology, their own opinions of what works and doesn't work. And so I want to propose a simple taxonomy. There's five frontier multi-agent frameworks. One is delegation. Right? This is where one agent spawns another agent and the parent agent may say go figure out the database schema and then gets a response back. This is the simplest form of multi-agent communication as what most people implement first. You have you know sub agents and coding tools are the most common example. The other one is creator verifier. Right? Where one agent builds something and then you have another agent that checks that work. And the key here is like a separation of concerns. The parent the the agent that implemented the the code is has some cost bias. Right? Wants that code to work. A fresh agent with fresh context is way more likely to find issues and this is why we do code review as humans as well. Another one is direct communication. This is when agents communicate without a central coordinator. Right? It's the kind of like DMing each other. It's hard to get right though because state fragments across conversations without that coordinator and there's no single source of truth. The next one is negotiation. Right? Negotiation is when agents communicate but over a shared resource. So that may be you know they want to use the same API. They want to modify the same portion of the code base. But negotiation doesn't need to be adversarial. In fact the best use case is when there is net positive sum trading. Right? And that's when agents have like a potential win-win situation while interacting. And then the last one is broadcast and that is when one agent sends information to many. Think of it like you know status updates, new context that applies to everyone, you shared constraints. It's a bit less flashy than the other ones but it's critical for maintaining coherence over long-running tasks. And so when you have all of these different building blocks how do you assemble that into a system that can run for many days? So missions is our answer. It's a system that combines four of those. Delegation, creator verifier, broadcast and negotiation into a single workflow. You describe a goal. You scope that through a conversation. You approve a plan and then the system handles execution for hours or days and that enables you to focus on something else. Notably a mission is not a single agent session. It's an ecosystem of agents that communicate through structured handoffs and shared state. It uses a three-role architecture. There's orchestrator, there's workers and then there's validators. The orchestrator handles planning. When you describe what you want the orchestrator is kind of like your sounding board. Ask you the right strategic questions. It you know checks out if there's any unclear requirements in in the problem space and then it eventually produces a plan that includes features, milestones and then something that's called a validation contract. And that validation contract defines what done sort of means before any coding is done. And I'll come back to why that matters because it turns out to be really important to the system. The next role are workers. They handle implementation. When a feature is assigned to a worker that worker has clean context, no accumulated baggage, no degraded attention. Right? The worker reads its spec. It implements the feature and then commits by Git allowing the next worker to inherit a clean slate and a working code base. And then the last role are validators. They handle verification. And so most systems validate by maybe running lint, type check, tests. Maybe they do code review. Missions does all of that but we also validate behavior. Instead of just asking you know does the code look right? We wonder does this work end to end? That's the difference that lets lets missions run for many hours, many days in a row without drifting. And making it work had to involve sort of rethinking validation entirely. So when you've worked with coding agents before you've probably seen this pattern where an agent builds a feature. It writes some tests. The tests pass. There's full coverage. But the tests were sort of shaped by the code not by what the code was attempting to actually do. Tests written after implementation don't catch bugs. They confirm decisions. So if you rely on validation like that your system will eventually drift. That's why this validation contract exists. It's written during planning before any code and it defines correctness independently of implementation. So for a complex project this can be hundreds of assertions and each feature is assigned one or more assertions that it must satisfy. The sum of all features must mean that every assertion is covered. After each after each milestone of features we have two types of validators that run. So you have the scrutiny validator and the user testing validator. The first one is more traditional. It runs the test suite, type checking, lints and critically it spawns dedicated code review agents for each completed feature within the milestone. And then the second one which is the user testing validator is more interesting. It kind of acts like a QA engineer. It spawns the application. It interacts with it through computer use or something similar to that. It fills out forms, you know, checks that pages render correctly, clicks buttons and ensures that functional flows work holistically. So this step takes significantly longer than the previous one of the scrutiny validator because the the system is interacting with a live application. And what we've noticed is that missions most of the missions wall clock time is actually spent here waiting for this like real world execution to occur instead of generating tokens. Critically neither validator has seen the code before. They're not invested in the implementation and so validation is adversarial by design. Okay. So then validation catches bugs. Right? But for a system that runs for many days you also need to make sure that context isn't lost between the agents. When a worker finishes a feature it doesn't just say I'm done. It fills out a structured handoff detailing what was completed, what was left undone, what commands were run throughout that that agent loop and what were the the exit codes of those commands. What issues were discovered and did it abide by the procedures that the orchestrator defined for that worker? That's how we catch issues and how the system self-heals. The errors get caught at milestone boundaries. Corrective work gets scoped and the mission sort of like pulls itself back on track. Not by hoping that agents remember what happened but by forcing them to write it down and then actually address issues and I'll I'll present on that in just a sec. Our longest mission ran for 16 days which is much longer than a full sprint and we believe that they can run for 30. That's only possible because of the structure. So once we had this architecture the next question became became how do we actually run it? Right? The most obvious choice is like parallelism. If you have 10 agents running at one point in time then you have 10 times the throughput. But we tried that and it doesn't really work for tasks in the like software dev domain because agents conflict. They step on each other's changes. They duplicate work. They make inconsistent architectural decisions. And so the coordination overhead ends up eating up the speed gains all the while you're burning tokens. The difference with missions is that we run features serially. So there's only one worker or validator running at any given point in time. Within a feature, we allow for parallelization on read-only operations. So, you have something like searching through the code base or researching APIs, all that gets parallelized. Within validators, we also parallelize read-only operations such as code review. This is serial execution with with targeted internal parallelization. It seems slower on paper, but the error rate drops dramatically, and when you have tasks that run for many days, this sort of correctness compounds. Now, your your standard chat interface doesn't really work for something that lasts many days. At a quick glance, you need to be able to be able to see how much of the project have you completed, and what's what amount of the budget that you originally like set off with have you burned through. So, using a mission actually, we built mission control, which is a dedicated view for this. You can see what does what is active worker doing right now, uh read off handoff summary is that detail. What did the worker the validator discover, um how it's going to sort of like alter its course moving forward. Or, you could just, you know, go check out, um go hang out with your friends that night. This entire view lets you just run missions asynchronously, and you could be plugged in as a project manager overseeing implementation, or you could just, you know, go and and uh hang out with your friends. Okay. So, the right model in each role. Um everything here sort of assumes one thing, and that is that you're using the right model in each role. Planning benefits from slow, careful reasoning, implementation from fast code fluency and creativity, validation benefits from uh precise instruction following, right? And so, no single model nor model provider is best at all three of these. Using systems like missions requires the development of a new skill, which internally we've been calling droid whispering, but it's this idea that you need to be able to mentally model how different LLMs interact, where they fail, how those failures compound over a multi-day run, and then you need to make a deliberate choice as to which model sits in which seat. Theo, the engineer who built our missions prototype, came up with our our model defaults, but we really encourage people to make these uh their own and customize them to the needs of their project. So, for example, validation might use a different model provider entirely to make sure that it's not biased by the same training data. This is a structural advantage of a model-agnostic architecture. You're only as strong as your weakest link. And if you're locked into one model provider, then you're constrained by that family's weakest capability. As models continue to specialize, the ability to put the right model in the right seat becomes a compounding advantage. It works in the other direction, too. If you're using missions, the structure of that can compensate for models that are not quite at like the frontier level performance. So, the validation contracts, the milestone checkpoints, they allow you to run missions very very successfully even using open-weight models. Now, this all sounds quite theoretical. What does it actually look like in production? I've got an example of building a clone of Slack right here. This slide has a ton of info, but I'll walk you through just a few things that I want to call out. 60% of our time is spent on implementation, and 60% of our tokens as well. Notice how validation never succeeds on the first go. That's in the mission What's it? The one on the bottom left. Um we almost always have to create follow-up features. So, it really demonstrates like the value of a system that does this QA loop. You end up with with 50% of your lines of code at the very end, in the bottom right, being tests, and 90% of your uh code is covered by those tests. And lastly, we take advantage of prompt caching heavily to make sure that we're sort of offsetting um the the price of running such a long task. People have really taken to missions, and it's been awesome to see what folks have been building with them. Um some examples I've included in this slide, but ones that I want to call out are specifically in the enterprise setting, which is where Factory really shines. Um they've been used to prototype new ideas and features overnight, to um make sure that people can uh build internal tools at increasingly rapid rates, to run huge refactors and migrations, for ML search uh research, sorry, and to modernize uh codebases so that agents are more productive in them. Um one thing that I wanted to talk about was also this concept of like the bitter lesson, because every person building multi-agent systems has this fear of the next model release sort of like making their their architecture obsolete overnight. Um so, when we were building missions, we decided we had to make this system get better with every model improvement. This means that almost all of the orchestration logic is defined in prompts and skills, um instead of like a hard-coded state machine. How it decomposes failures and um or decomposes features and handles failures is all in about like 700 lines of text, and four sentences of this can alter the execution strategy pretty dramatically. Worker behavior is driven by skills that the orchestrator defines per mission, so you get very customized behavior, and the only deterministic logic is very thin, and it's focused on enabling models to do what they do best while the system handles like the bookkeeping, right? Stuff like running validation and ensuring that progress is blocked when there are some handoff issues that are not addressed. So, missions sort of ensure the the discipline, and the models provide the intelligence uh using primitives that they're already familiar with, like agents.md, skills, etc. So, what does this unlock? Remember the bottleneck that I started off with? Human attention. The economics are sort of changing. Before, a team of five engineers might be able to uh work on 10 work streams at any given point in time. Now, maybe with missions, we can bring that up to 30. The team can focus on interesting problems such as uh the architecture, product decisions, um instead of uh worrying about the execution per se. And the important thing is the codebase ends up cleaner than when you started. The end-to-end tests, the unit tests, the skills, the structure that missions provide uh means that agents and humans are more productive in that environment moving forward. So, now that you understand how missions are structured and how they actually work, you can see that they're really a composition of those original um strategies, right? Delegation shows up everywhere in how the orchestrator spawns workers and how we spawn research sub-agents. Creator-verifier is fundamental in that validation and implementation are always separate agents with separate context. Broadcast runs through the shared mission state that every agent references, and negotiation shows up at milestone boundaries, where the orchestrator defines, you know, does this does this handoff summary sort of like look correct? Do we need to create follow-up features, rescope, etc. But strategies aren't enough. You need the connective tissue. You need uh these structured handoffs so that agents don't lose context, you need the right model in each role, and you need an architecture that will improve with each model improvement. So, what I like to think about is that people in this room who are thinking in terms of agent ecosystems, who develop an intuition for how different models compose under pressure, um that those folks are going to be really shipping the next generation of innovation. Uh there's a lot of open questions still, right? Um how do we further parallelize the workload of missions so that they run faster? How do we start orchestrating missions themselves into even more complex workflows? Uh but the data from production missions is clear. This works on real projects at scale today. So, this is what I'll leave you with. Open Droid, try running /missions, argue with the orchestrator about the scope, approve the plan, and then go do something else. I'm excited to see what you guys build, and I'll be around to answer any questions uh for the rest of the day. Thanks. >> [applause] [music]