UX Design Principles for Semi Autonomous Multi Agent Systems — Victor Dibia, Microsoft
Channel: aiDotEngineer
Published at: 2025-07-21
YouTube video id: fmZWvE7yDZo
Source: https://www.youtube.com/watch?v=fmZWvE7yDZo
[Music] Hi everyone. U my name is Victor Dia. I'm a principal research software engineer at Microsoft Research and um my background is mostly around uh human AI experiences. So that's sort of where I'm interested in right now. And over the last few years, I've been sort of looking at scenarios where a human works in tandem with an AI agent to solve problems. And so one of the things that work done at Microsoft research is GitHub copilot. So how many of you have used GitHub copilot? Excellent. And so I think in my opinion is the first example of an AI model working at scale um in ID helping the developer solve a problem. Go ahead. You can switch that. And um beyond that um more recently I spent my time working on an open source um multi- aent framework something called autogen. Uh how many of you have heard of autogen? Okay great about half of the room. And as part of that I've also helped build out autogen studio which is um um a local developer tool to help build out um multi- aent workflows. Um, previously I worked at cloud bear as a machine learning engineer and more and before that I worked at IBM research as a research staff member focusing on human computer interaction. Okay. So how did I get into agents? So I'm just going to give a really brief history. And so sometime in um I think August 2022, so this was about 4 months before Chat JP sort of took off um I had worked on a project called um LA. And essentially what that tool did was that it let you in a web user interface drag in some data CSV or a JSON file and it did a few things. So first it came up with a summarization of the data. Um, next it did things like ask a bunch of questions regarding the data and for each of those questions we sort of generated code, executed it, did some post-processing, uh, error recovery and then we showed the user a bunch of visualizations. And so if you sort of look at it, it actually is an agentic workflow um, and just just quite quite early in its time. So it had these four main categories summarization, goal exploration, visualization, generation which is essentially a code interpreter sort of built into the entire system. And then once we got the data visualizations, we did things like use a diffusion model to sort of come up with more representations, more uh sort of diverse representations of that data set. And so an interesting thing about the system was that uh the first version used the Dainci Codex models. Uh does anybody remember the OpenAI Da Vinci models? That's a really long time ago. And one of the interesting things was that once I showed it and the error rate was about 20% and then about 3 months later uh there was the GP3.5 turbo models and we sort of tuned that and the error rate sort of went down to about 1.5%. The key fun fact there is that like um this sort of showed that these sort of applications were possible and today I think you see a lot of these sort of uh capabilities across many Microsoft products and products even beyond Microsoft and so fast forward after that um a few colleagues started to think about um how can we as opposed to building this handheld workflows how can we build multi- aent applications where you define agents and they sort of exchange messages and self-organize to sort of explore problem space and that's sort of where autogen sort of came about. Um I'd encourage you guys to sort of look through it. It's a framework for building multi- aent applications. It's pretty well used for the 5k stars and I think the more interesting thing I did there was autogen studio which is a kind of a nifty tool. It's a low code tool again but in this case essentially what happens is that you sort of sign into a web interface. You can then compose multiple agents. So for example, you create a team, you drag in a set of agents into that team and then for each of those teams, you have primitives like models, tools, and you can sort of compose them together to sort of build multi- aent applications. And so when I started to prepare this talk, I think one of the things I wanted to do was sort of walk you through all of the capabilities of Autogen Studio, how we built it, the design philosophy behind all of that. But I thought, you know, there's just a bunch of resources out there and in general, this is AI.engineer engineer, how about Bows, we go ahead and we build something from scratch and we sort of show that today. And so maybe you shouldn't do that, but we're going to do that today. And the tool I'm going to show today is something called Blender LM. Um it's a multi- aent system built from scratch. No frameworks, nothing. And the idea is that it's supposed to help you build uh uh enable 3D sort of tasks. So you could go to this tool, say things like build like I don't know a scene with a bowl on a table and essentially it'll do all of the plumbing underneath and get you a Blender interface uh that sort of accomplishes that. Anyone here familiar with Blender? Okay, great. Awesome. So here's the plan today. I have about 13 minutes left. I'll show you a demo. I'll walk you through how I built it and then at the end we'll sort of discuss or synthesize a bunch of design principles that under underpin um a good user experience for a tool like this and then finally we sort of uh settled on a bunch of takeaways. Okay, let's go in terms of background how do I settle on Blender L. So about two years ago I sort of wanted to learn how to use Blender and of course um if you've tried to use Blender there's a really popular tutorial called donut tutorial. So that's what you're sort of looking at and it's it's kind of deceptive because the tutorial takes about four hours but at the end of the day you need about 40 or 50 hours just to get through the whole thing and so you're trying to learn know where are things where things leave how do you use the tool and then you need to learn all of the concepts underneath and so one of the things I asked myself was can I build with all that I know about agents with all my experience building autogen can I create an agentic workflow that will help me take go from natural language to let's say something that looks like is the prototype is not at this level of quality but I think it can get there and so the next question is how do you express this as a multi- aent workflow and you have a couple of options as a multi- aent system so do you build a workflow and I'm sure if you've been at this conference you've seen people debate um all of the pros and cons between a fixed deterministic workflow so essentially know exactly what all the steps and this is great we sort of use a lot of that in production today you can build reliable systems take advantage of things like function calling, structured output and built really really valuable systems. However, it requires that you know the exact solution to the problem and so what you're doing is that you're expressing that solution as a workflow. But there are class of problems like the kind of thing we want to address here that you don't know the exact solution to the problem because every time you take an action let's say click something in Blender the entire space changes and you have to react to that in some way. And so on the other hand, on the other end of the spectrum, what I'm going to focus on today is more autonomous exploratory systems. And so what that means is that we're sort of looking at a system where an LLM sort of drives the flow of control. We have tools, we take actions, we expect the results, we observe and then we make progress. Okay. So three characteristics here that we should be sort of have at the back of our minds. The system should have a bit of autonomy. So it might not address just a single task, maybe many different tasks. It should be able to take actions and an action here can have side effects. And so for example, you could try something call a tool and it could sort of return with a result that you don't expect and your system should be able to handle handle that. Then finally uh you need to have systems that uh are expected to sort of explore complex tasks, break them down into steps and then run for extended periods of time. Okay, so let's switch to a quick demo. This is the Blender LM interface. It's the web application and essentially what's going on here is that it's connected over a websocket connection to an actual Blender instance. So this is uh Blender, this is a software tool for building 3D applications. And what we have here, the first thing you'll notice is that we have a set of fixed tools that the developer can use directly. And I'll tell you why we we kind of need that in a second. So, for example, I could click a button to clear the scene. And because we have a socket connection, we can stream exactly what's going on in the Blender interface. This this the scene is now cleared and we can show that to the UI in the UI. Next, we can um let's say go onto a list of pre-selected or predetermined examples here. And so maybe I might ask this system to create two balls with a shiny glossy silver finish. And essentially what we see here is that a bunch of activities start to occur. They're streamed to the UI in real time. So first it says we're analyzing the task. If we scroll up just a bit, we've come up with a plan. There's some planning done. There's a planning agent underneath. Says the first step we're going to set up the scene environment by adding a ground plane. Um we're going to create two spheres with correct spatial separation. Assign a glossy seal definition and all of that. And we can see in real time the first thing is done is it's put on putting that plane. If we look at Blender, we see the uh horizontal plane here. And all of this is running live. You probably shouldn't do live demos in the talk. But hey, you know, we're trying to be brave here. And then it sort of explores um each time essentially what's happening is that each time it takes a step, calls a bunch of tools, executes those tools. Um, we stream the update to the user interface and then we have a sort of verification loop is a verify agent that sort of takes a snapshot of the scene. Um, uh, an actual log of what's in the scene, a visual representation. We use an L. We sort of judge are we making progress? Are we stalled? And then we sort of use that information to decide what we do next. And we can see that we have a bowl here, which is actually what we really, really want to do. And we can look at that in Blender. We can tweak it around. Hey, look. Look at that guys. It works. I think we deserve a little applause here. Come on. There we go. I want to explore more. I have about eight minutes left. Let's move on. Um, so how is all of this built? Let's walk through the process really really fast. Um, most of the time when people sort of think of a system like this, the first thing they probably will say is like, let's define the agent. Um, that's really not what you should do first. First, let's define the goal. Pretty simple. Next, we need to come up with a baseline. probably has nothing to do with agents, nothing do do with AI. Create just ensure everything works correctly. Third, we build out our tools. What tools does this a agent need? If a human was going to do this, what tools would they need to accomplish this? And then not the agent yet. Next, we define a test bed. How do we evaluate how this thing works? And so, and then finally, when you have all of that, that's when you then go ahead and build the agent. The step one is really simple. What we want here is we want to translate natural language tasks to 3D artifacts. Next, we create a baseline. We want a script that can say, "Let's build the hello world of Blender." Create a script. We run it. It adds a single cube to the scene. What we need here is a Blender add-on. We need a client library that can enable socket connections, all of that. And then this is really, really valuable for rapid prototyping and testing. Next, we need to define a set of tools. And there are two types of tools. They could be task specific tools. for example something directly just create a blender object and do nothing else and then you might have um let's say general purpose kind of tool which is something to execute arbitrary code and so in this case you get your llm to generate code and you execute that and that's what drives all the capabilities on blender and one thing to note is your agent is always only as good as the tools you give it so spend a lot of time about 50% of your time on tools and you can test all of this in code this what that looks Next, you want to build an eval test bed. In this case, it's three steps. V1 is just a Jupyter notebook. We're going to write all that code, test it in Jupyter notebook. Next, we create a full interactive web UI, which is the kind of thing I just showed here. Um, and then third, we probably want to create an eval automated test suite. So, things with metrics and and a full evaluation harness. And then finally um to create your agent the first thing you want to do is to create a base agent loop. If you've been at this conference you know that an agent is mostly an LLM in a tight loop with a bunch of function calls. So you create that you get your spinal result and then you're fine. But typically for a problem like that this is typically not enough and you need to iterate just a little bit more. Okay you need to iterate just a little bit more. And in this case we have two other agents. There's one called the verifier agent. What it does is that every time this agent takes a step we just take this the content of the scene. We take a list of all the objects there or use an LM to predict are we making progress is use a task completed and then we decide how to move forward. The second agent you want here is a planner. And so you saw earlier um when the task came in the planner sort of broke it down into atomic steps and then each of the steps sort of addressed in this sort of type loop. So, what can we learn from all of these? And so, the the design principles I'm going to give you here, um, they're not exhaustive. They're not perfect. And in fact, if you met someone that told you that they knew the exact design principles for multi- aent assistant design, you probably shouldn't trust them because the space is just too early for that. So, what I'm going to try to give you today is a set of four highlevel ideas that you can sort of take and once you build this sort of system, sort of apply them to see how um you can use that to improve your own systems. So the first is the first um principle is capability discovery. So what you want to do is that because you have an agent, it can do a whole bunch of things, but there are a few things that it can do with high reliability. And so you saw earlier I had this little sort of pills that showed that here are the things the agent can do. So you want to itemize the kind of things that your agent can do with high reliability. The second thing you can do is to have proactive suggestions based on user context. Let's say we have a scene open. You can sort of parse the scene and sort of suggest to the user some high level things they can accomplish. And so this is an example of that. The second thing is observability and provenence. And so stream all of the activity logs help the user sort of make sense of what the agent is doing. And then you want to provide tools for debugging and all of that. So all the little things around um number of tokens used the amount of time taken for each of that very very valuable for the user to make sense of what the agent is doing. The third is interruptibility. So at this point your agent is sort of taking all kinds of actions and so at any time you want to sort of design your system such that you can pause it. It might be going down the wrong route about to make a mistake or sort of consume a bunch of resources that you don't entitled. I say you want a system that enables things like checkpointing, roll back, pauses and resumes. And then finally, cost aware delegation. So every time an agent takes um an action um from the LM's perspective, all actions are equal. All two calls are equal unless you do something about it. And so you might in my in the case of Blender, you might have it write some Python code that let's say I don't add something to the stream, but for any reason, let's say it tries to delete the entire operating system. You really don't want things like that to happen. And so you want a module that actively inspects inspects the and tries to estimate the cost of the action and then knows when to delegate to the user. And so I'm kind of getting to the end. Um what are some of the key takeaways? The first one is know when to use a multi- aent approach. And so a multi- aent approach is not always the thing to use. And essentially when you have multiple agents collaborating and you give them a bunch of autonomy um as part of that process you also increase the surface for error and so like any other tool you should sort of inspect the problem space and sort of verify if a multi- aent system is actually the right tool for the job. I always show this little graph and so you have a big circle here and if the big circle is the task that you're engineering or most engineering teams need to do then this small circle here at the bottom is the task that truly benefit from a multi- aent system approach um it's really it's really that small and uh before you try to build an autonomous multi- aent system uh just think very very carefully and ensure that you have good ROI on on on this specific approach The next the next question I get is how do I know if my task might benefit from a multi- aent approach? I typically offer a five-step framework. Um the first is planning. Does the task benefit from planning? Can you take a highle input from user and meaningfully break it down to the bunch of steps that leads you from an unsolved state to a sol state? Next, can you take the task and sort of break it into multiple perspectives of personas? In this case, um let's say you might have some some some persona that explores things like just planning of the task. Um maybe some personas that handle let's say code execution all of that. And so with the multi agent approach you can explore this sort of domain driven kind of design. The third is um does the task require consuming or processing extensive context. And so here we're constantly snapshotting the state of the app screenshots all of that. And it's kind of useful to sort of give individual agents each of these uh large pieces of context to process and then return to some other final coordination agent. And then finally, adaptive solutions. As you take actions in the world or in the environment that your agent exists, the environment might change and you might constantly need to sort of react to that. So you might need like an autonomous agent approach here. The second takeaway is eval driven design. Uh most people want to start out with just building their agent. Um it's typically a mistake. Instead, you want to define your task. Um define evaluation metrics. Build the baseline that has nothing to do with agents. Improve your agents iteratively. And in the case of this app, we had like a simple type loop. Then we improved it. We added a verification agent. And then we added um a planning agent. And based on the interactive evaluation tool I built, I could see that all of these things actually had like ROI and improvement. And that's that's why it makes sense to explore a multi approach in this space. And then the final thing is that academic benchmarks are great, but they're not your task. And so um you really should build evals that sort of are tuned to your task. And then the second slide um the the last set of um key takeaways are the design principles that we'll walk through today. I think this is the money slide here today. Um four high high level things. First always ensure that your users can discover the ideal tasks that your multi- aent system is designed for. Um provide userfacing observability traces. Um ensure that your agents are interruptible. um you can checkpoint and restart them and then ensure that um your agents can quantify the risk or cost of all actions and delegate to users as needed. Then finally don't build the whole multi- aent system from scratch just to give a talk. You probably know that it's fun. It's a lot of work. Um and if you ever want to do something like this, consider using a framework to serve you a couple of keystrokes here and there. So at the last slide I have a bunch of further reading a couple of papers we've written on origin studio magentic one magentic UI challenges in human AI communication. Um these are all good references I recommend you take a look and then I'm at the end of my slides. Thank you so much for listening. Um I have a book I'm I'm writing. There's a lot more about this. Chapter 3 is really just about like design. um take a look um if it's helpful for you and all the code for Blender LLM is also available. Thank you. [Music]