UX Design Principles for Semi Autonomous Multi Agent Systems — Victor Dibia, Microsoft

Channel: aiDotEngineer
Published at: 2025-07-21
YouTube video id: fmZWvE7yDZo
Source: https://www.youtube.com/watch?v=fmZWvE7yDZo
[Music]
Hi everyone. U my name is Victor Dia.
I'm a principal research software
engineer at Microsoft Research and um my
background is mostly around uh human AI
experiences. So that's sort of where I'm
interested in right now. And over the
last few years, I've been sort of
looking at scenarios where a human works
in tandem with an AI agent to solve
problems. And so one of the things that
work done at Microsoft research is
GitHub copilot. So how many of you have
used GitHub copilot? Excellent. And so I
think in my opinion is the first example
of an AI model working at scale um in ID
helping the developer solve a problem.
Go ahead. You can switch that. And um
beyond that um more recently I spent my
time working on an open source um multi-
aent framework something called autogen.
Uh how many of you have heard of
autogen? Okay great about half of the
room. And as part of that I've also
helped build out autogen studio which is
um um a local developer tool to help
build out um multi- aent workflows. Um,
previously I worked at cloud bear as a
machine learning engineer and more and
before that I worked at IBM research as
a research staff member focusing on
human computer interaction.
Okay. So how did I get into agents? So
I'm just going to give a really brief
history. And so sometime in um I think
August 2022, so this was about 4 months
before Chat JP sort of took off um I had
worked on a project called um LA. And
essentially what that tool did was that
it let you in a web user interface drag
in some data CSV or a JSON file and it
did a few things. So first it came up
with a summarization of the data. Um,
next it did things like ask a bunch of
questions regarding the data and for
each of those questions we sort of
generated code, executed it, did some
post-processing, uh, error recovery and
then we showed the user a bunch of
visualizations. And so if you sort of
look at it, it actually is an agentic
workflow um, and just just quite quite
early in its time. So it had these four
main categories summarization, goal
exploration, visualization, generation
which is essentially a code interpreter
sort of built into the entire system.
And then once we got the data
visualizations, we did things like use a
diffusion model to sort of come up with
more representations, more uh sort of
diverse representations of that data
set. And so an interesting thing about
the system was that uh the first version
used the Dainci Codex models. Uh does
anybody remember the OpenAI Da Vinci
models? That's a really long time ago.
And one of the interesting things was
that once I showed it and the error rate
was about 20% and then about 3 months
later uh there was the GP3.5 turbo
models and we sort of tuned that and the
error rate sort of went down to about
1.5%.
The key fun fact there is that like um
this sort of showed that these sort of
applications were possible and today I
think you see a lot of these sort of uh
capabilities across many Microsoft
products and products even beyond
Microsoft and so fast forward after that
um a few colleagues started to think
about um how can we as opposed to
building this handheld workflows how can
we build multi- aent applications where
you define agents and they sort of
exchange messages and self-organize to
sort of explore problem space and that's
sort of where autogen sort of came
about. Um I'd encourage you guys to sort
of look through it. It's a framework for
building multi- aent applications. It's
pretty well used for the 5k stars and I
think the more interesting thing I did
there was autogen studio which is a kind
of a nifty tool. It's a low code tool
again but in this case essentially what
happens is that you sort of sign into a
web interface. You can then compose
multiple agents. So for example, you
create a team, you drag in a set of
agents into that team and then for each
of those teams, you have primitives like
models, tools, and you can sort of
compose them together to sort of build
multi- aent applications. And so when I
started to prepare this talk, I think
one of the things I wanted to do was
sort of walk you through all of the
capabilities of Autogen Studio, how we
built it, the design philosophy behind
all of that. But I thought, you know,
there's just a bunch of resources out
there and in general, this is
AI.engineer engineer, how about Bows, we
go ahead and we build something from
scratch and we sort of show that today.
And so maybe you shouldn't do that, but
we're going to do that today.
And the tool I'm going to show today is
something called Blender LM. Um it's a
multi- aent system built from scratch.
No frameworks, nothing. And the idea is
that it's supposed to help you build uh
uh enable 3D sort of tasks. So you could
go to this tool, say things like build
like I don't know a scene with a bowl on
a table and essentially it'll do all of
the plumbing underneath and get you a
Blender interface uh that sort of
accomplishes that. Anyone here familiar
with Blender?
Okay, great. Awesome. So here's the plan
today. I have about 13 minutes left.
I'll show you a demo. I'll walk you
through how I built it and then at the
end we'll sort of discuss or synthesize
a bunch of design principles that under
underpin um a good user experience for a
tool like this and then finally we sort
of uh settled on a bunch of takeaways.
Okay, let's go
in terms of background how do I settle
on Blender L. So about two years ago I
sort of wanted to learn how to use
Blender and of course um if you've tried
to use Blender there's a really popular
tutorial called donut tutorial. So
that's what you're sort of looking at
and it's it's kind of deceptive because
the tutorial takes about four hours but
at the end of the day you need about 40
or 50 hours just to get through the
whole thing and so you're trying to
learn know where are things where things
leave how do you use the tool and then
you need to learn all of the concepts
underneath and so one of the things I
asked myself was can I build with all
that I know about agents with all my
experience building autogen can I create
an agentic workflow that will help me
take go from natural language to let's
say something that looks like is the
prototype is not at this level of
quality but I think it can get there and
so the next question is how do you
express this as a multi- aent workflow
and you have a couple of options as a
multi- aent system so do you build a
workflow and I'm sure if you've been at
this conference you've seen people
debate um all of the pros and cons
between a fixed deterministic workflow
so essentially know exactly what all the
steps and this is great we sort of use a
lot of that in production today you can
build reliable systems take advantage of
things like function calling, structured
output and built really really valuable
systems. However, it requires that you
know the exact solution to the problem
and so what you're doing is that you're
expressing that solution as a workflow.
But there are class of problems like the
kind of thing we want to address here
that you don't know the exact solution
to the problem because every time you
take an action let's say click something
in Blender the entire space changes and
you have to react to that in some way.
And so on the other hand, on the other
end of the spectrum, what I'm going to
focus on today is more autonomous
exploratory systems. And so what that
means is that we're sort of looking at a
system where an LLM sort of drives the
flow of control. We have tools, we take
actions, we expect the results, we
observe and then we make progress.
Okay. So three characteristics here that
we should be sort of have at the back of
our minds. The system should have a bit
of autonomy. So it might not address
just a single task, maybe many different
tasks. It should be able to take actions
and an action here can have side
effects. And so for example, you could
try something call a tool and it could
sort of return with a result that you
don't expect and your system should be
able to handle handle that. Then finally
uh you need to have systems that uh are
expected to sort of explore complex
tasks, break them down into steps and
then run for extended periods of time.
Okay, so let's switch to a quick demo.
This is the Blender LM interface. It's
the web application and essentially
what's going on here is that it's
connected over a websocket connection to
an actual Blender instance. So this is
uh Blender, this is a software tool for
building 3D applications.
And what we have here, the first thing
you'll notice is that we have a set of
fixed tools that the developer can use
directly. And I'll tell you why we we
kind of need that in a second. So, for
example, I could click a button to clear
the scene. And because we have a socket
connection, we can stream exactly what's
going on in the Blender interface. This
this the scene is now cleared and we can
show that to the UI in the UI.
Next, we can um let's say go onto a list
of pre-selected or predetermined
examples here. And so maybe I might ask
this system to create two balls with a
shiny glossy silver finish. And
essentially what we see here is that a
bunch of activities start to occur.
They're streamed to the UI in real time.
So first it says we're analyzing the
task. If we scroll up just a bit, we've
come up with a plan. There's some
planning done. There's a planning agent
underneath. Says the first step we're
going to set up the scene environment by
adding a ground plane. Um we're going to
create two spheres with correct spatial
separation. Assign a glossy seal
definition and all of that. And we can
see in real time the first thing is done
is it's put on putting that plane. If we
look at Blender, we see the uh
horizontal plane here. And all of this
is running live. You probably shouldn't
do live demos in the talk. But hey, you
know, we're trying to be brave here. And
then it sort of explores um each time
essentially what's happening is that
each time it takes a step, calls a bunch
of tools, executes those tools. Um, we
stream the update to the user interface
and then we have a sort of verification
loop is a verify agent that sort of
takes a snapshot of the scene. Um, uh,
an actual log of what's in the scene, a
visual representation. We use an L. We
sort of judge are we making progress?
Are we stalled? And then we sort of use
that information to decide what we do
next.
And we can see that we have a bowl here,
which is actually what we really, really
want to do. And we can look at that in
Blender. We can tweak it around. Hey,
look. Look at that guys. It works. I
think we deserve a little applause here.
Come on. There we go. I want to explore
more. I have about eight minutes left.
Let's move on. Um, so how is all of this
built? Let's walk through the process
really really fast. Um, most of the time
when people sort of think of a system
like this, the first thing they probably
will say is like, let's define the
agent. Um, that's really not what you
should do first. First, let's define the
goal. Pretty simple. Next, we need to
come up with a baseline. probably has
nothing to do with agents, nothing do do
with AI. Create just ensure everything
works correctly. Third, we build out our
tools. What tools does this a agent
need? If a human was going to do this,
what tools would they need to accomplish
this? And then not the agent yet. Next,
we define a test bed. How do we evaluate
how this thing works? And so,
and then finally, when you have all of
that, that's when you then go ahead and
build the agent. The step one is really
simple. What we want here is we want to
translate natural language tasks to 3D
artifacts.
Next, we create a baseline. We want a
script that can say, "Let's build the
hello world of Blender." Create a
script. We run it. It adds a single cube
to the scene. What we need here is a
Blender add-on. We need a client library
that can enable socket connections, all
of that. And then this is really, really
valuable for rapid prototyping and
testing. Next, we need to define a set
of tools. And there are two types of
tools. They could be task specific
tools. for example something directly
just create a blender object and do
nothing else and then you might have um
let's say general purpose kind of tool
which is something to execute arbitrary
code and so in this case you get your
llm to generate code and you execute
that and that's what drives all the
capabilities on blender
and one thing to note is your agent is
always only as good as the tools you
give it so spend a lot of time about 50%
of your time on tools and you can test
all of this in code this what that looks
Next, you want to build an eval test
bed. In this case, it's three steps. V1
is just a Jupyter notebook. We're going
to write all that code, test it in
Jupyter notebook. Next, we create a full
interactive web UI, which is the kind of
thing I just showed here. Um,
and then third, we probably want to
create an eval automated test suite. So,
things with metrics and and a full
evaluation harness.
And then finally um to create your agent
the first thing you want to do is to
create a base agent loop. If you've been
at this conference you know that an
agent is mostly an LLM in a tight loop
with a bunch of function calls. So you
create that you get your spinal result
and then you're fine. But typically for
a problem like that this is typically
not enough and you need to iterate just
a little bit more.
Okay you need to iterate just a little
bit more. And in this case we have two
other agents. There's one called the
verifier agent. What it does is that
every time this agent takes a step we
just take this the content of the scene.
We take a list of all the objects there
or use an LM to predict are we making
progress is use a task completed and
then we decide how to move forward. The
second agent you want here is a planner.
And so you saw earlier um when the task
came in the planner sort of broke it
down into atomic steps and then each of
the steps sort of addressed in this sort
of type loop.
So, what can we learn from all of these?
And so, the the design principles I'm
going to give you here, um, they're not
exhaustive. They're not perfect. And in
fact, if you met someone that told you
that they knew the exact design
principles for multi- aent assistant
design, you probably shouldn't trust
them because the space is just too early
for that. So, what I'm going to try to
give you today is a set of four
highlevel ideas that you can sort of
take and once you build this sort of
system, sort of apply them to see how um
you can use that to improve your own
systems.
So the first is
the first um principle is capability
discovery. So what you want to do is
that because you have an agent, it can
do a whole bunch of things, but there
are a few things that it can do with
high reliability. And so you saw earlier
I had this little sort of pills that
showed that here are the things the
agent can do. So you want to itemize the
kind of things that your agent can do
with high reliability. The second thing
you can do is to have proactive
suggestions based on user context. Let's
say we have a scene open. You can sort
of parse the scene and sort of suggest
to the user some high level things they
can accomplish. And so this is an
example of that. The second thing is
observability and provenence. And so
stream all of the activity logs help the
user sort of make sense of what the
agent is doing. And then you want to
provide tools for debugging and all of
that. So all the little things around um
number of tokens used the amount of time
taken for each of that very very
valuable for the user to make sense of
what the agent is doing. The third is
interruptibility. So at this point your
agent is sort of taking all kinds of
actions and so at any time you want to
sort of design your system such that you
can pause it. It might be going down the
wrong route about to make a mistake or
sort of consume a bunch of resources
that you don't entitled. I say you want
a system that enables things like
checkpointing, roll back, pauses and
resumes.
And then finally, cost aware delegation.
So every time an agent takes um an
action um from the LM's perspective, all
actions are equal. All two calls are
equal unless you do something about it.
And so you might in my in the case of
Blender, you might have it write some
Python code that let's say I don't add
something to the stream, but for any
reason, let's say it tries to delete the
entire operating system. You really
don't want things like that to happen.
And so you want a module that actively
inspects inspects the and tries to
estimate the cost of the action and then
knows when to delegate to the user. And
so I'm kind of getting to the end. Um
what are some of the key takeaways? The
first one is know when to use a multi-
aent approach. And so a multi- aent
approach is not always the thing to use.
And
essentially when you have multiple
agents collaborating and you give them a
bunch of autonomy um as part of that
process you also increase the surface
for error and so like any other tool you
should sort of inspect the problem space
and sort of verify if a multi- aent
system is actually the right tool for
the job.
I always show this little graph and so
you have a big circle here and if the
big circle is the task that you're
engineering or most engineering teams
need to do then this small circle here
at the bottom is the task that truly
benefit from a multi- aent system
approach um it's really it's really that
small and uh before you try to build an
autonomous multi- aent system uh just
think very very carefully and ensure
that you have good ROI on on on this
specific approach
The next the next question I get is how
do I know if my task might benefit from
a multi- aent approach? I typically
offer a five-step framework. Um the
first is planning. Does the task benefit
from planning? Can you take a highle
input from user and meaningfully break
it down to the bunch of steps that leads
you from an unsolved state to a sol
state? Next, can you take the task and
sort of break it into multiple
perspectives of personas? In this case,
um let's say you might have some some
some persona that explores things like
just planning of the task. Um maybe some
personas that handle let's say code
execution all of that. And so with the
multi agent approach you can explore
this sort of domain driven kind of
design. The third is um does the task
require consuming or processing
extensive context. And so here we're
constantly snapshotting the state of the
app screenshots all of that. And it's
kind of useful to sort of give
individual agents each of these uh large
pieces of context to process and then
return to some other final coordination
agent. And then finally, adaptive
solutions. As you take actions in the
world or in the environment that your
agent exists, the environment might
change and you might constantly need to
sort of react to that. So you might need
like an autonomous agent approach here.
The second takeaway is eval driven
design. Uh most people want to start out
with just building their agent. Um it's
typically a mistake. Instead, you want
to define your task. Um define
evaluation metrics. Build the baseline
that has nothing to do with agents.
Improve your agents iteratively. And in
the case of this app, we had like a
simple type loop. Then we improved it.
We added a verification agent. And then
we added um a planning agent. And based
on the interactive evaluation tool I
built, I could see that all of these
things actually had like ROI and
improvement. And that's that's why it
makes sense to explore a multi approach
in this space. And then the final thing
is that academic benchmarks are great,
but they're not your task. And so um you
really should build evals that sort of
are tuned to your task.
And then the second slide um the the
last set of um key takeaways are the
design principles that we'll walk
through today. I think this is the money
slide here today. Um four high high
level things. First always ensure that
your users can discover the ideal tasks
that your multi- aent system is designed
for. Um provide userfacing observability
traces. Um ensure that your agents are
interruptible. um you can checkpoint and
restart them and then ensure that um
your agents can quantify the risk or
cost of all actions and delegate to
users as needed. Then finally don't
build the whole multi- aent system from
scratch just to give a talk. You
probably know that it's fun. It's a lot
of work. Um and if you ever want to do
something like this, consider using a
framework to serve you a couple of
keystrokes here and there. So at the
last slide I have a bunch of further
reading a couple of papers we've written
on origin studio magentic one magentic
UI challenges in human AI communication.
Um these are all good references I
recommend you take a look and then I'm
at the end of my slides. Thank you so
much for listening. Um I have a book I'm
I'm writing. There's a lot more about
this. Chapter 3 is really just about
like design. um take a look um if it's
helpful for you and all the code for
Blender LLM is also available. Thank
you.
[Music]