Building Applications with AI Agents — Michael Albada, Microsoft

Channel: aiDotEngineer

Published at: 2025-07-24

YouTube video id: R30col3UPUg

Source: https://www.youtube.com/watch?v=R30col3UPUg

[Music]
It's a pleasure to be with you today. My
name is Michael Alba and I'm a principal
applied scientist at Microsoft and today
I'm going to be presenting on building
applications with AI agents. So just as
a brief bio, I've been at Microsoft for
about two years. So why I've been one of
the key contributors to security copilot
and the recently announced security
copilot agents specifically working in
the cyber security division. Before that
I spent four years working on machine
learning at Uber lots of big geospatial
problems and I was in startups before
that and this talk is really a
distillation of a 300page book that I
have coming out with O'Reilly. Uh the
first seven chapters are already up on
early release on the platform and it is
going to print next month. And while
I'll be focusing mostly on slides, I
won't get too deep into code for this
particular talk, I just want to say that
there is full code examples backing
everything I'm describing here. So if
you want to dive in a little bit deeper,
take a look of what this looks like in
actual code. Um this is to give you a
brief a introduction to that. So to give
a brief overview of what I'll be
covering, I'll be talking a little bit
about the promise and obstacles that
we're seeing so far in agentic
development. I'll go through some of the
core components that are required to
build really effective AI systems that
we get to production and then I'll talk
through some of the common pitfalls and
lessons. So there's a tremendous amount
of excitement that's happening here to
just take one data point. If we look at
the companies that have been accepted to
Y Cominator just over the last three
years we've seen a 254% increase in
companies that are describing themselves
as agentic or as if they are building
agents. Uh certainly we're seeing a lot
of increased investment. There's a lot
of excitement here. Um, but I think
we're also seeing that these are really
hard problems that we're going after.
And so if we're looking at some of the
leading agentic benchmarks from
academia, we're see these are hard tasks
that require a sequence of multiple tool
calls, multiple executions, and really
operating in complex environments. If we
were to go back five or 10 years, we'd
be in the single digits on some of these
tasks. So the fact that we're getting up
to in the 50s, 60s, and 70s on some of
this is really impressive. And I think
just as Clay described, we want to move
to where uh the puck is moving and it's
actually accelerating in that direction.
But if you're going and trying to build
your own agentic system, do not expect
uh perfection by any measure. And it
we're seeing it's quite easy to get to
those initial prototype stages that
might get you to 70% accuracy, but it's
increasingly challenging to go after uh
that long tale of increasingly complex
scenarios. So just to give a brief uh
definition to start sure you've heard a
few of these at the conference so far
but I'm defining as an entity that can
reason act communicate and adapt to
solve tasks and so we treat the
foundation model as a foundation and
then we can add these additional
components to increase the performance
and effectiveness in different
scenarios. There's been a lot of
discussion about what constitutes an
agentic uh system and I think uh Andrew
A helped clarify this that it's not a
binary distinction but it really is a
continuum. It really is a spectrum. And
what I would add on to that is there's a
second axis that we should consider
which is the effectiveness of the
system. And so I wouldn't think about
agency or the agenticness of your system
as a goal or an end in and of itself. It
is a tool to help you solve problems.
And so I just think it's a classic
example of something that has a very low
degree of agency but a really high
degree of efficacy. It's robotic process
automation. This is a previous
generation of automation that's been
incredibly helpful and incredibly useful
and is used by many companies and
delivers a lot of economic value. The
problem is those types of automations
are fixed. They're brittle. Small
changes to the input can allow the
entire thing to break. So it requires a
lot of manual intervention to continue
maintaining and updating these. And so I
think part of the promise as we move to
these more agentic systems is they're
flexible, they're adaptable, and they
unlock this additional capability of
adapting to and responding to those
changing inputs. But we want to make
sure that anytime we're adding agency to
our system, when we're moving along the
right, we're staying at that very high
level of effectiveness. We want to make
sure that any incremental addition
towards agency maintains that high level
of performance. What we don't want to do
is end up compromising the degree of
effectiveness. And I think there's no
shortage of bad chat bots that we've
seen shipped by many companies that are
relatively low on efficacy and
relatively low on agency. But we also
want to avoid are going out over our
skis building agentic systems that are
low in efficacy. That's what I would
call the future news stories that all of
us want to avoid so that as we're
designing things, we're delivering
things that are delivering value. So
I'll start with uh tool use. And I just
want to say this is a really such a
powerful idea and it's really simple in
principle. We're working with foundation
models. These are auto reggressive
generative models that are predicting
one token at a time. Typically, those
are predicting natural language, but
they can also output function calls. And
so, if you are exposing tools and
functionality to this language model in
the attribute that we are calling an
entity, all of a sudden that agent can
now invoke functions and we're now
exposing the full range of tools that we
can expose over APIs. So just think
about the incredible functionality that
you can expose to this but also the risk
and challenge that comes along with it.
And so it requires a great degree of uh
discernment and responsibility to think
about which of those functionalities
we're going to expose in what way so
that we can deliver values to our
customers. And of course this also
operates in a loop. So we apply that
parser to the outputed text. We invoke
the tool and we get some response back
some observation. That's a way of
providing more information to the agent
that it can use to solve problems. We
then continue this in a loop until we
take our final output and we generate
that for the customer.
As you're thinking about designing and
building tools, one really common
fallacy I see is to think that there's a
onetoone mapping between your APIs and
your tools. If you're working in an
organization that has 300 APIs, please
do not register 300 tools with your
agent. It will get really confused. And
something we've seen empirically is that
the more tools that you expose to your
to an individual prompt to an individual
completion call, the less accuracy you
see overall. There's more semantic
collision between those different tools.
So if at all possible, reduce the number
of tools that you're exposing at a
single time and really try and group
them together in logical ways. You want
to keep that scope really specific,
really clear, specific names and
descriptions. And those tools should
really feel like a single humanfacing
action.
So now that you've exposed this rich
functionality and tools to your agent,
you need to think about how you're going
to invoke it and what the orchestration
pattern is going to look like. I would
recommend that you keep it simple and in
particular just a huge amount of of
great work can be done with these
standard workflow patterns and it's
applying this. So if it can fall into a
single chain, please do that. It will
make it easier to measure. It will keep
your cost down. It'll keep your
reliability up and it'll allow you to
deliver value for customers more easily.
And you can also apply different types
of branching logic. You can rely on the
LLM to choose which path through the
tree that you might want to go through.
I work in cyber security and applying
these types of patterns work really
nicely for deciding what the severity of
an incident is, what additional
information we need to enrich and
performing and going through that
multihop enrichment and reasoning. It's
incredibly valuable. Moving to through
to a full agentic pattern is putting
more power into the hands of the model.
you're relying on it to choose which
actions to invoke and do that repeatedly
in order to solve some work and it's
just harder to measure and it's harder
to get full performance out of that. But
remember the bitter lesson. If you are
getting to a point where your chains and
the trees that you're building out are
becoming so complex and so convoluted
and they're difficult to maintain,
that's probably a sign that you want to
move to a more agentic pattern that will
make it easier to maintain the long
term. and it might be something that you
might want to start considering uh doing
some additional fine-tuning on your
model for that. The other pattern that
we found to be incredibly useful and I
think we'll see uh scale um in the
future is too many teams are relying on
the large language model to apply the
logic that they want to have applied. So
if you have some fixed business logic,
if you only want to take an action if A,
B and C are correct, what you can do is
expose tools to your agent to update
each of those states and you can apply
sanitation sanitization and validation
on each of those. You keep your logic in
determin deterministic fixed business
logic that you can maintain over time
and you allow you maintain the state
external to the model and so that way
you can ensure that the correct actions
are only taken when those conditions are
met. There's also been a real interest
in moving from single agent to
multi-agent systems. I think the best
reason that I know to break down a
single agent system to a multi- aent
system is exactly that problem I was
describing earlier with tool calls. If
you just start dumping too many tools
into a single prompt, it'll get
overwhelmed. But if you can break down
those larger groups of tools into
semantically similar groups, register it
with a individual agent and then rely on
this coordinator to route to the
appropriate agent to handle that task is
a great way to continue to scale as your
number of tools grows so you can handle
a wider variety of of scenarios. There's
also been a lot of talk about the agent
agent protocol. It's a really exciting
future direction. Most of the successful
multi- aent systems that I've seen have
been built by an individual team that is
able to coordinate that. I think what
agentto agent protocol is really
reaching towards and aiming towards is a
future where different teams building
different agents are able to discover
and coordinate and work together. I
think we will see more of that but it's
really early days and there's plenty of
additional technical and security
questions to work through for that.
This brings us to evaluation.
My general recommendation both both for
my team and I think for just about
everyone is invest more in evaluation.
It's gotten so easy to build and it is
very easy to get to your 70% or 80%
accuracy. But there are so many
hyperparameters to choose. How many
agents? How many tools do you expose?
Which model do you use? What type of
memory do I want to use? All of those
questions are almost impossible to
answer without a high quality rigorous
evaluation set. So I really encourage
everyone to focus more time on this than
you think. And really, it's moving us
towards a type of test-driven
development with agents. Your agent then
becomes defined in terms of the inputs
and the outputs that you're expecting.
And there's a whole range of tools that
you can use to then automatically
improve relative to that. So labeling I
think is a bit of a bad word from all of
our time in machine learning. That is
the thing that we outsourced to
mechanical turk. I think that is no
longer something that we can do and that
really the AI architects and the AI
engineers are who are building agents.
You really need to take more ownership
and responsibility of exactly you want
you want your agent to do. And so
spending time defining those inputs and
outputs can really help accelerate your
team and making all of those hard
questions as you move forward. And the
ground is changing under us in terms of
models and frameworks.
And so you take those user inputs, you
run it through your agent, you get your
outputs. there needs to be some amount
of human review and then you take those
new additions, you add them to your
evaluation set. Once you have this
evaluation set, you can now run it
through your agent and go through our
evaluation loop. You can analyze your
failures. You probably want to do some
type of clustering and summarization on
those outputs and you can suggest
improvements. And if this looks like a
lot of work, fortunately, there are
fantastic tools to help you with this
and it's not as hard as it looks. So, a
couple incredible open source libraries
I recommend Intel Agent, which is great
at generating additional synthetic
inputs. Let's say you don't necessarily
have access to your raw user data uh for
uh security or privacy concerns, or
let's say you're building something that
hasn't shipped yet. Synthetic data can
get you a long way. Uh Microsoft has
also open sourced pirate, which is great
for red teaming agents. A fantastic idea
to run and launch before you ship your
agents that will try jailbreak, red
team, and try and otherwise compromise
uh your agent. great strategy to take.
Label studio is a great framework to
help you build up these evaluation sets.
And then there's this whole rich set of
tools. So besides automatic prompt
optimization, automatic prompt
engineering, we also have trace,
textrad, ds pi. All of these allow you
to set hyperparameters and calculate
gradients using a foundation model as a
judge. So it can look at your failures
and automatically suggest changes back
through your flows to automatically
improve your system. So instead of you
manually looking at examples and having
to say well maybe this thing will work
and I'll run it through there's a lot of
development by anecdote happening right
now and I just encourage all of us that
as we build up these evaluation sets and
run in batch and start analyzing at the
aggregate level it allows us to make
more intelligent steps to take the
parallel over to uh optimizing a neural
network model slightly larger batches
will help us take more accurate steps
towards the the global minimum.
This brings us to observability. I think
the iceberg is a fantastic metaphor
here. We're working with generative
models. They're very good at generating
content. That's a very good thing, but
it also is a real challenge for us as
we're thinking about evaluating these
systems at scale. And as soon as you
deploy these and get this out into the
hands of customers, it becomes really
hard to understand what's actually
happening out there and really
understanding the full range of failures
and use cases. So I encourage you to use
tools like open lmetry and open
telemetry integrations. You really want
to have detailed logs and tracing and
probably some way of doing additional
clustering and automated summarization
to understand those main categories of
failure modes so that you can optimize
and improve your system more easily.
And now this brings us to just a few of
the the common pitfalls that we've uh
we've seen both internally and also
speaking with folks outside of
Microsoft. Just insufficient eval is far
and away the the biggest limitation and
challenge that I see. But also on the
tool side, maybe you haven't built
enough tools. Maybe the the descriptions
are not sufficiently accurate or clear.
Uh maybe there's too high a degree of
semantic overlap between your tools. And
so individual completion calls are
getting confused between those tools and
and leading to worse outcomes than you
suspect. And then excessive complexity.
There's so many bells and whistles these
days. It's very easy to go chasing these
other things. I just encourage us all
stay really focused on the principles,
really focused on what we're trying to
achieve and only add additional
complexity if we've actually tested and
make sure that it's actually providing
for a better experience for our users
and customers. And then this lack of
learning. So tightening up the learning
loop is really challenging. All of this
content makes it hard to sift through.
And so really focusing on uh getting
down to those root causes and suggesting
improvements that will result in a
better system.
And then the final thing I'll add is
coming from the cyber security division,
this is such an exciting time for this
technology. I think it's going to help
us in so many ways. But agentic systems
are a new class of potential
vulnerability. And so I just encourage
all of us to really design for safety at
every layer. Uh pirate can definitely
help on many layers of this, but just
good software engineering and good
principle uh principles are really
critical for this. and make sure that
you're building trip wires and detectors
at different stages of your aentic stack
so that you can eject out and fall back
uh to human review and all of the
critical cases. Uh so this brings me to
the end of my talk and I think I'll just
close with a quote that I love from Paul
Krugman that productivity isn't
everything but in the long run it is
almost everything. A country's ability
to prove its standards of living over
time depends almost entirely on its
ability to raise its output per worker.
I really think we're at the beginning of
an upshift in the amount of work that
every single one of us can accomplish.
And I think this new uh design pattern
for agents is going to help each of us
accomplish more. And I'm really excited
about what we're going to be able to do
together. Thank you so much.
[Music]