Full Workshop: Build Your Own Deep Research Agents - Louis-François Bouchard, Paul Iusztin, Samridhi

Channel: aiDotEngineer

Published at: 2026-04-20

YouTube video id: mYSRn6PC1mc

Source: https://www.youtube.com/watch?v=mYSRn6PC1mc

Hello everyone. Not too loud. I hope uh
yeah, I hope it's fine. Perfect. Uh so
yeah, we will start kick off the
workshop and we will introduce ourselves
shortly. But first we just wanted to
present the slide because that's
basically LinkedIn this year. It's the
type of content that when you ask Chad
GPT that's basically what you get a very
generic response and there are a few
things wrong with this actual response
or that we don't really like seeing. The
obvious ones are the the slop words, the
AI slop, so the all the del intricacies
that we all know about. But there are
more uh and fortunately there there
aren't even some m dashes in there. But
there are some other problems like the
all right the lines are a bit off but
the the most that is more general like
most companies miss or most people miss
it does that often. These are some
examples from LinkedIn posts where they
all say most teams, most uh people
or here most AI projects and I'm
actually guilty of that one. So uh we
learn as we do and there are some other
problems with hallucinations or here
outdated information. Obviously, this
was generated like last week and GP4 is
not state-of-the-art.
And likewise,
um there are some AI phrases slop with
that that we see a lot, the the obvious
rapidly evolving one, but also the
classic it's not about something but
something else that it generates a lot.
And the worst of all is that typically
this is really meaningless and shallow.
It doesn't provide any value or anything
useful.
And so what we need here is a proper
research and proper writing around these
language models.
And to give more context of this
workshop, uh we basically at towards
create courses and tons of videos,
training and technical content. And to
do that, we need to create technical
content around AI engineering. So we
need technical writers which means we
need AI engineers, writers, editors that
iterate together and a lot of time to
create a good storytelling, a good just
a good article, good lesson or good
video in general which in the end cost a
lot of money. So what we tried to do is
to automate this process which uh we did
or at least a part of it because another
part is really a human thing to make a
good story relate to the others and so
that's what we did. We built a system
for replacing this this whole process of
doing a very thorough research and then
technical writing. So what it does is we
give a topic like what is harness
engineering and we have our own deep
research agent that will search various
websites and tools use tools to do lots
of things and then another system to
write an artic an actual technical
article with code images and ideally a
non-stop writing in there and actually
the system since we build courses we
used it to build a course to teach it to
build this system. So, it was a really
fun project to do in general and it
allowed us to iterate and test a lot and
to have user feedback directly with our
students. So, it was it's really nice
and that's what we try to share here in
this two-hour workshop or at least a
much more compact version uh with a deep
res a smaller simpler deep research
system and a writing agent strictly for
shorter content for LinkedIn type of
posts.
And also we try to share what we learned
building it. And you can open the or get
the GitHub repository. It's public. Uh
in the first 30 minutes I will just talk
with slides. So you won't be needing it.
But after that my colleagues will jump
in with the code. So it might be worth
to to have it. And we will show this
this QR code again later on. So you can
get that on time. But basically I will
cover what we learned and some
terminologies because we are education
an educational company and that's what
we like to do and so I will cover like
some basics to bring everyone to the
same level or at least to our
definitions of the of things and share
what we learned by building it and then
uh Sami will take over to talk about the
actual deep research agent and Paul with
the rest. So who are we on my end? I'm
the CTO and co-founder of towards the
eye. I'm my name is Lu Frana and I was a
PhD student in AI until CHP came out
where I basically switched to being an
educator full-time and creating towards
AI to focus around that. I've been doing
videos to explain research paper for
seven years now and obviously right now
I'm more around AI engineering and I've
been developing in the space since 2020
at the first startup I was hired at. So
I will leave Sami to introduce herself.
>> Hi everyone. Um um I'm Samrudi. I am a
machine learning engineer. I'm also a
technical writer and a consultant
helping towards AI growth. So nice to
meet you all.
Hello everyone. I'm Pauline and I'm
building a software for eight years and
I've been into the teaching AI space for
four plus years and I'm also the author
of the LM engineers handbook bestseller
and I'm excited to
show this uh workshop to you guys today.
So I'll start with um introducing we
always start with a problem and here
it's it's pretty general. I start
introducing the AI engineering problem
space where basically it's just that all
our decisions as AI engineers are
governed by constraints that typically
are are are applied less to software
engineers which is like the cost per
task with which can swing a lot based on
the model you use the architecture you
use. We have latency requirements with
reasoning models and other things that
we can control.
We obviously have some quality to
respect and data privacy in this case
that we need to be careful of.
Fortunately, we have a stack to be to to
help us whether it is from prompt
engineering, context engineering to
using tools, orchestration or building
our own evaluation systems. And we like
to see that as some sort of autonomy
slider that goes from very simple just
prompting a model to building agentic
systems where the more you add
complexity from prompting from prompting
to more advanced workflows to agentic
systems, the more autonomy you add but
also the less control you have over your
whole system and obviously the higher
the cost. And to us that's very
important to choose when we build for we
also build for clients. So that's why we
try to teach what we build and all this
loop. But we to us it's really important
to build the right system where we don't
need uh too too much autonomy where it
can add problems and uncertainty.
And obviously most people are interested
into building agents, but most of these
agents that our clients want are
actually
somewhat super simple workflows or at
least workflows that we can come up with
pretty easily.
And so I want to try to to start by
defining workflows up to multi-agent
systems finishing with this deep
research example that we have that we
have made. So what is a workflow? A
workflow is very simple. It's just a
language model but augmented mostly
because language models just take in
tokens and text and generate tokens and
text. So you cannot do much with text
other than reading and writing emails
which a lot of us some of us do more
than that. So ideally you want to add
stuff to it like adding data that it may
not have access to whether it's at your
company or some other data that needs uh
to to best answer the user. You can add
uh you can give it access to tools to do
other things than just generating tokens
and you can add a memory so that it can
remember the interaction and be much
more useful. But still this is not an
agent. is just a thing we can build
easily that can be reliable and a
workflow can be even more complex than
that. You can chain prompts together to
reduce the latency and complexity to do
many more things based on strict
conditions. You can even do that with a
router to decide automatically based on
conditions on what you want to do
whether it is to use a smaller model or
or just do a different task and then
come back. You can do that in parallel
to be more efficient or to improve the
results with majority voting and other
techniques. And you can even add loops
to automatically improve the results
based on a judge feedback.
But all this is still not agentic. It's
just a workflow that we can build and
add all these things. We can even
combine them. We can build really
advanced workflows.
So when when does that thing become an
agent? And here again I'm using a an
illustration from entropic. It's it's a
definition we really like. It's just
when it needs to take actions and more
importantly when it can react to what
happens in the environment. So very
simply it's it it should be able to take
autonomous action and react to what
happens as I said and be able ideally to
plan to decide which tool to use and
more importantly which tool not to use
how to respond and everything. So when
should we use this thing an agentic
system compared to a workflow? Well
typically we always want to use the
simplest solution. It's just a prompt
works. That's ideal. And when we build
things, we always try to start with
questions and use our sort of autonomy
slider.
So just a list of simple questions we
can answer would be if the model already
knows enough about the task to be done,
whether it is the user asking a
questions or some other task. Obviously,
you can just prompt it and do it. And
ideally add some examples. few set
examples just because it's it really
helps the model to adapt somewhat adapt
on the fly which would be just simple
prompting.
Then if you need external context, if
it's not too big and you can paste it in
the prom in the prompt under somewhat
around 200,000 tokens, you can just
paste it right away and ideally have it
before and before your user ask it and
use context caching so that it's more
efficient uh to just for example answer
several questions about the same report
or the same your privacy documents or
whatever internal documents you may
Now if the context is not known until uh
the user asks a question to your model.
This is where you may try to inject
knowledge on the fly based on the
question. Whether it is because your
data is private and you need to to
retrieve it to answer the question or if
it's something recent that the models
haven't been trained on or domain
specific, you might want to inject
knowledge. And if you want to do all
this in different ways or depending on
conditions, this is where you use a more
advanced quote called work workflow
with predetermined steps and sequences.
And so in practice, an example of a
workflow we built for a client was a
super ticket handling where basically
you have always the same steps in the
same order. the the the system receives
a tickets, it class a ticket, it
classifies it, route it to the right
team, it drafts a response, it can
validate it against the policy of the
team and then send it. And here the key
points is that these steps are always
the same. The order never changes. It
always needs to do uh this these six
steps in the same order. And so building
this as an agent would just add overhead
without adding anything extra. You don't
need to dynamically react on anything
right here. You just need to execute
these steps. So a workflow definitely
makes sense in this case. So for an
agent, what you want to ask yourself is
if your system needs to take actions or
to be able to branch dynamically. So
whether it is to call different API
depending on what the user needs or
write to the database or not do
different doing and other things. This
is where an agent uh can be useful. And
an example we had was a CRM platform in
uh in in Canada that wanted to create a
chatbot for for their for their CRM
platform for their users, their clients
to be able to generate marketing content
automatically. And initially they
reached out to us because they were
applying to an AI grant and they wanted
a multi- aent system to do a lot of
different things with agents for
everything
which did sound really well for the the
the grant because it was like AI focused
but it seemed really overkill to make
many agents to do something quite simple
as a as a a marketing chatbot. So in the
end what what we always do is to try to
really understand what the client needs
and wants which is often times very
different and so we uh by talking with
them we figured that the actual workflow
was always somewhat the same always
simple.
We needed the agent needed to make a
plan to decide what it should do then
retrieve the data specific to the the
client generate the content validate it
and fix it if needed.
And so the tasks were always sequential,
the same the same few tasks that we
wanted to do. They were super uh coupled
always the same thing about generating
marketing content but just for different
clients and for different formats
and the whole sequence is context
dependent. So the whether it is the plan
the retrieval or generation or fix it
all needs to have the content in mind to
be able to generate the the best content
as possible. So splitting decisions
across different agents as the client
wanted would have been uh would have
caused a lot of issues whether it is
information or just end of errors with
to call and other things that would add
that would reduce reliability. So
instead we just built one agent but use
tools our as our capabilities and this
is really great because we with tools
you can just have they can have their
own system prompt they can have their
own validation logic even their own LLM
or own LM calls. So you can do tons of
things with tools and for from our
example we have we had like a validation
tool one tool specific to all format the
SMS tool or email tool and the results
is that we use tools as specialists but
the global context stays within our only
agent the decision maker and the planner
but with if we do that it means that
everything stays within our agent. So we
have a constraint which is the context
window of our agent of the model that we
are using which is which is made of many
things. The system prompt obviously the
the instructions that we give it the
tool definitions the different schemas
and how to use these tools few short
examples of uh the the task to be done.
It's really great obviously to teach by
examples as we've seen with models they
they learn very quickly that way for
most tasks. You can have retrieve data
depending on what you want to do and
even entire conversation history if it's
a chatbot or whatever type of agent that
evolves over time. So all this can take
up take can take up a lot of space
which causes a a problem because as
we go through each of these steps the
context grows and the performance
degrades which we call context rot. And
the problem is that this happens much
before the actual context window limit
of like 1 million token for example. It
worsens quite fast after like 200,000 or
it keeps changing but around 200,000
now. And uh this is mostly because the
lust the lust in the middle problem
where we basically teach the these long
context models to be able to handle long
context by feeding them a large corpus
inserting a random fact in them and
retrieving that fact. So it's basically
more retrieval of one specific thing.
But it doesn't teach those models to
leverage the whole book the whole
context to answer the question. So it's
definitely not ideal but it would be too
expensive to build this kind of large
data set. So this this type this way of
training them cause the the problem of
uh of us having to manage this context
budget which means to always keep the
context as lean and relevant as possible
both to reduce cost in terms of token
amount but also to uh improve the the
metrics that we are using and we can use
many techniques for that. We can trim
the content. We can summarize it. We can
retrieve based on criterias or there are
also many interesting techniques in the
cloud code leak compaction methods that
I recommend looking into. Uh really
powerful ones. But the one I'm most
interested in for this talk is the
delegation that we can do to either
tools or sub aents with their own
context which is what most um harness
that we use are doing. And so this leads
us to the multi- aent systems where we
basically want to use them when we have
too much of this context mostly like
over 20 tools or or just the context
becomes way too large. We can delegate
to tools or to agents.
And there can be other reasons like if
you if you need to make autonomous
decision making or you need tool
variability
or just security compliance if you you
need to have one agent locally in inside
one hospital for example because I was
in the the healthcare field before and
we used to to have everything local. So
that can be a reason for multi- aents.
And why am I talking about all this?
Because what we've seen is that AI
products are never just you build an
agent or you build a multi- aent crew AI
thing. They basically combine all of all
of that. They combine tools, workflows.
Some parts are more specific and
predefined. Some parts have more
flexibility.
And we believe that AI engineers need to
understand all these to build to build
complex systems and deep research
actually implement that it's a very good
example of a complete system since it's
since they integrate all of these
techniques.
And before we show how we built our own
and allow you to use it as well, let me
just cover quickly what a deep research
system is just in case. It's a reasoning
system that will plan what it needs to
research about. It has some autonomy to
decide what to research about. It has
tools, internet or web access. It's
reliable. It will site its sources. It
has feedback loops uh with itself but
also with the human for human feedback.
So this is much more agentic as we see
it. It evolves in an environment which
is typically the web or user provided
sources or just API access and it's goal
driven. It has one objective. We don't
tell it how to do it exactly. We just
tell it do a research about something.
So it's typically much more than just a
chatbot. It basically replaces someone
that would do a very thorough research
about a specific topic. It plans, it
search, it inspects, pivots, it
synthesize information and can iterate.
So deep resource systems are really to
us one of the best projects to learn how
to build such a complex end to end
system. So we decided to build our own
and in this case our own is basically an
an end toend technical article
production from a topic because we
create lessons course lessons in our
case. So it came out of a real utility
which is the the most important part. So
the goal here is to take a topic
research deeply about it and write a
very good technical article with code
integration with that is actually
runnable with uh images that is useful
not just randomly generated images
throughout.
And there are a few challenges when
doing this. First, you need a really
high precision and recall when doing the
research because you need as much
relevant sources as possible, but you
don't need too much of them because of
the cont the limited context problem.
You need to reduce hallucinations and AI
slop obviously and incorporate a lot of
human feedback especially in writing in
the writing aspect.
And as we always do with projects and I
think you all should do as well. We
started this with asking ourselves
questions to better understand how we
should do it and if we should do it at
all. The first question is is this
worthwhile?
Is there a solution outside of uh of
what we could build that already exist?
So our analysis is that high quality
technical content is expensive to
produce. We do that every day and and we
need a lot of people and it needs a lot
of time and review and it's just really
time consuming really expensive
especially because these people need to
be somewhat senior AI engineers to be
able to explain
anything in AI engineering. So they need
expertise but to teach someone not to
build a successful product. So it's
expensive and not that lucrative. So um
yeah it would be ideal to automate most
of this process.
As I said the research needs strong
recall and precision and on the other
side the writing needs to be more
constraint. It needs to avoid
hallucinations be of high quality follow
your tone or the writing tone that you
want the structure that you want. And in
terms of deep resource tools, uh we use
all of them pretty much daily and they
are way too ex exhaustive. They they
gather way too much content and they
have a lot of noise. It's not ideal for
us in our case. We do use them, but we
decided to build our own just to be to
to be able to do exactly what we wanted.
So the decision is yes that this was
worthwhile but especially as a writer
augmentation not to replace them uh
because as we've seen uh even if you do
as if you build the best system possible
these days it's really hard to create a
piece of content whether it is a video
or a written lesson that connects with
the reader or the viewer. You really
need a human touch to be so that it's
relatable to someone. you need to make
jokes, good jokes, not the same ones all
the time and the ones that are
appropriate to the the context. So it
was it it's really difficult to automate
all this and we want to keep a human the
loop there.
Um now the the first actual question we
ask ourselves is what should be the
architecture? uh if it should be an
agent, multiple agents, a workflow. And
the analysis here is that first we have
the research part that is really
exploratory. It needs to um find lots of
sources. It needs to explore the web to
understand what it needs to find if it
misses information. So it needs to be
able to iterate, to pivot, to search
again. It needs a lot of flexibility.
And on the other side, the writing agent
is much more deterministic. It needs to
to follow a specific tone, to follow a
structure, to uh to um
to yeah, it needs to be much more
constrained than it needs flexibility.
You don't want this AI slope. You don't
want specific words, specific sentences.
You want specific sentences and specific
formats. So, it's much more constraint.
So, we have a conflict here where the
research needs flexibility, but the
writing needs constraint. And so the
decision is to split into two systems.
The research agent that is more
exploratory, dynamic, more agentic and
the writer agent, the writer system that
is more deterministic and consistent.
And we have tons of these questions and
we show in our course exactly how to
build these two systems. But here it's
just two hours. So I will we focus on
the research agent a bit more especially
for the next few questions.
So here how should it be created? Well,
how should the the research agent
communicate with the writer workflow,
the writer step?
Well, we pivoted a few times after
testing it. And that's the most
important part is to actually use your
product or a potential user should use
your product to give you feedback as
early on as possible. And on our end, we
used it to build a course to use it. So
it was really easy to iterate and to
teach how we what we learned and what we
did.
And what we saw is that our users which
were myself and the team
saw that we always used either both
agent or we didn't use a system. So we
didn't we weren't just alternating
between the research and then the
writer. If we were to alternate it was
just with the writer at the end to add
new suggestions, new topics, new new uh
information to add to the article or to
remove to it. But the resource was
already quite perfect and and done
beforehand as you will see with Sam.
And if a big change was needed, we just
would rerun the process again. So we
didn't need to have proper orchestration
in place.
And the two agents work on the same
artifacts. The research agent produced a
research MD file with all its research
summarized and send it to the writer. So
they need to have ways to communicate
together. So the decision that we made
is to separate the two agents and run
them sequentially without orchestration.
So just a script, just a very basic
script and our and our two agents are in
the same project so that it's easier to
maintain but also they they share
dedicated files so it's easy to have
them communicate with each other but
also have them communicate with us
through these files and having them
together in the same project is useful
for AI assisted coding with cloud code
or whatever system you're using.
The next question for the research agent
is uh how should it behave? So we really
need we really wanted to understand what
would be the process that we would want
out of this research assistant before
building it and again we refined this
after testing and in the end we really
needed a good human written guideline
which would basically be just what is
the topic about anything I'm interested
in covering in this lesson and uh useful
links that we think are relevant. Uh
sometimes I send the AI news newsletter
on some specific topics. Sometimes it's
my own videos or other topics that we
covered
and we saw that obviously there's a like
creator bias here but we saw that the
more we define the agent's goal in the
guideline so the less freedom it had the
better the results were.
Then if we give links in that guideline
we send to the agent the agent needs to
scrape web pages. the needs to be able
to digest YouTube videos. GitHub
repositories even if they are private uh
to the user to myself
to uh gather all the the necessary
context and then after that it needs to
do its actual deep research thing. So
search the web for anything that is
missing and needs understanding and then
revise all this context and write that
into a very nice final artifact that
will be sent to the writer agent.
So how did we do all that? How did we
scrape web content that we give the
links to and do web searches? Well, we
as always simplify the problem and don't
reinvent the wheel. We use existing
things existing systems and APIs and
divide the problem into simpler
problems. So first scraping solution we
use fire and agentic scraping and this
is for many reasons that uh well Sam
will talk a bit more about the research
agents so won't cover it too much but
it's really useful to have this kind of
of scraping for more
dynamic websites and other types of
websites that can block scraping and for
web searches we used we used to use
perplexity now Gemini with rounding for
get to get precise answers with sources
directly which we can then scrape if we
need to but otherwise we can already
prompt it well enough to get the
information we want from this grounding
queries.
Now for YouTube videos because there are
a lot of great information on YouTube.
We want to ideally pass just the text to
the agent because the video is is quite
heavy and we tried
a lot of things but typically a lot of
things but typically you need to
download the video and extract the
transcript. But fortunately Gemini now
handles this directly with a YouTube
URL. you just give the YouTube URL and
ask questions about it or even ask to
provide the the transcript. So, it's
very useful. We decided to use that
obviously.
And uh now for the GitHub content, we
just used the the Gingest library to get
a GitHub specific content into Markdown
and we can provide it with a a token to
access our private li private
repositories. So it's really simple and
you will see that in the the repository
for framework uh we used MCP with fast
MCP
so that we can have tools for the
different processes and we can have an
MCP prompt to the to basically give a
recipe on what to do and how to use the
tools to the agent which is very useful.
We will talk in depth about that in the
in the next hour and a half.
Then for the research uh the setup we
basically just use Python UV for
managing the dependencies and GitHub
obviously.
And for the models we recently changed
for everything to be Gemini
mostly because of the YouTube thing that
I mentioned but also we saw that first
it works really well and we can
alternate between pro and flash the the
cheaper one and and more expensive
depending on the complexity of the task.
So we use them alternatively in in the
the the code and the the one thing that
we really like is also that it has a
free tier for our students. So it's
really convenient for us. But obviously
we we also compared to the other models
we were using before and and other cloud
models and even open AIS but we've seen
that Gemini is much more interesting
especially from OpenAI these days. So we
are using Gemini but you need to compare
with different models to be sure that
you are using the right ones
and finally for the more theory part the
how the does the resource agent
communicate with the user it's inside
the MCP prompt we have specific steps to
stop and uh interact with the users and
all these interactions are handled
through files whether it is the
guideline or the research file or other
files that Paul will talk about in the
writer part. And by the way, all
everything I mentioned and much more is
in some sort of cheat sheet that we also
linked in the read me of the repository.
So you can definitely access it uh for
free obviously it's on GitHub and uh now
is the part for where we talk about the
deep research and you can so you can
scan the code if you haven't to get the
repository
and somebody will take over
sorry everyone just give me one Awesome.
So, I'll be talking about the deep
research agent.
And so, what is a deep research agent?
As Louis already explained um that you
know it can search any topic by
searching the web. It can analyze
YouTube videos and finally it'll give
you a cited report. Um how we've built
it, we've used u MCP for this which is
an open standard for giving agents um
you know access to tools and data. Um
the key idea like the design choice here
is that the agent is the brain um that
is doing all of the reasoning as like
Lewis highlighted that uh any gaps that
are there in the research part is being
done by the agent. it thinks if you know
there's anything missing in the research
if it needs to run multiple researches
or it has to search something else that
is the part that is being done by the
agent whereas the server is uh the MCP
is handling all of the capabilities so
it's going to be exposing all of the
tools um it's going to be ex uh you know
exposing the resources and the prompts
that are there uh we've used something
called fast MCP which is a library that
you know really um makes u creation of
like agents very easy. So you know it
hides all of the complexities and you do
not have to write any sort of protocols
or any sort of communications between
agents. All of that is handled by fast
MCP and we're using cloud code as our
agent harness. Um so you know but this
code is independent of that. If you're
more inclined towards using cursor or
get a copilot you can you know swap um
cloud code with any other um agent
hardness that you like.
Um so I'll I'll be repetitively talking
about u you know these three things. So
you know an MCP server exposes tools,
prompts and resources. So what are
tools? Tools are basically all of the
actions an agent can take. So you know
like deep research or like you know
going and doing Google search or like
analyzing a video and giving us a
transcript or compiling a report. So all
of these actionoriented things are done
by tools. Whereas u the second thing
that the MCP exposes is prompt which is
basically instruction you know when
you're talking to chat JPD you give it
instruction so this in case is uh
prompts that the agent can follow and um
in our case we have a detailed prompt
that I'll show you in a minute with the
code and finally the third thing is
resources which is the data the agent
can read so you know this is like static
data which is the model names we're
using the sort of versions that we have
if we have any sort of feature flags
those are all part of the resources so
these are something that the agent can
read. There is no action being done
here. Um in terms of tools, we have
three tools. So first is the deep
research tool. Um the deep research tool
basically does you know Google search
and give us uh sources for everything
that we have. So it provides us with
answers for any sort of query we had uh
we have and also it provides us all of
the sources of that. So um this is we're
using Gemini API for this. Um the second
tool that we have is the analyze YouTube
video tool. Um for this as well we're
using the Gemini API. So you can um take
any YouTube ear and you know paste it
and it's going to give you a transcript
here. Um and finally we have the compile
research tool. So what the it basically
compiles the output of the previous two
tools. So uh we're going to be talking
more about um the memory file and the
details of that when I show you the
code. Um
so um I just talked about um you know
the the parts of the tool the agent that
we have but you know this is the overall
architecture um so we are using clot
code as MCP client just to highlight
here um clot code for us is both um you
know MCP client and an LLM. So um you
know it is like the brain which is being
used also it is u MCP client which is
talking to the MCP server that we have.
Um the first MCV server that we have has
tools resources and prompts. Um these
are the details about uh each of the
tools that we have. Um so all of these
uh you know tools the the first two
tools write to a folder called memory
and uh basically we wanted something uh
we wanted to like preserve all of this
because it is easier easier for logging.
It is easier for us to verify all of the
results that we're getting. So, you
know, we are writing down all of our
results to a folder. And the third tool
reads from this folder and creates a
file for us. So, research a markdown
file called research.md.
Uh the first two tools do API calls um
to Gemini 3. And uh finally, everything
is like u you know read and written to
this particular folder.
Um, so now I'm gonna show you the code.
Okay. So this is how the let me just
zoom a bit.
Sorry. Okay, perfect. So you know this
is how we have structured our code. Um,
you know, you don't have to get
intimidated by this. So we're going to
simplify it a lot. So you know these are
the main two folders, the research
folder and the writing folder. Paul is
going to be talking about the writing
part. I'll just go uh and talk about the
research part. Uh so if you remember
when I was when I started my
presentation, the first thing I talked
about was the MCP server. So you know
this is the MCP server file that I have
and then I uh so you know this is how
you set up the MCP server. So we have
you know some settings that we are
loading up here. So these are basically
the name of the models that we have. I
can show that to you in a minute. And
you know this is how you set up the
server. So you give it a name and the
version. And then you register three
things. So you know this is something
that you should remember. We're
registering tools, resources, and
prompts. Um and then we're just uh
setting up basic logging here. And we're
using something called OPIC uh for
observability. And Paul is going to be
talking uh more in detail about that.
And then we're finally creating the MCP
server, you know, and Now let's see how
we have registered the tools.
So if we go to this fold um this file
called tools um the python tools file um
this is how we have registered um the
tools. So for registering um the tools
with u you know um first mcp sorry for
registering tools you need three things
um the first thing is of course the name
of the tool. So here uh the name of the
tool is deep research. The second thing
is arguments. So we're passing it two
arguments. The first one is the working
directory. Um and the second is the
question or the query that we're going
to be asking it. Then you need um a
definition for what your tool is going
to be uh doing. So you know you have to
be as precise as possible when you're
writing this definition. And finally we
also have the definition for the
arguments that we have. So uh you know
um so this is the way you write the
tool. Then we just have um the code for
the observability. And then finally we
have the implementation. Um we're
returning the implementation for the
tool. So this is it. So you know this is
how you define a tool and um um there's
no complications here. But let's go into
a little bit more detail about the tool
itself. So um how this tool is working.
So for the deep research tool we are
basically in in the first step we're
just validating all of our paths. So you
know we're checking if we are in the
current directory or not. Um and
secondly we are ensuring that we the
memory file that we're writing to
basically it's a folder where we're
writing all our results to if it exists
or not and you know we do all of this
validation here and then uh you know we
run this grounded search on the query
that we have. Um once we get the result
uh of this query you know we return the
status if it's successful or not. What
was the query that we sent? what is the
answer that we get? So you know we get
like a detailed response and you know
this is like been very helpful for our
team to like debug um and like you know
go through um all of the data that we
get so that you know we can refine the
research results we have going into more
detail about how we are um you know
running the the research right so in if
you see this uh the run grounded search
function that we have we have a prompt
that we've already written. So if you go
here you can see the research prompt
that we have. So we're basically saying
that if you have any questions you have
to provide a detailed comprehensive
answer to the question that is there. Uh
focus on only official authoritative
references and you know make sure um
you're including u all the relevant
details as possible and make sure to
site your sources clearly. Um so you
know this is the prompt that we're using
for the deep um research tool that we
have and then uh we're just calling the
Gemini API here and we're getting the um
answer text and the sources. Um then um
in this part of the code we're just um
so I mean every time you get an um
sources from Gemini API it's not in a
structured form. So here you we're just
like trying to structure it in a better
way. I'll show that to you when I result
sorry when I run it you'll see that you
know we get the output in this structure
where we get the ear we get the title
and we get the snippets and you know
finally we are returning the research
results. So you know this is the this is
all the details about our first tool. Uh
but going on to our second tool which is
the YouTube video analysis. So this is
also got the same structure as um the uh
the deep research tool that I showed
you. So three things. The first is um
the name of the tool. Um the second is
the arguments that it is using. So even
in this case it is using the working
directory and the YouTube world and then
the details about the tool um you know
what it is doing and the details about
the arguments that we have and then
we're just returning um the execution of
the tool. So going into more details
about what the analyze YouTube video
tool looks like. Um so here is the
implementation for that. Similar to what
we did for the deep research tool, we're
like validating all of the parts here.
Um and then you know we're taking out
the uh YouTube video ID using this get
video ID ID function. If we do not get
the you know video ID for the YouTube
world that we have provided uh we get um
you know we like try to sanitize the
result and get the video ID. Um and then
you know we're calling this is the line
where we're calling the Gemini API and
you know similar to the deep research
tool we're returning all of these
different details so that you know we
can keep track of the things we can see
what is the output that we have got and
let's go into the analyze YouTube video
um implementation.
So um you know similar to the deep
research tool we have a prompt for the
YouTube transcription and here is the
prompt that we have written. So um so
you know we're just saying that you have
to this is basically guiding uh the
Gemini API to give us the transcript in
a certain format and uh you know give us
all of the instructions and the details
that we want in our output. So um for
this part um so here we are uh you know
dividing our requests into two parts um
the request that we're sending the
Gemini API in the first part we are
sending the file sorry in the first part
we're sending the earl that the YouTube
earl that we have as a file URI and in
the second part we're sending the prompt
that I showed you and you know this is
like a very cool thing so um if you um
so Gemini is a model model and it
actually when you send a YouTube as a
file URI it actually sees the video so
it goes through part by part it doesn't
access any sort of transcript and that's
why when we'll be running the you know
um video YouTube video analyzer tool
you'll see that it takes like two to
three minutes because you know it is
actually going through your entire video
so it is like really multimodal um I
mean in here we are uh you know like
sending the um request test to Gemini
and once we get the output we try to
like clean the um output that we are
getting in a certain format and then you
know um outputting it and like saving it
in the memory folder that we get.
Finally coming back to uh the final tool
that we have which is the compile
research tool. So what it basically does
here is um it returns and compiles all
of the results that we get from the
previous two tools. If you go here and
see the implementation for the compile
research tool, you'll see that what it
is trying to do is it is trying to write
to the um output research MD file and
then it gives us like a success status
or a re uh or a return message. Um so
you know we have seen um how we have
registered the tool in the MCP server.
Similarly, we've got the you know code
for registering the resources. So um you
know here is all the all the things that
we are sharing with the um agent as
resources. So basically what is the name
of the server that we have the version
that we're using what sort of gemini
model are we using what is our YouTube
transcription model and all of these
different details are being provided um
to the agent using resources and
then um the MCP prompt. So you know um
this is how we have registered the MCP
prompt. So um basically we're returning
something called workflow instructions
and this is the detailed prompt that we
have written. So basically in this
prompt we are telling the agent that
these are the tools that are available
to you and you know you can use this
workflow to use these tools and you know
some additional details about like the
working directory and where all of the
results should be stored. So you know
this is our detailed ROM um that we're
giving to the in um agent.
Awesome. So going back um you know this
is like the sorry one more thing uh I
wanted to highlight here is that this is
our end uh file. So you know this is all
being accessed as resources but this is
where you'd have to add your Google API
key. You can add the OPIC API key if you
want any sort of observability. U but
this is the um end file that we have.
Additionally uh apart from that you know
we have the scripts here which are like
um you know helping us with the code
throughout and you know these are like
scripts um we've written additionally um
to manage uh some of the utils.
Um so sorry yeah going back. Um so now
we have the entire architecture there
and so how do we exactly connect it with
um our agent harness in our case this is
clot code um so this is a very um you
know straightforward way for clot you
just need to create um an MCP.json file
in the project root and you know give it
a command like this um and then for our
case we are running the MCP uh locally
you know not hosting it anywhere. It's
being run locally and it's using
standard in standard out so for
communication. So you know this works
but in case you know your MCP is hosted
elsewhere you would need an earl like
you know um notion is a very popular
example notion has its own hosted MCP
you can have access to it by getting the
ear. So this part of the code would be
changed slightly instead of putting the
UV command you would be um you know
instead of putting the UV command you
would have that Earl here. Um and you
can connect with um you know any MCP
that is available Snowflake notion
anything that you want. Uh one more
thing that I would like to highlight
here is that um as I said at the
beginning of my talk we're using clot
code as an agent harness. You can use
anything that you want. You know if you
like copilot you can use it for this u
you know any um sort of agent harness
you want to use you can use it with that
but this is um for cloud code. Okay. So
let's see where that file is. So it's in
the so you have to create that MCP JSON
file in the root directory and this is
um I can explain it to you. It's a very
simple thing. So here we have um you
know the name of the MCP server we have.
So you know it's deep research and this
is the command that we have. So we using
UV because it simplifies everything. So
you know any sort of um uh virtual
environment that you're creating any
sort of things packages that you're
downloading downloading it's all being
handled by UV. So you don't have to do
anything and you know then we have um
this command which is basically running
the server file that I showed you couple
of minutes ago. So it is just doing
that. So claude runs the server file as
a subprocess locally. So our server is
running locally and this this um entire
code is um sorry this entire command is
only doing that. It says run fast MCP
run and the part to the server file that
we have. And this is the environment
file that I told you that you're
supposed to create. Okay. So um you know
now um it's time to see um how the code
works.
Um
um so I would go to claude
uh sorry
okay so if you go to claude and you do
MCP uh you can see that I have two
servers here so you know because of the
MCP JSON file that I have um you know it
was automatically able to u detect both
the agents
So we'll just be seeing the deep
pre-search agent right now. And um you
know this is all the details. You can
see the command. You can see the
arguments that we have. You can see all
of the capabilities that the server is
exposing. So you know you can just do
view tools and you know you can see all
of the details about the tool. So you
know the tool name, the description that
I showed you, all of the arguments, the
parameters that we have. You know
similarly for all of the other tools you
can see all of the details here. Um,
now, uh, you know, I'm going to try to,
uh, sorry,
I'm going to try to, you know, run one
of the tools and see, um, how Claw does
it. So, I've already written out this
question. This was, uh, from a previous
AI talk about, um, AI agents. Um so I'm
going to just tell it to analyze this
video uh for me and it should
automatically be able to detect uh the
tool that we have uh that's being
exposed by the MCP server. So you can
see here that uh you know it has
automatically picked up the tool that we
are exposing and uh and it is um you
know it is uh trying to use that
and on the side u you know you can see
that there's a memory folder that I told
you you know where we are tracking all
of our results it is being created and
it's going to write um the transcript to
the um sorry write the transcript to
this uh file. So, as I told you, um
Gemini is watching or like analyzing
this video in real time. Um that's why
it's taking it's going to take like a
couple of minutes uh for it to like
produce the transcript. Um so,
yeah. Uh
so basically every time you get this
question um you know claude code tries
to see what it can do what um tools are
being available and you know then it
sends a request to the MCP server can
that can you please execute this tool
and once um the MCP like MCP executes
the tool it sends the results back to
cloud code and then it further analyzes
that result.
Um so we can wait a minute for that.
Sorry. Yeah. So
it has created this markdown file. So it
says that this video is from um you know
the AI engineer code summit. It has
provided me all of the details and um
you know it is like pretty good detailed
transcript that we can see here
and you know also gives um uh you know
gives its input at the end. Plus I also
like really like this thing that you
know we are storing all of the
transcript but it gives you like a
really neat summary at the bottom where
it says you know what were the key ideas
that were shared in the video and uh you
know so and it says that okay the
transcript is here. So um you know this
is like a short example of how we could
um you know run and see one of the tool
if it's uh if Claude is able to run that
or not. And now uh I will try to uh you
know run the entire pipeline again.
So
sorry.
Okay. So I've already written a question
for that. So I'm telling it to do an end
to end research and
So um now I'm trying to use the deep
research tool that I have and the
YouTube video tool that I have and
trying to see um you know how it runs
both of them together. So um it says
that um it is trying to u it is trying
to do the analyze YouTube video part
first.
So um when you have like a you know like
a big question like this the entire
workflow how it works is claude breaks
down the question into like different
parts and then sees which tool it needs.
So you know it has like an entire
reasoning going on that okay you know if
these are the sort of tools that I have
available I need um the YouTube u
analyze YouTube video tool for analyzing
the YouTube video I need the deep search
tool uh deep research tool for the first
part of the question so it's going to
divide that into two parts and then try
to call like both the tools
um okay let's see
sorry usually takes some time for it to
>> um sure.
>> Yeah, I'm actually going to talk that's
that's the next part. I was uh trying to
remove the skill initially when I uh
moved it to the temporary folder so that
it doesn't pick up the skill because it
it tends to do that as well.
>> Yeah.
>> Not about that. Just the prompt part
though. So
>> no,
>> not so I mean for us we're using um the
skill and replacing the prompt part. So
you know trying to use that entire
thing. I'll show it to you like just in
a minute. I I think that might clarify
your question. Um so I mean here it
gives me um you know it is like um
analyze the YouTube video again. It
didn't give me the entire answer for the
first part. Uh but yeah, I mean I I
think it just uh just give me one
second.
I have to clear the context of cloud
because it tends to um use whatever I
asked it previously and I also had to
like uh delete the memory folder.
Um so till the time you know this is um
you know running and um generating
research for us um sorry u till the time
it's like running and generating
research I could talk about the next
slide that we have
um which is the agent skill so um I'm
sure all of you must have heard a lot
about um you know agent skill it is very
popular right now so um this is uh
basically a compact concise way of you
know telling agent about capab
capabilities and workflows and um you
know so basically tools do the work um
and um you know skills provide you the
way to do it. So um you know we talked
about the entire prompt that I showed
you uh you know the workflow prompt that
we have. Um in that you know we were um
telling um sorry we were telling um our
agent that okay you know these are the
tools that are available and this is how
you should execute them. But instead of
you know writing it in the prompt we can
instead create a skill out of that. And
what is the benefit of doing that? You
know when one thing is already solving
the problem why should we create another
one? The reason for that is um that um
skill have something called progressive
disclosure. So when you uh you know when
um like I did u you know when I showed
you I I'm going to show it to you in a
minute when you load the skill. So you
know when claude just like shows you the
skill um it doesn't load the entire
information. It is only load the name
and the description of the skill. Once
you run a query, it's going to load the
entire thing and once that query is
executed, it is going to wipe off
everything. So, you know, it is not
going to be there in the context.
Whereas, when you're using prompt, um,
you know, everything is going to be
there in the context. It's going to
delete the context and so, you know,
you'd want to avoid that. So, you know,
um, skills are a very clean way of, you
know, doing that. And it is also very
sharable because, you know, we
internally write a lot of skills for our
team and like share it with each other.
So, it's a very um you know um it's a
concise sharable way and it's more
maintainable because you know you can
check it into G you can it's like your
go-to place and you can um do the entire
thing there. So um
um
okay so we have the results for this but
before that I would just like to show
you how we have written the skills.
Okay. So I mean we've written the
research skills. So this is replacing
the prompt that we have. Um so here we
have to write the name um of the skill
that you want. You need to write the
description. So why this description is
necessary? Because when um your agent is
trying to you know match a query to um a
it is going to look at this part of um
you know this part of the scale. So you
need to have like a good description
here and you know then this is called
the front matter which is loaded and
then um this is like the rest of the
part of the scale. So in our case you
know we are reusing the research
workflow uh prompt that we had instead
of like writing all of the details in
here. We you can do that but it's a
design choice for us and you know this
is like the rest of the body for this
skill. Okay. So now how could you see
all of those skills and what do I mean
that it only loads the front matter? So
if you go to plot again and if you do
sorry
if you do skills you'll be able to see
so these are some of the skills that
I've already downloaded and these are
the project skills that we have created.
So particularly the research skill that
I was talking about and um you and um so
every time you do this and you see this
only the front matter is loaded not the
entire description right and you can
actually if you want to use this you can
just do something like this or cla can
like automatically pick up um the skill
when you ask it a question um now I'm
going to try to ask it the same question
I was trying to ask um the MCP um with.
Okay. So, I'm going to ask it that it
researches what an AI agent skills are
and then also analyzes the YouTube video
for me.
Yeah. So, it is loading the research
workflow prompt and it is able to do
that because our server is running
locally and that's why it's able to read
the prompt because they're all in the
same folder.
And now just started to do a deep
research. as you can see here. Oh sorry.
So it has created um so it is running uh
you know the deep research here and then
it has run like first firstly it ran the
analyze YouTube video tool then it is
running the deep research tool. Um and I
think one amazing thing here that I
would like to highlight is that the
initial query that I had was um you know
can you uh give me details about what
are agent skills and every time you know
our agent harness identifies the gap in
the output that it gets. So in my
initial question it is going to identify
all of the gap that exists and then it's
going to ask a different question every
time to fill that gap. So you know it is
doing all of the reasoning behind the
scene and making sure um you know all of
those gaps are filled in the um you know
research that we're trying to do.
Um
so yeah I mean it's it usually takes
around two to three minutes for me to
like run this entire output. Um
let's see.
Sure.
Yeah. Um so I mean um we built it
originally with you know the MCP portion
of it and now you know we are like
transitioning but you know we want to
like keep our original code and um
>> I mean not the server part but like the
prompt essentially that we have we can
replace it with skills.
Yeah. But I mean we have a lot of
detailing and different components
within the servers. That's the reason we
wanted to like maintain that.
>> Compity.
>> Yeah. Yeah. It's it's it's it's more of
the complexity in that case. That's why
we want to like maintain that. Um, so
here we can see um, you know, we've got
the output of the video and now we're
we're just going to wait for it to like
produce the output for the MCP file.
Perfect. So you know here you can see
We've got um this was my initial
question that I asked it and you know it
has given me the answer for that and
this is like the first run that we've
had and now it's like you know doing
like more research on that because it um
you know gave us an answer but it
identifies some of the gaps in that. So
um this is the second query it ran and
this is like a different question you
can see here right? So it ran two
queries and it is like okay you know
this is good enough and then it says um
all three research tasks are complete
completed let me run one more targeted
query and then complete everything. So
you know that's why we wanted um you
know all of the reasoning to be done by
agents because you know this is
something a human would do because if
you're reading an article or if you're
doing any sort of research you try to
identify the gaps and like try to fill
it with that. So you know that's what it
is trying to do and that's why we needed
all of the brain um and you know are
trying to utilize that. So you know this
is like the third query that is it is
running for me
and then finally we expected to like
create like a compiled report uh which
is going to be um any minute.
Sure.
>> Um I think um I mean that is like intern
I mean we would have to like see the
observability for that and like check
and identify why it is not like seeing
the latest um code. But we also have
given it a video from 2025. So I think
uh maybe it's because of that. So the
video that I asked it to like transcribe
it is regarding that.
But I think um to probably answer that
better um I would like go into all of
the observability uh part of the code
that we have and like see what is going
on there.
Um so you know finally it's asking me if
it could create a markdown file for me.
And
so here is the research MD file that our
final tool creates. And uh you know it
is it is really detailed. It has got all
of the comparison tables
and it has got like all of the details.
So you know I think um this is like a
really nice way to compile um everything
that we have gotten so far. And um it
also like compiles and adds the YouTube
video link that we have uh sorry video
transcript that we have at the bottom.
Um so yeah um thank you so much
everyone. Um now I'll hand it over to
Paul.
Awesome. Thank you, Luis and Sami for
those
great slides. And just give me a second
to set up everything over here.
Okay. So now we are moving to part three
of this whole system which is the lint
LinkedIn writing workflow which
basically transforms this research from
deep research agent to polish post that
pass law detectors. So we can transform
all of you into LinkedIn influencers.
Okay. So just like a high level overview
this is our full system uh architecture
right. So we did we already implemented
the research agent which outputs this
research MD file which will be as input
to the writing workflow which on top of
that has as input the guideline MD file
which basically is a fancy way to put
uh it's a fancy way of the user input
and this is how we will model the user
input to guide the generation process
and the final output will be well the
the LinkedIn post Right.
Okay. Now let's actually zoom in into
the the architecture which is split into
three big pieces. So the first one is
where we build up the context and
ultimately we we end up with the big
system prompt. We because remember this
is ultimately this is not an agent. This
is just a workflow because
in reality like to write a piece of
content regardless of what it is a
LinkedIn post, a video or an article,
you don't really need agents, right?
Because as Louis said in the beginning,
this process can be it's very static.
You always kind of go through always the
same steps. So an agent will just make
it everything a lot complicated. Of
course, you can put agents in some parts
of it, but we won't show that into this
um into the this workshop. So the first
part is to load the context which is as
I said the guideline MD file which is
the user input the research and then we
have some static files the profiles and
a few short examples which I will dig
into a more detail a bit later. So we
take all of these build the system
prompt and then pass everything to the
LLM. So this is phase two. So so far
nothing fancy. We just basically create
this huge system prompt and called LM to
create the f first draft of the post.
And then the last phase is to apply the
evaluator optimizer loop which is
probably the most interesting part of
this where we add the LLM to create in
the loop reviews and edit the post a
couple of times. Right? So this is super
important because in reality you can
apply this not only to LinkedIn posts
but to like any type of content like
starting from video transcripts to
reports, financial reports, medical
reports,
uh even articles or like book chapters
or very detailed article lessons as we
did in our course where we needed to
follow like text snippets, images, code
snippets, references and all these
little details that LLM usually
sometimes get right but most often don't
get right and even if they get right 80%
of the time it's annoying to manually go
and fix that right that that's the whole
point so all of this is automated end to
end so we'll dig again into this process
in more depth later on so this is like
just like one of our examples so this is
the beginning of a LinkedIn post that we
asked to generate And
just for fun, I just want to copy this
post. So it's the exact same post
and put it into this like pretty popular
slop score detector.
So
you can trust me that this actually
works. And hit analyze. And yay. No, not
not slop.
And this is actually lower and less
sloppy than a human. as you can see is
over there on the bottom. Of course, not
all the posts are this uh not sloppish,
but most of them are are really good and
that they sound uh human. But remember,
this is LinkedIn
language. So, we need to sound a bit
like LinkedIn, right? But ultimately, as
you can see, it doesn't have any m
dashes, any weird uh adverbs, verbs, and
things like this. It can be read very
very nicely. the the structure is well
is schemable as we would like for
LinkedIn and things things like that you
you can check the full uh post at this
link in the GitHub repository and also
try try the slop test yourself if you
aren't using that link
okay so now let's start to dig deeper
into the first phase of this this this
workflow which where our core focus is
understanding how we can actually
control the generation right because As
Le said in the beginning, we don't want
just to write, hey, I want a LinkedIn
post on topic X. We actually want to
dump our ideas, values, and thoughts
into that post that actually sounds like
us.
So the first trick is actually to
structure this guideline, which is the
user input uh and structure it more than
just write me a post on X, right? So
this this actually guides what we want
what we want to write. So usually what
we did, we created a template for this
where we need to fill in such as
what topic we want to address, what
angle we want to address, some key
points we want to address, the narrative
flow and and and things like this. And
this is the only piece that changes from
post to post. So basically from the
whole system, this is what we need to
write ourself as a human as input to
write a different post. And yeah as I
said this is dynamic and changes. Next
the and again you you can see an example
here into this example from the repo
along many others but I will show them a
bit later.
So the second trick is to add these
writing profiles which basically tell
the LLM how to write this post right
because in reality you don't really want
the LM to just do its own thing. you
want to guide it and these are static
because as I said with the guideline you
you you define what you want to write
and this this is how you define how you
want to write it. is basically the
styling layer on top of it.
And we we mainly created three profiles
with which by themselves are mark
markdown files which are the structure
profile which bas where basically we
define things such as how many
characters on average a LinkedIn post
has like the the core structure of a
LinkedIn post. We want the hook like the
body, the call to action, also the
termin for the terminology. We define
things such as the active voice and the
AI slope words and expressions that we
want to ban. And unfortunately,
this is all you can do. You just need to
keep track of a huge list of delve,
tapestry, vibrant, and all those verbs
that we love. Uh, and just kindly ask LM
to not use them until you start using
them as a human as as well. And the last
one is the the character profile which
is kind of static but you can also
configure it for example in our writing
workflow we actually configure it under
my polin biography right so it knows a
bit of about me like how I run to write
how how I like to write my style and and
things like this so this is how you can
add a bit of personality to to it. So
again, we put all of these into the
system prompt plus the guideline that we
defined before and you can see these are
static. So as I said, we define them
just once and you can access them under
writing profiles and look around the
there are I know markdown files of a few
hundreds of lines of of words, right? So
the last trick is actually to add few
shot examples. So this is nothing fancy
but in reality the hard part is actually
to get them right. Basically this is uh
under the data collection part of your
system and in our particular use case we
added kind of three LinkedIn post from
from from my writing and the the key
idea is to use high quality and
representative
uh few short examples and make them as
varied as possible in things such as
topic length structure and so on and so
forth. And one question I I think many
people are curious about is why three
LinkedIn posts? I know three is a magic
number. But in reality, I just guessed
it. And the thing is that with fshot
examples, because you always pass them
in your system prompt, you want the
lower number possible that gets the job
done. So, usually people tell like
three, five, 10, 20 if you shot
examples, but usually you want to start
with the bigger number and trim down
that as much as possible to until it
works. And when it stops working, you
put put that back in. And uh because you
want to keep the the idea is that you
want to keep your system prompt with a
few short examples as small as possible
because
well you always pass that to LLM which
translates to more cost, more latency
and even degrad the degradance in
performance because the the content
grows.
Okay. So
for for that we I made available a data
set in the GitHub repository where I
extracted like uh 20 random u LinkedIn
posts from from my profile that are more
that got more more traction. So I used
that as a data set for uh this writing
workflow and further down the line to
configure other other things. And one
thing to to to highlight is that
probably the only way to reduce this
load
on the system prone with few short
example is through fine tuning but that
it's often overkill and adds a lot of
friction into your your uh tax stack.
So the last part of the system is the
evaluator optimizer pattern which
probably is the most uh uh f fancy one
from from from all of this and basically
it contains two LLM calls and this is
really important. It contains two
different context windows, right? We
have the the writer and the reviewer
where basically the writer first writes
the draft and the reviewer with a
complete different context window takes
the draft and and reviews it. And this
is super important to like avoid bias uh
because LLMs usually tend to be biased
in liking what they already written. So
by just putting that into a completely
new context window can can remove that
bias. So basically how how this works
the reviewer checks adherence of of the
draft against to the guideline which is
the user input against the research
basically to remove hallucinations and
against the profiles those structure
terminology and character profiles. So
we ensure that it adheres to it and then
the editor which is often the writer
itself applies the reviews and then we
run this loop in our examples three four
times to basically write the post review
it edit it optimize it and so on so
forth similar to like a fine-tuning
optimization loop and a few tricks here
that we observe over time is that just
by keeping in our uh working directory
all those versions
It helps because writing is subjective
and often if you apply this uh reviewer
too aggressively
it it might make it you might just not
like it, you know. So you want to look
around the less versions and pick pick
the one that you like the most. And
again because writing is subjective
usually the evaluator optimizer pattern
works with a score. So basically the
reviewer gives a score to the input and
you loop until you reach upon a specific
threshold. But because creative work is
subjective, that threshold is very hard
to to to quantify, right? And this loop
becomes very noisy and not reliable. And
we we realize that it's just easier to
just put a fixed number of iterations
and let the user run these iterations
again manually if it just wants to.
And now let let's dig a bit deeper into
how the reviewer works. So basically as
input it has the current state of the
post.
Next as context it gets the guideline
the research and the profiles which
basically it it's what the writer needs
to look around and there are basically
the rules the writer needs to to follow
and the outputs is actually a set of
identic objects right and I think this
is the most interesting part here is
that you actually constrained the lm to
output a list of structured objects and
this is powerful because if you give up
identic object to an LLM. Biden objects
have this property where you you can put
a field under each attribute and
actually explain what that field means.
So basically this is a prompt
engineering technique which guides the
LM a lot better into actually
understanding what each attribute of the
return object actually needs to contain.
So for example we a review model has
this profile location and comment
attributes and an example is uh an
example output is for example it
we violated the terminology profile on
paragraph two where the LLM us used the
leverage band term and like this we the
editor when it gets these reviews will
understand exactly what what it
violated. And we from our test we
realized that it's a lot easier to use
this structured uh actually it's a lot
more performant to use the this
structure outputs than just letting the
LLM I don't know spit out whatever it
wants.
Again you you can check this code uh uh
within this Python file. The code itself
is pretty simple. So unfortunately I
won't have time to dig too too deep into
it.
And now on on the editor part basically
it gets this list of identic objects
and the current state of the post and
also it gets all the context that the
writer initially has because the editor
is actually the writer itself. Right?
This is similar to how a team of writer
and editors will work. I write something
I give it to a team of editors. I get
the reviews and I I apply them. And
another important thing is that reviews
are not created equals, right?
Especially people like to to to to give
reviews on everything usually. And this
also is true for the LLMs. And because
we loop for a couple of times, we
realized that it's super important to
put a priority on them. So usually we
always want to prioritize the guideline
first which is the user input then the
research and then the profile. And this
is super powerful when uh those those
reviews clash on the same uh paragraphs
or on the same sentences and things like
this that let we know what to pick up
and what to apply.
Okay. Again you you can check the system
uh the prompt uh within this Python file
within the ripple.
And here is a concrete examples where we
can see the post v0 which is basically
what the LLM spits out before the
evaluator optimizer loop. And this is
how it looks like after four uh
reviewing iterations. So we we can see
that
the text is nicely formatted for
LinkedIn also like the first sentence
which is the most important is a lot is
punchier and and things like this. This
this is valid also for like the first
part of the post and also for for the
second part part of the post. Again, it
looked it modified the structure, the
wording, everything to look more on
to look more uh about something that you
would expect on on LinkedIn.
And again, you you can check the whole
post and all the versions from zero to
four uh within the examples directory.
And ultimately let's let me actually
show you how this directory looks like.
So to test this code you actually have
three levels. So I created a simple make
command just to check that everything
works. So no nothing crazy here. If this
works end to end it means your code
works locally. You can pass that to
cloud code to to make your code work. Uh
so we serve similar to how the research
uh agent works everything to an MCP
server prompt and then we coupled we
connected this MCP with with a right
post skill
and the thing is that this takes like
three four minutes to run. So I already
I already ran it to
avoid wasting your time. But here here
is how you can run it. So basically you
take this right post kill
and just as the LM u
we take for example the same example
that that that we took. We cop the copy
the relative path and we ask it to use
the guideline
from this deer and then basically it
will it will know to pick up the
guideline and the research associated
with it and and and write a post and the
output is
here.
As you can see, we have the four the
four versions of the post actually five
versions of the post and also the
guideline which I think is the most
interesting one.
So basically as you can see instead of
just like kindly asking LM to write
about something you need to be very
explicit about what you want from it. So
you need to put an angle, the target
audience that you want, a key points
that you actually want to cover, the
tone and things like this and actually a
constraint of characters that overrides
like the the the other profiles that are
static. So encoded encoding in the
system prompt that within this guideline
you actually can override like the
default profiles if you want something
different. And basically here you can go
wild and input anything you want. But I
just wanted to highlight like you
actually need to put in some effort to
to think it through. You know, you
cannot just fully automate u everything
if you don't want to sound like through
a slot.
Okay.
And on top of this, we actually have
like a small image generator. We haven't
put a lot of effort into this, but the
this prompt actually generates the post
using this writing workflow and then
asks another tool to automatically
generate an image for for for that post.
Okay. So now to move on.
So just to wrap up on what we build on
the writing workflow, we we build this
writing workflow pipeline that generates
reviews and edits the the post which
again you can apply to any type of
content. We tested it with almost every
type of content and it works really
well.
You just need to adapt basically the
profiles, the examples
and things like this.
And then we learn like how to control
the generation, the guideline, the
profiles and the few short examples
again for different uh content types you
will need to adapt this
and then how to serve everything as an
MCP server and skills. And yes, for this
local example, you could do everything
through skills. You don't really need
MCP servers to make it work. But from my
experience, MP MCP servers help you
distribute logic.
uh a lot easier because skills are like
your local setup, your your local hacked
setup that works great for you. But if
you want to distribute business logic,
you don't really want to ask someone,
hey, download those skills or install
these UV dependencies over there, plug
in those CLI tools, oh no, you also need
those credentials and quickly becomes a
mess to distribute this at scale. And
MCP server solved that problem. and
skills can help you personalize how you
use that. Or basically, if you want a
skill to be more than a prompt and
actually run that code, it works, but it
can become a a quick mess to set it up
if it's too complicated, right?
Uh so now let's move on to part four of
this system, which is observability,
more exactly monitoring and evals.
So we will use the writing workflow as
an examples for evals and we did
monitoring for for both.
Uh so for monitoring we will go really
quick over this because like theory wise
there's not much to say.
So the idea is that the core problem
behind monitoring which in my opinion
like exploded when we started to use
agents and workflow is that debugging
workflows and agents purely through the
logs is hard like I personally can't
really understand what what's going on
inside the terminal especially when you
see those thinking of the agent which
hides a lot of stuff so it quickly
becomes painful to debug what's going on
just using the the logs.
So basically you need some tool to
monitor this nicely that captures all
your traces such as all the LLM and tool
calls your input output your metadata it
captures everything about your run and
also latency and cost tracking right so
and as a bonus it also stores your
traces to build AI an AI evvel layer on
on top of everything
which I will dig uh which I'll dig more
into
a bit later.
So now let's actually look uh into our
moni monitoring log. So we used opic to
to to monitor both our agents the
research and the writing workflow
and yeah just let's look into it because
it's it's a lot easier to understand
like this. So for monitoring usually you
have like three big concepts. You have
those threads
which is basically in our use case is
like the whole workflow of writing an
end to end post like starting with the
post itself plus the image. Uh for
example as you can see this
workflow has 44 messages which basically
captures all the bouncing around between
the user and the LLM. You can more
intuitively see it as a conversation
thread which that's why it's called a
thread you know and then you have the
traces and here you can actually dig
deeper into what's going on. So a thread
contains multiple traces right and
within a thread you can act within a
trace you can actually dig into what's
going on. So for example for a generate
post
trace here we have the high level
overview generate post call where we can
see how long it took to run how many
tokens it consumed how much it cost to
run and here we can see all the tool
calls and LLM calls that happened under
the hood right so we can see the models
used the cost per model the latency per
model and all of that and on the right
we can also see like the
the high level input and output of the
system, some metadata, the token usage
and basically everything that happened
within that run which is which helps you
a lot to quickly understand what's
what's happening right
and we have something similar for the
research agent as well. Uh here the
thread is a lot easier to to to
understand because for example we have
this latest thread which which captures
all those tool calls basically all the
deep research uh tool calls the the the
tool call that creates the final report
and all the bouncing around and then the
threads
the traces which have just single tool
calls that they capture just a single
tool call where as you can see uh we
have just one LM all per per deep
research tool call. Okay.
So now let's move on to
to to to to AI evals. So I want to start
with why evals matter because it's not
necessarily a cool topic but it can do a
lot on on on your system. So basically
let let's take our writing workflow
example. Let's say that we want to to to
check how well it works, right? We all
start with vibe with vibes we with with
vibe checking. When we do and when we
generate just one post, we read it. We
we say that hey, it's cool, awesome, it
works. Well, then we have 10 posts uh we
read them maybe, maybe not. Maybe we
read two, maybe read three and call it a
day and say it works for for the others
as well. And then when we want to check
how well it works in 100 post, it
quickly becomes impossible. And just
imagine that you do this at the
beginning, but as you evolve your your
system, you start to plug in more
features, right? And every feature can
just break everything. And yeah, this is
standard practice in software
engineering, but here you work with
prompts and just one word somewhere
randomly can just break all your
features. If you're not not not careful
you know and you need this this this
layer on top of it.
So basically evals fit in three big big
layers. So we have optimization
which is very similar like to training a
model. So basically we have this AI
evals layer which lets us quantify how
well our writing workflow writes
LinkedIn post right
then when we want to improve it we have
a score which tells us that hey do we
move in the right directions or we don't
move. So on every change we run this AIS
layer and we know if we did better or
worse. The next layer is regression
testing which is basically similar to
test in classic software engineering
where whenever we start working on a new
feature we just ensure that we don't
break current functionality and the
layer three will be to actually run this
in production on live traces to actually
get warnings and errors and alarms and
all of that uh when users actually use
your system
And in reality how this is different for
from from like normal unit integration
and regression test is that everything
starts from the evil data set right. So
this is a data problem. We actually
build a model here. So here is how we
build it for our LinkedIn post where we
were lucky enough to actually already
have the data. So I extracted 20 real
posts for from from my LinkedIn. Why 20?
I I I said that is a big enough number
for a workshop, but in reality we
probably need to go at least to 100.
Uh then I reverse engineered the
guideline and the research. Basically, I
took the the guideline from from the
post and then I ran the deep research
agent on top of the guideline to find uh
whatever I need to support that post
and then I generated the output. So
basically I put the guideline and the
research as inputs and generated new
posts. And this is super important
because you actually want to see results
from your real system. And when you want
to generate like synthetic data of some
sort, never ask the LLM to directly
generate the output. Always ask it to
help you generate the input but never
the output. You want the output from
your real system because that that's
what you're testing ultimately.
And then I looked around and labeled
each output with binary pass and fail uh
labels and two three sentences critique
on why exactly I gave it a pass or fail
label to it. And usually I I stop when I
find the first error is just easier for
me while labeling and also for the LLM
to understand that whenever I see the
first fail I stop at it. I write why
failed into three sentences and move on.
And ultimately I split all of this into
a train dev test split because
ultimately uh remember that here we're
building a machine learning AI model. So
we needed to treat treat it as such. So
we need a classic split
and only when we have this data set and
uh these splits we can actually build
the evaluator and evaluate the evaluator
which most often for some reason when
they build element judges they think
they can they can just skip this last
step which is probably the most
important one
and again
what's the most important And
interesting here is the data set that we
created because actually creating a
element judge out of this if you have
the data set is just very easy.
Okay. So now let's see how this uh
element judge would work for for our
current scenario. So as input it will
take the generated post that it it needs
like to to label but also the profiles
research and guideline. And this is
super important because the element
judge actually needs to understand the
context used to generate this post.
And ultimately it outputs the pass fail
label plus the critique. Basically
exactly the same labels as our data set.
And then we pass as few short examples
our train split from the data set. And
this is the most important one because
like the system prompt of the element
judge can be extremely simple if it has
the right fshot examples in place that
tells the element judge uh what
decisions to take. Right? So the the
fshot examples look like this.
It they have the writer input which is
the guidant and research. Right? So the
input to our writing workflow system.
Then the writer output which is
basically the generated post of the
workflow and the labels. Right? So in
the few short examples you also have the
labels the pass and fail label and the
two three sentences critique about the
label
and this is it like you you can check
out our system prompt uh within this
file but there there's nothing fancy.
The most important part from this is
building the data set.
And our last step is to actually measure
the judge relability because ultimately
what we did here the element judge is
just a binary classifier, right? We we
we output pass and fail label. So yeah,
we used an element judge because the
reality is that it's easy, but we could
have just used any any any other model
that gets the job done and can output
this pass and fail labels.
And ultimately when we measure judge
relability, our final goal is to align
the limit judge with the domain expert.
And how we will do that?
We will do that by testing it against
the dev test. uh data set splits.
So this is a process very similar to
training any other binary classifier.
There's there's nothing new here. Just
the word judge is fancier. So what we do
basically is test judge against the
depth test data set splits and then use
F1 score the F1 score which is a
combination of precision and recall to
actually measure how well the element
judge performed against these splits and
the process looks like this.
So we first run delm judge on the death
split compute the f1 score and adjust
delm judge and prompt examples to
maximize this f1 score because most
probably when we first start this
process that one score will be low and
we basically want to get a score as high
as possible
and we repeat until we convert. Usually
you need to do this a couple of times
and when you think you're done and
element judge is ready to go, you run it
on the test plate as the final
validation step. Right? Basically
training a machine learning model
and now it's ready to run on new data.
So after we go through these steps, we
have development judge which is ready to
run on new samples of data which will
just output the binary uh pass or fail
and the critique.
Now like to wire everything together
that we had in this in this section
is that we first build the data set then
based on that we build the m judge then
we calibrated the judge and only After
that we can run it on real data and
usually all the steps are managed by
some observability platform. We used
here opic but
you can use any other platform or just
build something in house. It doesn't
really matter but the idea is everything
should be uh managed by a cohesive
platform.
Okay. So now as a quick demo,
let's emulate as much as possible this
calibration step because the element
judge is already calibrated. So I will
just show you
how the judge performs on the on the
evil split.
So under the hood with OPIC, we we built
like a very simple evaluation harness
where it allows us to run this
calibration steps and then run it on on
production traces.
So in reality the code is for our use
case is pretty straightforward. So, just
to show you around
a bit until it it runs.
Okay. So, actually what what it does now
it it it takes all the generated posts,
right? It passes the element judge on
top all all those generated posts. And
then what I wanted to show you is this
F1 score
where basically we compute an F1 score
based on the outputs from the LM judge
against our labels from the data set
which I think is more interesting to see
how our data set looks like really
quickly to get a sense of it.
So
basically here we
have like a single sample
within our data set where what's more
interesting is like we have a link to
the media of the post to the guideline
to the generated post basically links to
everything that we need as input and
output both for the deep research agent
and the writing workflow. And then just
putting this scope we plug them in as a
few short examples either for the post
generator or for the evaluator or for
example for the dev split and things
like this. So this is how we are careful
that we do the splits the right way
between the um judge and also the the
the generator. And for simplicity we
have all those files linked together in
this yl file here uh locally within
GitHub. So you can check everything and
plate plate
and run everything very easily with with
minimal configuration.
Okay. So
if we go back into the terminal, we saw
that it had like a perfect score. Okay.
So now let's actually run this on the
on the test split. And the goal here is
to see that our M and J did not overfeit
it. Right? So we ran it on the dev
plate. We get the perfect score. Usually
this is a high signal that it
overfitted. So unless the F1 score on
the test split is not one or around one,
it means that our element judge is
overfitted.
So
again,
similar to the process before is
basically the the exact same uh steps,
but now ran on the test split.
So let's see the final F1 score, which
is again one, which is good. And the
scores are so good because we do this
just on five samples per split. But if
you like expand your split, which you
should have at least like 20 30 samples,
uh the score probably will not be
perfect.
Okay. And let me actually show how these
experiments look inside inside OPEC.
>> Yeah, sure.
Yeah, I I think that the most important
is like the dynamics between the F1
score on your
death split and the dynamics on your
test split because the the overfeit when
the model is overfeited is actually
like how is compared between the two.
For example, if it, as I said, if it has
like a very big score on the dev split
and a low score on the test split, it
means it overfitted. But the score
itself usually is very um correlated
with your own data. So, for example,
you're pleased with
>> Yeah. Yeah. Exactly. Exactly.
>> Yeah. Exactly. You Yeah. Exactly. The
absolute value usually is very
correlated with your data.
>> Usually on the test split you say okay
this F1 score is good enough for me
and then you run on the test split to
see that you have the same F1 score on
the test plate as well.
Okay. So
again, usually we with these
observability platforms, you have this
experiments tab where you can actually
run like
uh experiments similar to fine-tuning
experiment trackers from from from the
old school. And for example, here we
have the dev and test experiments. So
let's open the the the test experiment
that we just ran. And here we can
actually see the output from the binary
element judge and our label. And we can
see that there's a perfect match between
the two. And if we hover over this, we
can also see like the the critique
of of the of the judge. But in reality
here, we compute the score only between
the the labels. And this critique is
more useful for us because ultimately
our goal is us as humans to look over
these results and understand what's
going wrong with with the system.
And then as a final step we I
also prepared like a online
uh simulation where I simulated some
online u traces
where the element judge actually runs on
top of them and
you get the binary labels and the
critique. But the thing is that this
takes like five minutes or even more to
run because it actually needs to gen for
real it needs to generate all the posts
and run M and judge on top of it. So
I'll just show you a result over here
which I I bundle it together as an
experiment.
So is this online test
evaluation suit
where again for example just let us dig
into into what further you can see for
example the generated post here on the
right and then the results from from
from the binary from the judge where we
we find the score and the critique
itself and of course this system is not
yet perfect I probably need to refine it
more because here on on live traces as
you can see it kind of failed uh to to
be honest it just passed uh an F1 score
it just give us a pass on all the traces
while in reality I also have the labels
for these simulated scenarios we can see
that it failed probably I I need to go
back to the dev split expand the dev
split with new uh samples for example I
could just take these samples and put
them in the dev split and start refining
it again and again and again until it
actually works.
And I'm kind of running out of time, but
you also have the skill to actually run
a a demo end to end, right? Research and
writing. You can for example write a
guideline about about something that you
want to write a post about and give that
as input to the skill which is just a
file and it will know to pick up and to
do the research and then continue to
writing the post
and you can also observe that in topic
with all the the threads traces and all
of that
and yeah I guess that the final step if
you haven't done so yet is to actually
open up the GitHub repository, run
everything yourself and read the code
because without reading the code, most
probably we we won't really understand
what what's going inside. So you can do
that by accessing this this link or
scanning the QR code. And
whenever you're ready and you want to go
deeper into building this type of multi-
aent systems, we have this agent
engineering course which basically was
inspiration for for this workshop. But
instead, our goal was actually to design
and build production ready AI agents and
not just some small local systems and
within 34 lessons, three end to end
portfolio projects, a certificate and a
Discord community with access to us.
And so far, it's rated 5x5 by 300 plus
students. And if you don't believe us,
the first six lessons are are free to
try out.
And you can access it using this link or
scanning the QR code. Yeah. Uh and
that's it for for the workshop today.