Full Workshop: Build Your Own Deep Research Agents - Louis-François Bouchard, Paul Iusztin, Samridhi
Channel: aiDotEngineer
Published at: 2026-04-20
YouTube video id: mYSRn6PC1mc
Source: https://www.youtube.com/watch?v=mYSRn6PC1mc
Hello everyone. Not too loud. I hope uh yeah, I hope it's fine. Perfect. Uh so yeah, we will start kick off the workshop and we will introduce ourselves shortly. But first we just wanted to present the slide because that's basically LinkedIn this year. It's the type of content that when you ask Chad GPT that's basically what you get a very generic response and there are a few things wrong with this actual response or that we don't really like seeing. The obvious ones are the the slop words, the AI slop, so the all the del intricacies that we all know about. But there are more uh and fortunately there there aren't even some m dashes in there. But there are some other problems like the all right the lines are a bit off but the the most that is more general like most companies miss or most people miss it does that often. These are some examples from LinkedIn posts where they all say most teams, most uh people or here most AI projects and I'm actually guilty of that one. So uh we learn as we do and there are some other problems with hallucinations or here outdated information. Obviously, this was generated like last week and GP4 is not state-of-the-art. And likewise, um there are some AI phrases slop with that that we see a lot, the the obvious rapidly evolving one, but also the classic it's not about something but something else that it generates a lot. And the worst of all is that typically this is really meaningless and shallow. It doesn't provide any value or anything useful. And so what we need here is a proper research and proper writing around these language models. And to give more context of this workshop, uh we basically at towards create courses and tons of videos, training and technical content. And to do that, we need to create technical content around AI engineering. So we need technical writers which means we need AI engineers, writers, editors that iterate together and a lot of time to create a good storytelling, a good just a good article, good lesson or good video in general which in the end cost a lot of money. So what we tried to do is to automate this process which uh we did or at least a part of it because another part is really a human thing to make a good story relate to the others and so that's what we did. We built a system for replacing this this whole process of doing a very thorough research and then technical writing. So what it does is we give a topic like what is harness engineering and we have our own deep research agent that will search various websites and tools use tools to do lots of things and then another system to write an artic an actual technical article with code images and ideally a non-stop writing in there and actually the system since we build courses we used it to build a course to teach it to build this system. So, it was a really fun project to do in general and it allowed us to iterate and test a lot and to have user feedback directly with our students. So, it was it's really nice and that's what we try to share here in this two-hour workshop or at least a much more compact version uh with a deep res a smaller simpler deep research system and a writing agent strictly for shorter content for LinkedIn type of posts. And also we try to share what we learned building it. And you can open the or get the GitHub repository. It's public. Uh in the first 30 minutes I will just talk with slides. So you won't be needing it. But after that my colleagues will jump in with the code. So it might be worth to to have it. And we will show this this QR code again later on. So you can get that on time. But basically I will cover what we learned and some terminologies because we are education an educational company and that's what we like to do and so I will cover like some basics to bring everyone to the same level or at least to our definitions of the of things and share what we learned by building it and then uh Sami will take over to talk about the actual deep research agent and Paul with the rest. So who are we on my end? I'm the CTO and co-founder of towards the eye. I'm my name is Lu Frana and I was a PhD student in AI until CHP came out where I basically switched to being an educator full-time and creating towards AI to focus around that. I've been doing videos to explain research paper for seven years now and obviously right now I'm more around AI engineering and I've been developing in the space since 2020 at the first startup I was hired at. So I will leave Sami to introduce herself. >> Hi everyone. Um um I'm Samrudi. I am a machine learning engineer. I'm also a technical writer and a consultant helping towards AI growth. So nice to meet you all. Hello everyone. I'm Pauline and I'm building a software for eight years and I've been into the teaching AI space for four plus years and I'm also the author of the LM engineers handbook bestseller and I'm excited to show this uh workshop to you guys today. So I'll start with um introducing we always start with a problem and here it's it's pretty general. I start introducing the AI engineering problem space where basically it's just that all our decisions as AI engineers are governed by constraints that typically are are are applied less to software engineers which is like the cost per task with which can swing a lot based on the model you use the architecture you use. We have latency requirements with reasoning models and other things that we can control. We obviously have some quality to respect and data privacy in this case that we need to be careful of. Fortunately, we have a stack to be to to help us whether it is from prompt engineering, context engineering to using tools, orchestration or building our own evaluation systems. And we like to see that as some sort of autonomy slider that goes from very simple just prompting a model to building agentic systems where the more you add complexity from prompting from prompting to more advanced workflows to agentic systems, the more autonomy you add but also the less control you have over your whole system and obviously the higher the cost. And to us that's very important to choose when we build for we also build for clients. So that's why we try to teach what we build and all this loop. But we to us it's really important to build the right system where we don't need uh too too much autonomy where it can add problems and uncertainty. And obviously most people are interested into building agents, but most of these agents that our clients want are actually somewhat super simple workflows or at least workflows that we can come up with pretty easily. And so I want to try to to start by defining workflows up to multi-agent systems finishing with this deep research example that we have that we have made. So what is a workflow? A workflow is very simple. It's just a language model but augmented mostly because language models just take in tokens and text and generate tokens and text. So you cannot do much with text other than reading and writing emails which a lot of us some of us do more than that. So ideally you want to add stuff to it like adding data that it may not have access to whether it's at your company or some other data that needs uh to to best answer the user. You can add uh you can give it access to tools to do other things than just generating tokens and you can add a memory so that it can remember the interaction and be much more useful. But still this is not an agent. is just a thing we can build easily that can be reliable and a workflow can be even more complex than that. You can chain prompts together to reduce the latency and complexity to do many more things based on strict conditions. You can even do that with a router to decide automatically based on conditions on what you want to do whether it is to use a smaller model or or just do a different task and then come back. You can do that in parallel to be more efficient or to improve the results with majority voting and other techniques. And you can even add loops to automatically improve the results based on a judge feedback. But all this is still not agentic. It's just a workflow that we can build and add all these things. We can even combine them. We can build really advanced workflows. So when when does that thing become an agent? And here again I'm using a an illustration from entropic. It's it's a definition we really like. It's just when it needs to take actions and more importantly when it can react to what happens in the environment. So very simply it's it it should be able to take autonomous action and react to what happens as I said and be able ideally to plan to decide which tool to use and more importantly which tool not to use how to respond and everything. So when should we use this thing an agentic system compared to a workflow? Well typically we always want to use the simplest solution. It's just a prompt works. That's ideal. And when we build things, we always try to start with questions and use our sort of autonomy slider. So just a list of simple questions we can answer would be if the model already knows enough about the task to be done, whether it is the user asking a questions or some other task. Obviously, you can just prompt it and do it. And ideally add some examples. few set examples just because it's it really helps the model to adapt somewhat adapt on the fly which would be just simple prompting. Then if you need external context, if it's not too big and you can paste it in the prom in the prompt under somewhat around 200,000 tokens, you can just paste it right away and ideally have it before and before your user ask it and use context caching so that it's more efficient uh to just for example answer several questions about the same report or the same your privacy documents or whatever internal documents you may Now if the context is not known until uh the user asks a question to your model. This is where you may try to inject knowledge on the fly based on the question. Whether it is because your data is private and you need to to retrieve it to answer the question or if it's something recent that the models haven't been trained on or domain specific, you might want to inject knowledge. And if you want to do all this in different ways or depending on conditions, this is where you use a more advanced quote called work workflow with predetermined steps and sequences. And so in practice, an example of a workflow we built for a client was a super ticket handling where basically you have always the same steps in the same order. the the the system receives a tickets, it class a ticket, it classifies it, route it to the right team, it drafts a response, it can validate it against the policy of the team and then send it. And here the key points is that these steps are always the same. The order never changes. It always needs to do uh this these six steps in the same order. And so building this as an agent would just add overhead without adding anything extra. You don't need to dynamically react on anything right here. You just need to execute these steps. So a workflow definitely makes sense in this case. So for an agent, what you want to ask yourself is if your system needs to take actions or to be able to branch dynamically. So whether it is to call different API depending on what the user needs or write to the database or not do different doing and other things. This is where an agent uh can be useful. And an example we had was a CRM platform in uh in in Canada that wanted to create a chatbot for for their for their CRM platform for their users, their clients to be able to generate marketing content automatically. And initially they reached out to us because they were applying to an AI grant and they wanted a multi- aent system to do a lot of different things with agents for everything which did sound really well for the the the grant because it was like AI focused but it seemed really overkill to make many agents to do something quite simple as a as a a marketing chatbot. So in the end what what we always do is to try to really understand what the client needs and wants which is often times very different and so we uh by talking with them we figured that the actual workflow was always somewhat the same always simple. We needed the agent needed to make a plan to decide what it should do then retrieve the data specific to the the client generate the content validate it and fix it if needed. And so the tasks were always sequential, the same the same few tasks that we wanted to do. They were super uh coupled always the same thing about generating marketing content but just for different clients and for different formats and the whole sequence is context dependent. So the whether it is the plan the retrieval or generation or fix it all needs to have the content in mind to be able to generate the the best content as possible. So splitting decisions across different agents as the client wanted would have been uh would have caused a lot of issues whether it is information or just end of errors with to call and other things that would add that would reduce reliability. So instead we just built one agent but use tools our as our capabilities and this is really great because we with tools you can just have they can have their own system prompt they can have their own validation logic even their own LLM or own LM calls. So you can do tons of things with tools and for from our example we have we had like a validation tool one tool specific to all format the SMS tool or email tool and the results is that we use tools as specialists but the global context stays within our only agent the decision maker and the planner but with if we do that it means that everything stays within our agent. So we have a constraint which is the context window of our agent of the model that we are using which is which is made of many things. The system prompt obviously the the instructions that we give it the tool definitions the different schemas and how to use these tools few short examples of uh the the task to be done. It's really great obviously to teach by examples as we've seen with models they they learn very quickly that way for most tasks. You can have retrieve data depending on what you want to do and even entire conversation history if it's a chatbot or whatever type of agent that evolves over time. So all this can take up take can take up a lot of space which causes a a problem because as we go through each of these steps the context grows and the performance degrades which we call context rot. And the problem is that this happens much before the actual context window limit of like 1 million token for example. It worsens quite fast after like 200,000 or it keeps changing but around 200,000 now. And uh this is mostly because the lust the lust in the middle problem where we basically teach the these long context models to be able to handle long context by feeding them a large corpus inserting a random fact in them and retrieving that fact. So it's basically more retrieval of one specific thing. But it doesn't teach those models to leverage the whole book the whole context to answer the question. So it's definitely not ideal but it would be too expensive to build this kind of large data set. So this this type this way of training them cause the the problem of uh of us having to manage this context budget which means to always keep the context as lean and relevant as possible both to reduce cost in terms of token amount but also to uh improve the the metrics that we are using and we can use many techniques for that. We can trim the content. We can summarize it. We can retrieve based on criterias or there are also many interesting techniques in the cloud code leak compaction methods that I recommend looking into. Uh really powerful ones. But the one I'm most interested in for this talk is the delegation that we can do to either tools or sub aents with their own context which is what most um harness that we use are doing. And so this leads us to the multi- aent systems where we basically want to use them when we have too much of this context mostly like over 20 tools or or just the context becomes way too large. We can delegate to tools or to agents. And there can be other reasons like if you if you need to make autonomous decision making or you need tool variability or just security compliance if you you need to have one agent locally in inside one hospital for example because I was in the the healthcare field before and we used to to have everything local. So that can be a reason for multi- aents. And why am I talking about all this? Because what we've seen is that AI products are never just you build an agent or you build a multi- aent crew AI thing. They basically combine all of all of that. They combine tools, workflows. Some parts are more specific and predefined. Some parts have more flexibility. And we believe that AI engineers need to understand all these to build to build complex systems and deep research actually implement that it's a very good example of a complete system since it's since they integrate all of these techniques. And before we show how we built our own and allow you to use it as well, let me just cover quickly what a deep research system is just in case. It's a reasoning system that will plan what it needs to research about. It has some autonomy to decide what to research about. It has tools, internet or web access. It's reliable. It will site its sources. It has feedback loops uh with itself but also with the human for human feedback. So this is much more agentic as we see it. It evolves in an environment which is typically the web or user provided sources or just API access and it's goal driven. It has one objective. We don't tell it how to do it exactly. We just tell it do a research about something. So it's typically much more than just a chatbot. It basically replaces someone that would do a very thorough research about a specific topic. It plans, it search, it inspects, pivots, it synthesize information and can iterate. So deep resource systems are really to us one of the best projects to learn how to build such a complex end to end system. So we decided to build our own and in this case our own is basically an an end toend technical article production from a topic because we create lessons course lessons in our case. So it came out of a real utility which is the the most important part. So the goal here is to take a topic research deeply about it and write a very good technical article with code integration with that is actually runnable with uh images that is useful not just randomly generated images throughout. And there are a few challenges when doing this. First, you need a really high precision and recall when doing the research because you need as much relevant sources as possible, but you don't need too much of them because of the cont the limited context problem. You need to reduce hallucinations and AI slop obviously and incorporate a lot of human feedback especially in writing in the writing aspect. And as we always do with projects and I think you all should do as well. We started this with asking ourselves questions to better understand how we should do it and if we should do it at all. The first question is is this worthwhile? Is there a solution outside of uh of what we could build that already exist? So our analysis is that high quality technical content is expensive to produce. We do that every day and and we need a lot of people and it needs a lot of time and review and it's just really time consuming really expensive especially because these people need to be somewhat senior AI engineers to be able to explain anything in AI engineering. So they need expertise but to teach someone not to build a successful product. So it's expensive and not that lucrative. So um yeah it would be ideal to automate most of this process. As I said the research needs strong recall and precision and on the other side the writing needs to be more constraint. It needs to avoid hallucinations be of high quality follow your tone or the writing tone that you want the structure that you want. And in terms of deep resource tools, uh we use all of them pretty much daily and they are way too ex exhaustive. They they gather way too much content and they have a lot of noise. It's not ideal for us in our case. We do use them, but we decided to build our own just to be to to be able to do exactly what we wanted. So the decision is yes that this was worthwhile but especially as a writer augmentation not to replace them uh because as we've seen uh even if you do as if you build the best system possible these days it's really hard to create a piece of content whether it is a video or a written lesson that connects with the reader or the viewer. You really need a human touch to be so that it's relatable to someone. you need to make jokes, good jokes, not the same ones all the time and the ones that are appropriate to the the context. So it was it it's really difficult to automate all this and we want to keep a human the loop there. Um now the the first actual question we ask ourselves is what should be the architecture? uh if it should be an agent, multiple agents, a workflow. And the analysis here is that first we have the research part that is really exploratory. It needs to um find lots of sources. It needs to explore the web to understand what it needs to find if it misses information. So it needs to be able to iterate, to pivot, to search again. It needs a lot of flexibility. And on the other side, the writing agent is much more deterministic. It needs to to follow a specific tone, to follow a structure, to uh to um to yeah, it needs to be much more constrained than it needs flexibility. You don't want this AI slope. You don't want specific words, specific sentences. You want specific sentences and specific formats. So, it's much more constraint. So, we have a conflict here where the research needs flexibility, but the writing needs constraint. And so the decision is to split into two systems. The research agent that is more exploratory, dynamic, more agentic and the writer agent, the writer system that is more deterministic and consistent. And we have tons of these questions and we show in our course exactly how to build these two systems. But here it's just two hours. So I will we focus on the research agent a bit more especially for the next few questions. So here how should it be created? Well, how should the the research agent communicate with the writer workflow, the writer step? Well, we pivoted a few times after testing it. And that's the most important part is to actually use your product or a potential user should use your product to give you feedback as early on as possible. And on our end, we used it to build a course to use it. So it was really easy to iterate and to teach how we what we learned and what we did. And what we saw is that our users which were myself and the team saw that we always used either both agent or we didn't use a system. So we didn't we weren't just alternating between the research and then the writer. If we were to alternate it was just with the writer at the end to add new suggestions, new topics, new new uh information to add to the article or to remove to it. But the resource was already quite perfect and and done beforehand as you will see with Sam. And if a big change was needed, we just would rerun the process again. So we didn't need to have proper orchestration in place. And the two agents work on the same artifacts. The research agent produced a research MD file with all its research summarized and send it to the writer. So they need to have ways to communicate together. So the decision that we made is to separate the two agents and run them sequentially without orchestration. So just a script, just a very basic script and our and our two agents are in the same project so that it's easier to maintain but also they they share dedicated files so it's easy to have them communicate with each other but also have them communicate with us through these files and having them together in the same project is useful for AI assisted coding with cloud code or whatever system you're using. The next question for the research agent is uh how should it behave? So we really need we really wanted to understand what would be the process that we would want out of this research assistant before building it and again we refined this after testing and in the end we really needed a good human written guideline which would basically be just what is the topic about anything I'm interested in covering in this lesson and uh useful links that we think are relevant. Uh sometimes I send the AI news newsletter on some specific topics. Sometimes it's my own videos or other topics that we covered and we saw that obviously there's a like creator bias here but we saw that the more we define the agent's goal in the guideline so the less freedom it had the better the results were. Then if we give links in that guideline we send to the agent the agent needs to scrape web pages. the needs to be able to digest YouTube videos. GitHub repositories even if they are private uh to the user to myself to uh gather all the the necessary context and then after that it needs to do its actual deep research thing. So search the web for anything that is missing and needs understanding and then revise all this context and write that into a very nice final artifact that will be sent to the writer agent. So how did we do all that? How did we scrape web content that we give the links to and do web searches? Well, we as always simplify the problem and don't reinvent the wheel. We use existing things existing systems and APIs and divide the problem into simpler problems. So first scraping solution we use fire and agentic scraping and this is for many reasons that uh well Sam will talk a bit more about the research agents so won't cover it too much but it's really useful to have this kind of of scraping for more dynamic websites and other types of websites that can block scraping and for web searches we used we used to use perplexity now Gemini with rounding for get to get precise answers with sources directly which we can then scrape if we need to but otherwise we can already prompt it well enough to get the information we want from this grounding queries. Now for YouTube videos because there are a lot of great information on YouTube. We want to ideally pass just the text to the agent because the video is is quite heavy and we tried a lot of things but typically a lot of things but typically you need to download the video and extract the transcript. But fortunately Gemini now handles this directly with a YouTube URL. you just give the YouTube URL and ask questions about it or even ask to provide the the transcript. So, it's very useful. We decided to use that obviously. And uh now for the GitHub content, we just used the the Gingest library to get a GitHub specific content into Markdown and we can provide it with a a token to access our private li private repositories. So it's really simple and you will see that in the the repository for framework uh we used MCP with fast MCP so that we can have tools for the different processes and we can have an MCP prompt to the to basically give a recipe on what to do and how to use the tools to the agent which is very useful. We will talk in depth about that in the in the next hour and a half. Then for the research uh the setup we basically just use Python UV for managing the dependencies and GitHub obviously. And for the models we recently changed for everything to be Gemini mostly because of the YouTube thing that I mentioned but also we saw that first it works really well and we can alternate between pro and flash the the cheaper one and and more expensive depending on the complexity of the task. So we use them alternatively in in the the the code and the the one thing that we really like is also that it has a free tier for our students. So it's really convenient for us. But obviously we we also compared to the other models we were using before and and other cloud models and even open AIS but we've seen that Gemini is much more interesting especially from OpenAI these days. So we are using Gemini but you need to compare with different models to be sure that you are using the right ones and finally for the more theory part the how the does the resource agent communicate with the user it's inside the MCP prompt we have specific steps to stop and uh interact with the users and all these interactions are handled through files whether it is the guideline or the research file or other files that Paul will talk about in the writer part. And by the way, all everything I mentioned and much more is in some sort of cheat sheet that we also linked in the read me of the repository. So you can definitely access it uh for free obviously it's on GitHub and uh now is the part for where we talk about the deep research and you can so you can scan the code if you haven't to get the repository and somebody will take over sorry everyone just give me one Awesome. So, I'll be talking about the deep research agent. And so, what is a deep research agent? As Louis already explained um that you know it can search any topic by searching the web. It can analyze YouTube videos and finally it'll give you a cited report. Um how we've built it, we've used u MCP for this which is an open standard for giving agents um you know access to tools and data. Um the key idea like the design choice here is that the agent is the brain um that is doing all of the reasoning as like Lewis highlighted that uh any gaps that are there in the research part is being done by the agent. it thinks if you know there's anything missing in the research if it needs to run multiple researches or it has to search something else that is the part that is being done by the agent whereas the server is uh the MCP is handling all of the capabilities so it's going to be exposing all of the tools um it's going to be ex uh you know exposing the resources and the prompts that are there uh we've used something called fast MCP which is a library that you know really um makes u creation of like agents very easy. So you know it hides all of the complexities and you do not have to write any sort of protocols or any sort of communications between agents. All of that is handled by fast MCP and we're using cloud code as our agent harness. Um so you know but this code is independent of that. If you're more inclined towards using cursor or get a copilot you can you know swap um cloud code with any other um agent hardness that you like. Um so I'll I'll be repetitively talking about u you know these three things. So you know an MCP server exposes tools, prompts and resources. So what are tools? Tools are basically all of the actions an agent can take. So you know like deep research or like you know going and doing Google search or like analyzing a video and giving us a transcript or compiling a report. So all of these actionoriented things are done by tools. Whereas u the second thing that the MCP exposes is prompt which is basically instruction you know when you're talking to chat JPD you give it instruction so this in case is uh prompts that the agent can follow and um in our case we have a detailed prompt that I'll show you in a minute with the code and finally the third thing is resources which is the data the agent can read so you know this is like static data which is the model names we're using the sort of versions that we have if we have any sort of feature flags those are all part of the resources so these are something that the agent can read. There is no action being done here. Um in terms of tools, we have three tools. So first is the deep research tool. Um the deep research tool basically does you know Google search and give us uh sources for everything that we have. So it provides us with answers for any sort of query we had uh we have and also it provides us all of the sources of that. So um this is we're using Gemini API for this. Um the second tool that we have is the analyze YouTube video tool. Um for this as well we're using the Gemini API. So you can um take any YouTube ear and you know paste it and it's going to give you a transcript here. Um and finally we have the compile research tool. So what the it basically compiles the output of the previous two tools. So uh we're going to be talking more about um the memory file and the details of that when I show you the code. Um so um I just talked about um you know the the parts of the tool the agent that we have but you know this is the overall architecture um so we are using clot code as MCP client just to highlight here um clot code for us is both um you know MCP client and an LLM. So um you know it is like the brain which is being used also it is u MCP client which is talking to the MCP server that we have. Um the first MCV server that we have has tools resources and prompts. Um these are the details about uh each of the tools that we have. Um so all of these uh you know tools the the first two tools write to a folder called memory and uh basically we wanted something uh we wanted to like preserve all of this because it is easier easier for logging. It is easier for us to verify all of the results that we're getting. So, you know, we are writing down all of our results to a folder. And the third tool reads from this folder and creates a file for us. So, research a markdown file called research.md. Uh the first two tools do API calls um to Gemini 3. And uh finally, everything is like u you know read and written to this particular folder. Um, so now I'm gonna show you the code. Okay. So this is how the let me just zoom a bit. Sorry. Okay, perfect. So you know this is how we have structured our code. Um, you know, you don't have to get intimidated by this. So we're going to simplify it a lot. So you know these are the main two folders, the research folder and the writing folder. Paul is going to be talking about the writing part. I'll just go uh and talk about the research part. Uh so if you remember when I was when I started my presentation, the first thing I talked about was the MCP server. So you know this is the MCP server file that I have and then I uh so you know this is how you set up the MCP server. So we have you know some settings that we are loading up here. So these are basically the name of the models that we have. I can show that to you in a minute. And you know this is how you set up the server. So you give it a name and the version. And then you register three things. So you know this is something that you should remember. We're registering tools, resources, and prompts. Um and then we're just uh setting up basic logging here. And we're using something called OPIC uh for observability. And Paul is going to be talking uh more in detail about that. And then we're finally creating the MCP server, you know, and Now let's see how we have registered the tools. So if we go to this fold um this file called tools um the python tools file um this is how we have registered um the tools. So for registering um the tools with u you know um first mcp sorry for registering tools you need three things um the first thing is of course the name of the tool. So here uh the name of the tool is deep research. The second thing is arguments. So we're passing it two arguments. The first one is the working directory. Um and the second is the question or the query that we're going to be asking it. Then you need um a definition for what your tool is going to be uh doing. So you know you have to be as precise as possible when you're writing this definition. And finally we also have the definition for the arguments that we have. So uh you know um so this is the way you write the tool. Then we just have um the code for the observability. And then finally we have the implementation. Um we're returning the implementation for the tool. So this is it. So you know this is how you define a tool and um um there's no complications here. But let's go into a little bit more detail about the tool itself. So um how this tool is working. So for the deep research tool we are basically in in the first step we're just validating all of our paths. So you know we're checking if we are in the current directory or not. Um and secondly we are ensuring that we the memory file that we're writing to basically it's a folder where we're writing all our results to if it exists or not and you know we do all of this validation here and then uh you know we run this grounded search on the query that we have. Um once we get the result uh of this query you know we return the status if it's successful or not. What was the query that we sent? what is the answer that we get? So you know we get like a detailed response and you know this is like been very helpful for our team to like debug um and like you know go through um all of the data that we get so that you know we can refine the research results we have going into more detail about how we are um you know running the the research right so in if you see this uh the run grounded search function that we have we have a prompt that we've already written. So if you go here you can see the research prompt that we have. So we're basically saying that if you have any questions you have to provide a detailed comprehensive answer to the question that is there. Uh focus on only official authoritative references and you know make sure um you're including u all the relevant details as possible and make sure to site your sources clearly. Um so you know this is the prompt that we're using for the deep um research tool that we have and then uh we're just calling the Gemini API here and we're getting the um answer text and the sources. Um then um in this part of the code we're just um so I mean every time you get an um sources from Gemini API it's not in a structured form. So here you we're just like trying to structure it in a better way. I'll show that to you when I result sorry when I run it you'll see that you know we get the output in this structure where we get the ear we get the title and we get the snippets and you know finally we are returning the research results. So you know this is the this is all the details about our first tool. Uh but going on to our second tool which is the YouTube video analysis. So this is also got the same structure as um the uh the deep research tool that I showed you. So three things. The first is um the name of the tool. Um the second is the arguments that it is using. So even in this case it is using the working directory and the YouTube world and then the details about the tool um you know what it is doing and the details about the arguments that we have and then we're just returning um the execution of the tool. So going into more details about what the analyze YouTube video tool looks like. Um so here is the implementation for that. Similar to what we did for the deep research tool, we're like validating all of the parts here. Um and then you know we're taking out the uh YouTube video ID using this get video ID ID function. If we do not get the you know video ID for the YouTube world that we have provided uh we get um you know we like try to sanitize the result and get the video ID. Um and then you know we're calling this is the line where we're calling the Gemini API and you know similar to the deep research tool we're returning all of these different details so that you know we can keep track of the things we can see what is the output that we have got and let's go into the analyze YouTube video um implementation. So um you know similar to the deep research tool we have a prompt for the YouTube transcription and here is the prompt that we have written. So um so you know we're just saying that you have to this is basically guiding uh the Gemini API to give us the transcript in a certain format and uh you know give us all of the instructions and the details that we want in our output. So um for this part um so here we are uh you know dividing our requests into two parts um the request that we're sending the Gemini API in the first part we are sending the file sorry in the first part we're sending the earl that the YouTube earl that we have as a file URI and in the second part we're sending the prompt that I showed you and you know this is like a very cool thing so um if you um so Gemini is a model model and it actually when you send a YouTube as a file URI it actually sees the video so it goes through part by part it doesn't access any sort of transcript and that's why when we'll be running the you know um video YouTube video analyzer tool you'll see that it takes like two to three minutes because you know it is actually going through your entire video so it is like really multimodal um I mean in here we are uh you know like sending the um request test to Gemini and once we get the output we try to like clean the um output that we are getting in a certain format and then you know um outputting it and like saving it in the memory folder that we get. Finally coming back to uh the final tool that we have which is the compile research tool. So what it basically does here is um it returns and compiles all of the results that we get from the previous two tools. If you go here and see the implementation for the compile research tool, you'll see that what it is trying to do is it is trying to write to the um output research MD file and then it gives us like a success status or a re uh or a return message. Um so you know we have seen um how we have registered the tool in the MCP server. Similarly, we've got the you know code for registering the resources. So um you know here is all the all the things that we are sharing with the um agent as resources. So basically what is the name of the server that we have the version that we're using what sort of gemini model are we using what is our YouTube transcription model and all of these different details are being provided um to the agent using resources and then um the MCP prompt. So you know um this is how we have registered the MCP prompt. So um basically we're returning something called workflow instructions and this is the detailed prompt that we have written. So basically in this prompt we are telling the agent that these are the tools that are available to you and you know you can use this workflow to use these tools and you know some additional details about like the working directory and where all of the results should be stored. So you know this is our detailed ROM um that we're giving to the in um agent. Awesome. So going back um you know this is like the sorry one more thing uh I wanted to highlight here is that this is our end uh file. So you know this is all being accessed as resources but this is where you'd have to add your Google API key. You can add the OPIC API key if you want any sort of observability. U but this is the um end file that we have. Additionally uh apart from that you know we have the scripts here which are like um you know helping us with the code throughout and you know these are like scripts um we've written additionally um to manage uh some of the utils. Um so sorry yeah going back. Um so now we have the entire architecture there and so how do we exactly connect it with um our agent harness in our case this is clot code um so this is a very um you know straightforward way for clot you just need to create um an MCP.json file in the project root and you know give it a command like this um and then for our case we are running the MCP uh locally you know not hosting it anywhere. It's being run locally and it's using standard in standard out so for communication. So you know this works but in case you know your MCP is hosted elsewhere you would need an earl like you know um notion is a very popular example notion has its own hosted MCP you can have access to it by getting the ear. So this part of the code would be changed slightly instead of putting the UV command you would be um you know instead of putting the UV command you would have that Earl here. Um and you can connect with um you know any MCP that is available Snowflake notion anything that you want. Uh one more thing that I would like to highlight here is that um as I said at the beginning of my talk we're using clot code as an agent harness. You can use anything that you want. You know if you like copilot you can use it for this u you know any um sort of agent harness you want to use you can use it with that but this is um for cloud code. Okay. So let's see where that file is. So it's in the so you have to create that MCP JSON file in the root directory and this is um I can explain it to you. It's a very simple thing. So here we have um you know the name of the MCP server we have. So you know it's deep research and this is the command that we have. So we using UV because it simplifies everything. So you know any sort of um uh virtual environment that you're creating any sort of things packages that you're downloading downloading it's all being handled by UV. So you don't have to do anything and you know then we have um this command which is basically running the server file that I showed you couple of minutes ago. So it is just doing that. So claude runs the server file as a subprocess locally. So our server is running locally and this this um entire code is um sorry this entire command is only doing that. It says run fast MCP run and the part to the server file that we have. And this is the environment file that I told you that you're supposed to create. Okay. So um you know now um it's time to see um how the code works. Um um so I would go to claude uh sorry okay so if you go to claude and you do MCP uh you can see that I have two servers here so you know because of the MCP JSON file that I have um you know it was automatically able to u detect both the agents So we'll just be seeing the deep pre-search agent right now. And um you know this is all the details. You can see the command. You can see the arguments that we have. You can see all of the capabilities that the server is exposing. So you know you can just do view tools and you know you can see all of the details about the tool. So you know the tool name, the description that I showed you, all of the arguments, the parameters that we have. You know similarly for all of the other tools you can see all of the details here. Um, now, uh, you know, I'm going to try to, uh, sorry, I'm going to try to, you know, run one of the tools and see, um, how Claw does it. So, I've already written out this question. This was, uh, from a previous AI talk about, um, AI agents. Um so I'm going to just tell it to analyze this video uh for me and it should automatically be able to detect uh the tool that we have uh that's being exposed by the MCP server. So you can see here that uh you know it has automatically picked up the tool that we are exposing and uh and it is um you know it is uh trying to use that and on the side u you know you can see that there's a memory folder that I told you you know where we are tracking all of our results it is being created and it's going to write um the transcript to the um sorry write the transcript to this uh file. So, as I told you, um Gemini is watching or like analyzing this video in real time. Um that's why it's taking it's going to take like a couple of minutes uh for it to like produce the transcript. Um so, yeah. Uh so basically every time you get this question um you know claude code tries to see what it can do what um tools are being available and you know then it sends a request to the MCP server can that can you please execute this tool and once um the MCP like MCP executes the tool it sends the results back to cloud code and then it further analyzes that result. Um so we can wait a minute for that. Sorry. Yeah. So it has created this markdown file. So it says that this video is from um you know the AI engineer code summit. It has provided me all of the details and um you know it is like pretty good detailed transcript that we can see here and you know also gives um uh you know gives its input at the end. Plus I also like really like this thing that you know we are storing all of the transcript but it gives you like a really neat summary at the bottom where it says you know what were the key ideas that were shared in the video and uh you know so and it says that okay the transcript is here. So um you know this is like a short example of how we could um you know run and see one of the tool if it's uh if Claude is able to run that or not. And now uh I will try to uh you know run the entire pipeline again. So sorry. Okay. So I've already written a question for that. So I'm telling it to do an end to end research and So um now I'm trying to use the deep research tool that I have and the YouTube video tool that I have and trying to see um you know how it runs both of them together. So um it says that um it is trying to u it is trying to do the analyze YouTube video part first. So um when you have like a you know like a big question like this the entire workflow how it works is claude breaks down the question into like different parts and then sees which tool it needs. So you know it has like an entire reasoning going on that okay you know if these are the sort of tools that I have available I need um the YouTube u analyze YouTube video tool for analyzing the YouTube video I need the deep search tool uh deep research tool for the first part of the question so it's going to divide that into two parts and then try to call like both the tools um okay let's see sorry usually takes some time for it to >> um sure. >> Yeah, I'm actually going to talk that's that's the next part. I was uh trying to remove the skill initially when I uh moved it to the temporary folder so that it doesn't pick up the skill because it it tends to do that as well. >> Yeah. >> Not about that. Just the prompt part though. So >> no, >> not so I mean for us we're using um the skill and replacing the prompt part. So you know trying to use that entire thing. I'll show it to you like just in a minute. I I think that might clarify your question. Um so I mean here it gives me um you know it is like um analyze the YouTube video again. It didn't give me the entire answer for the first part. Uh but yeah, I mean I I think it just uh just give me one second. I have to clear the context of cloud because it tends to um use whatever I asked it previously and I also had to like uh delete the memory folder. Um so till the time you know this is um you know running and um generating research for us um sorry u till the time it's like running and generating research I could talk about the next slide that we have um which is the agent skill so um I'm sure all of you must have heard a lot about um you know agent skill it is very popular right now so um this is uh basically a compact concise way of you know telling agent about capab capabilities and workflows and um you know so basically tools do the work um and um you know skills provide you the way to do it. So um you know we talked about the entire prompt that I showed you uh you know the workflow prompt that we have. Um in that you know we were um telling um sorry we were telling um our agent that okay you know these are the tools that are available and this is how you should execute them. But instead of you know writing it in the prompt we can instead create a skill out of that. And what is the benefit of doing that? You know when one thing is already solving the problem why should we create another one? The reason for that is um that um skill have something called progressive disclosure. So when you uh you know when um like I did u you know when I showed you I I'm going to show it to you in a minute when you load the skill. So you know when claude just like shows you the skill um it doesn't load the entire information. It is only load the name and the description of the skill. Once you run a query, it's going to load the entire thing and once that query is executed, it is going to wipe off everything. So, you know, it is not going to be there in the context. Whereas, when you're using prompt, um, you know, everything is going to be there in the context. It's going to delete the context and so, you know, you'd want to avoid that. So, you know, um, skills are a very clean way of, you know, doing that. And it is also very sharable because, you know, we internally write a lot of skills for our team and like share it with each other. So, it's a very um you know um it's a concise sharable way and it's more maintainable because you know you can check it into G you can it's like your go-to place and you can um do the entire thing there. So um um okay so we have the results for this but before that I would just like to show you how we have written the skills. Okay. So I mean we've written the research skills. So this is replacing the prompt that we have. Um so here we have to write the name um of the skill that you want. You need to write the description. So why this description is necessary? Because when um your agent is trying to you know match a query to um a it is going to look at this part of um you know this part of the scale. So you need to have like a good description here and you know then this is called the front matter which is loaded and then um this is like the rest of the part of the scale. So in our case you know we are reusing the research workflow uh prompt that we had instead of like writing all of the details in here. We you can do that but it's a design choice for us and you know this is like the rest of the body for this skill. Okay. So now how could you see all of those skills and what do I mean that it only loads the front matter? So if you go to plot again and if you do sorry if you do skills you'll be able to see so these are some of the skills that I've already downloaded and these are the project skills that we have created. So particularly the research skill that I was talking about and um you and um so every time you do this and you see this only the front matter is loaded not the entire description right and you can actually if you want to use this you can just do something like this or cla can like automatically pick up um the skill when you ask it a question um now I'm going to try to ask it the same question I was trying to ask um the MCP um with. Okay. So, I'm going to ask it that it researches what an AI agent skills are and then also analyzes the YouTube video for me. Yeah. So, it is loading the research workflow prompt and it is able to do that because our server is running locally and that's why it's able to read the prompt because they're all in the same folder. And now just started to do a deep research. as you can see here. Oh sorry. So it has created um so it is running uh you know the deep research here and then it has run like first firstly it ran the analyze YouTube video tool then it is running the deep research tool. Um and I think one amazing thing here that I would like to highlight is that the initial query that I had was um you know can you uh give me details about what are agent skills and every time you know our agent harness identifies the gap in the output that it gets. So in my initial question it is going to identify all of the gap that exists and then it's going to ask a different question every time to fill that gap. So you know it is doing all of the reasoning behind the scene and making sure um you know all of those gaps are filled in the um you know research that we're trying to do. Um so yeah I mean it's it usually takes around two to three minutes for me to like run this entire output. Um let's see. Sure. Yeah. Um so I mean um we built it originally with you know the MCP portion of it and now you know we are like transitioning but you know we want to like keep our original code and um >> I mean not the server part but like the prompt essentially that we have we can replace it with skills. Yeah. But I mean we have a lot of detailing and different components within the servers. That's the reason we wanted to like maintain that. >> Compity. >> Yeah. Yeah. It's it's it's it's more of the complexity in that case. That's why we want to like maintain that. Um, so here we can see um, you know, we've got the output of the video and now we're we're just going to wait for it to like produce the output for the MCP file. Perfect. So you know here you can see We've got um this was my initial question that I asked it and you know it has given me the answer for that and this is like the first run that we've had and now it's like you know doing like more research on that because it um you know gave us an answer but it identifies some of the gaps in that. So um this is the second query it ran and this is like a different question you can see here right? So it ran two queries and it is like okay you know this is good enough and then it says um all three research tasks are complete completed let me run one more targeted query and then complete everything. So you know that's why we wanted um you know all of the reasoning to be done by agents because you know this is something a human would do because if you're reading an article or if you're doing any sort of research you try to identify the gaps and like try to fill it with that. So you know that's what it is trying to do and that's why we needed all of the brain um and you know are trying to utilize that. So you know this is like the third query that is it is running for me and then finally we expected to like create like a compiled report uh which is going to be um any minute. Sure. >> Um I think um I mean that is like intern I mean we would have to like see the observability for that and like check and identify why it is not like seeing the latest um code. But we also have given it a video from 2025. So I think uh maybe it's because of that. So the video that I asked it to like transcribe it is regarding that. But I think um to probably answer that better um I would like go into all of the observability uh part of the code that we have and like see what is going on there. Um so you know finally it's asking me if it could create a markdown file for me. And so here is the research MD file that our final tool creates. And uh you know it is it is really detailed. It has got all of the comparison tables and it has got like all of the details. So you know I think um this is like a really nice way to compile um everything that we have gotten so far. And um it also like compiles and adds the YouTube video link that we have uh sorry video transcript that we have at the bottom. Um so yeah um thank you so much everyone. Um now I'll hand it over to Paul. Awesome. Thank you, Luis and Sami for those great slides. And just give me a second to set up everything over here. Okay. So now we are moving to part three of this whole system which is the lint LinkedIn writing workflow which basically transforms this research from deep research agent to polish post that pass law detectors. So we can transform all of you into LinkedIn influencers. Okay. So just like a high level overview this is our full system uh architecture right. So we did we already implemented the research agent which outputs this research MD file which will be as input to the writing workflow which on top of that has as input the guideline MD file which basically is a fancy way to put uh it's a fancy way of the user input and this is how we will model the user input to guide the generation process and the final output will be well the the LinkedIn post Right. Okay. Now let's actually zoom in into the the architecture which is split into three big pieces. So the first one is where we build up the context and ultimately we we end up with the big system prompt. We because remember this is ultimately this is not an agent. This is just a workflow because in reality like to write a piece of content regardless of what it is a LinkedIn post, a video or an article, you don't really need agents, right? Because as Louis said in the beginning, this process can be it's very static. You always kind of go through always the same steps. So an agent will just make it everything a lot complicated. Of course, you can put agents in some parts of it, but we won't show that into this um into the this workshop. So the first part is to load the context which is as I said the guideline MD file which is the user input the research and then we have some static files the profiles and a few short examples which I will dig into a more detail a bit later. So we take all of these build the system prompt and then pass everything to the LLM. So this is phase two. So so far nothing fancy. We just basically create this huge system prompt and called LM to create the f first draft of the post. And then the last phase is to apply the evaluator optimizer loop which is probably the most interesting part of this where we add the LLM to create in the loop reviews and edit the post a couple of times. Right? So this is super important because in reality you can apply this not only to LinkedIn posts but to like any type of content like starting from video transcripts to reports, financial reports, medical reports, uh even articles or like book chapters or very detailed article lessons as we did in our course where we needed to follow like text snippets, images, code snippets, references and all these little details that LLM usually sometimes get right but most often don't get right and even if they get right 80% of the time it's annoying to manually go and fix that right that that's the whole point so all of this is automated end to end so we'll dig again into this process in more depth later on so this is like just like one of our examples so this is the beginning of a LinkedIn post that we asked to generate And just for fun, I just want to copy this post. So it's the exact same post and put it into this like pretty popular slop score detector. So you can trust me that this actually works. And hit analyze. And yay. No, not not slop. And this is actually lower and less sloppy than a human. as you can see is over there on the bottom. Of course, not all the posts are this uh not sloppish, but most of them are are really good and that they sound uh human. But remember, this is LinkedIn language. So, we need to sound a bit like LinkedIn, right? But ultimately, as you can see, it doesn't have any m dashes, any weird uh adverbs, verbs, and things like this. It can be read very very nicely. the the structure is well is schemable as we would like for LinkedIn and things things like that you you can check the full uh post at this link in the GitHub repository and also try try the slop test yourself if you aren't using that link okay so now let's start to dig deeper into the first phase of this this this workflow which where our core focus is understanding how we can actually control the generation right because As Le said in the beginning, we don't want just to write, hey, I want a LinkedIn post on topic X. We actually want to dump our ideas, values, and thoughts into that post that actually sounds like us. So the first trick is actually to structure this guideline, which is the user input uh and structure it more than just write me a post on X, right? So this this actually guides what we want what we want to write. So usually what we did, we created a template for this where we need to fill in such as what topic we want to address, what angle we want to address, some key points we want to address, the narrative flow and and and things like this. And this is the only piece that changes from post to post. So basically from the whole system, this is what we need to write ourself as a human as input to write a different post. And yeah as I said this is dynamic and changes. Next the and again you you can see an example here into this example from the repo along many others but I will show them a bit later. So the second trick is to add these writing profiles which basically tell the LLM how to write this post right because in reality you don't really want the LM to just do its own thing. you want to guide it and these are static because as I said with the guideline you you you define what you want to write and this this is how you define how you want to write it. is basically the styling layer on top of it. And we we mainly created three profiles with which by themselves are mark markdown files which are the structure profile which bas where basically we define things such as how many characters on average a LinkedIn post has like the the core structure of a LinkedIn post. We want the hook like the body, the call to action, also the termin for the terminology. We define things such as the active voice and the AI slope words and expressions that we want to ban. And unfortunately, this is all you can do. You just need to keep track of a huge list of delve, tapestry, vibrant, and all those verbs that we love. Uh, and just kindly ask LM to not use them until you start using them as a human as as well. And the last one is the the character profile which is kind of static but you can also configure it for example in our writing workflow we actually configure it under my polin biography right so it knows a bit of about me like how I run to write how how I like to write my style and and things like this so this is how you can add a bit of personality to to it. So again, we put all of these into the system prompt plus the guideline that we defined before and you can see these are static. So as I said, we define them just once and you can access them under writing profiles and look around the there are I know markdown files of a few hundreds of lines of of words, right? So the last trick is actually to add few shot examples. So this is nothing fancy but in reality the hard part is actually to get them right. Basically this is uh under the data collection part of your system and in our particular use case we added kind of three LinkedIn post from from from my writing and the the key idea is to use high quality and representative uh few short examples and make them as varied as possible in things such as topic length structure and so on and so forth. And one question I I think many people are curious about is why three LinkedIn posts? I know three is a magic number. But in reality, I just guessed it. And the thing is that with fshot examples, because you always pass them in your system prompt, you want the lower number possible that gets the job done. So, usually people tell like three, five, 10, 20 if you shot examples, but usually you want to start with the bigger number and trim down that as much as possible to until it works. And when it stops working, you put put that back in. And uh because you want to keep the the idea is that you want to keep your system prompt with a few short examples as small as possible because well you always pass that to LLM which translates to more cost, more latency and even degrad the degradance in performance because the the content grows. Okay. So for for that we I made available a data set in the GitHub repository where I extracted like uh 20 random u LinkedIn posts from from my profile that are more that got more more traction. So I used that as a data set for uh this writing workflow and further down the line to configure other other things. And one thing to to to highlight is that probably the only way to reduce this load on the system prone with few short example is through fine tuning but that it's often overkill and adds a lot of friction into your your uh tax stack. So the last part of the system is the evaluator optimizer pattern which probably is the most uh uh f fancy one from from from all of this and basically it contains two LLM calls and this is really important. It contains two different context windows, right? We have the the writer and the reviewer where basically the writer first writes the draft and the reviewer with a complete different context window takes the draft and and reviews it. And this is super important to like avoid bias uh because LLMs usually tend to be biased in liking what they already written. So by just putting that into a completely new context window can can remove that bias. So basically how how this works the reviewer checks adherence of of the draft against to the guideline which is the user input against the research basically to remove hallucinations and against the profiles those structure terminology and character profiles. So we ensure that it adheres to it and then the editor which is often the writer itself applies the reviews and then we run this loop in our examples three four times to basically write the post review it edit it optimize it and so on so forth similar to like a fine-tuning optimization loop and a few tricks here that we observe over time is that just by keeping in our uh working directory all those versions It helps because writing is subjective and often if you apply this uh reviewer too aggressively it it might make it you might just not like it, you know. So you want to look around the less versions and pick pick the one that you like the most. And again because writing is subjective usually the evaluator optimizer pattern works with a score. So basically the reviewer gives a score to the input and you loop until you reach upon a specific threshold. But because creative work is subjective, that threshold is very hard to to to quantify, right? And this loop becomes very noisy and not reliable. And we we realize that it's just easier to just put a fixed number of iterations and let the user run these iterations again manually if it just wants to. And now let let's dig a bit deeper into how the reviewer works. So basically as input it has the current state of the post. Next as context it gets the guideline the research and the profiles which basically it it's what the writer needs to look around and there are basically the rules the writer needs to to follow and the outputs is actually a set of identic objects right and I think this is the most interesting part here is that you actually constrained the lm to output a list of structured objects and this is powerful because if you give up identic object to an LLM. Biden objects have this property where you you can put a field under each attribute and actually explain what that field means. So basically this is a prompt engineering technique which guides the LM a lot better into actually understanding what each attribute of the return object actually needs to contain. So for example we a review model has this profile location and comment attributes and an example is uh an example output is for example it we violated the terminology profile on paragraph two where the LLM us used the leverage band term and like this we the editor when it gets these reviews will understand exactly what what it violated. And we from our test we realized that it's a lot easier to use this structured uh actually it's a lot more performant to use the this structure outputs than just letting the LLM I don't know spit out whatever it wants. Again you you can check this code uh uh within this Python file. The code itself is pretty simple. So unfortunately I won't have time to dig too too deep into it. And now on on the editor part basically it gets this list of identic objects and the current state of the post and also it gets all the context that the writer initially has because the editor is actually the writer itself. Right? This is similar to how a team of writer and editors will work. I write something I give it to a team of editors. I get the reviews and I I apply them. And another important thing is that reviews are not created equals, right? Especially people like to to to to give reviews on everything usually. And this also is true for the LLMs. And because we loop for a couple of times, we realized that it's super important to put a priority on them. So usually we always want to prioritize the guideline first which is the user input then the research and then the profile. And this is super powerful when uh those those reviews clash on the same uh paragraphs or on the same sentences and things like this that let we know what to pick up and what to apply. Okay. Again you you can check the system uh the prompt uh within this Python file within the ripple. And here is a concrete examples where we can see the post v0 which is basically what the LLM spits out before the evaluator optimizer loop. And this is how it looks like after four uh reviewing iterations. So we we can see that the text is nicely formatted for LinkedIn also like the first sentence which is the most important is a lot is punchier and and things like this. This this is valid also for like the first part of the post and also for for the second part part of the post. Again, it looked it modified the structure, the wording, everything to look more on to look more uh about something that you would expect on on LinkedIn. And again, you you can check the whole post and all the versions from zero to four uh within the examples directory. And ultimately let's let me actually show you how this directory looks like. So to test this code you actually have three levels. So I created a simple make command just to check that everything works. So no nothing crazy here. If this works end to end it means your code works locally. You can pass that to cloud code to to make your code work. Uh so we serve similar to how the research uh agent works everything to an MCP server prompt and then we coupled we connected this MCP with with a right post skill and the thing is that this takes like three four minutes to run. So I already I already ran it to avoid wasting your time. But here here is how you can run it. So basically you take this right post kill and just as the LM u we take for example the same example that that that we took. We cop the copy the relative path and we ask it to use the guideline from this deer and then basically it will it will know to pick up the guideline and the research associated with it and and and write a post and the output is here. As you can see, we have the four the four versions of the post actually five versions of the post and also the guideline which I think is the most interesting one. So basically as you can see instead of just like kindly asking LM to write about something you need to be very explicit about what you want from it. So you need to put an angle, the target audience that you want, a key points that you actually want to cover, the tone and things like this and actually a constraint of characters that overrides like the the the other profiles that are static. So encoded encoding in the system prompt that within this guideline you actually can override like the default profiles if you want something different. And basically here you can go wild and input anything you want. But I just wanted to highlight like you actually need to put in some effort to to think it through. You know, you cannot just fully automate u everything if you don't want to sound like through a slot. Okay. And on top of this, we actually have like a small image generator. We haven't put a lot of effort into this, but the this prompt actually generates the post using this writing workflow and then asks another tool to automatically generate an image for for for that post. Okay. So now to move on. So just to wrap up on what we build on the writing workflow, we we build this writing workflow pipeline that generates reviews and edits the the post which again you can apply to any type of content. We tested it with almost every type of content and it works really well. You just need to adapt basically the profiles, the examples and things like this. And then we learn like how to control the generation, the guideline, the profiles and the few short examples again for different uh content types you will need to adapt this and then how to serve everything as an MCP server and skills. And yes, for this local example, you could do everything through skills. You don't really need MCP servers to make it work. But from my experience, MP MCP servers help you distribute logic. uh a lot easier because skills are like your local setup, your your local hacked setup that works great for you. But if you want to distribute business logic, you don't really want to ask someone, hey, download those skills or install these UV dependencies over there, plug in those CLI tools, oh no, you also need those credentials and quickly becomes a mess to distribute this at scale. And MCP server solved that problem. and skills can help you personalize how you use that. Or basically, if you want a skill to be more than a prompt and actually run that code, it works, but it can become a a quick mess to set it up if it's too complicated, right? Uh so now let's move on to part four of this system, which is observability, more exactly monitoring and evals. So we will use the writing workflow as an examples for evals and we did monitoring for for both. Uh so for monitoring we will go really quick over this because like theory wise there's not much to say. So the idea is that the core problem behind monitoring which in my opinion like exploded when we started to use agents and workflow is that debugging workflows and agents purely through the logs is hard like I personally can't really understand what what's going on inside the terminal especially when you see those thinking of the agent which hides a lot of stuff so it quickly becomes painful to debug what's going on just using the the logs. So basically you need some tool to monitor this nicely that captures all your traces such as all the LLM and tool calls your input output your metadata it captures everything about your run and also latency and cost tracking right so and as a bonus it also stores your traces to build AI an AI evvel layer on on top of everything which I will dig uh which I'll dig more into a bit later. So now let's actually look uh into our moni monitoring log. So we used opic to to to monitor both our agents the research and the writing workflow and yeah just let's look into it because it's it's a lot easier to understand like this. So for monitoring usually you have like three big concepts. You have those threads which is basically in our use case is like the whole workflow of writing an end to end post like starting with the post itself plus the image. Uh for example as you can see this workflow has 44 messages which basically captures all the bouncing around between the user and the LLM. You can more intuitively see it as a conversation thread which that's why it's called a thread you know and then you have the traces and here you can actually dig deeper into what's going on. So a thread contains multiple traces right and within a thread you can act within a trace you can actually dig into what's going on. So for example for a generate post trace here we have the high level overview generate post call where we can see how long it took to run how many tokens it consumed how much it cost to run and here we can see all the tool calls and LLM calls that happened under the hood right so we can see the models used the cost per model the latency per model and all of that and on the right we can also see like the the high level input and output of the system, some metadata, the token usage and basically everything that happened within that run which is which helps you a lot to quickly understand what's what's happening right and we have something similar for the research agent as well. Uh here the thread is a lot easier to to to understand because for example we have this latest thread which which captures all those tool calls basically all the deep research uh tool calls the the the tool call that creates the final report and all the bouncing around and then the threads the traces which have just single tool calls that they capture just a single tool call where as you can see uh we have just one LM all per per deep research tool call. Okay. So now let's move on to to to to to AI evals. So I want to start with why evals matter because it's not necessarily a cool topic but it can do a lot on on on your system. So basically let let's take our writing workflow example. Let's say that we want to to to check how well it works, right? We all start with vibe with vibes we with with vibe checking. When we do and when we generate just one post, we read it. We we say that hey, it's cool, awesome, it works. Well, then we have 10 posts uh we read them maybe, maybe not. Maybe we read two, maybe read three and call it a day and say it works for for the others as well. And then when we want to check how well it works in 100 post, it quickly becomes impossible. And just imagine that you do this at the beginning, but as you evolve your your system, you start to plug in more features, right? And every feature can just break everything. And yeah, this is standard practice in software engineering, but here you work with prompts and just one word somewhere randomly can just break all your features. If you're not not not careful you know and you need this this this layer on top of it. So basically evals fit in three big big layers. So we have optimization which is very similar like to training a model. So basically we have this AI evals layer which lets us quantify how well our writing workflow writes LinkedIn post right then when we want to improve it we have a score which tells us that hey do we move in the right directions or we don't move. So on every change we run this AIS layer and we know if we did better or worse. The next layer is regression testing which is basically similar to test in classic software engineering where whenever we start working on a new feature we just ensure that we don't break current functionality and the layer three will be to actually run this in production on live traces to actually get warnings and errors and alarms and all of that uh when users actually use your system And in reality how this is different for from from like normal unit integration and regression test is that everything starts from the evil data set right. So this is a data problem. We actually build a model here. So here is how we build it for our LinkedIn post where we were lucky enough to actually already have the data. So I extracted 20 real posts for from from my LinkedIn. Why 20? I I I said that is a big enough number for a workshop, but in reality we probably need to go at least to 100. Uh then I reverse engineered the guideline and the research. Basically, I took the the guideline from from the post and then I ran the deep research agent on top of the guideline to find uh whatever I need to support that post and then I generated the output. So basically I put the guideline and the research as inputs and generated new posts. And this is super important because you actually want to see results from your real system. And when you want to generate like synthetic data of some sort, never ask the LLM to directly generate the output. Always ask it to help you generate the input but never the output. You want the output from your real system because that that's what you're testing ultimately. And then I looked around and labeled each output with binary pass and fail uh labels and two three sentences critique on why exactly I gave it a pass or fail label to it. And usually I I stop when I find the first error is just easier for me while labeling and also for the LLM to understand that whenever I see the first fail I stop at it. I write why failed into three sentences and move on. And ultimately I split all of this into a train dev test split because ultimately uh remember that here we're building a machine learning AI model. So we needed to treat treat it as such. So we need a classic split and only when we have this data set and uh these splits we can actually build the evaluator and evaluate the evaluator which most often for some reason when they build element judges they think they can they can just skip this last step which is probably the most important one and again what's the most important And interesting here is the data set that we created because actually creating a element judge out of this if you have the data set is just very easy. Okay. So now let's see how this uh element judge would work for for our current scenario. So as input it will take the generated post that it it needs like to to label but also the profiles research and guideline. And this is super important because the element judge actually needs to understand the context used to generate this post. And ultimately it outputs the pass fail label plus the critique. Basically exactly the same labels as our data set. And then we pass as few short examples our train split from the data set. And this is the most important one because like the system prompt of the element judge can be extremely simple if it has the right fshot examples in place that tells the element judge uh what decisions to take. Right? So the the fshot examples look like this. It they have the writer input which is the guidant and research. Right? So the input to our writing workflow system. Then the writer output which is basically the generated post of the workflow and the labels. Right? So in the few short examples you also have the labels the pass and fail label and the two three sentences critique about the label and this is it like you you can check out our system prompt uh within this file but there there's nothing fancy. The most important part from this is building the data set. And our last step is to actually measure the judge relability because ultimately what we did here the element judge is just a binary classifier, right? We we we output pass and fail label. So yeah, we used an element judge because the reality is that it's easy, but we could have just used any any any other model that gets the job done and can output this pass and fail labels. And ultimately when we measure judge relability, our final goal is to align the limit judge with the domain expert. And how we will do that? We will do that by testing it against the dev test. uh data set splits. So this is a process very similar to training any other binary classifier. There's there's nothing new here. Just the word judge is fancier. So what we do basically is test judge against the depth test data set splits and then use F1 score the F1 score which is a combination of precision and recall to actually measure how well the element judge performed against these splits and the process looks like this. So we first run delm judge on the death split compute the f1 score and adjust delm judge and prompt examples to maximize this f1 score because most probably when we first start this process that one score will be low and we basically want to get a score as high as possible and we repeat until we convert. Usually you need to do this a couple of times and when you think you're done and element judge is ready to go, you run it on the test plate as the final validation step. Right? Basically training a machine learning model and now it's ready to run on new data. So after we go through these steps, we have development judge which is ready to run on new samples of data which will just output the binary uh pass or fail and the critique. Now like to wire everything together that we had in this in this section is that we first build the data set then based on that we build the m judge then we calibrated the judge and only After that we can run it on real data and usually all the steps are managed by some observability platform. We used here opic but you can use any other platform or just build something in house. It doesn't really matter but the idea is everything should be uh managed by a cohesive platform. Okay. So now as a quick demo, let's emulate as much as possible this calibration step because the element judge is already calibrated. So I will just show you how the judge performs on the on the evil split. So under the hood with OPIC, we we built like a very simple evaluation harness where it allows us to run this calibration steps and then run it on on production traces. So in reality the code is for our use case is pretty straightforward. So, just to show you around a bit until it it runs. Okay. So, actually what what it does now it it it takes all the generated posts, right? It passes the element judge on top all all those generated posts. And then what I wanted to show you is this F1 score where basically we compute an F1 score based on the outputs from the LM judge against our labels from the data set which I think is more interesting to see how our data set looks like really quickly to get a sense of it. So basically here we have like a single sample within our data set where what's more interesting is like we have a link to the media of the post to the guideline to the generated post basically links to everything that we need as input and output both for the deep research agent and the writing workflow. And then just putting this scope we plug them in as a few short examples either for the post generator or for the evaluator or for example for the dev split and things like this. So this is how we are careful that we do the splits the right way between the um judge and also the the the generator. And for simplicity we have all those files linked together in this yl file here uh locally within GitHub. So you can check everything and plate plate and run everything very easily with with minimal configuration. Okay. So if we go back into the terminal, we saw that it had like a perfect score. Okay. So now let's actually run this on the on the test split. And the goal here is to see that our M and J did not overfeit it. Right? So we ran it on the dev plate. We get the perfect score. Usually this is a high signal that it overfitted. So unless the F1 score on the test split is not one or around one, it means that our element judge is overfitted. So again, similar to the process before is basically the the exact same uh steps, but now ran on the test split. So let's see the final F1 score, which is again one, which is good. And the scores are so good because we do this just on five samples per split. But if you like expand your split, which you should have at least like 20 30 samples, uh the score probably will not be perfect. Okay. And let me actually show how these experiments look inside inside OPEC. >> Yeah, sure. Yeah, I I think that the most important is like the dynamics between the F1 score on your death split and the dynamics on your test split because the the overfeit when the model is overfeited is actually like how is compared between the two. For example, if it, as I said, if it has like a very big score on the dev split and a low score on the test split, it means it overfitted. But the score itself usually is very um correlated with your own data. So, for example, you're pleased with >> Yeah. Yeah. Exactly. Exactly. >> Yeah. Exactly. You Yeah. Exactly. The absolute value usually is very correlated with your data. >> Usually on the test split you say okay this F1 score is good enough for me and then you run on the test split to see that you have the same F1 score on the test plate as well. Okay. So again, usually we with these observability platforms, you have this experiments tab where you can actually run like uh experiments similar to fine-tuning experiment trackers from from from the old school. And for example, here we have the dev and test experiments. So let's open the the the test experiment that we just ran. And here we can actually see the output from the binary element judge and our label. And we can see that there's a perfect match between the two. And if we hover over this, we can also see like the the critique of of the of the judge. But in reality here, we compute the score only between the the labels. And this critique is more useful for us because ultimately our goal is us as humans to look over these results and understand what's going wrong with with the system. And then as a final step we I also prepared like a online uh simulation where I simulated some online u traces where the element judge actually runs on top of them and you get the binary labels and the critique. But the thing is that this takes like five minutes or even more to run because it actually needs to gen for real it needs to generate all the posts and run M and judge on top of it. So I'll just show you a result over here which I I bundle it together as an experiment. So is this online test evaluation suit where again for example just let us dig into into what further you can see for example the generated post here on the right and then the results from from from the binary from the judge where we we find the score and the critique itself and of course this system is not yet perfect I probably need to refine it more because here on on live traces as you can see it kind of failed uh to to be honest it just passed uh an F1 score it just give us a pass on all the traces while in reality I also have the labels for these simulated scenarios we can see that it failed probably I I need to go back to the dev split expand the dev split with new uh samples for example I could just take these samples and put them in the dev split and start refining it again and again and again until it actually works. And I'm kind of running out of time, but you also have the skill to actually run a a demo end to end, right? Research and writing. You can for example write a guideline about about something that you want to write a post about and give that as input to the skill which is just a file and it will know to pick up and to do the research and then continue to writing the post and you can also observe that in topic with all the the threads traces and all of that and yeah I guess that the final step if you haven't done so yet is to actually open up the GitHub repository, run everything yourself and read the code because without reading the code, most probably we we won't really understand what what's going inside. So you can do that by accessing this this link or scanning the QR code. And whenever you're ready and you want to go deeper into building this type of multi- aent systems, we have this agent engineering course which basically was inspiration for for this workshop. But instead, our goal was actually to design and build production ready AI agents and not just some small local systems and within 34 lessons, three end to end portfolio projects, a certificate and a Discord community with access to us. And so far, it's rated 5x5 by 300 plus students. And if you don't believe us, the first six lessons are are free to try out. And you can access it using this link or scanning the QR code. Yeah. Uh and that's it for for the workshop today.