Shipping complex AI applications — Braintrust & Trainline
Channel: aiDotEngineer
Published at: 2026-05-01
YouTube video id: ZdheJTfLu-s
Source: https://www.youtube.com/watch?v=ZdheJTfLu-s
Good afternoon everybody. Welcome to sunny London. Hey um is everyone's first time here? I think it is because this is the first conference. Amazing. Amazing. Well, thank you very much for joining today's session. Um hopefully you are in the right session but for those who uh need to double click uh this is a hands-on workshop to delivering quality AI applications of brain trust and we're also be partnering with our colleagues at uh trainon which I'll introduce themselves shortly. Uh so probably be wondering who's this guy does he even go here what's his name introduction to myself is uh you can call me Jirean but like the band Juran Jirean with a G. So hopefully folks are Simon Leon fan and I've had a bit of part of helping organizations and enterprises help scale adoption of mission critical systems and now moving into the age of AI. Um you know background in uh formal mathematics. So I know there's all the rage going from machine learning to data science and now AI engineering and this comes at a very topical time. Uh feel free to connect me with LinkedIn if you want to talk or just want to uh you know just do general chart around this area. I'm also joined by my two friends over from Train Line. If you want to come over and introduce yourselves to you. >> Of course, if the mic is working. All good. Yes. Hello everyone. Uh thanks for coming to this workshop especially after that lunch break and uh uh and the sunny time outside. Thank you very much for coming here. Uh so my name is Usama. I'm a senior a IML engineer at Train Line. Uh for those who know me, yeah one person uh yeah I was a staff platform engineer before and I have a background in computer vision um and I was doing also mobile apps on the site. So nothing to do uh with with AI at at some point but yeah now we are doing AI together with Mayan at train line if we go. >> Hi >> everyone my name is May I um I'm a senior AI engineer at train line. I have background research background in LLMs but my research is in the pre-big LLM. So like if you've heard of BERT or the older LLMs um so we continuing building a state-of-the-art uh agentic products at Train Line. If you have come here from abroad, I'm sure you must have used Train Line. If not, please do cuz it's a um it's the state-of-the-art app for um buying tickets and other things. So, uh, yeah, welcome and we're very excited to host you at this workshop and, uh, yeah, we'll guide you through the the hands-on, uh, experience. >> Fantastic. I'm also joined by some of my colleagues as well from Brain Trust coming over the pond. So, Phil, Eric, and Rose, if you just shut your hands. So, don't worry folks, you're in safe hands if you're ever stuck. So, just holler and they can help. Fantastic. Uh, just want to do a little bit of housekeeping as well. if everybody could join into the AI engineer Slack channel or uh well there's an AI engineer organization but there's also a specific Slack channel which we'll be using today to help uh progress the workshop. So if you are stuck we can use the Slack channel to help each other out. Again we're already on there but as we progress and get to the hands-on elements and if you are stuck um this this will really help. We're also providing a uh cheat sheet. So if you are encounter any particular hurdle, there's a step-by-step instructions which help you just get to where we need to go to the workshop without you know having to to feel like you're you're falling behind. Again, a lot of the assets which we share today uh you know is publicly available. Um again we can have any particular follow-ups um if needed. But we'll give you a few minutes to make sure you get onto that and then join the channel AI engineer Europe 226 brain trust-workshop. It's a public facing channel. All righty team. Um guess we can start. Um yep just for the people at c in the back um we'll need to you to join the slack channel. So if seen pierce that team thank you everybody. Um let's proceed. Okay so just kind of help orient today's uh workshop we'll be breaking into kind of three main sections. We'll spend a little bit of time just setting a bit of establishing the background why we're here and and why this workshop is relevant. Hopefully he'll set the context for when we go into the workshop uh building uh the system talking about this you know how do we ship uh AI equality and then we'll wrap up for uh the key takeaways and we'll be around to answer any kind of open uh questions and answers that we have from the field. Okay. So hopefully today um there's going to be a lot of people from very different backgrounds. Um we intended this workshop to be catered for again probably everyone here knows what an LLM is. I don't want to insult anybody but hopefully you're starting to uh explore your journey in terms of your mature maturing your your operations in terms of building AI systems. So whether you're an AI product engineer probably an applied team or come from a traditional machine learning folks maybe you might be a platform operation infrastructure. um this is really appropriate for you just kind of a show of hands here like who here comes from a let's say a traditional data science background perhaps okay interesting interesting who here perhaps comes from maybe software engineering or then pivoting into AI okay so this is definitely the right room for you uh so hopefully we'll we'll be able to kind of accelerate things as we progress okay um to the context here um I think this is not an uncommon expression which we are seeing more broadly across the indust industry. Um, again, put of show hands. Who here has done a a machine learning or AI PC? Let's say, let's say more specifically, a generative AI P, but then has failed to kind of take that into production. Okay, it's got a few hands. Yeah, I'll be worried if they know everyone didn't. But this is kind of a key thing which is, you know, speaking to many of my customers, you know, executives, top level folks, all the rage, you know, they're thinking about this new technology. uh it's not necessarily new, but uh it's newer for a lot of folks, especially in kind of more enterprise and regulated industries, and then they're trying to take this to uh delivering value to their customers. But unfortunately, there's a big hurdle between um taking what you might develop locally on your machine and then industrializing it and making sure it works in Angular. But the key thing that we've seen from all of the research out there, it's not that the models aren't particularly smart. like we've got very sophisticated models whether you're building something in house or you're using kind of a top offtheshelf you know top uh commercial LM providers out there what we do see more broadly speaking is the type of operational rigor when it comes to delivering these systems at scale has not kind of kept up because traditional software engineering very deterministic 1 plus 1 equals 2 great LLM systems you know as my 10-year-old sorry 5-year-old would say um you know 2 plus 2 equals 10 daddy so it's having to kind of adjust it and make sure that we're delivering that to scale And again holistic capable what we're seeing uh when shipping these things is the fact that again people think a demo state is suitable. Um but clearly doing two two to three to five demos great putting production everything goes ary. um treating some of your logs as observability and then this is really critical to to how we work at at brain trust is you know logs will tell you what has happened but sometimes you need to go deep into the system and understand its behavior and this is really where observability comes into play u something as well is you know works on my machine fails in production I try to patch the prompt um and then again it's operational until the next issue happens or the next failure mode but again how do we keep track of that um if especially if you don't have a system in place um irrespective of what tooling you use um this can you know has a categorical effect. So again a lot of what we see again it's not to do with the tooling technology it's down to more operational workflows and this is really what we're aiming to help you in today's workshop. Okay, so again as I mentioned it's not the prototype. It's getting to a state where we're knowing exactly what's changed in the system. How do we interact with that and then how do we systematically put a set of rigor so that we can get better and better. Remember 100 uh our target is not 100% coverage. It's getting as close as possible uh while maintaining um fixing the gaps that might have existed. Um and again something that we see time and time again is again a one prompt might work but as you move and industrialize it you want to probably do things like breaking this down into each individual um sectors of responsibility. So again if you do come from the software engineering background you know about these you're breaking down the monolith into microservices. We'll outline a very similar approach here when we talk about building these systems. Um again making sure that we're understanding these changes uh putting a set of systems in place. This is really what we want to do in today's workshop. Okay. In terms of today, um hopefully as I mentioned, this is a hands-on workshop. So, we're going to be going into the terminal. We're going to be going into the UI and we're going to be going through the step by step and guiding you along the way. So, we'll be doing a stage AI system uh with uh multi-stage uh tool calling which really allows us to um see this more uh let's say aentic flow. We will then use brain trust to instrument and see how the performance of the application is working. We then also want to then you know take a look at creating what we call um or identifying failure modes using a golden set. So we'll push that through. We also then want to talk about how do we industrialize it. So they're moving from hey it just works on my machine to something that you can use in production and have it manage forward with a system in place. And key thing is identifying you know those those edge cases because again you can create a a test data set but ultimately um there's no substitute for real world data. So again we'll be able to show you how we we take those real signals in and evaluate and complete the loop. Um so just bit of introduction here today as well. Who here has heard of brain trust or played around with brain trust? Can I get a show of hands? Okay great fantastic. So okay love this. Love going on in. So just a bit introduction to brain trust. We're a company now that's uh I think just shy of three years old. Uh we're approximately a um a series B company. So we just announced that a few months ago we raised $80 million at $800 million valuation. Um we have investors such as Iconic AZ 16 uh as well as Greylock uh to really talk about you know helping organizations ship quality AI at scale. So we're the platform for AI observability. uh we've got a heavy user base uh globally but we're also expanding our presence heavily in Europe. So again I'm one of the first engineers to to join and help build out uh our go to market function here and we're very excited again with our customers and our friends over at train line to do that. Some of our other local customers include like lovable doctor lib as well uh which are really pushing the forefront of these AI systems. uh again when it comes to using kind of brain trust again I know to again I want to kind of get your hands on with it but where we really distinguish oursel is being able to do this at scale so our founder anchor goel um this is actually his third time building brain trust um so he's an expert when it comes to sort of database systems and he's built um a company called Imper previously that was acquired by Figma that talks about document extraction and he's led the ML uh machine learning team out there and because he realized you know, building these evaluations are hard. Understanding production traces are hard. So, if we're having this issue, I'm sure there are other organizations out there which are doing the same. And so, he founded Brain Trust to really help do this at scale. And as we're doing and understanding this these these tracers are coming in and again being very highly semi-structured data which changes uh he realized that traditional analytical systems weren't fit for purpose. So we've put created a um a new category of databases called brainstorm which really helps identify and accelerate this at scale with us again we're tool work we're platform agnostic so irrespective of the agent framework you're using or or the LLM providers out there we we're we're intending to help you deliver value uh irrespective of that okay one things I will talk about as we progress uh in this workshop is this the concept of a flywheel so Uh again if you ever come from agile development um you know perfection is the enemy of good we want to start somewhere. So even if it's a case that it's a new application and you don't know how it's going to be in production we can start up with an evaluation set. If you have an existing application to instrumenting that's great we can pull that information in and identify the failure modes here. So the key thing is get information into the system, identify those modes, remediate, ship it out and then monitor and complete the supply uh again and again till you get to where you need. All right then um you would heard uh a lot from me. So one thing do is I'm going to provide uh my colleagues over at trainine to maybe just share their experiences of you know prior to to brain trust and and how they're helping. So uh I'll give it to you. >> Thank you very much. Uh and hello again for people who are joining us just now. Um as my uh mate my uh introduced train line uh we are a company that's actually helping people get on the trains. U trains are different than planes. Uh if you don't know uh there is a like worldwide system central system for all planes around the world. It's not the case for trains. uh and in Europe and the UK it's very hard I would say if you would like to install an app for each career it will take the whole space on on your mobile phone for sure so trainer is actually being that platform uh basically uh to help you book tickets uh mobile app agnostic platform agnostic career agnostic you do it on one app and you can book a train from Paris to London uh from from uh from Leyon to to to uh to Milano whenever you like basically every career in the EU we sell like almost like 6.3 billions uh of tickets on the trains 27 millions active users and counting and the other interesting probably one for for this conference is how many uh AI conversations actually that we have from our with our travel assistant so we do have a travel assistant that is exposed to people and it's not just a chatbot uh it's actually a multi- aent system that actually can handle refunds for you, can handle changing trains for you. So it is very I would say a proactive um Asian system, Asianic system and not just a charge. Probably mine would like to add more about that. >> So what are the benefits of having 27 active customers is you've got a huge space of how you can serve agentic applications live to the customers. One of the examples of that is what Osama talked about is a travel assistant which you can get to from a ticket window in the application. Um it's a digentic system which is something we want to talk about uh a little bit later in the slide. >> Awesome. Which brings us I would say to the next uh point uh we will keep it short. Uh selling train tickets and being like a a train tickets company. How come that we are doing uh machine learning? Of course we we can uh there's so much things uh to do and to help people uh in terms of their journeys getting their tickets getting their trains getting back home. Basically uh we build we do two things. So the classic ML part which is actually building models we do that we build ML models inside of train line from scratch from data to model. This is something we do and we also do the multi- aentic generative AI systems that we are now familiar with on top of LLMs that we love and cherish all the toolings and context engineer and all of that we do that. So we do these both sides of the story uh at train line and these are two examples of uh what users are actually uh using uh on top of those u systems on the left side. So this is uh you can you can think of it as your weather application but for train disruptions. So basically you have a ticket for a train you get there we know if this train would be uh would be disrupted or not if it will be uh pro probably late or not. We don't we know that but based on a huge data that we have on top of it that was a machine learning model that was trained and can actually predict uh train uh disruptions uh being late and all of that. So this is the classic ML part. The other one that's the traveler system that I told you about. It is as I said it's very I would say advanced multi- aent system. It can show you alternative trains if your train is canceled or something wrong with it. Good luck doing that yourself even with Chhat GPT. And the um the other one is handle refund. So you can actually get for a refund um on your ticket if your train is late or all of that. And it actually can give you all of that uh without a handover and also can you can do the handover to actual u uh human customer support uh in our lovely customer support team. Um which means that if you are doing this at production in production level and that scale at uh train line scale it means that uh or you can ask the question are we breaking things at train line of course we don't want to to do that um and we are moving fast because technology is definitely is moving fast and and this is why we are here we are we're here to show you like what are we doing to move fast without breaking things uh in terms of course of AI and we can do that on top of handling the complex uh software systems that we uh we love and cherish cherish from APIs at scale hand serving millions of users. Um so how we would like to think about it is this scale. So we know for sure that whatever we have as software systems this is deterministic side of the story that we have and on the other side building the ML models this is the nondeterministic side of the story and we know for sure that the agentic systems are in between there are parts of them that are deterministic there are parts of them that are not definitely deterministic uh and this is how we are the framework of thinking that we have in terms of quality how are we handling that that's question for the ML models. We do we do care about the quality of data and and that we use for training the models but also we have on top of it of course uh ML machine learning evaluations uh whether offline or online offline it means like before going to production you you need to do your evaluations and online is the data from production you get it and evaluate your model if it's doing well or not. In our case for instance for that for weather forecast uh is it pred predicting the state of the train if it's disrupted or not is it is the model correct in its prediction. So that's one uh one example. On the other hand, for people who are familiar with software engineering, we do all I would say quality checks and the diagrams or whatever and the tooling that we have for handling quality at scale, a very large scale at training line. Uh and for those systems, I think you already guessed it. It's basically combination of both. It's not one without another for sure. We do everything that we do equality wise in terms of deterministic systems but we also use the the evaluation um uh side in um from the nondeterministic systems. Uh so yeah that's I would say how the I would say the framework of thinking that we have a train line thinking about these systems and um definitely uh brain trust uh is helping with that and by the way disclaimer we are not I would say we are here just because we are convinced that it brain trust works for us we we are not paid so that's 100% uh we we are happy customers we we have been with brain trust for a long time and we use it in different um some of uh brain trust features we actually use them here. So for instance for the ML AI evaluations we do that and we we we follow the scoring of travel assistant on many many levels from uh from tone of voice to actual helpfulness uh when it comes to tickets and tickets are really really I would say complex in terms of reasoning and what you should get what you not get depending on the if the train is late or not uh the type of ticket is a return ticket or advanc so many complex cases uh but yeah we do follow that uh evaluation side probably m would like to add more about that. >> Yeah. So can I get a raise of hand from people who have struggled with LM costs, number of tokens, switching models, problems like this, right? So yeah, it's nice to see. So it was a problem with train line as well because we do this at scale and the amount of open AI entropic amount that we pay just like bizarre. So we have to keep switching models which is like the best model for our use case, cheaper models uh to models with efficient tokens. Now anytime you want to switch models you want to make sure that it is performing at least at the same level as your current model right before brain trust we had no specific way of doing it because we didn't have scores set up we didn't have you know sort of like the evaluation so with the usage of brain trust what it has enabled us to do is to simulate how the performance of the lower model would look like and we've used brain trust extensively to run offline evaluation and see what the effect is going to be and also online evaluation to see that the intended effect that we observed in offline is is as expected. So that's I think one of the use cases. The other one is more generic which is like it would have taken us like a lot long to evaluate a new feature shipping into travel assistant but with brain trust we have been able to make sure that we are assured that the new experience for the user is going to be good. So brain trust has helped us a lot uh in shipping fast. >> Cool. And also the other part is observability for sure. Uh we do we do use brain trust for observability if this one is working. Yes. Um so yeah this is true example from uh from brain trust. We are spoiling basically the workshop what whatever you are going to see there but yeah we can we can track everything regarding tool calling the other agent um the in in terms of number in terms of quality that those kind of insights you will need them to get from proof of concept to actual system in production or in your company or for something actually um uh that's been used there out there uh and users find it I would say true working product and not I would say just any slope AI generated system. No, we are we need we will need that for definitely for for those cases. Would you like to add something here? >> I was just going to say that brain dress enables you to look inside complex agentic workflows up to like tool call level token level which is like very insightful and helps you debug a lot of things in production and before you deploy it. >> Yeah. And one last thing probably is the cross functional uh uh friendly point. Uh that's something that we have discovered along the way. Uh because we are building those systems from travel assistant to the models at some point we need to have we are a big company and we have people from product people with nontechnical backgrounds and we need to communicate and we need to share many things and they also need to selfserve some of those things. We cannot I would say babysit any uh people and say hey you should do this should let me get you the logs download it and send them that does not work at scale we need like a way like where we can help I would say can work crossf functional way um and let people be I would say um uh uh free to do whatever they want and self-s serve with with data and insights. So this is what we have also discovered uh along the way and brain trust helps us I would say with many requests that that we had but yeah really appreciate it uh for sure and and of course more we are using just a part of uh that system we have our own things branch I think you are building even more more more toolings so so definitely uh more to come I think the workshop will definitely help you discover all those uh things and u of course have fun building uh during the workshop I'd say laptops out and if you are interested train line is hiring uh in the AI engineering side of course and uh give it to you >> perfect thank you so much for that team yeah and just to say again an impartial thank you uh an extension of my wife we thank you so much for the split saver functionality so uh give you kudos to that all right then team all right let's proceed with the the setup so again this is going to be a hands-on workshop uh better dust off your bash skills team jokes. I've got a nice rafcom to help out with that. So hopefully everybody or at least most folks have joined the Slack uh organization for AI engineer and then also joined the the channel. I've put a link to this uh but there's also a QR code to the repository which we'll be using. So today there so I'll give it a few minutes. Um really key things is um signing up to a free Brain trust uh account. Um if you use Gmail, you can use a little plus sign trick to just create a new account if you um you know keep it if you are an existing Brain Trust uh user today. Uh so hopefully it doesn't pollute your uh existing one. You will also need access to an open AI API key. um this should be easy for you to generate. If for some reason you're not able to generate a key then please let my colleagues know. We'll be able to send it and DM you on on Slack to use for this specific session today. Um and I think obviously as an engineering conference and especially in AI uh if you are using AI coding assistance feel free to use that within your ID or in the terminal to to guide you along the way ex ask questions around the code base and so forth. Um I've tried to simplify a lot of the scaffolding in place. So I'm using mice to uh manage uh node and pnpm um on on the machine. But alternatively, you can download node v22 and the specific version there. It should be fine. U but again I just want to cate especially for this workshop. I fixed that specific u let's say runtime. I'm using make to uh just again some bit of syntactic sugar to wrap the commands. Um if for a reason you don't have make installed if you're on a Windows machine, you can then just run the um pnpm commands which are then just wrappers for the package.json file >> here there today. >> So yeah, we'll give it a few minutes uh for folks to um >> to scan and clone. Just worth pointing out as well um I have a uh in the sheet today I can see lots of folks have already um gone through it but um yeah we provide a step-by-step logo. So as mentioned as we're proceeding with the um workshop today if you are stuck you can please answer questions on the slack channel but also refer to this u as mentioned this will stay this is a public facing asset so uh even if you uh outside of this workshop you're stuck you can kind of go through it at your own uh leisure just kind of maybe a show of hands. Um who here um is not able to get an open API key open AI sorry API key hopefully folks be able to provision that it's uh really critical to this. >> Yeah. some people join. >> Wait, we don't have the QR code >> for the Slack. >> For the Slack. Yeah, >> I can do that. Yep. >> Okay, folks. I think we did have a few late starters. So, um, yep. I don't want to spend too long with this, but if you did join late, please scan that QR code uh which will give you access to the AI Slack engineering uh organization and then join the AI channel uh AI engineer Europe 2026 brain trust workshop. It's a public facing channel. Through that channel, you'll be able to get access to the pinned um uh content. So, the repository as well as a cheat sheet which we'll be following as well. Just yeah just a bit of a sense check folks. Um hopefully most of you would been able to join the slack channel and are able to clone the repository have access to that. >> All right then um as I mentioned at the start of the um the agenda as well what we'll be doing is creating a um an example let's say support triage agent. So, you know, hopefully this kind of exemplifies um a lot of uh systems that folks maybe have already built or or trying to build. Um it is it is a fictitious uh application uh designed specifically for this workshop. Please do not use it in production. It's just really around to just teach us um you know how how do we build these AI patterns in for scale. So the idea there is um given a ticket might be raised in the particular system. uh we have a set of agent that goes through a a pipeline process I would say pipeline but a stage process tool calling uh that produces um a set of information that can be emitted into downstream systems uh from that perspective okay um just to kind of help visualize what uh the system does again mentioning we're taking the ticket input uh we have a first step which is collecting the context it's quite a deterministic way of extracting information we then proceed with the agentic type of portion where we've broken DS into three stages. So we have an LLM and tool calls to to triage that particular issue. Um we then want to do a policy reviewer to make sure the output is is correct. We then want to create and draft let's say a customerf facing reply with reply writer and then we want to package this together and depending on um how severe this particular ticket we can invoke another tool call to whether we need to escalate uh this to let's say a human in the loop and then draft the final result. So uh again not too too complex but just give you an idea of of the the system that we'll be trying to build and and operationalize uh in this session. One thing to point out is you know as we begin to use brain trust is like really where does it fit into helping you you know deploy and and manage this at scale. I just isolated uh this here where you know everything we're going to be doing towards the end especially as we progress we'll trace it end to end. Um as my colleagues have talked about um we'll be able to use manage mode for prompts. So the offloading that what you'll do locally into a secure environment tool calls as well will be managed. So again uh how do you you know talk about getting to external systems and lastly around evaluations and scores which will be run through our brain trust infrastructure. It's worth pointing out as well if you do go into the GitHub p uh repository uh there's a page GitHub pages version of this as well. So again the slides will be readily available for you to use and consume. Okay, key thing as well uh with this is I've tried to help you um do each phase and checkpoint this. So if you are stuck the idea is just use get get check out to a specific tag or the branch but in this case each individual tag or branch is fully runnable at that stage. So if you're ever stuck again check get check out to that branch to your make install make setup run the commands and then you should be able to get uh near identical output every every every time. So again this is all documented in the the readme but it's also included in the cheat sheet as well uh for you to progress. Okay. So, yeah, just to kind of give you the sequence of events, uh, we'll be doing kind of the scaffold and setup. Uh, I'll talk about building a basic agent. Again, this workshop isn't really designed to talk about building agents. It's about, okay, once you do have an agent, how do you then operationalize that? So, but just for brevity, we wanted to kind of do it step by step. And then we'll kind of get into the the core of it where we'll add the tracing, talk about the evaluations, you know, talking about the golden set and identifying where we can we can improve. And then again tying it all together and using kind of manage infrastructure within brain trust to to help operationalize this out and again talk about that collaboration that my colleagues at train line talked about. Also kind of the key thing which is like once you do identify a production failure it's then how do we apply fix to this and then complete that flywheel and give you a finalized asset. Okay. So step one we're going to be talking about building the agent. So in this checkpoint here um again we need to start somewhere right so what do we do uh we will take an initial call to an LLM we'll set up a prompt one shot one in one out and get an output u so again if you're doing a proof of concept um this might be great as part of that initial spike but again we know there's there's work to be done but yeah for context we're just going to be stepping through this but as I mentioned just because it works in the demo doesn't mean it's necessarily going to work in production. Um I've put some pseudo code here just to articulate here. But you can see here as as with all uh these um you know language models we provide a sort of a system prompt. We provide user and especially the text and we pass an output back into our application. So one function one model call and we have the structured output which we want to receive as part of the the output. Okay. So what we'll do as well uh when we do the the scaffolding and checking out to the first uh branch uh we can run a set of pre-built tickets. So it's it's a case that's kind of already done in code or you can use a command line. So I'll just demonstrate that shortly what it would look like uh with the outputs and you'll get something similar to that in uh JSON format. Okay. So hopefully everyone can see this at the moment. Um, I have the the application here. Okay. So, I'm just going to go to basic checkout. Okay. And uh what's this? So, if I take a look here, so I'm now checked out into uh that first tag. So, building a basic agent. Um I've got the application here which is again very similar to the the prompt code. Um we're creating the client uh calling open AI SDK. Again for brevity I've just I didn't use any agent agent SDKs but we fully support that as well as we proceed with the the workshop. Um so this is really available and you'll see again for make file uh if you don't have make installed then you can just execute uh pmppm scripts or I don't know if you if you use npm or yarn that's also possible as well. I haven't tested it but theoretically it should work because it's just using package.json. So in this case let's say I want to do something like you know make ticket. So I want to give it things say you know my password needs to be reset. Um in this case I provided some defaults. So I'm just like enter enter. And now it's just making a call to open AI. I'm just using um you probably would have seen from the environment variable file I'm using GPT5 mini. You can switch it if you wanted but uh just for the purpose of this we'll keep it simple. Um so you can see here I've got the ticket and it's provided some um some output here as well. But again that's a singular shot. I think we've okay so you know based on this output you know it looks fairly plausible u but again it's not going to account for a lot of the edge cases that we want especially if we've got a lot of nuance to the organization which we're trying to build into uh the logic uh I can even do things like make uh demo. So make demo for a script. Um yeah, it's the same thing. I'm just calling the same function, but I've got it codified as JSON here. So yeah, the JSON fields are available to see. Okay, so that's fairly straightforward. I don't want to dwell on that too much. So the next thing I want to do then is talk about um adding a kind of local tools. So we probably want to say look, let's try to make this a bit more deterministic. Even though a prompt might be very well structured, I may want to bring in uh different ways of how it might operate. So u in this case u I'm calling three different tools to look at relevant u help desk articles around that. Uh this could be both internal and external. I may want to look at certain things that have happened to the account. So let's say a customer might have done a certain migration and that might have an impact on backend systems and there probably reason why I want to you know create an escalation here. In the case what I've done is I've made it a bit more deterministic. For the purposes of this workshop, again, this is kind of treated as code, but uh in reality, you probably are going to be interfacing with external systems like a vector search, MCP, CLI, and other types of uh uh interfaces to to build out this capability. And again, key thing here is uh the more things that you add, the number of ways that it can fail will also increase. So again, this is why tracing as we begin to go will become more important. Right. So just one here. So against read me here. Next thing we'll do is go to add local tools. So what do check? All right. So in this case, yeah, the tools that I created uh are then available here, but again, they're just kind of checked in as code for simplicity sake. And again I can do the same thing where you can make a ticket say password means resetting account locked. Okay, now see it's provided a little bit more information. You can see it's a little bit more verbose uh because we've introduced uh tool calls into this and it's given more context to the LLM. Um also worth pointing if you are feeling stuck folks that a lot of these work uh or the the tags of the workshop branches are built sequentially. So if you let's go into let's say get tag number six, it's going to include everything as part of that. So don't feel like you have to go through each one. If you're feeling stuck and you want to skip, you can do that as well. Okay, let's get on to tools. So again, we already I already showed the code anyway. So but this is an idea through pseudo code what it would look like. um stages. So I think this is the next thing where again we're breaking down that monolithic call for an LLM. We're now introducing tools and the next thing is even further drilling down to special stages of how the LLM should behave. So you would have seen from that sequence uh diagram we've done effectively five stages for this where I'm now setting up things to collect the context triage determining if it's meeting brand policies providing a customerfriendly reply and also something internally for our systems and then finalizing the result for downstream systems. Um again it's just coming from you know traditional software engineering. You're breaking down your problem you can see the exactly where something's going wrong on the stack and you'll be able to remediate. So again where possible try to be more explicit and break it down into challenges that you can work on. So yeah just a bit of pseudo code. Um this is kind of like what it would look like. um probably can put a debugger um with a new ID and and take a look at that. Okay. So, I've kind of been doing this in part but uh we've already started with uh start point and then I'm going to talk about you know doing the the specialist stage here. So, let's take a look at the next part of read me. So get check out special stages here. So if I take a look now um in the source folder it's it's got the individual uh functions which are are are pieced out uh the prompts associated with that uh it's been looser. Let's say here a triage from what we're using and if I go down to the application you can see uh it's it's done here. So I'm using a a synchronous functions to execute. Yeah. And similarly as well if I do something like you know make ticket maybe I'll do something a bit different. What's another classical problem? Oh, it's a I need to upgrade my plan from pro to enterprise but the website is not working getting 500 errors you know. Uh so in this case I'm in customer tier two. Uh we're talking about um billing here and my account is actually account number three. So again just to show you that we are this is live. It's not doing something that's hardcoded in the system. Um, worth pointing out as well, this stage is slightly slower because we've broken down the individual oneshot LLM into sequential calls. So, it is expected to take a little bit longer, but again, that's that's part of building out the sentic flow. And now you can see again it's a bit more um verbose but it's there's a lot more thought into um this this agent here. So I guess you can see again we we talked about a billing issue. Um you can see that it it believes that it's quite high because a customer wants to do that upgrade from a different tier but there's obviously an impact to this from a revenue perspective. So in this case we should escalate to the appropriate uh people on our side. So you can see here what the escalation region should be um and the both the internal and the the customer are facing reply as well. So we've included a confidence score um to say you know if this is true issue but again um as mentioned um tool calls will be able to pull information which is happening across different systems to provide a a greater level of competence. Okay. Perfect. All right. that in mind. Um, so you know, again, we've now hopefully shown how we can build an take an agent, break it down, build something that's multi-stage, introducing tool calling, uh, to give us where we need. The next thing is then to provide that information and start tracing it. So we can actually see what's happening, you know, down to the individual details. And this is where, you know, observability piece comes into play. So what we want to do in this kind of section is break down the full execution path. We know that the stuff is very uh nested in structure. So there's tool calls. There's additional function calls behind it. Um we do want to track some of the key things and I think um you know Mayan pointed out you know the early struggles that they had to talk about you know latency cost uh tokens account especially time to first token is a very important metrics that we see many of our customers trying to identify. Um also you know what were the inputs what were the outputs metadata associated with this and then also including additional uh types of fields so that again when we talk about monitoring and observability we can query it uh within the UI on the fly and set up alerting if needed. So again having an output is not enough we need to understand uh the full execution path and that's what tracing allows us to do. Um yeah, just to give you an idea of the context and I know it's a very contentious topic top topic to SA because again I come from in a a background in full stack development. So you know I split both coins in terms both you know Python, Go as well as TypeScript. But um yeah we our SDKs is is multilingual. So Ruby, go um I think even net we've got some folks who are using it and uh that's that's all good. But yeah, I've just kept Typescript for simplicity today. But yes, our SDKs do cover a range of um different languages out there that you can start tracing your application with. So yeah, uh a real key thing to this and uh one of my challenges I do see with our customers broadly is they will do an individual interaction uh against a singular uh parent span and this even might work with let's say multi-turn multi- conversational agents. What we want to be able to do is trace that into a nested structure so you can see everything but in one interaction let's say one conversation uh in a in a parent span. So it's really critical as we start instrumenting our application to make sure that we're getting the the right structures in place otherwise you're not you're not going to be able to see the full effect of where things might go wrong in your application. Okay. Um again this will just come down to reading the trace. Hopefully folks um especially I know some of the you've got more software engineers in the room uh that are using kind of the traditional observability tools out there this is not too dissimilar uh but for folks who may be new to this screen a trace just allows us to uh not find out what has happened but what's currently happening with your application uh in in in real time. So uh again tracking every single call bringing in that metadata and again depending where the failure mode is we can then identify okay what might be the the course of remediation that we need to take. Okay so in the next stage what we're going to do is add tracing to this um and then we're actually going to run it and then go into brain trust and hopefully we'll see it happening. But one thing I do want to bear in mind before I do that so So, let's go to our readme. It's telling us to add Tracy. I think that's okay. Now, we're going to keep it there. Okay. So I've introduced a uh helper script here called tracing. So um what we do quite well uh within our SDK is again we want to not introduce more complexity where it's needed. So if you are using again the standard um LLM providers SDK you can simply just wrap that function we provide that out of the box. But then when it comes down to the individual uh calls again we can set up uh some some helper uh scripts here to help just kind of wrap up uh as needed. If you are using something like Python then you can also have a decorator function which helps as well. But in this case you know I've got a nice little u function which helps with parent and child. And then when it comes down to the actual um application tracing, uh you can see here like I've got a um a child span which has been um executed throughout this. Again, I don't want to do too many code at this point because the code is um expanding pretty heavily. Uh but yeah, just kind of want to walk through the kind of key concepts for this. Uh, okay. Um, before I run the uh application, I did want to go over into the the Brain Trust UI. Sometimes it might create a uh if you've already signed up for a free trial account, so hopefully you would have done that earlier today. It'll create like a test project. That's totally fine. We'll create a new project as we execute this. Uh really key thing uh as well um if you haven't done already, top leftand corner, you go to your profile. Um oh, okay, let me just create one. Okay, let's create that just for now. Uh go to your profile and then where it says um API keys. Um you know, enter your uh a name for the key, generate that in and then use that within your environment uh yourv file or secure that uh in a key vault if you have that already. Um there's also the open AI key which we're hoping you generated. um that should also be used uh in AI providers. So I've set this at the organization level. You can also do this per project. Um you know depending if you want to segregated by a particular team or environment uh that's also fully supported. Uh but in this case um we'll need this key for later when it comes down to the managed and online scoring. So, but just bit of a tidbit um that key that you generated, make sure you put it into your uh brain trust um organization here to be used later. Okay. Right. With that in mind, I'm going to run the demo. So, here. So, and just point of reference, if you are ever stuck with subs, you should just use make setup. Make sure everything's in place. In this case then we want to do kind of make demo and just we're going to run and execute those tickets. I can even do it uh using the uh make ticket command as well. Okay. So while it's running in the background, oh sorry, while it's running in the background, um if you go back to uh your application, you would to see uh there's a project called helper workshop. So that's the one that's will be created as part of this uh workshop there today. Um if you navigate to the logs tab, you'll start to see this uh coming through in real time. It's worth pointing out again thanks to our capabilities of brainstor uh we have near instantaneous write into our system and read available shortly after. So especially it's it's a non-blocking function. So a lot of our customers uh especially the more sophisticated ones are really using brain trust at scale and really pushing the envelope. So it's not the case, hey, I've got a thousand traces, you know, they're pushing, you know, tens of millions of traces at a at a time across a very short period and they want to be able to again aggregate this. And again, as you build more sophistication in your application, you're sending this out to more users. That's going to grow up pretty quickly and you need to be able to um have a system that can handle this at scale. So once you start to see the logs that are coming in um come in reverse order. I'm just going to take a look at the first one here and we'll start to see the imp instrument application globally traced. Um there's a particular button here that allows you to view this in full screen which I think is quite helpful. So uh as I mentioned because we've got this nest structure in place where we're going through each uh in the sort of tree um this interaction at the top um is the is is the demo ticket. You can see the very top level we're saying how long it took to actually run that invocation uh the prompt tokens in out cost and latency associated with that as well. I've also included some metadata because I want to be able to kind of extract and filter that out as needed and I'll show you how that will work. Um and everything's available here to view uh metadas here. If I also want to take a different look at this, we can even look at individual steps. So in this case, I want to look at what's happened with uh triage specialist down to the actual um invocation to the LLM. So again, SDK uh provides a lot of um flexibility around this. So you can see what was put in uh the reasoning behind it and the information that we set output and then coming down to the last you know should we escalate or not. Um another views that we will see as part of this is uh taking a look at the timeline. So just gives you an idea like a waterfall methodology to see okay was a particular step which is taking longer do we need to remediate uh it's all possible to do. Um, all right. So, if I take a look back logs page, get a back out of that refresh. So, I think it was kind of four tickets that were pushed. Yeah, they should come through right now. Okay. Yeah. So, that's it's two tickets at the moment. So, again, that same information that we replayed is also available in the console. Okay. Yep. So, we've covered tracing. Let's talk about the evaluation portion for this. Okay. It's quite interesting. Uh again, depending where you are in your journey of building this application. Uh a lot of customers already build an application already, have it sort of monitored and pulled that in. But what happens when you have effectively a co-star problem where you don't know what you're building, right? So what's what does good look like? Um and effectively what does good good shipped mean? Um in the case of the support application that we developing today um kind of non-negotiables, you know, have we categorized the support case? Have we make sure that there's no severity or low severity outcome for uh these issues which are blocking and does the escalation stay in policy with with SLAs? um does the structure look uh look sound and if we're making any particular changes does it actually improve uh the way without actually breaking or having a regression uh to to the application um we can do this using evaluation. So an evaluation um for those hopefully many folks who know what they are but for those folks in in the room think of it as a way you have your your data set your input you have a task and then you have an outcome which you want to evaluate against or scoring function uh and this is kind of a little bit different uh comparing it say traditional software development into working with AI systems because of its nondeterministic nature right so in this particular uh portion What I'm going to be talking about here is creating what we call a golden data set. So in this support application, I'll be kind of testing anecdotally, but I want to create a set of edge cases where I think this is really going to help give us at least an initial level of confidence to the business that what we're releasing out into production, you know, is is sort of for purpose. We there's always room for improvement, but I want to be functional. I I I want to just think look it's not just me releasing this application based on vibes. I have a concrete way of saying okay this is how it's is performed over time. Um to do this again we use kind of two main types of scoring functions. The first of which is deterministic. So I think a lot of folks may have already started using this again coming from traditional software engineering uh you know unit tests uh would say analogous but they are quite similar in nature where um they're very easy to run cost effective you're not actually using a model at this point the secondary type which is a little bit more sophisticated which is then using an LLM as a judge so another AI system and this is really helpful when it comes to systems where there's nuance which can't really be determined on deterministic systems alone. So again creating one to talk about you know branding style is this meeting you know customer satisfaction uh and so forth. So these most important thing is if you cannot write it in a deterministic way you want to be using LLM as a judge where possible. So, and why this is more important again it's just making sure that any change that we make is it safe to change uh as we progress this. Okay. So, I'm going to pivot into uh the ID again. Let's take a look at the uh read me file and then we're going to do a dictionary maybe coming from okay C check out there we go okay okay and then based on this here the make demo make setup up. We should be fine. Okay. So, one thing I'd like you to do as well is then to run the seed data set command. So, we do make see data set. Okay. So, what this is going to upload our evaluation uh test cases into into the brain UI. So if I pivot into uh my data sets now you'll see it's called helper C data set again just for simplicity I've created 10 um inputs um I've also categorized them u around there so you input what we expect and some metadata associated with that so kind of the core structure to uh creating an evaluation So um as part of this as well um and the deterministic and nondeterministic I've created some scoring functions uh which we'll be using as part of this. So it's available here. Uh and then yep scoring functions. So again checking the category uh is a schema in place. Um is there an escalation reason when when needed? Again very easy to run uh and codify. Yeah we have a bit of more sophistication with the the custom rubric. So this is more of the LLM as a charge use case here. All right. Uh then what we can do is then go back to the readme. So we don't need to do the demo. We already pushed that through. Uh let's go ahead and then do um make eval. Okay. So this is then going to run an evaluation uh here. Um the seed data as well. Uh again you don't necessarily have to put it into code but um whether it's coming from a database or in this case I'll just use a flat file JSON for this uh just to give you some context. So we are running this evaluation and we should have if we go to the UI a place for experiment. ments. So just double checking here. Yeah. So that experiment ran against all the JSONs associated with it. So we should you should get an output like this at least in the terminal. But if we go back to the UI uh again you can we're starting to track the our application against the inputs and outputs for this particular data set. Um quite similar to the um the tracing view you would have seen from the online uh traces. You get to see a very similar view here and seen how it works across um your experimentation as well. Yeah. And I think pointing out as we progress with the workshop uh we'll start to see how do we then improve it using the the set of difference functions and track that across the UI. If you need a bit more real estate as well um you can collapse the menu which does help. Okay. Time. Perfect. Okay. Um right. Let's proceed with the next sec checkpoint uh around deploying and managing this. So um as I mentioned I think a lot of folks or at least anecdotally when when I've seen this spoken to customers is again things will work really well on my machine. Okay now I'm checking that into code. I kind of want to take this into a place where um I can start to collaborate a little bit better. It's in a there's a point of reference. Um I've got versioning history. I can start to identify again who's made what change. and you need a way to be able to bring these users together. So I think SA talked about a collaboration. Um what's interesting as well is again changing the prompt on your machine and then trying to ship that code to a repository and then maybe somebody who's let's say a non-technicalme product manager perhaps they want to update the prompts they can't do it. They have to tap you on the shoulder. Maybe that's happened to some of the folks in the room here today. Uh I know it's happened to me uh a few times before and it can I understand it can get really frustrating. So we actually just want a way to be able to pull that together. Really key thing especially for those who work in very regulated industries like reproducibility is a very big key thing and I've worked in you know better part of over a decade in in both banking and and capital markets. I know obviously with uh you know the regulatory out there especially uh things like right to be forgotten understanding you know who's made a change especially in a stress exit scenario how can we put this into a new system this is really key to to helping unlock that and interesting when it comes to to identifying changes before you do that. So again, we just want to we don't want to just be making changes, pushing out production and asking us what's happened. We we need to be able to kind of geek that in place. Um and so for us again, uh we're introducing some capabilities now. So what you've been running at the moment, you've been running the tools, you've been running the prompts, everything on your local machine. What we want to do is then offload that capability into Brain Trust. So when your application is running in a secure environment, it can refer to brain trust to uh pull that information uh and then uh help with the path of execution which again follows the tracing mechanisms that we've done. So u by default when you're running the make commands um the runtime mode was set to kind of local. If you want to use managed mode just use the prefix managed and whatever the remaining make commands will happen there as well. So what I want to do then is um pivot into u the ID. So going back to make uh my read me. Okay. Tricky. Okay. So, get check out. Okay. So, just taking a look here. We'll do make setup. Right. And the key thing I do want to emphasize at this point because we're managing this in in brain trust use the setup um brain trust um command here. So what that's going to do is going to package up those scoring functions, those tools, those prompts and push it onto the secured uh infrastructure. So you should get an output like this. and what that would look like uh in the UI. So I'll just go back to overview to see. So on the left hand side um if we take a look at prompts you'll start to see the the three prompts that we we created as as part of that that workflow. Um give you an idea here if we go to the uh the trio specialist. So u you can define a slug. So let's say an immutable ID which you can then refer to it in the code. Um this can also be generated if needed. Um the prompt um is again treated as code. Uh we also use some interpolation there if you want to parameterize this and I'll show what that looks like shortly. If I go to the scoring function uh so let's click on scores. Okay come up in a second. um take a look at parameters. So I'm just going to take a look here and I've created a parameter specifically just to simplify things changing the baseline model. So maybe just kind of a show of hands um you know these models get released so quickly. Has anybody had like a PM or let's say uh let's say a non-engineering theme say hey by the way can I change uh the prompt can I change the model or the prompt and see what that would look like? Has that kind of happened to you folks? I think we've got a few hands here. Great. Okay. So, the good thing with this now using the the manage parameter is those non uh technical thememes they can come into brain trust and change a prompt here. Write a comment to say look um let's use um a different model. Let me use something like for mini. Uh just say testing a new model. I'm going to save this version here. Write the comment. And what I'm going to do as well just to give you an ending, I'm going to do, you know, make setup branch. So every time you change something, um, you can run this command, but I'm just doing it for brevity here. Um, that's more to keep, um, this model in sync. So if I go to prompts, uh, you know, uh, that's kind of in sync. If I take a look at um the uh uh sorry the parameters in place yeah it should be there and sorry what I'm going to do is if I do something like um this is a runtime account unlock. So in this case by running the manage mode I'm just changing the course of the execution uh not to run the model locally but to follow the path of what brain trust has uh set for the model. Okay. And uh if everything goes well so you can see it's just now you can see that I've changed a model here. So again I didn't have to do any code changes. All I had to do was go into the UI change the model I wanted to or any other parameter run that and have that uh used as a establish a new baseline for the evaluations. Um again if you want to uh you can also run the the demo script to push in the demo tickets. I'm just skipping that for the workshop today. Um I think I do see this with many of our different customers is saying oh you know um they can be not necessarily a cause of concern. and it's saying, "Hey, by the way, brain trust is now having access control to these certain parameters." Um, what I would say is brain trust is not really intended to replace that rigor. You probably still want to use things like version control systems anyway to track that. What we're just saying is when it comes to operationalize it and making sure that other users are able to work on a shared system, this is a recommended path that we would take uh to help out with that. So again, you would probably still have to have your prompts, your tooling, your parameters in a kind of centralized way, but then provide automation in place for you to synchronize that and work. And that's the best way we've seen customers take advantage of this. Okay. Uh this next portion we're going to talk about online scoring. So now that we have those evaluations in in place um what we want to do is then apply those um scoring to actual live production logs that are coming through uh in the application. So okay it's great that we've done our test cases. We we've got some level of confidence that it's working but again there's no substitute for production data right we all know this. So what we're going to be doing is uh creating uh again moving that logic uh into brain trust and setting what we call automations that would then track and evaluate this as logs are coming in in in real time. Uh worth pointing out when you start your journey um it's probably especially if you're using uh LLM as a judge you want to start with let's say a higher sampling rate. So again, as logs are coming in, uh where possible, you you want to make sure you identify a baseline. But again, there's a trade-off when these calls can be quite expensive, especially if you're using more sophisticated models and you need higher rate of reasoning at that point when you do want to uh are happy with um the output, you really want to reduce that that sampling rate down to 5 to 10%. So again, you're managing your your cost effectively. Uh, deterministic scores again, they're cheap. Recommend running them all all the time. >> Sorry, what was that question? >> Yes. So, I'll bring that up in the UI. I'll show you. Yeah. Okay. So, if I go back into the uh IDE, I'll go into read me here. Um, oh, so that's why I messed up the manage tools. So let's just say get checkout. Okay, there. And as I mentioned, they all build upon each other. So skipping this is totally fine. So get checkout. All right. Now if I do um make setup, everything should be fine. Make uh setup green trust. Um so in this case I'm actually taking the the tools as well um for the production. Again this is just to really help accelerate things where possible. Um coming down to um if I hit refresh um right so scores the scoring functions that you saw earlier uh in code they now managed in brain trust so you can see it's it's available here um the ones that I want to call out is the the triage so the I've got an LLM uh as a judge um which has been applied so put an output taking this and there's an automation rule in place. So quality online. Um so what I've done here again I've automated the setup uh as as code. So to to bootstrap the project but again uh what we're able to do here is to say look depending if it's an individual span. We can run the execution or the entire trace. And this is why metadata is so important because we may only want to trace maybe specific failures that might happen within the code. uh again depending on the use case um my sampling rate as I mentioned is set to to 100 but again for more expensive calls we want to taper that as well so this is what the automation uh or how we would do that in in brain trust there today it's like packaging >> no so automations is more like the execution against uh incoming logs so it's um be Yeah, but I'm more more holistically I've applied automation to setting up this environment and scaffolding. >> Yeah. So that's just want to delineate that. Yeah. >> Sorry question. Yeah. >> Yeah. What kind of things are you scoring? >> Um could you expand on that please? Yeah. >> Yes. Correct. properties >> in >> there's like no ground truth. >> Okay. >> How can you perform online? You don't have ground truth. >> Yeah. Well, you probably want to take that as an edge case, push that into a data set, identify and then move that back. So that's would be the approach that I would take for this um if you don't have any ground truth already. Right. So this is why I said depending where you start, it's better to have some kind of data and then begin your flywheel around from that. Well, so you have some data, how do you apply that to the judge? >> Oh, that case. Um, we can probably put that into a data set and then replay that through the playground. That's that's going to be the way we would do that. Yeah. Yeah. All right. Um, so we checked up online scoring cognitive time. Okay. Now to the remediation portion. was hopefully seeing the delta or again why we we're here today. So hopefully this might help you uh with that particular question. So again, here's something that might happen as a as a plausible input to our agentic system is you know customer user might say hey this isn't urgent but our CFO can't export the invoices um before uh board meeting the model says look hey this looks this looks okay for me someone says not urgent come see kamsa but the business is very different right this probably does need uh immediate attention your CFO I'm sure there's an end of quarter report that needs to be done. And this is the difference between uh what we're doing here today is trying to identify um what is a proper failure mode and then remediate that where possible. Um so in this case again I can run um u this particular mode here. I've got this in this in this data set. So uh just kind of a player scene where we want to replay the failure. We run a specific uh evaluation against this. Going to tighten the prompt, run it again and see uh what that looks like. We probably want to do this against not just one particular test case but across our entire test cases as well just to see if it does work as intended and we haven't regressed on something else uh from that perspective. Okay. So let's go ahead and have two we have two separate branches for this. So I'll split out branches. A one's going to have the the failure rate in mind. We're going to go through the UI and view that. Um and then we're going to talk about the the remediation part uh there as well. Okay. should just go here. So many products. Okay. So, next set. So we can even do run time. So runtime mode and we'll make uh replay failure. Okay. So I've got the set of five cases which are you know the regression of failure modes. this case. So that first one that you saw in the example ticket that's that's done as a JSON file here. So if I take a look here. Okay. So if I take a look at the the failure mode here, you can drill into this. see the uh and you'll notice as well now that we set up the online uh the managed tools as well as the online scoring the trace becomes um even more sophisticated in the fact that we're executing this against the secure brain trust uh environment. So again moving from local to to managed I'm just going to go here to uh the make file. Sorry, package.json. Let's go to scenario here impact. Let's run an evaluation here. So, I'm just doing an evaluation against a specific um scenario. All right. Do you also want to show the monitor latency and things like that? >> Okay. Okay. Um, yep. So as we can see as we progressing with uh the application the experiment again viewer just allows us to see the progress of our changes. Um so you can see we've run the the latest set we notice some you know degregation here which just allows us to track it a little bit better. So the managed data set that I I had in mind uh the f failure rates are being captured and you can see you know I have the ability then to compare it against existing um experiments to say to track the progress and remediate where possible so forgive me the next thing is I'm going to go to the readme and and proceed with the uh remediation. Okay. Get check out make just Okay. So in this remediation I've changed um the the prompt which I've used. So uh one way to to view that is if I take a look at the the prompt and the uh the change. So so see so you'll see any any differences there. some place. Okay. Come down to the uh read me file here. Let's say I do One second. Collect here. Let's give us a run the trust command contents. Yeah, there's a specific flag I think. Don't know if you want to just double check with that. Yeah, that's it. If if it fits So if you want to just put that if if exists replace um env in the chat. Yeah. >> Yeah. No, no, it's fine. So yeah, just folks uh if you want to in the part of the remediation script, if you set the environment variable brain trust if exists, you set that to replace um it's going to push in the the updated changes into the into the environment. So then for take I forgot. Okay. Um and then you'll see as well now that we've updated the problem via code and pushed it up um it's um it's available here as well. Um so as well in the UI um again part of the operation operationalization uh we can see you know who's changed what but actually what has been changed to that particular prompt as in this in this case take a look here. So, we're including tools. Clean that out. Okay. So, coming back here. Let's say we didn't do a um run this evaluation again. So, we made the change to the prompt and now we're running um the remediated version to see how that performs. Perfect. Okay. Okay, so that's ran the experiment here with the new changes hopefully and I'm pivot into the the UI. I'm going to go to experiments here and it's barely there but you can just see you know kind of tap it back up just you know any improvement going up is an improvement but yeah just to give you an idea here that um you know now that we've done the evaluation uh actually what I can do is um do a diff uh and we can do a a comparison in the delta So yeah. Yeah. So yeah, that's come up uh improved over time. Um uh which way which way uh the intended outcome. Yeah. Okay. So I think we're approaching the end of the the content. Um, so I know it's a been a number of steps. So I really want to thank you for your time and attention to kind of walk through that. Uh, as mentioned the the artifacts are public. We've got the cheat sheet there. We've got the Slack channel to help if you have any any questions. But uh, hopefully just to give you a summary of what you've accomplished today uh, in this order is you went from taking that single shot prompt into building a five-stage AI agentic workflow using uh, tool calls. What we're able to then do is then inspect how this works pretty much independent diving into this by adding brain trust tracing making sure everything is recorded. Um we also then want to talk about you know how do we then evaluate the system from a um you know that's not online with something new uh creating those those effectively a golden set with those test cases which we want to execute against. We then deployed uh that uh those managed prompts those tools and parameters into the brain trust secure architecture uh to be able to use and we've also added online scoring to then evaluate the system uh as it unfolds and then we picked a particular uh production failure. We looked at the trace. We modified the prompt in our uh case and we saw the delta there and running the evaluation again we saw it achieve back up to to where it needed to be and in this case completing the full evaluation of your building observing it deploying it and taking a look at that moving forward. Um so yeah just to kind of u call it um and bring it at home. So again hopefully this is not uncommon but again what might work in production is not really going to work in prototype. uh we really need to break this down, identify the failure modes and and move forward. And that's where again expression stages become really really important, right? Um again, this does introduce more uh areas of of where things could go wrong, but it's easier to debug if that's the case. Um you there's no substitute for diving into the code and tracking everything. So I would say it's it's observability is table stakes at this point. If you've got a production application and you're not tracing it, you need to go back to the drawing board and get that done operational. And hopefully, again, we can show you how to to be able to do that um using brain trust. Um again, no substitution for production logs, but better to start somewhere from there. If you have an idea of what an issue might happen, these are your perfect ways to to support an evaluation. So, using those uh failure modes as as your test cases and as I mentioned, this is a continuous process, right? Nothing's ever done. If you've ever worked in agile development, constant feedback is is important. And again, we're bringing this this operation model, but with uh a newer surface of of operating um and yeah, just to hopefully bring it home to your teams here today. So, um, you know, my encouragement to you is to, if this is something of of of interest, is pick something that's already operational today. It doesn't have to be the entire suite, maybe start off with something that's maybe a bit more um uh, I guess more critical that you really want to improve um the operational modes, add the tracing, collect your edge cases from that that mode, um, build those determining scores and then route everything back as possible. So again the faster feedback loop that you have uh you have more insight and the more that you can again improve the overall uh delivery and operations of of your system. Um yeah and just kind of call to action. So again I know we've thrown a lot of content at you. It's uh we'll obviously try to get bit of feedback. We're trying to put some tables in place. So I appreciate everyone's kind of juggling everything to do this. But um as mentioned you have to start somewhere. Let's just try to accelerate you. uh we have a list of documentation that's provided um that's uh you can also use our AI agent on the agent to to search if you have any questions but we also have a cookbook available so I tend to throw that cookbook directly into you know cursor codecs or whatever and say based on this take the SDK uh and start tracing my application it does it pretty effective we even do have uh we've actually announced a CLI um for our the brain trust um applications So that allows you to even do things as auto instrumentation. I'm just going to plug my colleague Eric who's doing some fastic work and please check out his booth because it's it's it's amazing. Um and again if you this is something that's interested you and you want to explore more then you know please reach out to your account team at brain trust. Again we're happy here to support where possible. Uh and if you're on discord you feel free to to join. Happy to answer questions there uh from that. And uh yeah, again really want to thank you for your time, attention, energy. I know it's a really sunny day and I don't want to keep people in here. I want you to get some fresh air, but uh yeah, just on behalf of brain trust trainine and myself like thank you so much for your time and attention. It's it's been an honor and look forward to seeing you out there tracing and and gaining value from delivering AI in production. Thank you.