Build a Prompt Learning Loop - SallyAnn DeLucia & Fuad Ali, Arize
Channel: aiDotEngineer
Published at: 2026-01-06
YouTube video id: SbcQYbrvAfI
Source: https://www.youtube.com/watch?v=SbcQYbrvAfI
[music] Hey everyone, gonna get started here. Thanks so much for joining us today. Um, I'm Sally. I'm the director of RISE. I'm going to be walking you through some of crowd prompt learning. Uh we're actually going to be building a driven optimization loop for the part of the workshop. Um I come from a technical background and started off in data science before I made my way over to product. Uh I do like to still be touching code today. I think one of my favorite projects that I work on is building our own agent Alex into our platform. So I'm very familiar with all of the pain points um and how important it is to optimize your prompt. So I'm going to spend a little bit time on slides. I like to like just set the scene, make sure everybody here has context on what we're going to be doing and then we'll jump into the code with me. So, I'll let you do a little bit of an intro. >> Yeah, thank you so much, Ellen. Great to meet all of you. Excited to be walking through prompt learning with you all. I don't know if you got a chance to see a harness talk yesterday, but hopefully that gave you some good background on how powerful prompting and prompt learning can be. Uh, so my name is I'm a product manager here at Arise as well. And like Sally said, we like to stay in code. We'll be doing a few slides, then we'll walk through the code and we'll be floating around helping you guys debug and things like that. My background is also technical. So, I was a backend distributed systems engineer for a long time. So, no stranger to how important observability infrastructure really is. Um, and I think it's an appropriate setting in AWS for that. So, yeah, excited to dive deep into front loading with you all. Thank you. >> Awesome. All right, so we're gonna get started. Just give you a little bit of an agenda of the things I'm going to be covering. Uh, so we're gonna talk about why agents fail today. what is evening prom learning? I want to go through a case study kind of show youall why this actually works. Uh and we'll talk about learning versus GA. I think everybody I had a few people come up to me over the conference about like what about GEA? Uh we have some benchmarking against that and then we'll hop into our workshop. Um but with this I want to ask a question. How many people here are building agents today? >> Okay, that's what I expected. Um and how many people actually feel like the agents they're building are reliable? >> Yeah, that's what I also thought. So let's talk a little bit about why agents fail today. So why do they fail? Well, there's a few things that we're seeing with a lot of our folks and we're seeing even internally as we build with Alex for why agents are b breaking. So um I think that a lot of times it's not because the models are weak. It's a lot of times the environment um and the instructions are weak. So uh having no instructions um from their learned environment uh no planning or very static planning. I feel like a lot of agents right now don't have planning. We do have some good examples of planning like we have cloud code cursor. Those are really great examples but I'm not seeing it make its way into every agent that I come across. Uh missing tools big one. Sometimes you just don't have the tool sets that you need. Uh and then missing kind of tool guidance on like which of the tools we should be picking and then context engineering continues to be a big struggle for folks. If I were to distill this out, I think it's like these three core issues. So adaptability and selfarning. Um so no system instructions learned from the environment touched on determinism versus non-determinism balance. So having the planning um or no planning versus doing like a very static planning. You want to kind of have some flexibility there. And then context engineering I think is a term that just kind of emerged in the last like you know six to eight months but it's something that's really really important that we're finding you know missing tools tool guidance just not having context or confirming your data and not giving the LM enough context. So these are um kind of the core issues to still. But I think there's one other pretty important thing. Um and that is kind of this distribution of who's responsible for what. So um there's these technical users, your AI engineers, your data scientists, developers, and they're really responsible for the code automation pipelines actually, you know, managing the performance and costs. But then we have our domain experts, subject matter experts, AI product managers. These are the ones that actually knew what the user experience would be. they probably are super familiar with um the principles that we're actually building to our AI applications. They're tracking our evals and they're really trying to ensure that the product success. So there's this split between responsibilities but everybody is contributing but then there's this difference um in terms of like maybe technical abilities. And so with prompt learning it's going to be a combination of all these things. So everybody's going to really need to be involved and we can talk about that uh a little bit more. So [clears throat] what even is prompting? I'm going to first kind of go through some of the um approaches that we kind of borrowed when we came up with prompt learning. So this is something that Arise has been really really uh dedicated to doing some research. And so one of the first things we borrow from uh which is reinforcement learning. How many folks here are familiar with how reinforcement learning works? All right, cool. Um so if I were to give like a really like silly kind of analogy, we have a reinforcement model. Uh pretend it's like a a student brain that we're trying to kind of, you know, boost up. And so they're going to take an action uh which might be something like you're just going to take a test an exam and there's going to be a score. A teacher is going to come through and actually you know score the exam here um that's going to produce this kind of like scaler reward um and you know pretend the student has an algorithm in their brain that can just kind of take those scores and update the weights in their brain and kind of like the learning behavior there and then we kind of reprocess. So you know in this kind of reinforcement one we're updating weights based off of some scalers. Um, but it's really actually difficult to update the weights directly, especially in like the LLM world. So, reinforcement learning isn't going to quite work that well uh when we're we're doing things like prompting. So, then there's metaprompting, which is very close to what we do with uh prompt learning, but still not quite right. So, here with metal prompting, we're asking LM to improve the prompt. Uh, so again, we use that kind of like student example. We have an agent which is our student. Um, and it's going to produce some kind of output like that's a user asking a question getting an output. That's our test in this example. And then we're going to score. Eval is pretty much what you can think of there. Uh, where it's going to output a score and from there we have like the metapromp thing. So now the teacher is kind of like the metapar prompt. It's going to take the result uh from our scorer and update the prompts based off of that. Um, but it's still not quite what we want to do. And that's where we kind of introduce this idea of prompt learning. So prompt learning is going to take the the exam going to produce an output. Um we're going to have our enlumm evals on there. But there's also this really important piece which is the English feedback. So which answers were wrong? Why were the answers wrong? Where the student needs to actually study? Really pinpointing those issues. And then we still aren't using metapro. We still are asking an LLM uh to improve the prompt. It's just the information that we are giving that LLM uh is quite different. And so we're going to update uh the prompt there with all of this kind of feedback. So from our evals from a subject matter expert going in and labeling and use that uh to kind of boost our prompt with better instructions and sometimes exams. So this is kind of like the traditional prompt optimization where it's like we have we're kind of treating it like an ML where we have our data and we have the prompt. We're saying optimize this prompt and maximize our like prediction impulse. Um but that doesn't quite work uh for Allens were missing a lot of context. So what we really found um is that the human instructions of why it failed. So imagine you have your application data, your traces, a data set, whatever it is. Your subject matter expert goes in and they're not only annotating correct or incorrect. They're saying this is why this is wrong. It failed to adhere to this key instruction. It didn't adhere to the context. It's missing out whatever it is. Um, and then you also have your ego explanations from Ellen as a judge, which is same kind of principle where instead of just the label, it provides the reasoning behind the label. And then we're pointing it at the exact instructions um to change. We're changing the system prompt to help it improve so that we then get, you know, prediction labels, but we also get those evals um and explanations of it. So, we're just kind of optimizing more than just um our outlet here. And I think a really key learning that we've had is the explanations in human instructions or through your own as a judge. That text is really really valuable. I think that's what we see not being utilized in a lot of other broad optimization approaches. Um they're either kind of optimizing for a score uh or they're just paying attention to the output. But you can think of it this way. It's like these elements are operating in the text domain. So we have all this rich text that tells us exactly what it needs to do to improve. why wouldn't we use that to actually improve our so um that's kind of the basics of prompt learning but everybody always comes up to me and like sounds great s but does it actually work um it does and we have some examples of when we do this so we did a little bit of a case study um I think coding agents everybody is pretty much using them at this point there's a quite a few that have been really really successful I think cloud code is a great example cursor but there's also client uh which is more of a um an open version of this and so we decided to take a look and compare to see if we could you know do anything to improve. So these are kind of the the baseline of where we started here. Um you can see the difference between the different models. U obviously using two and throttle kind of the state-of-the-art there but we also had this opportunity where CL was using you know 45 and it was working decently well at 30% versus 40. Um and then there was kind of the conversation around. So this is where we started um and we took a pass optimizing the system prompt here. So you can see this is what the old one was looking like. It has like no rules section. So it was just very like you are a cloud agent. You're built on this model. You're you're here to do coding. Um but there was no rules and so we took a pass at updating the system. So there were all of these different uh rules associated. So when dealing with errors or exceptions, handle them in a specific way. make sure that the changes align with, you know, the systems design. Um any changes to be accompanied by appropriate test. So really just kind of building in like the rules that like a good engineer would have uh which was completely missing before. Um and so we found that plan performs better with updated system problem. Pretty kind of simple. It's kind of the whole concept here. It's like you can see these different problems and we're seeing you know things that were incorrect now being correctly done just by simply adding more instructions. [clears throat] So it really demonstrates pretty well here um how those system prompts can improve and we benchmarked again with a s bench light to get another just like kind of coding uh benchmark for these coding agents and we were able to improve by 15% just through the addition of rules. Uh so I think that that's pretty powerful. So no fine-tuning, no tool changes, no architecture changes. I think those are the big things folks like reach for when they're trying to improve their agents. Uh but sometimes it's just about your system prompt and just adding rules. I think we've really seen that and that's why we're really passionate about prompt learning and prompt optimization in general is it feels like the lowest lift way to get massive improvement gains in your agent. Uh 4.1 achieved performance near 4.5 which is pretty much considered right now state-of-the-art when it comes to coding questions and it's twothirds of the cost which is always uh really beneficial. So uh these are some of kind of the tables here. will definitely distribute this so you can kind of take a closer look. But I think the main point I want y'all to come away with is the fact that like, you know, 15% is pretty, you know, powerful uh improvement in our performance. Now, a question we get all the time is we're taking these examples of perform learning. So, how this is really important is we're going to take a data set. A lot of time that data set is going to be a set of examples that didn't perform well. either a human went through and uh labeled them and found that they you know were incorrect or you have your emails that are labeling them incorrect and so you've gathered all these examples and that's what we're going to use to optimize our prompt. So I get a question all the time like well aren't we going to overfitit uh based off of these bad examples but there's this rule of generalization where mending properly enforces high level reusable coding coding standards rather than repo specific fixes and we are doing this train test split uh to ensure that the rules are generalized beyond just like local quirks and whatever our uh training data set is. But if you kind of think of this as like you hire an engineer, right, to to be an engineer at your company, you do kind of want them to overfit to the database that they're working on. So, uh we kind of feel that overfitting is maybe a better term for it is expertise. Uh we are again not kind of training in the traditional world. We are trying to build expertise and as we'll talk about this is not something we feel that you do once. You're actually going to kind of continuously be running this. So, um more problems are going to come up. we're going to kind of optimize our prompt for what the application is seeing now. Um, and then we'll kind of So, we don't actually think it's a flaw. We feel like it's expertise instead. Um, we can kind of adapt as needed and kind of mirroring what humans would do if they were taking on a task themselves. Um this is just another set of benchmarking again kind of proving here um that this diverse evaluation suite that focuses on the task for those difficult or tasks that are difficult for relish language models um and we're seeing again success with our improvements. Now Ga just kind of came out recently and I think that's something everybody's really excited about. I think the previous uh DSPI optimizers were a little bit more focused on optimizing a metric and as we talked about like we really want to be using uh the text modality that these applications are working in um that have a lot of the the reasons or how we need to improve and so we definitely wanted to do some benchmarking here. So how many people are familiar with Gered about it? All right, cool. Well, I'll just give like sort of high level. I just kind of noted that the main difference between their other like new pro optimizers is that they are actually um using this positive reflection and evaluation while they are are doing the optimization. So it's this evolutionary optimization um where there's this parentto-based candidate selection and probabilistic merging of prompts. What this really does under the hood is we take candidate cross uh we evaluate them. Then there's this reflection LM that's reviewing the evaluations and then kind of making some mutations some changes um and kind of repeating until it feels like it has the right set of prompts. So I think something that is important to notice about GABA is it doesn't really choose kind of just one. It does try to keep the top candidates um and then you know do the merging from there. But we benchmarked it and proper learning actually does do a little bit of a better job. And I think something that's really key is it does it in a lower number of loops. And I think something that we'll we'll talk about in just a second here is that it does actually matter what your emails look like and how reliable those are. I think that's something we really feel strongly about at Arise is uh you definitely want to be optimizing your agent prompts, but I think a lot of people forget about the fact that you should also be optimizing your email prompts because if you're using emails as a signal, um you can't really rely on them if you don't feel confident in them. So, it's just as important to invest there, making sure you're kind of applying the same principles that you are to your agent prompt as your email prompts so you have a really reliable signal that you can trust and then feed that into your prompt optimization. But, um in both of these graphs, the pink line is prompt learning. Uh we did also benchmark it against me pro their older optimization technique that I was mentioning kind of functions off like um optimizing around score and eval make the difference. So it kind of I I highlighted on this slide here like the with eval engineering we were able to do this. So we did have to make sure that the eval part of prompt learning uh were really high quality because again it's this only works um if the eval itself is working. So, yep, emails make all the difference. Kind of spend some time optimizing a prompt here. Um, again, it's all about making sure you have proper instruction. The same kind of rules apply. So, I want to kind of walk through. I know there's a lot of content. I think it's really important to have context. But before we jump into any of the workshops, any questions I could answer about what I discussed so far? >> Uh, I have a question comment. So I I think you know coding is the greatest example in terms of having the structure and evals. Uh one thing I'm sort of curious about is if you have other examples sort of general prompts forational interactions with systems that are not as easily quantifiable. I'm just curious about any experience you guys have there. >> Yeah. Is that for like eval general? >> Well I think it's just clear how you would set up what the eval would look like and I'm just wondering how you would do that for other types of so the question is like is there any kind of instruction for how you should set up your evals? coding seems like a very straightforward example. You kind of want to make sure the code's correct, right? But where some of these other agent tasks um it's a little bit harder. I think the advice that I usually give folks is we do have a set of like out of box. You can always start with things like QA correctness or focus on the task. But what I always suggest is like getting all the stakeholders kind of in the room. So getting those you know subject matter experts and security you know leadership and really defining what success would look like and then start kind of converting that to different evaluations. So um I think an example is Sterling and Alex. Um I have some task level evaluation. So like I really care did it find the right data uh that it should have. Um should it did it create a filter using semantic search or structured like making the right tool call? Um and then I care did it call things in the right order? Was the plan correct? So kind of thinking about like what each step was and then like even security will be like well we care how often people are trying to jailbreak Alex. So, it's just taking each of those success criteria, converting it to eval. Um, and we do have different tools that can help you, but that's usually the framework I give folks is like start with just success and then worry about converting into an email after. >> Yeah. Just to add to that, maybe like more of like a subjective use case is like for example like Booking.com is one of our clients and so when they do like what is a good posting for a property like what is a good picture? [clears throat] Defining that is really hard, right? Like to you, you might think something is a very attractive posting for like a hotel or something, right? But to someone else, it might look really different. And sometimes, as kind of Sil was alluding to, it's sufficient to just gate it as a good, bad, and then kind of iterate from there. So like, is this a good picture or bad picture? Let decide and then gate from there into specific background like, oh, this was dimly lit, the layout of the room was different, etc., etc. Yeah. >> Yeah. That's that you're actually building on the question I was going to ask which is that they end up with that binary outcome which doesn't necessarily give you a gradient to advance upon are you then effectively using those questions like digitally lit not to like get like a more continuous space is that >> exactly right and then from there as you get more signal you can refine your evaluator further and further and then use those states and you can actually put a lot of that in your prompting itself right so yeah >> I have two questions and I'm not sure if I should ask both of them or maybe your workshop will answer it. One is about rules and the rule section or like operating procedures. I'm curious how you uh do you just continuously refine that in the English language and uh maybe reduce the friction of any contradictory rules. That's the first question. And then the other was I would love to see the slide on eval. if you could just say a little bit more on how you approach that because my issue [clears throat] in doing this work is um whether or not to have like an a simulator of the product and then the simulator is evaluating or to do what I'd like to do which is like an end toend evaluation that I build but I would love to see you talk about that if you could. >> Yeah, absolutely. So from the first one about like how the instructions it's definitely something I think that like you iterate over time on them. So a lot of times I think we take our best bet like we write them by hand, right? And I think what we're trying to do with proper optimization is like leverage the data uh to dynamically change them. Uh and is I think great at like removing redundant instructions, things like that. But the goal is is we want to move away from static instructions. We feel very confidently that like that is not going to really scale. It's not going to lead to like sustainable um performance. So the idea exactly with pump learning is something that you can kind of run over time. We see this even like a long running task eventually uh where you're building up examples of incorrect things uh maybe having a human annotate them and then the task is kind of always running producing optimized prompts that you can then pull in production and it it kind of is like a cycle that repeats over time. >> Sorry just to intervene. So, are you saying that when you're doing this over a long period of time and then you have examples, you're just running the shots back into your rules section? >> Kind of. It's going to pass it like when we get to the optimization actual like loop we're going to build, you'll kind of see it as like you are feeding the data in that's going to build a new set of instructions that you would then, you know, push to production to use. >> Okay. >> Uh I think your second question was around evals and like how to where to start, how to like write them and like how to optimize those. Is that right? >> Yes. >> Yeah. So, it's a very similar approach. I think it's like the data that you're reviewing is almost a little bit different. So, uh I should have pulled up the the loops. I don't know if you can find it. Let me just try something really quick to kind of show this. There we go. So, this is kind of like how we we see it is you have two co-evolving loops. I've been talking about the one on the left, the blue one a lot about we're improving agent, we're collecting failures, kind of setting that to do kind of fine-tuning or prompt learning, but you basically want to do the same thing with your evals where uh we're collecting the data set of failures, but instead of thinking about the failures being the output of your agent, we're actually talking about the eval output. So having somebody go through and you know evaluate the evaluators or using things like log props as confidence scores or jury as a judge to determine where things are not confident. We're kind of doing the same thing. So figuring out where your eval is low confidence and then you're collecting that annotating maybe having somebody go through and say okay this is where the eval went wrong. And so it's the same pretty much process of optimizing your eval prompt. It's just you know I think folks think they can just grab something off the shelf or write something once and then they can just forget about it. But this loop, I've said it a few times, but the the left loop only works as well as your eval. >> Sorry, I think my question is actually way more static and basic. It's like do you are you talking about this orange circle as like are you building a system or simulator for the eval or are you just talking about like system prompt, user prompt, eval? >> Yeah, I think it's more right now what we're talking about is just like kind of the different prompts. You could definitely do simulation, but I think that's a whole different workshop. >> Thank you. Any [clears throat] more maybe questions before we get to the bridge club? Any switch back? All right. Um, so here is going to be a QR code uh for our prompt learning repo. Um, so I'll give everyone a few minutes to get such with that. Get it on your laptops. I know it's a little bit clunky to add this QR and like airdrop it. was not sure a better way. Um I can just show you also here if you want to find it. Um it is going to be in our Rise AI uh repo here and under prompt learning and you just want to kind of clone that. We are going to kind of be running it uh locally here. >> You go back to the page with the URL. >> Yes. Sorry about that. Oh >> no, the page with the URL. Oh, >> we'll give folks just a few minutes to get >> What do you What's your process when you're building a new agent or work for anything that could be evaluated? Do you guys start by just like, oh, try something prototype and then see where it's bad and then do eval? [clears throat] >> Yeah, I think there's different perspectives on this. Our perspective is EOS should never block you. Like you need to get started and you need to just build something really scrappy. We don't think like you should, you know, waste time doing eval. I think it's helpful to pull something out of the box sometimes in those situations just because it's hard to comb through your data. like that's something we've experienced with Alex of like when you're getting started just running a test manually reviewing like it it's kind of painful. Um so I think that having eval is helpful but shouldn't be a blocker. Pull something off the shelf maybe start with that then as you're iterating you're understanding where your issues are then you're starting to refine your evals as you're refining your agent. >> Yeah. One last question. Yeah. >> So it makes sense to like optimize the system like sub aents or commands or how are you thinking about this like multi- aent? >> Yeah. So the question is is like are you just doing one single prompt or how do you think about this in a multi- aent? I think we're kind of thinking that this right now is kind of independent tasks that can optimize your prompts kind of independently and then running tests um to get into like the agent simulation of running them all together. But right now, our approach is a little bit isolated, but I definitely see a future where we're going to kind of meet the the standard of like sub agents and everything else that's going on right now. >> No, I think that's pretty accurate. And also like I mean even in a single agent use case versus like a multi- aent use case like ultimately like each of those agents may be specialized. They may have their own prompts that they need to learn from. So I think doing this in isolation still has benefits for the multi- aent system as a whole that can pass on over time in scenarios like hand off etc and making something like really really specialized. So I guess like what we're talking about with like the overfitting as well which is again like question we get all the time but really you want to be over fit on your code base as an engineer. Um you don't want to be so generalized that you're no longer good at picking up specific works in your code base. Yeah. >> All right. Everybody kind of getting to read the mode. Okay. Anybody need any help? >> All right. So, we are going to be using OpenAI for this. So, I think the next thing that I'll have everyone do is probably spend some time just grabbing your API key. We'll get to it and then I'll just kind of start walking through our notebook here. So, we are going to be doing a JSON webpage prompt example. So, you're going to find that under notebooks here. Um, and so we'll give everybody a second to pull it out. There's going to be just some slight adjustments we're going to add to this example uh just to make it run a little faster and work a little better. The first is um what this is even doing this is going to be a very simple example uh for just a JSON web page prompts. If anybody has like a prompt or use case that they want to kind of like code along, Van and I are absolutely help like glad to help kind of adapt what you're working on to the use case here. It's something very simple just to kind of demonstrate um the the principles and we are going to be using we can definitely experiment. If you want to swap out any other providers that you want to use, we can also definitely help you do that. Um but the the goal of this is essentially going to be to iterate through different versions of a prompt using a data set. Um and we will optimize. So the first thing is obviously we need to do some installs. Um I am just going to have you all update it. It says like greater than 2.00. Uh but we're going to actually just use I think 22 today. And then the next thing is just to make this run a little faster. So we're going to run things in async which is missing. So you can go ahead and add these lines in the cell as well. All right, everyone kind of follow along and I never know want to move too fast. Seems to head nuts. Cool. Let's talk about configuration. So um I kind of talked about it a little bit when I was going through the slides. So we are going to be doing some looping. So the general idea is is we start out with the data set uh with some feedback in it and we'll we'll look through the data set once we get it. Um, but you're going to want to have either human evaluation. Um, so like annotations, either free text, labels, um, or you're going to want to have some evaluation data. But the feedback is really important. That's what makes this kind of work. Um, we're going to then, you know, pass that to Allen to do the optimization and then it's going to basically have eval. So as it's optimizing, it's using that kind of data set to then run and assess whether or not it should, you know, kind of keep optimizing. Um, and then it also provides you data that you can kind of like use to gauge which of the prompts that it outputs um, in you know a production setting. So we're going to do some configuration. Um, so I've kind of wrote out here kind of what each of these means. So we have the number of samples. So this controls how many rows of the sample data set. Um, you can, you know, set to zero to use all data or you can, you know, use a positive number to limit for, you know, faster experimentation. So I think that sometimes folks use different um approaches here. Sometimes you want to just move really quick so you set a low sample. Sometimes you want to be a little bit more representative so you up it. Um I have it here set as 100. Feel free to adjust. Um and then the next thing is train split. Um so I think folks are probably pretty familiar with the concept here of like a train test split, but it's just how much of the data do we want to use into our training? Again, that's what we're using to actually optimize. Then how much of it do we want to use when we're testing when we're running the eval um on the new prompt? Um and there's number of rules. Uh basically the specific number of rules to use for evaluation. This just determines which prompts to use. Um and so this is like as we're running these loops, we're outputting, you know, a bunch of different prompts. So this is just saying how many um we should use for evaluation. And then key one here, number of optimization loops. So this sets how many optimization iterations to run per experiment. Um and each loop basically generates those outputs, evaluates them and refineses the prompt. And so these just control the experiment scope the data splitting um just went through the whole prompt learning loop and and how much data we want to use. So you can kind of just run these as you are or if you want to adjust them feel free. Uh and then the next step pretty simple. We're just going to uh grab that open AI key if you haven't already uh set that up. So, get passage is going to like pop up. Um I'll show you here quick. It's going to pop up there. You can just paste in your API key there before we start looking at the the data a little bit. Just if anybody runs into any issues, you just give this away. All right. I think this particular we get through this >> I'm doing good but if you have a free one you want to give me that >> I wish >> all right let's talk about the data so we provided data with you with queries um you can see here that we're doing the 8020 split based off of kind of configuration we set above I'm just going to pull this um train set here and let's just >> Yeah, I run because in the minus 50 >> Oh, yep. You're right. That's a mistake on my part. Yeah, it is the 50. Um let's take a look at what this data set looks like. No. Uh just so folks can kind of understand. Um so kind of starting here with some just basic input and output. Um transcept we don't have any of the the feedback in these rows that I printed out here but you can imagine you can have different uh correctness labels here explanations any real validation data can be whatever it is that um you'd like it to be. Some folks use multiple eval feedback sometimes it's a combination but you really want to have you know the input and output that will use that way. Should my output of train set be the same as you? >> Not necessarily. Depends on >> I didn't know if head was sort or not. >> It all depends on kind of what the the same but we could look at like you know if I did this this should be the same for you maybe just to make sure. >> Yeah. >> That's what you're saying. Okay. Yeah. Quick question. [clears throat] Um, is it possible for the input to be like a chat history and not just >> Great question. So, I think it depends on like what it is you're trying to do. If you're doing just like a simple kind of uh system of the input, you kind of want it to be one to one. You don't want to give it a ton of um like conversation data that's not relevant to the prompt that you're optimizing. um we we generally just use like the single input but I think that there are applications that you could do like conversation level um inputs. >> Yeah. Because because quite often the failure is somewhere middleation right. So so if you put just the original task in uh then the probability of you hitting you know a failure in the middle of the >> totally. So in that case, what we generally see is like different rows of like having each of like the back and forths be like kind of independent rows because you're probably going to evaluate each of them and um honestly probably like get the human feedback on each of them. So we usually separate them out in that way. >> But it's a good point. If you just always are focusing on the first turn, there's probably a lot of redundancy there. uh you definitely will have to like say over parts of the conversation >> and and how we can biferate like instructions and we have some context also. So >> should not touch the context. It should only uh whatever the manipulate the system instruction or the prompting context it should be the static it should not be like based on the answer it will change my context. >> Yes. What you're saying is like looking at the input there might be like a tool volume context you're kind of passing that in. You can absolutely include that in your data set um so that the application kind of understands what other or not the application [clears throat] but the prompt learning um LM can understand all of the data that's kind of like available. So you can just have that passed in as extra column if you want. Most people start with just kind of input and the feedback. Um but you can absolutely add what other data you think is relevant and if when for the rerunning when we're doing the experiment of testing you'll definitely always want to have the data that would be required to answer >> any even very simple some call some call or some context it is pulling some API call whatever the prompt engineering it should be based on the out getting the output right and whatever the context front my plus whatever the tool call I have done API call all the uh contact engineering and then last finalize >> totally yeah so again at this point we're we're testing just like one prompt and not that kind of end to end but you definitely want to have everything that like is flowing into the prompt that you're optimizing so uh if your system prompt takes in the user input for example some data from an external API you would definitely want to provide all of that data does that make sense Because because you're saying that like the the like trajectories [clears throat] the like tool calls and what the agent's going to do depending on what the tool call was is what you're trying to proper to. >> Yeah, exactly. We want to just like because we're kind of trying to replay and optimize one step of it. We definitely don't want to do it completely in isolation. So if there's like data that flows into that prompt um that's context that's using that's producing the output, right? So we want to be sure that we're including that. We don't want to exclude anything. But if it's data that comes like at a different step probably not then you don't want to do that that way. It's just like think about what's relevant for the the step that we're trying to optimize in this. All right. Any other questions coming? All right. Cool. So we're going to set up our initial system prompt. You can see this is something very very basic. Uh we'll definitely I think we can do a whole lot better than this, but I just kind of want to illustrate something uh that we're going to test and optimize. So we're just saying you are an expert in JSON web page creation. Your task is input. And then so all these inputs that we're seeing are going to be what we're actually generating outputs for and trying to optimize. Now I already kind of touched on this. Um evaluators are extremely important to make all of this work, right? Um so we're going to uh initialize two evaluators that use elements as a judge to assess the quality of generated outputs. So we are using elements a judge. If you have any other like codebased evaluations, whatever you need to do to evaluate, you can definitely swap those out. Uh but we're going to do evaluate output. This is going to be a comprehensive evaluator that assesses the JSON webpage correctness against the input query and the evaluation rules. It's going to provide an output label of correct or incorrect. So pretty simple binary. Again, you can use multilel. And then it's going to have the detailed explanations as well. Um, and then we have a rule checker. This is a more specialized evaluator that performs a granular rule by rule analysis. Um, and it examines if each rule um was compliant. And then both of these are going to generate feedback that goes into our optimization loop uh to iteratively improve the system prompt. Um, explanation role violations guide. Um and we'll get to this the prompt learning optimizer and creating the more effective prompts. So I have some imports here. Let's take a look at what the actual eval output has. Um so we do have some rules that are in um in here wait um they're going to be in a repo. Um so we're going to open that as a file. We have this llm provider and we're using open AI here. And then we're going to do our classification evaluator. So, uh we're just calling it uh evaluate output. It's al we have an evaluation template that we're reading from the bottom here. Um then we just have choices correct and correct. Now we're mapping a label to a score. Sometimes it's helpful to be able to like add or score. Sometimes a number is easier than just looking at a bunch of labels. Uh it is optional you want to map these if you have like a multiclass use case. You can set the scores u accordingly. But these are just going to be our choices like the rails that we want our elements as a judge to adhere to. And then all we're doing here is getting our results. I have it doing some printing so you can kind of take a look. So this is going to be slightly different than what you're seeing in the notebook. So I'm just going to pause here. Uh if you want to make the code changes from what you're seeing in probably your version, this is a a good time for that. Does kind of the setup of the evaluator make sense to all kind of the key. It's going to be the rails. It's going to be the output. Uh and of course our template. [clears throat] >> Yeah, you will want to grab your own uh OpenAI key here uh to set [clears throat] >> and we can help you if you want to use different provider. We can help you swap this out like that is helpful to anybody. >> Okay, I'm going to start walking you through the output generation. So uh this is just kind of you know you can imagine this as your own agent logic or the the part that you're kind of testing. Uh this is just going to function that actually generates the JSON u outputs. We're using for one here with JSON response format zero temperature for consistent outputs. Um it's taking a data set a system prompt generates outputs for all rows returns the results for evaluation. Um and it's called during each iteration to produce output. So this is like our experimentation function that we're writing. So as we're passing in data, it's producing new uh prompts. We need a way to test it, evaluate, understand uh how we are kind of moving the needle here. So that's all this is. So it's pretty straightforward function just called generate output. We have that output model. Again, we're using OpenAI. If anybody wants help switching things around, happy to help. Uh we are using response format because we are dealing with JSON here. So uh we know that what you just prompted. I mean some of the the newer models are decent at it, but using response format is really helpful. And then we're also setting temperature to zero. Um, and here is just kind of where we're passing all the data in. So the data set because again we want to run this on all of the the testing data, the system prompt that will be input. So as we get to the optimization loop, we're going to be passing in a new prompt to this with the data set and then evaluating. Um, we have our output model that we've already passed, concurrency, all that good stuff. And it's just returning all of the outputs there. Would you for the uh the current generation of models since this one's basically like in in AI terms ancient uh would you like still recommend setting the temperature to zero or would you actually want to try to encourage some of the creativity to like >> I think it depends on the use case a little bit and what you're you're trying to do. You can definitely experiment that and kind of take it through the lens of how how important is consistency to use something like I feel like JSON web page I feel like consistency probably like temperature zero makes sense but I definitely think not for every agent every use case do you want to use zero any other questions get moving all right additional metric so we kind of talked about before that we are kind of using some score mapping uh this part is optional you want to use the metrics that make sense [clears throat] to you we're not directly using this um as like we are kind of like using it to know whether or not we optimize but it's not like we're you know using this as our sole kind of indicator for the success. Uh here we are just going to calculate some very basic metrics. Um it's just you can you know choose something like accuracy, F1 precision, recall just some basic kind of classification metrics for us to understand and because we are using binary mapping scores we can do that. Um and so that's what you're seeing happen here. We're mapping to binary and then just based off the score we calculate the metric. So very simple uh helper function here. All right, the good stuff the optimization loop. We made it. Um okay, so this cell implements the core prompt optimization algorithm. It's a three-part process. Uh so we want to generate and evaluate. So generate outputs using the current pump on the test data set and evaluate their correctness. Uh we want to train and optimize. If results are unsatisfactory, generate uh outputs on the training set, evaluate them, use the feedback to improve the prompt, and then iterate. So we kind of want to repeat until either the threshold is met or all the loops are u kind of completed. So if you remember above, um we're kind of setting that to just like five loops. Um and then you know we can kind of repeat um based off of that or the thresh met um it's going to track metrics across all the iterations. So turn to detailed results including a train test accuracy scores the optimized prompts and the raw value. So as I kind of mentioned at the beginning as we're running these different loops on the experiments we're going to be producing a lot of different prompts. Um and so we're kind of getting that information back that you can use. Um and these are our key parameters. I'll kind of go through them, you know, as we get to the code, but just to give you a heads up. Uh, this is the target accuracy score to stop optimizations. Um, it could also be whatever other metric you'll see, we have a score so you can kind of determine the number of loops of the optimization iterations. We've set that score and then the number of rules. Again, these are some configurations we've already set. Um, cool. So, optimization loop. This is um going to take in all of those um you know parameters that I've mentioned there. Um it just kind of kicks off saying hey we're starting um it's going to do the initial evaluation so we understand uh how things are starting off. Again you can kind of pass in data too. You can kind of skip this initial evaluation. We're kind of running it um at the start here. But if you were running production setting, you might already have evalu. Um, and then it's going to assess the threshold against kind of our initial valuation. Again, this could kind of be skipped when we're coming from a production setting, but wanted to kind of start us off from scratch so that we can get a real feel for this. Um, and then it starts the loop. So, we're generating output. Um, it's setting that as the train output. So, when I printed train, you kind of saw the outputs. I kind of skipped ahead there. Um and then it also will set um you know correctness, explanation, any rule violations. Um and then we'll actually use our prompt learning optimizer. So this comes with like the SDK uh the prop learning SDK that you can use um with the rise. Uh so we're sending in that prompt optimization the b choice um and then that API. So under the hood as we talked about in the slides taking in that feedback um taking in the original prompt and trying to optimize to get better results and then spinning out prompt um and then can also add an evaluator. So again those three um kind of feedback columns we're looking to get back as correctness explanation for that if there are any rule violations and then from there we just kind of kicked off the optimizer and optimize with our train set output those feedback columns again and then you know any context size limitations you want to add um next step so the optimizer again is going to take our data produce a prompt we want to evaluate so we understand how we're doing what this code block doing is doing here so trying to get that new prompting again with all those details getting our result and then we do that with our test set as well and then we're getting back like our score and our metric value and then doing the checks and then we repeat it all again till we either get above our threshold or we've hit the max number of loops and then returning our results. So that's kind of what's going to be happening other here. Any questions on that? uh just some result saving function more helper functions here. So we do want to obviously save all these results. We don't want them just be ephemeral that we can't ever access again. So just saving them all. Um you can also save all the single experimentation so you have all of that data towards the end. We'll be able to kind of pull this um and determine what the best prompt is. But these are just very basic helper functions. I don't spend too much time just saving them to CSV at the end of the day. Now we execution it. Um, so this cell runs the prop optimization experiment, saves the results. We're getting the JSON format, the CSV format. Um, it includes calls for the iteration number, the number of rules, test, train, accuracy scores, all the data that we're actually going to need to evaluate uh whether or not this thing is successful, and then we're going to start getting uh results here. So, um, this does take quite a while to run. So, we'll run and I think this will be a great point for discussion, but as you kind of are running it, you're going to start seeing the different loops. um kind of outputs coming out as well. Um and yeah, we'll just kind of like work through it as it it runs. It's probably going to take like 20 30 minutes for things to run, but um happy to take any questions and help anybody out as they run into issues. >> One thing, can you scroll back to the part of code that we needed to change? >> Change. It's gonna be >> So, one reminder are running into this. I don't think I was >> for this line here, like when you're doing install, you do want to be equals 2.2. Um, because I think there's a a little bit of a package issue. Um, so just make sure that's you're hitting errors with the eval. If not, let me down and try to fix it. This is the reason why [clears throat] >> uses like a generic evaluation. >> Yes. And you can kind of see the evaluation problem if you go to the We've kind of just taken that part out of this, but we can definitely go through that. Um so if you look here um on this line here, we're reading in um under prompts here, you can find the evaluation if you're curious. [snorts] And this is the reason why everyone hates on docker. This is why we use all. >> Yes, absolutely. >> The notebook. >> So I would also recommend uh patching your code with nestio if you haven't already. Helps it run a lot faster. Also for the purpose of the workshop um I switched our loops to one. uh that took me six minutes to run. So would recommend also doing that instead of having five. Obviously wouldn't recommend doing that when you're actually optimizing your prompt, but for now it'll help you get through the workshop. >> All right, I just want to kind of call out the the last little bit here. Um >> the last step before folks, let's see. Okay. Um so the the last little bit of code here um is just to extract the prompt that achieves the best test accuracy. So I mentioned how we're kind of like saving up all the results to use. Uh we just have a function that essentially gets the last or the best uh version of that kind of showing you the original and then the best optimized version uh which you can then use to kind of [clears throat] pull and put into your um code. I did want to kind of just give one kind of call out as you kind of saw today can be a little bit um difficult to to manage and so I want to call out for those of you who are kind of maybe looking for more of like an enterprise solution to this in Arise uh you do have these prompt optimization tasks. Uh you can have your prompts living in our prompt hub um data sets with all of your human annotations or ebal that you can either create from traces or just by ingesting it into Arise. Um, and then from there, all you really need to do is like give it a task name, choose what you want your training data set to be, where the output lives, where all your feedback columns are. Uh, you can adjust all of the parameters uh that you'd like. And then from there, you can just like kick it off and it will produce an optimized prompt in the hub for you. Um, so if I go over here, I think I have some. No, maybe not. it will basically just create a new version here that says it's optimized prompts with all the results and we are building on this so you can add all your ebots to it have that all running in the loop but just wanted to call out that if you're not interested in maybe maintaining code loops and having to build uh like a task infrastructure yourself it is something that we do offer in Arise um but yeah hopefully I know some folks are hanging out we'll be sticking around here for a little while as we um can help you kind of work through issues But uh thanks so much for joining us. Um hopefully you learned something useful. [music] [music]