Build a Prompt Learning Loop - SallyAnn DeLucia & Fuad Ali, Arize

Channel: aiDotEngineer
Published at: 2026-01-06
YouTube video id: SbcQYbrvAfI
Source: https://www.youtube.com/watch?v=SbcQYbrvAfI
[music]
Hey everyone, gonna get started here.
Thanks so much for joining us today. Um,
I'm Sally. I'm the director of RISE. I'm
going to be walking you through some of
crowd prompt learning. Uh we're actually
going to be building a driven
optimization loop for the part of the
workshop. Um I come from a technical
background and started off in data
science before I made my way over to
product. Uh I do like to still be
touching code today. I think one of my
favorite projects that I work on is
building our own agent Alex into our
platform. So I'm very familiar with all
of the pain points um and how important
it is to optimize your prompt. So I'm
going to spend a little bit time on
slides. I like to like just set the
scene, make sure everybody here has
context on what we're going to be doing
and then we'll jump into the code with
me. So, I'll let you do a little bit of
an intro.
>> Yeah, thank you so much, Ellen. Great to
meet all of you. Excited to be walking
through prompt learning with you all. I
don't know if you got a chance to see a
harness talk yesterday, but hopefully
that gave you some good background on
how powerful prompting and prompt
learning can be. Uh, so my name is I'm a
product manager here at Arise as well.
And like Sally said, we like to stay in
code. We'll be doing a few slides, then
we'll walk through the code and we'll be
floating around helping you guys debug
and things like that. My background is
also technical. So, I was a backend
distributed systems engineer for a long
time. So, no stranger to how important
observability infrastructure really is.
Um, and I think it's an appropriate
setting in AWS for that. So, yeah,
excited to dive deep into front loading
with you all. Thank you.
>> Awesome. All right, so we're gonna get
started. Just give you a little bit of
an agenda of the things I'm going to be
covering. Uh, so we're gonna talk about
why agents fail today. what is evening
prom learning? I want to go through a
case study kind of show youall why this
actually works. Uh and we'll talk about
learning versus GA. I think everybody I
had a few people come up to me over the
conference about like what about GEA? Uh
we have some benchmarking against that
and then we'll hop into our workshop. Um
but with this I want to ask a question.
How many people here are building agents
today?
>> Okay, that's what I expected. Um and how
many people actually feel like the
agents they're building are reliable?
>> Yeah, that's what I also thought. So
let's talk a little bit about why agents
fail today. So why do they fail? Well,
there's a few things that we're seeing
with a lot of our folks and we're seeing
even internally as we build with Alex
for why agents are b breaking. So um I
think that a lot of times it's not
because the models are weak. It's a lot
of times the environment um and the
instructions are weak. So uh having no
instructions um from their learned
environment uh no planning or very
static planning. I feel like a lot of
agents right now don't have planning. We
do have some good examples of planning
like we have cloud code cursor. Those
are really great examples but I'm not
seeing it make its way into every agent
that I come across. Uh missing tools big
one. Sometimes you just don't have the
tool sets that you need. Uh and then
missing kind of tool guidance on like
which of the tools we should be picking
and then context engineering continues
to be a big struggle for folks. If I
were to distill this out, I think it's
like these three core issues. So
adaptability and selfarning. Um so no
system instructions learned from the
environment touched on determinism
versus non-determinism balance. So
having the planning um or no planning
versus doing like a very static
planning. You want to kind of have some
flexibility there. And then context
engineering I think is a term that just
kind of emerged in the last like you
know six to eight months but it's
something that's really really important
that we're finding you know missing
tools tool guidance just not having
context or confirming your data and not
giving the LM enough context. So these
are um kind of the core issues to still.
But I think there's one other pretty
important thing. Um and that is kind of
this distribution of who's responsible
for what. So um there's these technical
users, your AI engineers, your data
scientists, developers, and they're
really responsible for the code
automation pipelines actually, you know,
managing the performance and costs. But
then we have our domain experts, subject
matter experts, AI product managers.
These are the ones that actually knew
what the user experience would be. they
probably are super familiar with um the
principles that we're actually building
to our AI applications. They're tracking
our evals and they're really trying to
ensure that the product success. So
there's this split between
responsibilities but everybody is
contributing but then there's this
difference um in terms of like maybe
technical abilities. And so with prompt
learning it's going to be a combination
of all these things. So everybody's
going to really need to be involved and
we can talk about that uh a little bit
more. So [clears throat] what even is
prompting? I'm going to first kind of go
through some of the um approaches that
we kind of borrowed when we came up with
prompt learning. So this is something
that Arise has been really really uh
dedicated to doing some research. And so
one of the first things we borrow from
uh which is reinforcement learning. How
many folks here are familiar with how
reinforcement learning works? All right,
cool. Um so if I were to give like a
really like silly kind of analogy, we
have a reinforcement model. Uh pretend
it's like a a student brain that we're
trying to kind of, you know, boost up.
And so they're going to take an action
uh which might be something like you're
just going to take a test an exam and
there's going to be a score. A teacher
is going to come through and actually
you know score the exam here um that's
going to produce this kind of like
scaler reward um and you know pretend
the student has an algorithm in their
brain that can just kind of take those
scores and update the weights in their
brain and kind of like the learning
behavior there and then we kind of
reprocess. So you know in this kind of
reinforcement one we're updating weights
based off of some scalers. Um, but it's
really actually difficult to update the
weights directly, especially in like the
LLM world. So, reinforcement learning
isn't going to quite work that well uh
when we're we're doing things like
prompting. So, then there's
metaprompting, which is very close to
what we do with uh prompt learning, but
still not quite right. So, here with
metal prompting, we're asking LM to
improve the prompt. Uh, so again, we use
that kind of like student example. We
have an agent which is our student. Um,
and it's going to produce some kind of
output like that's a user asking a
question getting an output. That's our
test in this example. And then we're
going to score. Eval is pretty much what
you can think of there. Uh, where it's
going to output a score and from there
we have like the metapromp thing. So now
the teacher is kind of like the metapar
prompt. It's going to take the result uh
from our scorer and update the prompts
based off of that. Um, but it's still
not quite what we want to do. And that's
where we kind of introduce this idea of
prompt learning. So prompt learning is
going to take the the exam going to
produce an output. Um we're going to
have our enlumm evals on there. But
there's also this really important piece
which is the English feedback. So which
answers were wrong? Why were the answers
wrong? Where the student needs to
actually study? Really pinpointing those
issues. And then we still aren't using
metapro. We still are asking an LLM uh
to improve the prompt. It's just the
information that we are giving that LLM
uh is quite different. And so we're
going to update uh the prompt there with
all of this kind of feedback. So from
our evals from a subject matter expert
going in and labeling and use that uh to
kind of boost our prompt with better
instructions and sometimes exams.
So this is kind of like the traditional
prompt optimization where it's like we
have we're kind of treating it like an
ML where we have our data and we have
the prompt. We're saying optimize this
prompt and maximize our like prediction
impulse. Um but that doesn't quite work
uh for Allens were missing a lot of
context. So what we really found um is
that the human instructions of why it
failed. So imagine you have your
application data, your traces, a data
set, whatever it is. Your subject matter
expert goes in and they're not only
annotating correct or incorrect. They're
saying this is why this is wrong. It
failed to adhere to this key
instruction. It didn't adhere to the
context. It's missing out whatever it
is. Um, and then you also have your ego
explanations from Ellen as a judge,
which is same kind of principle where
instead of just the label, it provides
the reasoning behind the label. And then
we're pointing it at the exact
instructions um to change. We're
changing the system prompt to help it
improve so that we then get, you know,
prediction labels, but we also get those
evals um and explanations of it. So,
we're just kind of optimizing more than
just um our outlet here. And I think a
really key learning that we've had is
the explanations in human instructions
or through your own as a judge. That
text is really really valuable. I think
that's what we see not being utilized in
a lot of other broad optimization
approaches. Um they're either kind of
optimizing for a score uh or they're
just paying attention to the output. But
you can think of it this way. It's like
these elements are operating in the text
domain. So we have all this rich text
that tells us exactly what it needs to
do to improve. why wouldn't we use that
to actually improve our so um that's
kind of the basics of prompt learning
but everybody always comes up to me and
like sounds great s but does it actually
work um it does and we have some
examples of when we do this so we did a
little bit of a case study um I think
coding agents everybody is pretty much
using them at this point there's a quite
a few that have been really really
successful I think cloud code is a great
example cursor but there's also client
uh which is more of a um an open version
of this and so we decided to take a look
and compare to see if we could you know
do anything to improve. So these are
kind of the the baseline of where we
started here. Um you can see the
difference between the different models.
U obviously using two and throttle kind
of the state-of-the-art there but we
also had this opportunity where CL was
using you know 45 and it was working
decently well at 30% versus 40. Um and
then there was kind of the conversation
around. So this is where we started um
and we took a pass optimizing the system
prompt here. So you can see this is what
the old one was looking like. It has
like no rules section. So it was just
very like you are a cloud agent. You're
built on this model. You're you're here
to do coding. Um but there was no rules
and so we took a pass at updating the
system. So there were all of these
different uh rules associated. So when
dealing with errors or exceptions,
handle them in a specific way. make sure
that the changes align with, you know,
the systems design. Um any changes to be
accompanied by appropriate test. So
really just kind of building in like the
rules that like a good engineer would
have uh which was completely missing
before. Um and so we found that plan
performs better with updated system
problem. Pretty kind of simple. It's
kind of the whole concept here. It's
like you can see these different
problems and we're seeing you know
things that were incorrect now being
correctly done just by simply adding
more instructions. [clears throat]
So it really demonstrates pretty well
here um how those system prompts can
improve and we benchmarked again with a
s bench light to get another just like
kind of coding uh benchmark for these
coding agents and we were able to
improve by 15% just through the addition
of rules. Uh so I think that that's
pretty powerful. So no fine-tuning, no
tool changes, no architecture changes. I
think those are the big things folks
like reach for when they're trying to
improve their agents. Uh but sometimes
it's just about your system prompt and
just adding rules. I think we've really
seen that and that's why we're really
passionate about prompt learning and
prompt optimization in general is it
feels like the lowest lift way to get
massive improvement gains in your agent.
Uh 4.1 achieved performance near 4.5
which is pretty much considered right
now state-of-the-art when it comes to
coding questions and it's twothirds of
the cost which is always uh really
beneficial. So uh these are some of kind
of the tables here. will definitely
distribute this so you can kind of take
a closer look. But I think the main
point I want y'all to come away with is
the fact that like, you know, 15% is
pretty, you know, powerful uh
improvement in our performance.
Now, a question we get all the time is
we're taking these examples of perform
learning. So, how this is really
important is we're going to take a data
set. A lot of time that data set is
going to be a set of examples that
didn't perform well. either a human went
through and uh labeled them and found
that they you know were incorrect or you
have your emails that are labeling them
incorrect and so you've gathered all
these examples and that's what we're
going to use to optimize our prompt. So
I get a question all the time like well
aren't we going to overfitit uh based
off of these bad examples but there's
this rule of generalization where
mending properly enforces high level
reusable coding coding standards rather
than repo specific fixes and we are
doing this train test split uh to ensure
that the rules are generalized beyond
just like local quirks and whatever our
uh training data set is. But if you kind
of think of this as like you hire an
engineer, right, to to be an engineer at
your company, you do kind of want them
to overfit to the database that they're
working on. So, uh we kind of feel that
overfitting is maybe a better term for
it is expertise. Uh we are again not
kind of training in the traditional
world. We are trying to build expertise
and as we'll talk about this is not
something we feel that you do once.
You're actually going to kind of
continuously be running this. So, um
more problems are going to come up.
we're going to kind of optimize our
prompt for what the application is
seeing now. Um, and then we'll kind of
So, we don't actually think it's a flaw.
We feel like it's expertise instead. Um,
we can kind of adapt as needed and kind
of mirroring what humans would do if
they were taking on a task themselves.
Um this is just another set of
benchmarking again kind of proving here
um that this diverse evaluation suite
that focuses on the task for those
difficult or tasks that are difficult
for relish language models um and we're
seeing again success with our
improvements.
Now Ga just kind of came out recently
and I think that's something everybody's
really excited about. I think the
previous uh DSPI optimizers were a
little bit more focused on optimizing a
metric and as we talked about like we
really want to be using uh the text
modality that these applications are
working in um that have a lot of the the
reasons or how we need to improve and so
we definitely wanted to do some
benchmarking here. So how many people
are familiar with Gered about it? All
right, cool. Well, I'll just give like
sort of high level. I just kind of noted
that the main difference between their
other like new pro optimizers is that
they are actually um using this positive
reflection and evaluation while they are
are doing the optimization. So it's this
evolutionary optimization um where
there's this parentto-based candidate
selection and probabilistic merging of
prompts. What this really does under the
hood is we take candidate cross uh we
evaluate them. Then there's this
reflection LM that's reviewing the
evaluations and then kind of making some
mutations some changes um and kind of
repeating until it feels like it has the
right set of prompts. So I think
something that is important to notice
about GABA is it doesn't really choose
kind of just one. It does try to keep
the top candidates um and then you know
do the merging from there. But we
benchmarked it and proper learning
actually does do a little bit of a
better job. And I think something that's
really key is it does it in a lower
number of loops. And I think something
that we'll we'll talk about in just a
second here is that it does actually
matter what your emails look like and
how reliable those are. I think that's
something we really feel strongly about
at Arise is uh you definitely want to be
optimizing your agent prompts, but I
think a lot of people forget about the
fact that you should also be optimizing
your email prompts because if you're
using emails as a signal, um you can't
really rely on them if you don't feel
confident in them. So, it's just as
important to invest there, making sure
you're kind of applying the same
principles that you are to your agent
prompt as your email prompts so you have
a really reliable signal that you can
trust and then feed that into your
prompt optimization. But, um in both of
these graphs, the pink line is prompt
learning. Uh we did also benchmark it
against me pro their older optimization
technique that I was mentioning kind of
functions off like um optimizing around
score
and eval make the difference. So it kind
of I I highlighted on this slide here
like the with eval engineering we were
able to do this. So we did have to make
sure that the eval part of prompt
learning uh were really high quality
because again it's this only works um if
the eval itself is working.
So, yep, emails make all the difference.
Kind of spend some time optimizing a
prompt here. Um, again, it's all about
making sure you have proper instruction.
The same kind of rules apply.
So, I want to kind of walk through. I
know there's a lot of content. I think
it's really important to have context.
But before we jump into any of the
workshops, any questions I could answer
about what I discussed so far?
>> Uh, I have a question comment. So I I
think you know coding is the greatest
example in terms of having the structure
and evals. Uh one thing I'm sort of
curious about is if you have other
examples sort of general prompts
forational
interactions with systems that are not
as easily quantifiable. I'm just curious
about any experience you guys have
there.
>> Yeah. Is that for like eval
general?
>> Well I think it's just clear how you
would set up what the eval would look
like and I'm just wondering how you
would do that for other types of so the
question is like is there any kind of
instruction for how you should set up
your evals? coding seems like a very
straightforward example. You kind of
want to make sure the code's correct,
right? But where some of these other
agent tasks um it's a little bit harder.
I think the advice that I usually give
folks is we do have a set of like out of
box. You can always start with things
like QA correctness or focus on the
task. But what I always suggest is like
getting all the stakeholders kind of in
the room. So getting those you know
subject matter experts and security you
know leadership and really defining what
success would look like and then start
kind of converting that to different
evaluations. So um I think an example is
Sterling and Alex. Um I have some task
level evaluation. So like I really care
did it find the right data uh that it
should have. Um should it did it create
a filter using semantic search or
structured like making the right tool
call? Um and then I care did it call
things in the right order? Was the plan
correct? So kind of thinking about like
what each step was and then like even
security will be like well we care how
often people are trying to jailbreak
Alex. So, it's just taking each of those
success criteria, converting it to eval.
Um, and we do have different tools that
can help you, but that's usually the
framework I give folks is like start
with just success and then worry about
converting into an email after.
>> Yeah. Just to add to that, maybe like
more of like a subjective use case is
like for example like Booking.com is one
of our clients and so when they do like
what is a good posting for a property
like what is a good picture?
[clears throat] Defining that is really
hard, right? Like to you, you might
think something is a very attractive
posting for like a hotel or something,
right? But to someone else, it might
look really different. And sometimes, as
kind of Sil was alluding to, it's
sufficient to just gate it as a good,
bad, and then kind of iterate from
there. So like, is this a good picture
or bad picture? Let decide and then gate
from there into specific background
like, oh, this was dimly lit, the layout
of the room was different, etc., etc.
Yeah.
>> Yeah. That's that you're actually
building on the question I was going to
ask which is that they end up with that
binary outcome which doesn't necessarily
give you a gradient to advance upon are
you then effectively using those
questions like digitally lit not to like
get like a more continuous space is that
>> exactly right and then from there as you
get more signal you can refine your
evaluator further and further and then
use those states and you can actually
put a lot of that in your prompting
itself right so yeah
>> I have two questions and I'm not sure if
I should ask both of them or maybe your
workshop will answer it. One is about
rules and the rule section or like
operating procedures. I'm curious how
you uh do you just continuously refine
that in the English language and uh
maybe reduce the friction of any
contradictory rules. That's the first
question. And then the other was I would
love to see the slide on eval. if you
could just say a little bit more on how
you approach that because my issue
[clears throat] in doing this work is um
whether or not to have like an a
simulator of the product and then the
simulator is evaluating or to do what
I'd like to do which is like an end
toend evaluation that I build but I
would love to see you talk about that if
you could.
>> Yeah, absolutely. So from the first one
about like how the instructions it's
definitely something I think that like
you iterate over time on them. So a lot
of times I think we take our best bet
like we write them by hand, right? And I
think what we're trying to do with
proper optimization is like leverage the
data uh to dynamically change them. Uh
and is I think great at like removing
redundant instructions, things like
that. But the goal is is we want to move
away from static instructions. We feel
very confidently that like that is not
going to really scale. It's not going to
lead to like sustainable um performance.
So the idea exactly with pump learning
is something that you can kind of run
over time. We see this even like a long
running task eventually uh where you're
building up examples of incorrect things
uh maybe having a human annotate them
and then the task is kind of always
running producing optimized prompts that
you can then pull in production and it
it kind of is like a cycle that repeats
over time.
>> Sorry just to intervene. So, are you
saying that when you're doing this over
a long period of time and then you have
examples, you're just running the shots
back into your rules section?
>> Kind of. It's going to pass it like when
we get to the optimization actual like
loop we're going to build, you'll kind
of see it as like you are feeding the
data in that's going to build a new set
of instructions that you would then, you
know, push to production to use.
>> Okay.
>> Uh I think your second question was
around evals and like how to where to
start, how to like write them and like
how to optimize those. Is that right?
>> Yes.
>> Yeah. So, it's a very similar approach.
I think it's like the data that you're
reviewing is almost a little bit
different. So, uh I should have pulled
up the the loops. I don't know if you
can find it.
Let me just try something really quick
to kind of show this.
There we go.
So, this is kind of like how we we see
it is you have two co-evolving loops.
I've been talking about the one on the
left, the blue one a lot about we're
improving agent, we're collecting
failures, kind of setting that to do
kind of fine-tuning or prompt learning,
but you basically want to do the same
thing with your evals where uh we're
collecting the data set of failures, but
instead of thinking about the failures
being the output of your agent, we're
actually talking about the eval output.
So having somebody go through and you
know evaluate the evaluators or using
things like log props as confidence
scores or jury as a judge to determine
where things are not confident. We're
kind of doing the same thing. So
figuring out where your eval is low
confidence and then you're collecting
that annotating maybe having somebody go
through and say okay this is where the
eval went wrong. And so it's the same
pretty much process of optimizing your
eval prompt. It's just you know I think
folks think they can just grab something
off the shelf or write something once
and then they can just forget about it.
But this loop, I've said it a few times,
but the the left loop only works as well
as your eval.
>> Sorry, I think my question is actually
way more static and basic. It's like do
you are you talking about this orange
circle as like are you building a system
or simulator for the eval or are you
just talking about like system prompt,
user prompt, eval?
>> Yeah, I think it's more right now what
we're talking about is just like kind of
the different prompts. You could
definitely do simulation, but I think
that's a whole different workshop.
>> Thank you. Any [clears throat]
more maybe questions before we get to
the bridge club? Any
switch back?
All right. Um, so here is going to be a
QR code uh for our prompt learning repo.
Um, so I'll give everyone a few minutes
to get such with that. Get it on your
laptops. I know it's a little bit clunky
to add this QR and like airdrop it. was
not sure a better way. Um I can just
show you also here if you want to find
it. Um it is going to be in our Rise AI
uh repo here and under prompt learning
and you just want to kind of clone that.
We are going to kind of be running it uh
locally here.
>> You go back to the page with the URL.
>> Yes. Sorry about that.
Oh
>> no, the page with the URL. Oh,
>> we'll give folks just a few minutes to
get
>> What do you What's your process when
you're building a new agent or work for
anything that could be evaluated? Do you
guys start by just like, oh, try
something prototype and then see where
it's bad and then do eval?
[clears throat]
>> Yeah, I think there's different
perspectives on this. Our perspective is
EOS should never block you. Like you
need to get started and you need to just
build something really scrappy. We don't
think like you should, you know, waste
time doing eval. I think it's helpful to
pull something out of the box sometimes
in those situations just because it's
hard to comb through your data. like
that's something we've experienced with
Alex of like when you're getting started
just running a test manually reviewing
like it it's kind of painful. Um so I
think that having eval is helpful but
shouldn't be a blocker. Pull something
off the shelf maybe start with that then
as you're iterating you're understanding
where your issues are then you're
starting to refine your evals as you're
refining your agent.
>> Yeah.
One last question. Yeah.
>> So it makes sense to like optimize the
system
like sub aents or commands or how are
you thinking about this like multi-
aent?
>> Yeah. So the question is is like are you
just doing one single prompt or how do
you think about this in a multi- aent? I
think we're kind of thinking that this
right now is kind of independent tasks
that can optimize your prompts kind of
independently and then running tests um
to get into like the agent simulation of
running them all together. But right
now, our approach is a little bit
isolated, but I definitely see a future
where we're going to kind of meet the
the standard of like sub agents and
everything else that's going on right
now.
>> No, I think that's pretty accurate. And
also like I mean even in a single agent
use case versus like a multi- aent use
case like ultimately like each of those
agents may be specialized. They may have
their own prompts that they need to
learn from. So I think doing this in
isolation still has benefits for the
multi- aent system as a whole that can
pass on over time in scenarios like hand
off etc and making something like really
really specialized. So I guess like what
we're talking about with like the
overfitting as well which is again like
question we get all the time but really
you want to be over fit on your code
base as an engineer. Um you don't want
to be so generalized that you're no
longer good at picking up specific works
in your code base.
Yeah.
>> All right. Everybody kind of getting to
read the mode. Okay. Anybody need any
help?
>> All right. So, we are going to be using
OpenAI for this. So, I think the next
thing that I'll have everyone do is
probably spend some time just grabbing
your API key. We'll get to it and then
I'll just kind of start walking through
our notebook here.
So, we are going to be doing a JSON
webpage prompt example. So, you're going
to find that under notebooks here. Um,
and so we'll give everybody a second to
pull it out. There's going to be just
some slight adjustments we're going to
add to this example uh just to make it
run a little faster and work a little
better. The first is um what this is
even doing this is going to be a very
simple example uh for just a JSON web
page prompts. If anybody has like a
prompt or use case that they want to
kind of like code along, Van and I are
absolutely help like glad to help kind
of adapt what you're working on to the
use case here. It's something very
simple just to kind of demonstrate um
the the principles and we are going to
be using we can definitely experiment.
If you want to swap out any other
providers that you want to use, we can
also definitely help you do that.
Um but the the goal of this is
essentially going to be to iterate
through different versions of a prompt
using a data set. Um
and we will optimize. So the first thing
is obviously we need to do some
installs. Um I am just going to have you
all update it. It says like greater than
2.00.
Uh but we're going to actually just use
I think 22 today.
And then the next thing is just to make
this run a little faster. So we're going
to run things in async which is missing.
So you can go ahead and add these lines
in the cell as well. All right, everyone
kind of follow along and I never know
want to move too fast. Seems to head
nuts. Cool. Let's talk about
configuration. So um I kind of talked
about it a little bit when I was going
through the slides. So we are going to
be doing some looping. So the general
idea is is we start out with the data
set uh with some feedback in it and
we'll we'll look through the data set
once we get it. Um, but you're going to
want to have either human evaluation.
Um, so like annotations, either free
text, labels, um, or you're going to
want to have some evaluation data. But
the feedback is really important. That's
what makes this kind of work. Um, we're
going to then, you know, pass that to
Allen to do the optimization and then
it's going to basically have eval. So as
it's optimizing, it's using that kind of
data set to then run and assess whether
or not it should, you know, kind of keep
optimizing. Um, and then it also
provides you data that you can kind of
like use to gauge which of the prompts
that it outputs um, in you know a
production setting. So we're going to do
some configuration. Um, so I've kind of
wrote out here kind of what each of
these means. So we have the number of
samples. So this controls how many rows
of the sample data set. Um, you can, you
know, set to zero to use all data or you
can, you know, use a positive number to
limit for, you know, faster
experimentation. So I think that
sometimes folks use different um
approaches here. Sometimes you want to
just move really quick so you set a low
sample. Sometimes you want to be a
little bit more representative so you up
it. Um I have it here set as 100. Feel
free to adjust. Um and then the next
thing is train split. Um so I think
folks are probably pretty familiar with
the concept here of like a train test
split, but it's just how much of the
data do we want to use into our
training? Again, that's what we're using
to actually optimize. Then how much of
it do we want to use when we're testing
when we're running the eval um on the
new prompt? Um
and there's number of rules. Uh
basically the specific number of rules
to use for evaluation. This just
determines which prompts to use. Um and
so this is like as we're running these
loops, we're outputting, you know, a
bunch of different prompts. So this is
just saying how many um we should use
for evaluation. And then key one here,
number of optimization loops. So this
sets how many optimization iterations to
run per experiment. Um and each loop
basically generates those outputs,
evaluates them and refineses the prompt.
And so these just control the experiment
scope the data splitting um just went
through the whole prompt learning loop
and and how much data we want to use. So
you can kind of just run these as you
are or if you want to adjust them feel
free. Uh and then the next step pretty
simple. We're just going to uh grab that
open AI key if you haven't already uh
set that up. So, get passage is going to
like pop up. Um I'll show you here
quick. It's going to pop up there. You
can just paste in your API key there
before we start looking at the the data
a little bit.
Just if anybody runs into any issues,
you just give this away. All right. I
think this particular
we get through this
>> I'm doing good but if you have a free
one you want to give me
that
>> I wish
>> all right let's talk about the data so
we provided data with you with queries
um you can see here that we're doing the
8020 split based off of kind of
configuration we set above I'm just
going to pull this um train set here and
let's just
>> Yeah, I run because in the minus
50
>> Oh, yep. You're right. That's a mistake
on my part.
Yeah, it is the 50. Um
let's take a look at what this data set
looks like. No. Uh just so folks can
kind of understand. Um so kind of
starting here with some just basic input
and output. Um
transcept we don't have any of the the
feedback in these rows that I printed
out here but you can imagine you can
have different uh correctness labels
here explanations any real validation
data can be whatever it is that um you'd
like it to be. Some folks use multiple
eval
feedback sometimes it's a combination
but you really want to have you know the
input and output that will use that way.
Should my output of train set be the
same as you?
>> Not necessarily. Depends on
>> I didn't know if head was sort or not.
>> It all depends on kind of what the the
same but we could look at like you know
if I did this this should be the same
for you maybe just to make sure.
>> Yeah.
>> That's what you're saying. Okay. Yeah.
Quick question. [clears throat]
Um, is it possible for the input to be
like a chat history and not just
>> Great question. So, I think it depends
on like what it is you're trying to do.
If you're doing just like a simple kind
of uh system of the input, you kind of
want it to be one to one. You don't want
to give it a ton of um like conversation
data that's not relevant to the prompt
that you're optimizing. um we we
generally just use like the single input
but I think that there are applications
that you could do like conversation
level um inputs.
>> Yeah. Because because quite often the
failure is somewhere middleation
right. So so if you put just the
original task in uh then the probability
of you hitting you know a failure in the
middle of the
>> totally. So in that case, what we
generally see is like different rows of
like having each of like the back and
forths be like kind of independent rows
because you're probably going to
evaluate each of them and um honestly
probably like get the human feedback on
each of them. So we usually separate
them out in that way.
>> But it's a good point. If you just
always are focusing on the first turn,
there's probably a lot of redundancy
there. uh you definitely will have to
like say over parts of the conversation
>> and and how we can biferate like
instructions and we have some context
also. So
>> should not touch the context. It should
only uh whatever the manipulate the
system instruction or the prompting
context it should be the static it
should not be like
based on the answer it will change my
context.
>> Yes. What you're saying is like looking
at the input there might be like a tool
volume context you're kind of passing
that in. You can absolutely include that
in your data set um so that the
application kind of understands what
other or not the application
[clears throat] but the prompt learning
um LM can understand all of the data
that's kind of like available. So you
can just have that passed in as extra
column if you want. Most people start
with just kind of input and the
feedback. Um but you can absolutely add
what other data you think is relevant
and if when for the rerunning when we're
doing the experiment of testing you'll
definitely always want to have the data
that would be required to answer
>> any even very simple some call some call
or some context it is pulling some API
call
whatever the prompt engineering it
should be based on the out
getting the output right and whatever
the context front my plus whatever the
tool call I have done API call all the
uh contact engineering and then last
finalize
>> totally yeah so again at this point
we're we're testing just like one prompt
and not that kind of end to end but you
definitely want to have everything that
like is flowing into the prompt that
you're optimizing so uh if your system
prompt takes in the user input for
example some data from an external API
you would definitely want to provide all
of that data does that make sense
Because because you're saying that like
the the like trajectories
[clears throat] the like tool calls and
what the agent's going to do depending
on what the tool call was is what you're
trying to proper to.
>> Yeah, exactly. We want to just like
because we're kind of trying to replay
and optimize one step of it. We
definitely don't want to do it
completely in isolation. So if there's
like data that flows into that prompt um
that's context that's using that's
producing the output, right? So we want
to be sure that we're including that. We
don't want to exclude anything. But if
it's data that comes like at a different
step probably not then you don't want to
do that that way. It's just like think
about what's relevant for the the step
that we're trying to optimize in this.
All right. Any other questions coming?
All right. Cool. So we're going to set
up our initial system prompt. You can
see this is something very very basic.
Uh we'll definitely I think we can do a
whole lot better than this, but I just
kind of want to illustrate something uh
that we're going to test and optimize.
So we're just saying you are an expert
in JSON web page creation. Your task is
input. And then so all these inputs that
we're seeing are going to be what we're
actually generating outputs for and
trying to optimize. Now I already kind
of touched on this. Um evaluators are
extremely important to make all of this
work, right? Um so we're going to uh
initialize two evaluators that use
elements as a judge to assess the
quality of generated outputs. So we are
using elements a judge. If you have any
other like codebased evaluations,
whatever you need to do to evaluate, you
can definitely swap those out. Uh but
we're going to do evaluate output. This
is going to be a comprehensive evaluator
that assesses the JSON webpage
correctness against the input query and
the evaluation rules. It's going to
provide an output label of correct or
incorrect. So pretty simple binary.
Again, you can use multilel. And then
it's going to have the detailed
explanations as well. Um, and then we
have a rule checker. This is a more
specialized evaluator that performs a
granular rule by rule analysis. Um, and
it examines if each rule um was
compliant.
And then both of these are going to
generate feedback that goes into our
optimization loop uh to iteratively
improve the system prompt. Um,
explanation role violations guide. Um
and we'll get to this the prompt
learning optimizer and creating the more
effective prompts. So I have some
imports here. Let's take a look at what
the actual eval output has. Um so we do
have some rules that are in um in here
wait
um they're going to be in a repo. Um so
we're going to open that as a file. We
have this llm provider and we're using
open AI here. And then we're going to do
our classification evaluator. So, uh
we're just calling it uh evaluate
output. It's al we have an evaluation
template that we're reading from the
bottom here. Um then we just have
choices correct and correct. Now we're
mapping a label to a score. Sometimes
it's helpful to be able to like add or
score. Sometimes a number is easier than
just looking at a bunch of labels. Uh it
is optional you want to map these if you
have like a multiclass use case. You can
set the scores u accordingly. But these
are just going to be our choices like
the rails that we want our elements as a
judge to adhere to. And then all we're
doing here is getting our results. I
have it doing some printing so you can
kind of take a look. So this is going to
be slightly different than what you're
seeing in the notebook. So I'm just
going to pause here. Uh if you want to
make the code changes from what you're
seeing in probably your version, this is
a a good time for that.
Does kind of the setup of the evaluator
make sense to all kind of the key. It's
going to be the rails. It's going to be
the output. Uh and of course our
template. [clears throat]
>> Yeah, you will want to grab your own uh
OpenAI key here uh to set
[clears throat]
>> and we can help you if you want to use
different provider. We can help you swap
this out like that is helpful to
anybody.
>> Okay, I'm going to start walking you
through the output generation. So uh
this is just kind of you know you can
imagine this as your own agent logic or
the the part that you're kind of
testing. Uh this is just going to
function that actually generates the
JSON u outputs. We're using for one here
with JSON response format zero
temperature for consistent outputs. Um
it's taking a data set a system prompt
generates outputs for all rows returns
the results for evaluation. Um and it's
called during each iteration to produce
output. So this is like our
experimentation function that we're
writing. So as we're passing in data,
it's producing new uh prompts. We need a
way to test it, evaluate, understand uh
how we are kind of moving the needle
here. So that's all this is. So it's
pretty straightforward function just
called generate output. We have that
output model. Again, we're using OpenAI.
If anybody wants help switching things
around, happy to help. Uh we are using
response format because we are dealing
with JSON here. So uh we know that what
you just prompted. I mean some of the
the newer models are decent at it, but
using response format is really helpful.
And then we're also setting temperature
to zero. Um, and here is just kind of
where we're passing all the data in. So
the data set because again we want to
run this on all of the the testing data,
the system prompt that will be input. So
as we get to the optimization loop,
we're going to be passing in a new
prompt to this with the data set and
then evaluating. Um, we have our output
model that we've already passed,
concurrency, all that good stuff. And
it's just returning all of the outputs
there. Would you for the uh the current
generation of models since this one's
basically like in in AI terms ancient uh
would you like still recommend setting
the temperature to zero or would you
actually want to try to encourage some
of the creativity to like
>> I think it depends on the use case a
little bit and what you're you're trying
to do. You can definitely experiment
that and kind of take it through the
lens of how how important is consistency
to use something like I feel like JSON
web page I feel like consistency
probably like temperature zero makes
sense but I definitely think not for
every agent every use case do you want
to use zero
any other questions get moving all right
additional metric so we kind of talked
about before that we are kind of using
some score mapping uh this part is
optional you want to use the metrics
that make sense [clears throat] to you
we're not directly using this um as like
we are kind of like using it to know
whether or not we optimize but it's not
like we're you know using this as our
sole kind of indicator for the success.
Uh here we are just going to calculate
some very basic metrics. Um it's just
you can you know choose something like
accuracy, F1 precision, recall just some
basic kind of classification metrics for
us to understand and because we are
using binary mapping scores we can do
that. Um and so that's what you're
seeing happen here. We're mapping to
binary and then just based off the score
we calculate the metric. So very simple
uh helper function here.
All right, the good stuff the
optimization loop. We made it. Um okay,
so this cell implements the core prompt
optimization algorithm. It's a
three-part process. Uh so we want to
generate and evaluate. So generate
outputs using the current pump on the
test data set and evaluate their
correctness. Uh we want to train and
optimize. If results are unsatisfactory,
generate uh outputs on the training set,
evaluate them, use the feedback to
improve the prompt, and then iterate. So
we kind of want to repeat until either
the threshold is met or all the loops
are u kind of completed. So if you
remember above, um we're kind of setting
that to just like five loops. Um and
then you know we can kind of repeat um
based off of that or the thresh met um
it's going to track metrics across all
the iterations. So turn to detailed
results including a train test accuracy
scores the optimized prompts and the raw
value. So as I kind of mentioned at the
beginning as we're running these
different loops on the experiments we're
going to be producing a lot of different
prompts. Um and so we're kind of getting
that information back that you can use.
Um and these are our key parameters.
I'll kind of go through them, you know,
as we get to the code, but just to give
you a heads up. Uh, this is the target
accuracy score to stop optimizations.
Um, it could also be whatever other
metric you'll see, we have a score so
you can kind of determine the number of
loops of the optimization iterations.
We've set that score and then the number
of rules. Again, these are some
configurations we've already set.
Um, cool. So, optimization loop. This is
um going to take in all of those um you
know parameters that I've mentioned
there. Um it just kind of kicks off
saying hey we're starting um it's going
to do the initial evaluation so we
understand uh how things are starting
off. Again you can kind of pass in data
too. You can kind of skip this initial
evaluation. We're kind of running it um
at the start here. But if you were
running production setting, you might
already have evalu.
Um, and then it's going to assess the
threshold against kind of our initial
valuation. Again, this could kind of be
skipped when we're coming from a
production setting, but wanted to kind
of start us off from scratch so that we
can get a real feel for this. Um, and
then it starts the loop. So, we're
generating output. Um, it's setting that
as the train output. So, when I printed
train, you kind of saw the outputs. I
kind of skipped ahead there. Um and then
it also will set um you know
correctness, explanation, any rule
violations. Um and then we'll actually
use our prompt learning optimizer. So
this comes with like the SDK uh the prop
learning SDK that you can use um with
the rise. Uh so we're sending in that
prompt optimization the b choice um and
then that API. So under the hood as we
talked about in the slides taking in
that feedback um taking in the original
prompt and trying to optimize to get
better results and then spinning out
prompt um and then can also add an
evaluator. So again those three um kind
of feedback columns we're looking to get
back as correctness explanation for that
if there are any rule violations and
then from there we just kind of kicked
off the optimizer and optimize with our
train set output those feedback columns
again and then you know any context size
limitations you want to add um next step
so the optimizer again is going to take
our data produce a prompt we want to
evaluate so we understand how we're
doing what this code block doing is
doing here so trying to get that new
prompting again with all those details
getting our result and then we do that
with our test set as well and then we're
getting back like our score and our
metric value and then doing the checks
and then we repeat it all again till we
either get above our threshold or we've
hit the max number of loops and then
returning our results. So that's kind of
what's going to be happening other here.
Any questions on that?
uh just some result saving function more
helper functions here. So we do want to
obviously save all these results. We
don't want them just be ephemeral that
we can't ever access again. So just
saving them all. Um you can also save
all the single experimentation so you
have all of that data towards the end.
We'll be able to kind of pull this um
and determine what the best prompt is.
But these are just very basic helper
functions. I don't spend too much time
just saving them to CSV at the end of
the day. Now we execution it. Um, so
this cell runs the prop optimization
experiment, saves the results. We're
getting the JSON format, the CSV format.
Um, it includes calls for the iteration
number, the number of rules, test,
train, accuracy scores, all the data
that we're actually going to need to
evaluate uh whether or not this thing is
successful, and then we're going to
start getting uh results here. So, um,
this does take quite a while to run. So,
we'll run and I think this will be a
great point for discussion, but as you
kind of are running it, you're going to
start seeing the different loops. um
kind of outputs coming out as well. Um
and yeah, we'll just kind of like work
through it as it it runs. It's probably
going to take like 20 30 minutes for
things to run, but um happy to take any
questions and help anybody out as they
run into issues.
>> One thing, can you scroll back to the
part of code that we needed to change?
>> Change. It's gonna be
>> So, one reminder
are running into this. I don't think I
was
>> for this line here, like when you're
doing install, you do want to be equals
2.2. Um, because I think there's a a
little bit of a package issue. Um, so
just make sure that's you're hitting
errors with the eval. If not, let me
down and try to fix it. This is the
reason why [clears throat]
>> uses like a generic evaluation.
>> Yes. And you can kind of see the
evaluation problem if you go to the
We've kind of just taken that part out
of this, but we can definitely go
through that. Um so if you look here um
on this line here, we're reading in
um under prompts here,
you can find the evaluation if you're
curious. [snorts]
And this is the reason why everyone
hates on docker. This is why we use
all.
>> Yes, absolutely.
>> The notebook.
>> So I would also recommend uh patching
your code with nestio if you haven't
already. Helps it run a lot faster. Also
for the purpose of the workshop um I
switched our loops to one. uh that took
me six minutes to run. So would
recommend also doing that instead of
having five. Obviously wouldn't
recommend doing that when you're
actually optimizing your prompt, but for
now it'll help you get through the
workshop.
>> All right, I just want to kind of call
out the the last little bit here. Um
>> the last step
before folks, let's see. Okay. Um so the
the last little bit of code here um is
just to extract the prompt that achieves
the best test accuracy. So I mentioned
how we're kind of like saving up all the
results to use. Uh we just have a
function that essentially gets the last
or the best uh version of that kind of
showing you the original and then the
best optimized version uh which you can
then use to kind of [clears throat] pull
and put into your um code. I did want to
kind of just give one kind of call out
as you kind of saw today can be a little
bit um difficult to to manage and so I
want to call out for those of you who
are kind of maybe looking for more of
like an enterprise solution to this in
Arise uh you do have these prompt
optimization tasks. Uh you can have your
prompts living in our prompt hub um data
sets with all of your human annotations
or ebal that you can either create from
traces or just by ingesting it into
Arise. Um, and then from there, all you
really need to do is like give it a task
name, choose what you want your training
data set to be, where the output lives,
where all your feedback columns are. Uh,
you can adjust all of the parameters uh
that you'd like. And then from there,
you can just like kick it off and it
will produce an optimized prompt in the
hub for you. Um, so if I go over here, I
think I have some. No, maybe not.
it will basically just create a new
version here that says it's optimized
prompts with all the results and we are
building on this so you can add all your
ebots to it have that all running in the
loop but just wanted to call out that if
you're not interested in maybe
maintaining code loops and having to
build uh like a task infrastructure
yourself it is something that we do
offer in Arise um but yeah hopefully I
know some folks are hanging out we'll be
sticking around here for a little while
as we um can help you kind of work
through issues But uh thanks so much for
joining us. Um hopefully you learned
something useful.
[music]
[music]