Shipping complex AI applications — Braintrust & Trainline

Channel: aiDotEngineer
Published at: 2026-05-01
YouTube video id: ZdheJTfLu-s
Source: https://www.youtube.com/watch?v=ZdheJTfLu-s
Good afternoon everybody. Welcome to
sunny London. Hey um is everyone's first
time here? I think it is because this is
the first conference. Amazing. Amazing.
Well, thank you very much for joining
today's session. Um hopefully you are in
the right session but for those who uh
need to double click uh this is a
hands-on workshop to delivering quality
AI applications of brain trust and we're
also be partnering with our colleagues
at uh trainon which I'll introduce
themselves shortly. Uh so probably be
wondering who's this guy does he even go
here what's his name introduction to
myself is uh you can call me Jirean but
like the band Juran Jirean with a G. So
hopefully folks are Simon Leon fan and
I've had a bit of part of helping
organizations and enterprises help scale
adoption of mission critical systems and
now moving into the age of AI. Um you
know background in uh formal
mathematics. So I know there's all the
rage going from machine learning to data
science and now AI engineering and this
comes at a very topical time. Uh feel
free to connect me with LinkedIn if you
want to talk or just want to uh you know
just do general chart around this area.
I'm also joined by my two friends over
from Train Line. If you want to come
over and introduce yourselves to you.
>> Of course, if the mic is working. All
good. Yes. Hello everyone. Uh thanks for
coming to this workshop especially after
that lunch break and uh uh and the sunny
time outside. Thank you very much for
coming here. Uh so my name is Usama. I'm
a senior a IML engineer at Train Line.
Uh for those who know me, yeah one
person uh yeah I was a staff platform
engineer before and I have a background
in computer vision um and I was doing
also mobile apps on the site. So nothing
to do uh with with AI at at some point
but yeah now we are doing AI together
with Mayan at train line if we go.
>> Hi
>> everyone my name is May I um I'm a
senior AI engineer at train line. I have
background research background in LLMs
but my research is in the pre-big LLM.
So like if you've heard of BERT or the
older LLMs um so we continuing building
a state-of-the-art uh agentic products
at Train Line. If you have come here
from abroad, I'm sure you must have used
Train Line. If not, please do cuz it's a
um it's the state-of-the-art app for um
buying tickets and other things. So, uh,
yeah, welcome and we're very excited to
host you at this workshop and, uh, yeah,
we'll guide you through the the
hands-on, uh, experience.
>> Fantastic. I'm also joined by some of my
colleagues as well from Brain Trust
coming over the pond. So, Phil, Eric,
and Rose, if you just shut your hands.
So, don't worry folks, you're in safe
hands if you're ever stuck. So, just
holler and they can help.
Fantastic. Uh, just want to do a little
bit of housekeeping as well. if
everybody could join into the AI
engineer Slack channel or uh well
there's an AI engineer organization but
there's also a specific Slack channel
which we'll be using today to help uh
progress the workshop. So if you are
stuck we can use the Slack channel to
help each other out. Again we're already
on there but as we progress and get to
the hands-on elements and if you are
stuck um this this will really help.
We're also providing a uh cheat sheet.
So if you are encounter any particular
hurdle, there's a step-by-step
instructions which help you just get to
where we need to go to the workshop
without you know having to to feel like
you're you're falling behind. Again, a
lot of the assets which we share today
uh you know is publicly available. Um
again we can have any particular
follow-ups um if needed. But we'll give
you a few minutes to make sure you get
onto that and then join the channel AI
engineer Europe 226 brain
trust-workshop. It's a public facing
channel.
All righty team. Um
guess we can start. Um yep just for the
people at c in the back
um we'll need to you to join the slack
channel. So if seen pierce
that team thank you everybody. Um let's
proceed. Okay so just kind of help
orient today's uh workshop we'll be
breaking into kind of three main
sections. We'll spend a little bit of
time just setting a bit of establishing
the background why we're here and and
why this workshop is relevant. Hopefully
he'll set the context for when we go
into the workshop uh building uh the
system talking about this you know how
do we ship uh AI equality and then we'll
wrap up for uh the key takeaways and
we'll be around to answer any kind of
open uh questions and answers that we
have from the field.
Okay. So hopefully today um there's
going to be a lot of people from very
different backgrounds. Um we intended
this workshop to be catered for again
probably everyone here knows what an LLM
is. I don't want to insult anybody but
hopefully you're starting to uh explore
your journey in terms of your mature
maturing your your operations in terms
of building AI systems. So whether
you're an AI product engineer probably
an applied team or come from a
traditional machine learning folks maybe
you might be a platform operation
infrastructure. um this is really
appropriate for you just kind of a show
of hands here like who here comes from a
let's say a traditional data science
background perhaps
okay interesting interesting who here
perhaps comes from maybe software
engineering or then pivoting into AI
okay so this is definitely the right
room for you uh so hopefully we'll we'll
be able to kind of accelerate things as
we progress okay um to the context here
um I think this is not an uncommon
expression which we are seeing more
broadly across the indust industry. Um,
again, put of show hands. Who here has
done a a machine learning or AI PC?
Let's say, let's say more specifically,
a generative AI P, but then has failed
to kind of take that into production.
Okay, it's got a few hands. Yeah, I'll
be worried if they know everyone didn't.
But this is kind of a key thing which
is, you know, speaking to many of my
customers, you know, executives, top
level folks, all the rage, you know,
they're thinking about this new
technology. uh it's not necessarily new,
but uh it's newer for a lot of folks,
especially in kind of more enterprise
and regulated industries, and then
they're trying to take this to uh
delivering value to their customers. But
unfortunately, there's a big hurdle
between um taking what you might develop
locally on your machine and then
industrializing it and making sure it
works in Angular. But the key thing that
we've seen from all of the research out
there, it's not that the models aren't
particularly smart. like we've got very
sophisticated models whether you're
building something in house or you're
using kind of a top offtheshelf you know
top uh commercial LM providers out there
what we do see more broadly speaking is
the type of operational rigor when it
comes to delivering these systems at
scale has not kind of kept up because
traditional software engineering very
deterministic 1 plus 1 equals 2 great
LLM systems you know as my 10-year-old
sorry 5-year-old would say um you know 2
plus 2 equals 10 daddy so it's having to
kind of adjust it and make sure that
we're delivering that to scale
And again holistic capable what we're
seeing uh when shipping these things is
the fact that again people think a demo
state is suitable. Um but clearly doing
two two to three to five demos great
putting production everything goes ary.
um treating some of your logs as
observability and then this is really
critical to to how we work at at brain
trust is you know logs will tell you
what has happened but sometimes you need
to go deep into the system and
understand its behavior and this is
really where observability comes into
play u something as well is you know
works on my machine fails in production
I try to patch the prompt um and then
again it's operational until the next
issue happens or the next failure mode
but again how do we keep track of that
um if especially if you don't have a
system in place um irrespective of what
tooling you use um this can you know has
a categorical effect. So again a lot of
what we see again it's not to do with
the tooling technology it's down to more
operational workflows and this is really
what we're aiming to help you in today's
workshop. Okay, so again as I mentioned
it's not the prototype. It's getting to
a state where we're knowing exactly
what's changed in the system. How do we
interact with that and then how do we
systematically put a set of rigor so
that we can get better and better.
Remember 100 uh our target is not 100%
coverage. It's getting as close as
possible uh while maintaining um fixing
the gaps that might have existed.
Um and again something that we see time
and time again is again a one prompt
might work but as you move and
industrialize it you want to probably do
things like breaking this down into each
individual um sectors of responsibility.
So again if you do come from the
software engineering background you know
about these you're breaking down the
monolith into microservices. We'll
outline a very similar approach here
when we talk about building these
systems.
Um again making sure that we're
understanding these changes uh putting a
set of systems in place. This is really
what we want to do in today's workshop.
Okay. In terms of today, um hopefully as
I mentioned, this is a hands-on
workshop. So, we're going to be going
into the terminal. We're going to be
going into the UI and we're going to be
going through the step by step and
guiding you along the way. So, we'll be
doing a stage AI system uh with uh
multi-stage uh tool calling which really
allows us to um see this more uh let's
say aentic flow. We will then use brain
trust to instrument and see how the
performance of the application is
working. We then also want to then you
know take a look at creating what we
call um or identifying failure modes
using a golden set. So we'll push that
through. We also then want to talk about
how do we industrialize it. So they're
moving from hey it just works on my
machine to something that you can use in
production and have it manage forward
with a system in place. And key thing is
identifying you know those those edge
cases because again you can create a a
test data set but ultimately um there's
no substitute for real world data. So
again we'll be able to show you how we
we take those real signals in and
evaluate and complete the loop.
Um so just bit of introduction here
today as well. Who here has heard of
brain trust or played around with brain
trust? Can I get a show of hands? Okay
great fantastic. So okay love this. Love
going on in. So just a bit introduction
to brain trust. We're a company now
that's uh I think just shy of three
years old. Uh we're approximately a um a
series B company. So we just announced
that a few months ago we raised $80
million at $800 million valuation. Um we
have investors such as Iconic AZ 16 uh
as well as Greylock uh to really talk
about you know helping organizations
ship quality AI at scale. So we're the
platform for AI observability.
uh we've got a heavy user base uh
globally but we're also expanding our
presence heavily in Europe. So again I'm
one of the first engineers to to join
and help build out uh our go to market
function here and we're very excited
again with our customers and our friends
over at train line to do that. Some of
our other local customers include like
lovable doctor lib as well uh which are
really pushing the forefront of these AI
systems.
uh again when it comes to using kind of
brain trust again I know to again I want
to kind of get your hands on with it but
where we really distinguish oursel is
being able to do this at scale so our
founder anchor goel um this is actually
his third time building brain trust um
so he's an expert when it comes to sort
of database systems and he's built um a
company called Imper previously that was
acquired by Figma that talks about
document extraction and he's led the ML
uh machine learning team out there and
because he realized you know, building
these evaluations are hard.
Understanding production traces are
hard. So, if we're having this issue,
I'm sure there are other organizations
out there which are doing the same. And
so, he founded Brain Trust to really
help do this at scale. And as we're
doing and understanding this these these
tracers are coming in and again being
very highly semi-structured data which
changes uh he realized that traditional
analytical systems weren't fit for
purpose. So we've put created a um a new
category of databases called brainstorm
which really helps identify and
accelerate this at scale with us again
we're tool work we're platform agnostic
so irrespective of the agent framework
you're using or or the LLM providers out
there we we're we're intending to help
you deliver value uh irrespective of
that okay one things I will talk about
as we progress uh in this workshop is
this the concept of a flywheel so Uh
again if you ever come from agile
development um you know perfection is
the enemy of good we want to start
somewhere. So even if it's a case that
it's a new application and you don't
know how it's going to be in production
we can start up with an evaluation set.
If you have an existing application to
instrumenting that's great we can pull
that information in and identify the
failure modes here. So the key thing is
get information into the system,
identify those modes, remediate, ship it
out and then monitor and complete the
supply uh again and again till you get
to where you need. All right then um you
would heard uh a lot from me. So one
thing do is I'm going to provide uh my
colleagues over at trainine to maybe
just share their experiences of you know
prior to to brain trust and and how
they're helping. So uh I'll give it to
you.
>> Thank you very much. Uh and hello again
for people who are joining us just now.
Um as my uh mate my uh introduced train
line uh we are a company that's actually
helping people get on the trains. U
trains are different than planes. Uh if
you don't know uh there is a like
worldwide system central system for all
planes around the world. It's not the
case for trains. uh and in Europe and
the UK it's very hard I would say if you
would like to install an app for each
career it will take the whole space on
on your mobile phone for sure so trainer
is actually being that platform uh
basically uh to help you book tickets uh
mobile app agnostic platform agnostic
career agnostic you do it on one app and
you can book a train from Paris to
London uh from from uh from Leyon to to
to uh to Milano whenever you like
basically
every career in the EU we sell like
almost like 6.3 billions uh of tickets
on the trains 27 millions active users
and counting and the other interesting
probably one for for this conference is
how many uh AI conversations actually
that we have from our with our travel
assistant so we do have a travel
assistant that is exposed to people and
it's not just a chatbot uh it's actually
a multi- aent system that actually can
handle
refunds for you, can handle changing
trains for you. So it is very I would
say a proactive um Asian system, Asianic
system and not just a charge. Probably
mine would like to add more about that.
>> So what are the benefits of having 27
active customers is you've got a huge
space of how you can serve agentic
applications live to the customers. One
of the examples of that is what Osama
talked about is a travel assistant which
you can get to from a ticket window in
the application. Um it's a digentic
system which is something we want to
talk about uh a little bit later in the
slide.
>> Awesome. Which brings us I would say to
the next uh point uh we will keep it
short. Uh selling train tickets and
being like a a train tickets company.
How come that we are doing uh machine
learning? Of course we we can uh there's
so much things uh to do and to help
people uh in terms of their journeys
getting their tickets getting their
trains getting back home. Basically uh
we build we do two things. So the
classic ML part which is actually
building models we do that we build ML
models inside of train line from scratch
from data to model. This is something we
do and we also do the multi- aentic
generative AI systems that we are now
familiar with on top of LLMs that we
love and cherish all the toolings and
context engineer and all of that we do
that. So we do these both sides of the
story uh at train line and these are two
examples of uh what users are actually
uh using uh on top of those u systems on
the left side. So this is uh you can you
can think of it as your weather
application but for train disruptions.
So basically you have a ticket for a
train you get there we know if this
train would be uh would be disrupted or
not if it will be uh pro probably late
or not. We don't we know that but based
on a huge data that we have on top of it
that was a machine learning model that
was trained and can actually predict uh
train uh disruptions uh being late and
all of that. So this is the classic ML
part. The other one that's the traveler
system that I told you about. It is as I
said it's very I would say advanced
multi- aent system. It can show you
alternative trains if your train is
canceled or something wrong with it.
Good luck doing that yourself even with
Chhat GPT. And the um the other one is
handle refund. So you can actually get
for a refund um on your ticket if your
train is late or all of that. And it
actually can give you all of that uh
without a handover and also can you can
do the handover to actual u uh human
customer support uh in our lovely
customer support team.
Um which means that if you are doing
this at production in production level
and that scale at uh train line scale it
means that uh or you can ask the
question are we breaking things at train
line of course we don't want to to do
that um and we are moving fast because
technology is definitely is moving fast
and and this is why we are here we are
we're here to show you like what are we
doing to move fast without breaking
things uh in terms of course of AI and
we can do that on top of handling the
complex uh software systems that we uh
we love and cherish cherish from APIs at
scale hand serving millions of users. Um
so how we would like to think about it
is this scale. So we know for sure that
whatever we have as software systems
this is deterministic side of the story
that we have and on the other side
building the ML models this is the
nondeterministic side of the story and
we know for sure that the agentic
systems are in between there are parts
of them that are deterministic there are
parts of them that are not definitely
deterministic uh and this is how we are
the framework of thinking that we have
in terms of quality how are we handling
that that's question for the ML models.
We do we do care about the quality of
data and and that we use for training
the models but also we have on top of it
of course uh ML machine learning
evaluations uh whether offline or online
offline it means like before going to
production you you need to do your
evaluations and online is the data from
production you get it and evaluate your
model if it's doing well or not. In our
case for instance for that for weather
forecast uh is it pred predicting the
state of the train if it's disrupted or
not is it is the model correct in its
prediction. So that's one uh one
example. On the other hand, for people
who are familiar with software
engineering, we do all I would say
quality checks and the diagrams or
whatever and the tooling that we have
for handling quality at scale, a very
large scale at training line. Uh and for
those systems,
I think you already guessed it. It's
basically combination of both. It's not
one without another for sure. We do
everything that we do equality wise in
terms of deterministic systems but we
also use the the evaluation um uh side
in um from the nondeterministic systems.
Uh so yeah that's I would say how the I
would say the framework of thinking that
we have a train line thinking about
these systems and um definitely uh brain
trust uh is helping with that and by the
way disclaimer we are not I would say we
are here just because we are convinced
that it brain trust works for us we we
are not paid so that's 100% uh we we are
happy customers we we have been with
brain trust for a long time and we use
it in different um some of uh brain
trust features we actually use them
here. So for instance for the ML AI
evaluations we do that and we we we
follow the scoring of travel assistant
on many many levels from uh from tone of
voice to actual helpfulness uh when it
comes to tickets and tickets are really
really I would say complex in terms of
reasoning and what you should get what
you not get depending on the if the
train is late or not uh the type of
ticket is a return ticket or advanc so
many complex cases uh but yeah we do
follow that uh evaluation side probably
m would like to add more about that.
>> Yeah. So can I get a raise of hand from
people who have struggled with LM costs,
number of tokens, switching models,
problems like this, right? So yeah, it's
nice to see. So it was a problem with
train line as well because we do this at
scale and the amount of open AI entropic
amount that we pay just like bizarre. So
we have to keep switching models which
is like the best model for our use case,
cheaper models uh to models with
efficient tokens. Now anytime you want
to switch models you want to make sure
that it is performing at least at the
same level as your current model right
before brain trust we had no specific
way of doing it because we didn't have
scores set up we didn't have you know
sort of like the evaluation so with the
usage of brain trust what it has enabled
us to do is to simulate how the
performance of the lower model would
look like and we've used brain trust
extensively to run offline evaluation
and see what the effect is going to be
and also online evaluation to see that
the intended effect that we observed in
offline is is as expected. So that's I
think one of the use cases. The other
one is more generic which is like it
would have taken us like a lot long to
evaluate a new feature shipping into
travel assistant but with brain trust we
have been able to make sure that we are
assured that the new experience for the
user is going to be good. So brain trust
has helped us a lot uh in shipping fast.
>> Cool. And also the other part is
observability for sure. Uh we do we do
use brain trust for observability if
this one is working. Yes. Um so yeah
this is true example from uh from brain
trust. We are spoiling basically the
workshop what whatever you are going to
see there but yeah we can we can track
everything regarding tool calling the
other agent um the in in terms of number
in terms of quality that those kind of
insights you will need them to get from
proof of concept to actual system in
production or in your company or for
something actually um uh that's been
used there out there uh and users find
it I would say true working product and
not I would say just any slope AI
generated system. No, we are we need we
will need that for definitely for for
those cases. Would you like to add
something here?
>> I was just going to say that brain dress
enables you to look inside complex
agentic workflows up to like tool call
level token level which is like very
insightful and helps you debug a lot of
things in production and before you
deploy it.
>> Yeah. And one last thing probably is the
cross functional uh uh friendly point.
Uh that's something that we have
discovered along the way. Uh because we
are building those systems from travel
assistant to the models at some point we
need to have we are a big company and we
have people from product people with
nontechnical backgrounds and we need to
communicate and we need to share many
things and they also need to selfserve
some of those things. We cannot I would
say babysit any uh people and say hey
you should do this should let me get you
the logs download it and send them that
does not work at scale we need like a
way like where we can help I would say
can work crossf functional way um and
let people be I would say um uh uh free
to do whatever they want and self-s
serve with with data and insights. So
this is what we have also discovered uh
along the way and brain trust helps us I
would say with many requests that that
we had but yeah really appreciate it uh
for sure and and of course more we are
using just a part of uh that system we
have our own things branch I think you
are building even more more more
toolings so so definitely uh more to
come I think the workshop will
definitely help you discover all those
uh things and u of course have fun
building uh during the workshop I'd say
laptops out and if you are interested
train line is hiring uh in the AI
engineering side of course and uh give
it to you
>> perfect thank you so much for that team
yeah and just to say again an impartial
thank you uh an extension of my wife we
thank you so much for the split saver
functionality so uh give you kudos to
that all right then team
all right let's proceed with the the
setup so again this is going to be a
hands-on workshop uh better dust off
your bash skills team jokes. I've got a
nice rafcom to help out with that. So
hopefully everybody or at least most
folks have joined the Slack uh
organization for AI engineer and then
also joined the the channel. I've put a
link to this uh but there's also a QR
code to the repository which we'll be
using. So today there so I'll give it a
few minutes. Um really key things is um
signing up to a free Brain trust uh
account. Um if you use Gmail, you can
use a little plus sign trick to just
create a new account if you um you know
keep it if you are an existing Brain
Trust uh user today. Uh so hopefully it
doesn't pollute your uh existing one.
You will also need access to an open AI
API key. um this should be easy for you
to generate. If for some reason you're
not able to generate a key then please
let my colleagues know. We'll be able to
send it and DM you on on Slack to use
for this specific session today. Um and
I think obviously as an engineering
conference and especially in AI uh if
you are using AI coding assistance feel
free to use that within your ID or in
the terminal to to guide you along the
way ex ask questions around the code
base and so forth. Um I've tried to
simplify a lot of the scaffolding in
place. So I'm using mice to uh manage uh
node and pnpm um on on the machine. But
alternatively, you can download node v22
and the specific version there. It
should be fine. U but again I just want
to cate especially for this workshop. I
fixed that specific u let's say runtime.
I'm using make to uh just again some bit
of syntactic sugar to wrap the commands.
Um if for a reason you don't have make
installed if you're on a Windows
machine, you can then just run the um
pnpm commands which are then just
wrappers for the package.json file
>> here there today.
>> So yeah, we'll give it a few minutes uh
for folks to um
>> to scan and clone. Just worth pointing
out as well um I have a uh in the sheet
today I can see lots of folks have
already um gone through it but um yeah
we provide a step-by-step logo. So as
mentioned as we're proceeding
with the um workshop today if you are
stuck you can please answer questions on
the slack channel but also refer to this
u as mentioned this will stay this is a
public facing asset so uh even if you uh
outside of this workshop you're stuck
you can kind of go through it at your
own uh leisure
just kind of maybe a show of hands. Um
who here um is not able to get an open
API key open AI sorry API key hopefully
folks be able to provision that it's uh
really critical to this.
>> Yeah. some people join.
>> Wait, we don't have the QR code
>> for the Slack.
>> For the Slack. Yeah,
>> I can do that. Yep.
>> Okay, folks. I think we did have a few
late starters. So, um, yep. I don't want
to spend too long with this, but if you
did join late, please scan that QR code
uh which will give you access to the AI
Slack engineering uh organization and
then join the AI channel uh AI engineer
Europe 2026 brain trust workshop. It's a
public facing channel. Through that
channel, you'll be able to get access to
the pinned um uh content. So, the
repository as well as a cheat sheet
which we'll be following as well.
Just yeah just a bit of a sense check
folks. Um hopefully most of you would
been able to join the slack channel and
are able to clone the repository have
access to that.
>> All right then um as I mentioned at the
start of the um the agenda as well what
we'll be doing is creating a um an
example let's say support triage agent.
So, you know, hopefully this kind of
exemplifies um a lot of uh systems that
folks maybe have already built or or
trying to build. Um it is it is a
fictitious uh application uh designed
specifically for this workshop. Please
do not use it in production. It's just
really around to just teach us um you
know how how do we build these AI
patterns in for scale. So the idea there
is um given a ticket might be raised in
the particular system. uh we have a set
of agent that goes through a a pipeline
process I would say pipeline but a stage
process tool calling uh that produces um
a set of information that can be emitted
into downstream systems uh from that
perspective okay um just to kind of help
visualize what uh the system does again
mentioning we're taking the ticket input
uh we have a first step which is
collecting the context it's quite a
deterministic way of extracting
information
we then proceed with the agentic type of
portion where we've broken DS into three
stages. So we have an LLM and tool calls
to to triage that particular issue. Um
we then want to do a policy reviewer to
make sure the output is is correct. We
then want to create and draft let's say
a customerf facing reply with reply
writer and then we want to package this
together and depending on um how severe
this particular ticket we can invoke
another tool call to whether we need to
escalate uh this to let's say a human in
the loop and then draft the final
result. So uh again not too too complex
but just give you an idea of of the the
system that we'll be trying to build and
and operationalize uh in this session.
One thing to point out is you know as we
begin to use brain trust is like really
where does it fit into helping you you
know deploy and and manage this at
scale. I just isolated uh this here
where you know everything we're going to
be doing towards the end especially as
we progress we'll trace it end to end.
Um as my colleagues have talked about um
we'll be able to use manage mode for
prompts. So the offloading that what
you'll do locally into a secure
environment tool calls as well will be
managed. So again uh how do you you know
talk about getting to external systems
and lastly around evaluations and scores
which will be run through our brain
trust infrastructure.
It's worth pointing out as well if you
do go into the GitHub p uh repository uh
there's a page GitHub pages version of
this as well. So again the slides will
be readily available for you to use and
consume.
Okay, key thing as well uh with this is
I've tried to help you um do each phase
and checkpoint this. So if you are stuck
the idea is just use get get check out
to a specific tag or the branch but in
this case each individual tag or branch
is fully runnable at that stage. So if
you're ever stuck again check get check
out to that branch to your make install
make setup run the commands and then you
should be able to get uh near identical
output every every every time. So again
this is all documented in the the readme
but it's also included in the cheat
sheet as well uh for you to progress.
Okay. So, yeah, just to kind of give you
the sequence of events, uh, we'll be
doing kind of the scaffold and setup.
Uh, I'll talk about building a basic
agent. Again, this workshop isn't really
designed to talk about building agents.
It's about, okay, once you do have an
agent, how do you then operationalize
that? So, but just for brevity, we
wanted to kind of do it step by step.
And then we'll kind of get into the the
core of it where we'll add the tracing,
talk about the evaluations, you know,
talking about the golden set and
identifying where we can we can improve.
And then again tying it all together and
using kind of manage infrastructure
within brain trust to to help
operationalize this out and again talk
about that collaboration that my
colleagues at train line talked about.
Also kind of the key thing which is like
once you do identify a production
failure it's then how do we apply fix to
this and then complete that flywheel and
give you a finalized asset.
Okay. So step one we're going to be
talking about building the agent. So in
this checkpoint here um again we need to
start somewhere right so what do we do
uh we will take an initial call to an
LLM we'll set up a prompt one shot one
in one out and get an output u so again
if you're doing a proof of concept um
this might be great as part of that
initial spike but again we know there's
there's work to be done but yeah for
context we're just going to be stepping
through this but as I mentioned just
because it works in the demo doesn't
mean it's necessarily going to work in
production.
Um I've put some pseudo code here just
to articulate here. But you can see here
as as with all uh these um you know
language models we provide a sort of a
system prompt. We provide user and
especially the text and we pass an
output back into our application. So one
function one model call and we have the
structured output which we want to
receive as part of the the output.
Okay. So what we'll do as well uh when
we do the the scaffolding and checking
out to the first uh branch uh we can run
a set of pre-built tickets. So it's it's
a case that's kind of already done in
code or you can use a command line. So
I'll just demonstrate that shortly what
it would look like uh with the outputs
and you'll get something similar to that
in uh JSON format.
Okay.
So hopefully everyone can see this at
the moment. Um, I have the the
application here.
Okay. So, I'm just going to go to basic
checkout.
Okay.
And uh what's this?
So, if I take a look here, so I'm now
checked out into uh that first tag. So,
building a basic agent. Um I've got the
application here which is again very
similar to the the prompt code. Um we're
creating the client uh calling open AI
SDK. Again for brevity I've just I
didn't use any agent agent SDKs but we
fully support that as well as we proceed
with the the workshop.
Um so this is really available and
you'll see again for make file uh if you
don't have make installed then you can
just execute uh pmppm scripts or I don't
know if you if you use npm or yarn
that's also possible as well. I haven't
tested it but theoretically it should
work because it's just using
package.json.
So in this case let's say I want to do
something like you know make ticket.
So I want to give it things say you know
my password
needs to be reset.
Um in this case I provided some
defaults. So I'm just like enter enter.
And now it's just making a call to open
AI. I'm just using um you probably would
have seen from the environment variable
file I'm using GPT5 mini. You can switch
it if you wanted but uh just for the
purpose of this we'll keep it simple. Um
so you can see here I've got the ticket
and it's provided some um some output
here as well. But again that's a
singular shot. I think we've
okay
so you know based on this output you
know it looks fairly plausible u but
again it's not going to account for a
lot of the edge cases that we want
especially if we've got a lot of nuance
to the organization which we're trying
to build into uh the logic
uh I can even do things like make uh
demo.
So make demo
for a script. Um yeah, it's the same
thing. I'm just calling the same
function, but I've got it codified as
JSON here.
So yeah, the JSON fields are available
to see.
Okay, so that's fairly straightforward.
I don't want to dwell on that too much.
So the next thing I want to do then is
talk about um adding a kind of local
tools. So we probably want to say look,
let's try to make this a bit more
deterministic. Even though a prompt
might be very well structured, I may
want to bring in uh different ways of
how it might operate. So u in this case
u I'm calling three different tools to
look at relevant u help desk articles
around that. Uh this could be both
internal and external. I may want to
look at certain things that have
happened to the account. So let's say a
customer might have done a certain
migration and that might have an impact
on backend systems and there probably
reason why I want to you know create an
escalation here. In the case what I've
done is I've made it a bit more
deterministic. For the purposes of this
workshop, again, this is kind of treated
as code, but uh in reality, you probably
are going to be interfacing with
external systems like a vector search,
MCP, CLI, and other types of uh uh
interfaces to to build out this
capability.
And again, key thing here is uh the more
things that you add, the number of ways
that it can fail will also increase. So
again, this is why tracing as we begin
to go will become more important.
Right.
So just one here. So against read me
here. Next thing we'll do is go to add
local tools.
So what do
check?
All right.
So in this case, yeah, the tools that I
created
uh are then available here, but again,
they're just kind of checked in as code
for simplicity sake. And again I can do
the same thing where you can make a
ticket
say password means resetting
account locked.
Okay, now see it's provided a little bit
more information. You can see it's a
little bit more verbose uh because we've
introduced uh tool calls into this and
it's given more context to the LLM.
Um also worth pointing if you are
feeling stuck folks that a lot of these
work uh or the the tags of the workshop
branches are built sequentially. So if
you let's go into let's say get tag
number six, it's going to include
everything as part of that. So don't
feel like you have to go through each
one. If you're feeling stuck and you
want to skip, you can do that as well.
Okay,
let's get on to tools. So again, we
already I already showed the code
anyway. So but this is an idea through
pseudo code what it would look like.
um stages. So I think this is the next
thing where again we're breaking down
that monolithic call for an LLM. We're
now introducing tools and the next thing
is even further drilling down to special
stages of how the LLM should behave. So
you would have seen from that sequence
uh diagram we've done effectively five
stages for this where I'm now setting up
things to collect the context triage
determining if it's meeting brand
policies providing a customerfriendly
reply and also something internally for
our systems and then finalizing the
result for downstream systems. Um again
it's just coming from you know
traditional software engineering. You're
breaking down your problem you can see
the exactly where something's going
wrong on the stack and you'll be able to
remediate. So again where possible try
to be more explicit and break it down
into challenges that you can work on.
So yeah just a bit of pseudo code. Um
this is kind of like what it would look
like. um probably can put a debugger um
with a new ID and and take a look at
that.
Okay. So, I've kind of been doing this
in part but uh we've already started
with uh start point and then I'm going
to talk about you know doing the the
specialist stage here. So, let's take a
look at the next part of read me.
So get check out
special stages here. So if I take a look
now um in the source folder it's it's
got the individual uh functions which
are are are pieced out
uh the prompts associated with that uh
it's been looser. Let's say here a
triage
from what we're using and if I go down
to the application you can see
uh it's it's done here. So I'm using a a
synchronous functions to
execute.
Yeah. And similarly as well if I do
something like you know make ticket
maybe I'll do something a bit different.
What's another classical problem? Oh,
it's a I need to upgrade
my plan from pro to enterprise
but the website
is not working
getting 500 errors
you know. Uh so in this case I'm in
customer
tier two. Uh we're talking about um
billing here and my account is actually
account number three. So again just to
show you that we are this is live. It's
not doing something that's hardcoded in
the system.
Um, worth pointing out as well, this
stage is slightly slower because we've
broken down the individual oneshot LLM
into sequential calls. So, it is
expected to take a little bit longer,
but again, that's that's part of
building out the sentic flow. And now
you can see again it's a bit more um
verbose but it's there's a lot more
thought into um this this agent here. So
I guess you can see again we we talked
about a billing issue. Um you can see
that it it believes that it's quite high
because a customer wants to do that
upgrade from a different tier but
there's obviously an impact to this from
a revenue perspective. So in this case
we should escalate to the appropriate uh
people on our side. So you can see here
what the escalation region should be um
and the both the internal and the the
customer are facing reply as well.
So we've included a confidence score um
to say you know if this is true issue
but again um as mentioned um tool calls
will be able to pull information which
is happening across different systems to
provide a a greater level of competence.
Okay.
Perfect.
All right. that in mind. Um, so you
know, again, we've now hopefully shown
how we can build an take an agent, break
it down, build something that's
multi-stage, introducing tool calling,
uh, to give us where we need. The next
thing is then to provide that
information and start tracing it. So we
can actually see what's happening, you
know, down to the individual details.
And this is where, you know,
observability piece comes into play.
So what we want to do in this kind of
section is break down the full execution
path. We know that the stuff is very uh
nested in structure. So there's tool
calls. There's additional function calls
behind it. Um we do want to track some
of the key things and I think um you
know Mayan pointed out you know the
early struggles that they had to talk
about you know latency cost uh tokens
account especially time to first token
is a very important metrics that we see
many of our customers trying to
identify. Um also you know what were the
inputs what were the outputs metadata
associated with this and then also
including additional uh types of fields
so that again when we talk about
monitoring and observability we can
query it uh within the UI on the fly and
set up alerting if needed. So again
having an output is not enough we need
to understand uh the full execution path
and that's what tracing allows us to do.
Um yeah, just to give you an idea of the
context and I know it's a very
contentious topic top topic to SA
because again I come from in a a
background in full stack development. So
you know I split both coins in terms
both you know Python, Go as well as
TypeScript. But um yeah we our SDKs is
is multilingual. So Ruby, go um I think
even net we've got some folks who are
using it and uh that's that's all good.
But yeah, I've just kept Typescript for
simplicity today. But yes, our SDKs do
cover a range of um different languages
out there that you can start tracing
your application with.
So yeah, uh a real key thing to this and
uh one of my challenges I do see with
our customers broadly is they will do an
individual interaction uh against a
singular uh parent span and this even
might work with let's say multi-turn
multi- conversational agents. What we
want to be able to do is trace that into
a nested structure so you can see
everything but in one interaction let's
say one conversation uh in a in a parent
span. So it's really critical as we
start instrumenting our application to
make sure that we're getting the the
right structures in place otherwise
you're not you're not going to be able
to see the full effect of where things
might go wrong in your application.
Okay. Um again this will just come down
to reading the trace. Hopefully folks um
especially I know some of the you've got
more software engineers in the room uh
that are using kind of the traditional
observability tools out there this is
not too dissimilar uh but for folks who
may be new to this screen a trace just
allows us to uh not find out what has
happened but what's currently happening
with your application uh in in in real
time. So uh again tracking every single
call bringing in that metadata and again
depending where the failure mode is we
can then identify okay what might be the
the course of remediation that we need
to take.
Okay so in the next stage what we're
going to do is add tracing to this um
and then we're actually going to run it
and then go into brain trust and
hopefully we'll see it happening. But
one thing I do want to bear in mind
before I do that so
So, let's go to our readme. It's telling
us to add Tracy.
I think that's
okay. Now, we're
going to keep it there.
Okay. So I've introduced a uh helper
script here called tracing. So um what
we do quite well uh within our SDK is
again we want to not introduce more
complexity where it's needed. So if you
are using again the standard um LLM
providers SDK you can simply just wrap
that function we provide that out of the
box. But then when it comes down to the
individual uh calls again we can set up
uh some some helper uh scripts here to
help just kind of wrap up uh as needed.
If you are using something like Python
then you can also have a decorator
function which helps as well. But in
this case you know I've got a nice
little u function which helps with
parent and child. And then when it comes
down to the actual um application
tracing, uh you can see here like I've
got a um a child span which has been
um
executed throughout this. Again, I don't
want to do too many code at this point
because the code is um expanding pretty
heavily. Uh but yeah, just kind of want
to walk through the kind of key concepts
for this.
Uh, okay. Um, before I run the uh
application, I did want to go over into
the the Brain Trust UI. Sometimes it
might create a uh if you've already
signed up for a free trial account, so
hopefully you would have done that
earlier today. It'll create like a test
project. That's totally fine. We'll
create a new project as we execute this.
Uh really key thing uh as well um if you
haven't done already, top leftand
corner, you go to your profile. Um oh,
okay, let me just create one. Okay,
let's create that just for now. Uh go to
your profile and then where it says um
API keys. Um
you know, enter your uh a name for the
key, generate that in and then use that
within your environment uh yourv file or
secure that uh in a key vault if you
have that already.
Um there's also the open AI key which
we're hoping you generated. um that
should also be used uh in
AI providers.
So I've set this at the organization
level. You can also do this per project.
Um you know depending if you want to
segregated by a particular team or
environment uh that's also fully
supported. Uh but in this case um we'll
need this key for later when it comes
down to the managed and online scoring.
So, but just bit of a tidbit um that key
that you generated, make sure you put it
into your uh brain trust um organization
here to be used later.
Okay.
Right. With that in mind, I'm going to
run the demo.
So, here. So, and just point of
reference, if you are ever stuck with
subs, you should just use make setup.
Make sure everything's in place. In this
case then we want to do kind of make
demo
and just we're going to run and execute
those tickets. I can even do it uh using
the uh make ticket command as well.
Okay. So while it's running in the
background, oh sorry, while it's running
in the background, um if you go back to
uh your application,
you would to see uh there's a project
called helper workshop. So that's the
one that's will be created as part of
this uh workshop there today. Um if you
navigate to the logs tab, you'll start
to see this uh coming through in real
time. It's worth pointing out again
thanks to our capabilities of brainstor
uh we have near instantaneous write into
our system and read available shortly
after. So especially it's it's a
non-blocking function. So a lot of our
customers uh especially the more
sophisticated ones are really using
brain trust at scale and really pushing
the envelope. So it's not the case, hey,
I've got a thousand traces, you know,
they're pushing, you know, tens of
millions of traces at a at a time across
a very short period and they want to be
able to again aggregate this. And again,
as you build more sophistication in your
application, you're sending this out to
more users. That's going to grow up
pretty quickly and you need to be able
to um have a system that can handle this
at scale.
So once you start to see the logs that
are coming in um come in reverse order.
I'm just going to take a look at the
first one here and we'll start to see
the imp instrument application globally
traced. Um there's a particular button
here that allows you to view this in
full screen which I think is quite
helpful. So uh as I mentioned because
we've got this nest structure in place
where we're going through each uh in the
sort of tree um this interaction at the
top um is the is is the demo ticket. You
can see the very top level we're saying
how long it took to actually run that
invocation uh the prompt tokens in out
cost and latency associated with that as
well. I've also included some metadata
because I want to be able to kind of
extract and filter that out as needed
and I'll show you how that will work. Um
and everything's available here to view
uh metadas here. If I also want to take
a different look at this, we can even
look at individual steps. So in this
case, I want to look at what's happened
with uh triage specialist down to the
actual um invocation to the LLM. So
again, SDK uh provides a lot of um
flexibility around this. So you can see
what was put in
uh the reasoning behind it and the
information that we set output and then
coming down to the last you know should
we escalate or not.
Um another views that we will see as
part of this is uh taking a look at the
timeline. So just gives you an idea like
a waterfall methodology to see okay was
a particular step which is taking longer
do we need to remediate uh it's all
possible to do.
Um, all right.
So, if I take a look back logs page, get
a back out of that refresh. So, I think
it was kind of four tickets that were
pushed. Yeah, they should come through
right now.
Okay. Yeah. So, that's it's two tickets
at the moment. So, again, that same
information that we replayed is also
available in the console.
Okay.
Yep. So, we've covered tracing.
Let's talk about the evaluation portion
for this. Okay. It's quite interesting.
Uh again, depending where you are in
your journey of building this
application. Uh a lot of customers
already build an application already,
have it sort of monitored and pulled
that in. But what happens when you have
effectively a co-star problem where you
don't know what you're building, right?
So what's what does good look like? Um
and effectively what does good good
shipped mean? Um in the case of the
support application that we developing
today um kind of non-negotiables, you
know, have we categorized the support
case? Have we make sure that there's no
severity or low severity outcome for uh
these issues which are blocking and does
the escalation stay in policy with with
SLAs? um does the structure look uh look
sound and if we're making any particular
changes does it actually improve uh the
way without actually breaking or having
a regression uh to to the application um
we can do this using evaluation. So an
evaluation um for those hopefully many
folks who know what they are but for
those folks in in the room think of it
as a way you have your your data set
your input you have a task and then you
have an outcome which you want to
evaluate against or scoring function uh
and this is kind of a little bit
different uh comparing it say
traditional software development into
working with AI systems because of its
nondeterministic nature
right so in this particular uh portion
What I'm going to be talking about here
is
creating what we call a golden data set.
So in this support application, I'll be
kind of testing anecdotally, but I want
to create a set of edge cases where I
think this is really going to help give
us at least an initial level of
confidence to the business that what
we're releasing out into production, you
know, is is sort of for purpose. We
there's always room for improvement, but
I want to be functional. I I I want to
just think look it's not just me
releasing this application based on
vibes. I have a concrete way of saying
okay this is how it's is performed over
time. Um to do this again we use kind of
two main types of scoring functions. The
first of which is deterministic. So I
think a lot of folks may have already
started using this again coming from
traditional software engineering uh you
know unit tests uh would say analogous
but they are quite similar in nature
where um they're very easy to run cost
effective you're not actually using a
model at this point the secondary type
which is a little bit more sophisticated
which is then using an LLM as a judge so
another AI system and this is really
helpful when it comes to systems where
there's nuance which can't really be
determined on deterministic systems
alone. So again creating one to talk
about you know branding style is this
meeting you know customer satisfaction
uh and so forth. So these most important
thing is if you cannot write it in a
deterministic way you want to be using
LLM as a judge where possible. So, and
why this is more important again it's
just making sure that any change that we
make
is it safe to change uh as we progress
this.
Okay.
So, I'm going to pivot into uh the ID
again.
Let's take a look at the uh read me file
and then we're going to do
a dictionary maybe coming from
okay C
check out
there we go okay okay
and then based on this here
the make demo make setup up. We should
be fine. Okay. So, one thing I'd like
you to do as well is then to run the
seed data set command. So, we do make
see data set.
Okay. So, what this is going to upload
our evaluation
uh test cases into into the brain UI.
So if I pivot into uh my data sets now
you'll see it's called helper C data set
again just for simplicity I've created
10 um inputs um I've also categorized
them u around there so you input what we
expect and some metadata associated with
that so kind of the core structure to uh
creating an evaluation
So
um as part of this as well um and the
deterministic and nondeterministic I've
created some scoring functions
uh which we'll be using as part of this.
So it's available here.
Uh and then yep scoring functions. So
again checking the category uh is a
schema in place. Um
is there an escalation reason when when
needed? Again very easy to run uh and
codify.
Yeah we have a bit of more
sophistication with the the custom
rubric. So this is more of the LLM as a
charge use case here.
All right. Uh then what we can do is
then
go back to
the readme.
So we don't need to do the demo. We
already pushed that through. Uh let's go
ahead and then do
um
make eval. Okay. So this is then going
to run
an evaluation
uh here. Um the seed data as well. Uh
again you don't necessarily have to put
it into code but um whether it's coming
from a database or in this case I'll
just use a flat file JSON for this uh
just to give you some context.
So we are running this evaluation
and we should have if we go to the UI
a place for experiment. ments.
So just double checking here. Yeah. So
that experiment ran against all the
JSONs associated with it.
So we should you should get an output
like this at least in the terminal.
But if we go back to the UI
uh again you can we're starting to track
the our application against the inputs
and outputs for this particular data
set.
Um
quite similar to the um the tracing view
you would have seen from the online uh
traces. You get to see a very similar
view here and seen how it works across
um your experimentation as well.
Yeah. And I think pointing out as we
progress with the workshop uh we'll
start to see how do we then improve it
using the the set of difference
functions and track that across the UI.
If you need a bit more real estate as
well um you can collapse the menu which
does help.
Okay. Time.
Perfect.
Okay.
Um
right. Let's proceed with the next sec
checkpoint
uh around deploying and managing this.
So um as I mentioned I think a lot of
folks or at least anecdotally when when
I've seen this spoken to customers is
again things will work really well on my
machine. Okay now I'm checking that into
code. I kind of want to take this into a
place where um I can start to
collaborate a little bit better. It's in
a there's a point of reference. Um I've
got versioning history. I can start to
identify again who's made what change.
and you need a way to be able to bring
these users together. So I think SA
talked about a collaboration. Um what's
interesting as well is again changing
the prompt on your machine and then
trying to ship that code to a repository
and then maybe somebody who's let's say
a non-technicalme
product manager perhaps they want to
update the prompts they can't do it.
They have to tap you on the shoulder.
Maybe that's happened to some of the
folks in the room here today. Uh I know
it's happened to me uh a few times
before and it can I understand it can
get really frustrating. So we actually
just want a way to be able to pull that
together. Really key thing especially
for those who work in very regulated
industries like reproducibility is a
very big key thing and I've worked in
you know better part of over a decade in
in both banking and and capital markets.
I know obviously with uh you know the
regulatory out there especially uh
things like right to be forgotten
understanding you know who's made a
change especially in a stress exit
scenario how can we put this into a new
system this is really key to to helping
unlock that and interesting when it
comes to to
identifying changes before you do that.
So again, we just want to we don't want
to just be making changes, pushing out
production and asking us what's
happened. We we need to be able to kind
of geek that in place.
Um and so for us again, uh we're
introducing some capabilities now. So
what you've been running at the moment,
you've been running the tools, you've
been running the prompts, everything on
your local machine. What we want to do
is then offload that capability into
Brain Trust. So when your application is
running in a secure environment, it can
refer to brain trust to uh pull that
information uh and then uh help with the
path of execution which again follows
the tracing mechanisms that we've done.
So u by default when you're running the
make commands um the runtime mode was
set to kind of local. If you want to use
managed mode just use the prefix managed
and whatever the remaining make commands
will happen there as well.
So what I want to do then is um pivot
into u the ID.
So
going back to make uh my read me.
Okay.
Tricky.
Okay. So, get check out.
Okay.
So, just taking a look here.
We'll do make setup.
Right. And the key thing I do want to
emphasize at this point because we're
managing this in in brain trust use the
setup
um brain trust um command here. So what
that's going to do is going to package
up those scoring functions, those tools,
those prompts and push it onto the
secured uh infrastructure.
So you should get an output like this.
and what that would look like uh in the
UI.
So I'll just go back to overview to see.
So on the left hand side um if we take a
look at prompts you'll start to see the
the three prompts that we we created as
as part of that that workflow. Um give
you an idea here if we go to the uh the
trio specialist.
So u you can define a slug. So let's say
an immutable ID which you can then refer
to it in the code. Um this can also be
generated if needed. Um the prompt um is
again treated as code. Uh we also use
some interpolation there if you want to
parameterize this and I'll show what
that looks like shortly.
If I go to the scoring function
uh so let's click on scores.
Okay come up in a second. um take a look
at parameters.
So I'm just going to take a look here
and I've created a parameter
specifically just to simplify things
changing the baseline model. So maybe
just kind of a show of hands um you know
these models get released so quickly.
Has anybody had like a PM or let's say
uh let's say a non-engineering theme say
hey by the way can I change uh the
prompt can I change the model or the
prompt and see what that would look
like? Has that kind of happened to you
folks? I think we've got a few hands
here. Great. Okay. So, the good thing
with this now using the the manage
parameter is those non uh technical
thememes they can come into brain trust
and change a prompt here. Write a
comment to say look um let's use um a
different model. Let me use something
like for mini. Uh just say testing a new
model.
I'm going to save this version here.
Write the comment.
And what I'm going to do as well just to
give you an ending, I'm going to do, you
know, make setup branch. So every time
you change something, um, you can run
this command, but I'm just doing it for
brevity here.
Um, that's more to keep, um, this model
in sync. So if I go to prompts,
uh, you know, uh, that's kind of in
sync. If I take a look at um
the uh
uh sorry the parameters in place
yeah it should be there and sorry what
I'm going to do is
if I do something like um
this is a runtime
account unlock.
So in this case by running the manage
mode I'm just changing the course of the
execution uh not to run the model
locally but to follow the path of what
brain trust has uh set for the model.
Okay.
And uh if everything goes well
so you can see it's just now
you can see that I've changed a model
here. So again I didn't have to do any
code changes. All I had to do was go
into the UI change the model I wanted to
or any other parameter run that and have
that uh used as a establish a new
baseline for the evaluations.
Um again if you want to uh you can also
run the the demo script to push in the
demo tickets. I'm just skipping that for
the workshop today.
Um I think I do see this with many of
our different customers is saying oh you
know um they can be not necessarily a
cause of concern. and it's saying, "Hey,
by the way, brain trust is now having
access control to these certain
parameters." Um, what I would say is
brain trust is not really intended to
replace that rigor. You probably still
want to use things like version control
systems anyway to track that. What we're
just saying is when it comes to
operationalize it and making sure that
other users are able to work on a shared
system, this is a recommended path that
we would take uh to help out with that.
So again, you would probably still have
to have your prompts, your tooling, your
parameters in a kind of centralized way,
but then provide automation in place for
you to synchronize that and work. And
that's the best way we've seen customers
take advantage of this.
Okay. Uh this next portion we're going
to talk about online scoring. So now
that we have those evaluations in in
place um what we want to do is then
apply those um scoring to actual live
production logs that are coming through
uh in the application. So okay it's
great that we've done our test cases. We
we've got some level of confidence that
it's working but again there's no
substitute for production data right we
all know this. So what we're going to be
doing is uh creating uh again moving
that logic uh into brain trust and
setting what we call automations that
would then track and evaluate this as
logs are coming in in in real time. Uh
worth pointing out when you start your
journey um it's probably especially if
you're using uh LLM as a judge you want
to start with let's say a higher
sampling rate. So again, as logs are
coming in, uh where possible, you you
want to make sure you identify a
baseline. But again, there's a trade-off
when these calls can be quite expensive,
especially if you're using more
sophisticated models and you need higher
rate of reasoning at that point when you
do want to uh are happy with um the
output, you really want to reduce that
that sampling rate down to 5 to 10%. So
again, you're managing your your cost
effectively.
Uh, deterministic scores again, they're
cheap. Recommend running them all all
the time.
>> Sorry, what was that question?
>> Yes. So, I'll bring that up in the UI.
I'll show you. Yeah. Okay. So, if I go
back into the uh IDE,
I'll go into read me here.
Um, oh, so that's why I messed up the
manage tools.
So let's just say get checkout.
Okay, there. And as I mentioned, they
all build upon each other. So skipping
this is totally fine.
So get checkout.
All right. Now if I do um
make setup, everything should be fine.
Make uh setup
green trust. Um
so in this case I'm actually taking the
the tools as well um for the production.
Again this is just to really help
accelerate things where possible. Um
coming down to um if I hit refresh
um
right so scores the scoring functions
that you saw earlier uh in code they now
managed in brain trust so you can see
it's it's available here um the ones
that I want to call out is the the
triage so the I've got an LLM uh as a
judge
um which has been applied
so put an output taking this and there's
an automation rule in place. So quality
online.
Um so what I've done here again I've
automated the setup uh as as code. So to
to bootstrap the project but again uh
what we're able to do here is to say
look depending if it's an individual
span. We can run the execution or the
entire trace. And this is why metadata
is so important because we may only want
to trace maybe specific failures that
might happen within the code. uh again
depending on the use case um my sampling
rate as I mentioned is set to to 100 but
again for more expensive calls we want
to taper that as well so this is what
the automation uh or how we would do
that in in brain trust there today
it's like packaging
>> no so automations is more like the
execution against uh incoming logs so
it's um be
Yeah, but I'm more more holistically
I've applied automation to setting up
this environment and scaffolding.
>> Yeah. So that's just want to delineate
that. Yeah.
>> Sorry question. Yeah.
>> Yeah. What kind of things are you
scoring?
>> Um could you expand on that please?
Yeah.
>> Yes. Correct.
properties
>> in
>> there's like no ground truth.
>> Okay.
>> How can you perform online? You don't
have ground truth.
>> Yeah. Well, you probably want to take
that as an edge case, push that into a
data set, identify and then move that
back. So that's would be the approach
that I would take for this um if you
don't have any ground truth already.
Right. So this is why I said depending
where you start, it's better to have
some kind of data and then begin your
flywheel around from that. Well, so you
have some data, how do you apply that to
the judge?
>> Oh, that case. Um, we can probably put
that into a data set and then replay
that through the playground. That's
that's going to be the way we would do
that. Yeah. Yeah.
All right. Um, so we checked up online
scoring cognitive time. Okay. Now to the
remediation portion. was hopefully
seeing the delta or again why we we're
here today. So hopefully this might help
you uh with that particular question. So
again, here's something that might
happen as a as a plausible input to our
agentic system is you know customer user
might say hey this isn't urgent but our
CFO can't export the invoices um before
uh board meeting the model says look hey
this looks this looks okay for me
someone says not urgent come see kamsa
but the business is very different right
this probably does need uh immediate
attention your CFO
I'm sure there's an end of quarter
report that needs to be done. And this
is the difference between uh what we're
doing here today is trying to identify
um what is a proper failure mode and
then remediate that where possible. Um
so in this case again I can run um u
this particular mode here. I've got this
in this in this data set.
So uh just kind of a player scene where
we want to replay the failure. We run a
specific uh evaluation against this.
Going to tighten the prompt, run it
again and see uh what that looks like.
We probably want to do this against not
just one particular test case but across
our entire test cases as well just to
see if it does work as intended and we
haven't regressed on something else uh
from that perspective.
Okay. So let's go ahead and have two we
have two separate branches for this. So
I'll split out branches. A one's going
to have the the failure rate in mind.
We're going to go through the UI and
view that. Um and then we're going to
talk about the the remediation part uh
there as well.
Okay.
should just go here.
So many products.
Okay.
So, next set.
So we can even do
run time. So runtime mode
and we'll make
uh replay failure.
Okay. So I've got the set of five cases
which are you know the regression of
failure modes. this case.
So that first one that you saw in the
example ticket that's that's done as a
JSON file here.
So if I take a look here.
Okay. So if I take a look at the the
failure mode here, you can drill into
this.
see the uh and you'll notice as well now
that we set up the online uh the managed
tools as well as the online scoring the
trace becomes um even more sophisticated
in the fact that we're executing this
against the secure brain trust uh
environment.
So again moving from local to to managed
I'm just going to go here to uh the make
file. Sorry, package.json.
Let's go to
scenario here
impact.
Let's run an evaluation here.
So, I'm just doing an evaluation against
a specific
um scenario.
All right.
Do you also want to show the monitor
latency and things like that?
>> Okay.
Okay.
Um, yep. So as we can see as we
progressing with uh the application the
experiment again viewer just allows us
to see the progress of our changes. Um
so you can see we've run the the latest
set we notice some you know degregation
here which just allows us to track it a
little bit better.
So the managed data set that I I had in
mind uh the f failure rates are being
captured and you can see you know I have
the ability then to compare it against
existing um experiments to say to track
the progress and remediate where
possible
so forgive me the next thing is I'm
going to go to the readme
and and proceed with the uh remediation.
Okay.
Get check out
make
just
Okay. So in this remediation I've
changed um the
the prompt which I've used.
So uh one way to to view that is if I
take a look at the the prompt and the uh
the change.
So so
see
so you'll see any any differences there.
some place.
Okay. Come down to the uh
read me file here.
Let's say I do
One second.
Collect here.
Let's give us a
run the
trust command
contents.
Yeah, there's a specific flag I think.
Don't know if you want to just double
check with that.
Yeah, that's it. If if it fits
So if you want to just put that if if
exists
replace um env in the chat. Yeah.
>> Yeah. No, no, it's fine.
So yeah, just folks uh if you want to in
the part of the remediation script, if
you set the environment variable brain
trust if exists, you set that to replace
um it's going to push in the the updated
changes into the into the environment.
So then for take I forgot.
Okay. Um and then you'll see as well now
that we've updated the problem via code
and pushed it up um it's um it's
available here as well.
Um so as well in the UI um again part of
the operation operationalization
uh we can see you know who's changed
what but actually what has been changed
to that particular prompt as in this in
this case
take a look here. So, we're including
tools.
Clean that out.
Okay. So, coming back here.
Let's say we didn't do a um
run this evaluation again.
So, we made the change to the prompt and
now we're running um the remediated
version to see how that performs.
Perfect. Okay. Okay, so that's ran the
experiment here with the new changes
hopefully
and I'm pivot into the the UI. I'm going
to go to experiments here
and it's barely there but you can just
see you know kind of tap it back up just
you know any improvement going up is an
improvement but yeah just to give you an
idea here that um you know now that
we've done the evaluation uh actually
what I can do is um do a diff
uh
and we can do a a comparison in the
delta So yeah.
Yeah. So yeah, that's come up uh
improved over time.
Um uh which way which way uh the
intended outcome. Yeah.
Okay.
So
I think we're approaching the end of the
the content. Um, so I know it's a been a
number of steps. So I really want to
thank you for your time and attention to
kind of walk through that. Uh, as
mentioned the the artifacts are public.
We've got the cheat sheet there. We've
got the Slack channel to help if you
have any any questions. But uh,
hopefully just to give you a summary of
what you've accomplished today uh, in
this order is you went from taking that
single shot prompt into building a
five-stage AI agentic workflow using uh,
tool calls. What we're able to then do
is then inspect how this works pretty
much independent diving into this by
adding brain trust tracing making sure
everything is recorded. Um we also then
want to talk about you know how do we
then evaluate the system from a um you
know that's not online with something
new uh creating those those effectively
a golden set with those test cases which
we want to execute against. We then
deployed uh that uh those managed
prompts those tools and parameters into
the brain trust secure architecture uh
to be able to use and we've also added
online scoring to then evaluate the
system uh as it unfolds and then we
picked a particular uh production
failure. We looked at the trace. We
modified the prompt in our uh case and
we saw the delta there and running the
evaluation again we saw it achieve back
up to to where it needed to be and in
this case completing the full evaluation
of your building observing it deploying
it and taking a look at that moving
forward.
Um so yeah just to kind of u call it um
and bring it at home. So again hopefully
this is not uncommon but again what
might work in production is not really
going to work in prototype. uh we really
need to break this down, identify the
failure modes and and move forward. And
that's where again expression stages
become really really important, right?
Um again, this does introduce more uh
areas of of where things could go wrong,
but it's easier to debug if that's the
case. Um you there's no substitute for
diving into the code and tracking
everything. So I would say it's it's
observability is table stakes at this
point. If you've got a production
application and you're not tracing it,
you need to go back to the drawing board
and get that done operational. And
hopefully, again, we can show you how to
to be able to do that um using brain
trust. Um again, no substitution for
production logs, but better to start
somewhere from there. If you have an
idea of what an issue might happen,
these are your perfect ways to to
support an evaluation. So, using those
uh failure modes as as your test cases
and as I mentioned, this is a continuous
process, right? Nothing's ever done. If
you've ever worked in agile development,
constant feedback is is important. And
again, we're bringing this this
operation model, but with uh a newer
surface of of operating
um and yeah, just to hopefully bring it
home to your teams here today. So, um,
you know, my encouragement to you is to,
if this is something of of of interest,
is pick something that's already
operational today. It doesn't have to be
the entire suite, maybe start off with
something that's maybe a bit more um uh,
I guess more critical that you really
want to improve um the operational
modes, add the tracing, collect your
edge cases from that that mode, um,
build those determining scores and then
route everything back as possible. So
again the faster feedback loop that you
have uh you have more insight and the
more that you can again improve the
overall uh delivery and operations of of
your system.
Um yeah and just kind of call to action.
So again I know we've thrown a lot of
content at you. It's uh we'll obviously
try to get bit of feedback. We're trying
to put some tables in place. So I
appreciate everyone's kind of juggling
everything to do this. But um as
mentioned you have to start somewhere.
Let's just try to accelerate you. uh we
have a list of documentation that's
provided um that's uh you can also use
our AI agent on the agent to to search
if you have any questions but we also
have a cookbook available so I tend to
throw that cookbook directly into you
know
cursor codecs or whatever and say based
on this take the SDK uh and start
tracing my application it does it pretty
effective we even do have uh we've
actually announced a CLI um for our the
brain trust um applications So that
allows you to even do things as auto
instrumentation. I'm just going to plug
my colleague Eric who's doing some
fastic work and please check out his
booth because it's it's it's amazing. Um
and again if you this is something
that's interested you and you want to
explore more then you know please reach
out to your account team at brain trust.
Again we're happy here to support where
possible. Uh and if you're on discord
you feel free to to join. Happy to
answer questions there uh from that. And
uh yeah, again really want to thank you
for your time, attention, energy. I know
it's a really sunny day and I don't want
to keep people in here. I want you to
get some fresh air, but uh yeah, just on
behalf of brain trust trainine and
myself like thank you so much for your
time and attention. It's it's been an
honor and look forward to seeing you out
there tracing and and gaining value from
delivering AI in production. Thank you.