From Self-driving to Autonomous Voice Agents — Brooke Hopkins, Coval

Channel: aiDotEngineer

Published at: 2025-07-31

YouTube video id: kDczF4wBh8s

Source: https://www.youtube.com/watch?v=kDczF4wBh8s

[Music]
Hey everyone, I'm Brooke. I'm the
founder of Koval and we are building
evals for voice agents. So today I'm
going to be talking a little bit about
what can we learn from self-driving in
building evals for voice agents. My
background is from Whimo. So I led our
eval job infrastructure team at Whimo
that was responsible for our developer
tools for launching and running
simulations. And now we're taking a lot
of the learnings from self-driving and
robotics and applying them to voice
agents.
Um but first, why are voice agents not
everywhere? They have this massive
promise of being able to automate all of
these really critical hard workflows
autonomously. And I think probably a lot
of you are building in voice agents and
know how amazing voice agents can be.
Um, I think the biggest problem to
launching voice agents is trust. So, we
simultaneously are paradox paradoxically
overestimating voice agents and trying
to say, I'm going to automate all of my
call volume or all of my workflows with
voice all at once and we're
underestimating them. I think what
they're capable of today and really
scoping a smaller problem versus what
they could possibly do in the next six
months. I think there's just so much
more that you could have magical voice
experience.
Um, so conversational agents are so
capable, but you also know that scaling
to production is really hard. So it's
easy to nail it for 10 conversations,
but to do it for 10,000 or 100,000
becomes really difficult. So a lot of
times voice agents get stuck in PC hell
where enterprises are scared to actually
deploy them to customerf facing issues
or non- internal workflows.
So I think there's two approaches to
deal with that today. There's
conservative but deterministic. So you
force the agent down a specific path and
try and get it to do something exactly
as you want it to do. This is basically
an expensive IVR tree. Um you're using
LLM, but you're essentially forcing it
into certain pathways. Or you can make
them much more autonomous and flexible
to new scenarios that they've never seen
before. But this makes it really hard to
scale it to production because they're
so unpredictable. And so taking actions
on behalf or interfacing with users can
be really um unpredictable.
I think this is a false choice. I think
you can have reliability and autonomy.
So how many of you have taken the Whimo?
Yeah. If you haven't and you're from out
of town, you definitely should. It is so
magical. So how did it become so
magical? It's so reliable and also so
smooth, but it's able to navigate all of
these inter um interactions that it's
never seen before. Go down streets it's
never seen before. And Whimo is
launching to all of these new cities um
very quickly.
So, I'm biased, but I think large scale
simulation has been the huge unlock for
self-driving and robotics because
without it, you So, we started in more
manual eval. So um starting with like
running the car throughout all the
streets, noting where it doesn't go well
and then bringing that back to the
engineers. This obviously is very hard
to scale. So then we created these
specific tests where we're saying for
this specific scenario, we expect these
things to happen. Um but this is very
brittle. Your scenarios tend to no
longer be useful after a very short
period of time and they're very
expensive to maintain because you have
to build up these very complicated
scenarios and then say exactly what
should happen in those. So then they
move to the v the industry as a whole
has moved to largecale evaluation. So
how often is a certain type of event
happening across many many simulations
and so instead of trying to say for this
specific instance I want this to happen
you run large scale simulation to really
reliably show how the agent is
performing. So I'm going to talk through
like some of the things that I've
learned from self-driving and how they
apply to voice. Um, and hopefully that
would that's useful because it's
definitely been useful for us as we
interact with hundreds of voice uh
systems. So, what is the similarity
between the two? Self-driving and
conversational evals are very similar
because both of them are systems where
you're interacting with the real world
and for each step that you take, you
have to respond to the environment and
go back and forth.
And so simulations are really important
for this because for every step that I
take in a Whimo or a self-driving car or
a smaller robotics vehic um device or in
a conversation when I say hello what's
your name you'll respond differently
than when I say hello what's your email.
So being able to simulate all these
possible scenarios is really important
because otherwise you have to create
these static tests or do it manually and
both of which are either expensive or
brittle. Um, you also, and so this makes
for a very durable
tests. If you have to specifically
outline every single step along the way,
those break immediately and are very
expensive to maintain. And then lastly,
coverage. You really want to simulate
all of the possible scenarios across a
very large area. And so the
non-determinism of LMS is actually
really useful for this because you can
show what are all the possible things
that someone might respond back to this.
And I might simulate that over and over
and look for what the probability of my
agent succeeding is.
Another thing that I touched on a bit is
input output evals versus probabilistic
evals. So with LMS to date, we have seen
you run a set of inputs for your prompt.
Excuse me. You run a set of inputs for
your prompt and then you look at all the
outputs and evaluate whether or not the
output for that input was correct based
on some criteria. So you might have a
golden data set that you're iterating
against. With conversational evals, it
becomes even more important to have
reference free evaluation where you
don't necessarily need to say these are
all of the expected things I want am
expecting for this exact input, but
rather you're defining as a whole how
often is my agent resolving the user
inquiry? How often is my agent repeating
itself over and over? How often is my
agent saying things it shouldn't? um
rather than saying for this specific
scenario these six things should happen.
Um and so this is what's going to really
allow you to scale your evals and also
what we did at Whimo. So coming up with
metrics that apply to lots of scenarios.
Another thing is that constant eval
loops are what made autonomous vehicles
scalable and that's what's going to make
um voice agents scalable. I think we're
seeing today that voice agents are so
expensive to maintain in production once
you deploy to an enterprise. It becomes
often a professional service if you
don't set up your processes right. And
so you're constantly making all of these
tweaks for specific enterprises which
can take up 80% of your time even after
you've set up the initial agent. So
something that the autonomous vehicle
industry has been doing is how do you
run like let's say you find a bug and as
an engineer I might iterate on that and
I run a couple of evals to reproduce
that and then I fix the issue and then I
run more so it wasn't stopping at a stop
sign I iterate on that and now it is
stopping at a stop sign. But then I run
a larger regression set because maybe I
just made the car stop every 10 seconds
and so I broke the everything. Um, so
then you run a larger regression set,
make sure you didn't break everything.
And then we have a set of um presubmit
and postsubmit CI/CD workflows so that
before you ship code and then after um
you push the code to production, we make
sure everything is continuously working.
And then there's large scale release. So
making sure that everything is up to par
before we launch a new release. And this
might be both manual evals and automated
evals and some combination thereof. and
then live monitoring and detection um
which then you can uh feed back into
this whole system.
So we're emulating a lot of this with
voice. I think we think that's the right
uh approach as well. But you're notably
that there's still manual evals
involved. The goal is not to automate
all evals, but rather to leverage auto
evals for speed and scale and then use
the manual time that you have to really
focus on how um those like very, you
know, human touch judgment calls. Um so
the process that we've seen is you might
start with simulated conversations and
you run some happy paths of like I know
I should be able to book an appointment.
Um so book an appointment for tomorrow.
I run a bunch of simulations of that. I
run emails. I come up with metrics of I
look through all those conversations.
Looking at your data is super important.
Um I look at all those conversations and
I say these are the ways they're
failing. So I set up some automated
metrics and iterate through this loop
several times. Now I think it's ready
for production. So I ship it to
production and then run those evals
again. And so this cycle, this virtuous
cycle of iterating on through simulation
and then detecting or like flagging
things for human review and then feeding
all of that back into your simulations
is super important for scalable voice
agents.
Um, so what level of realism is actually
needed?
Uh, something a question we get a lot is
how are your are your voice agents
exactly how my customers sound? And
that's a good question because the level
of realism is dependent on what you're
trying to test. So like any scientific
method, right? You're trying to control
variables and then test for the things
that you care about. So there's a
something we saw in self-driving is that
there's kind of this hierarchy of like
you might not need to simulate
everything in order to get a
representative feedback on how your
system is doing. Um so the way we think
so for example all the time there are
all these you know super hyperrealistic
simulations coming out that look like a
you know that look like a real video and
people would say that simulation system
is amazing and really that's not
necessarily true because what you want
from a simulation system is how much can
you control what parts of the system
you're simulating and then how like what
inputs are needed. So you might just
need to know this is a dog and I this is
a cat and this is a person walking
across the street. Um and then what
should I do next as a result of those
inputs? And so this is the same for
voice. We think about it as uh for
example workflows, tool calls,
instruction following. You actually
don't need to even simulate that with
voice. Often you might want to do
endto-end tests with voice, but when
you're iterating, doing that all with
text is probably the fastest and
cheapest way to do that. Um then for
interruptions or latency or instructed
pauses um simple voices that are just
the basic basic voices are sufficient
because you're doing that voicetovoice
testing but you know accents or
background noises might not impact that
as much. And then where you need
hyperrealistic voices of different
accents, different background noises,
different audio quality, etc. is when
you're testing those things in
production and trying to recreate those
issues. And so thinking about what what
are the base level of components that
you really need to to develop this is
super important for um for building a
good eval strategy.
Uh and then a awesome tactic that we've
learned is denoising. So um you run a
bunch of evals and then you might find
one that failed. And something that's
really important about agents is that it
doesn't it's not the end of the world
maybe if it fails one time. You really
want to know what is the probability of
this failing overall. So then you can
find that scenario and then reimulate
that and maybe you reimulate that a
hundred times. So is this scenario
failing 50 out of a 100 times? Is it a
coin flip? Is it failing 99 out of 100
times which it means it's definitely
always failing. Um or does it fail once
out of a 100 times and that might be
totally okay for your application. Um
and so having a sense in the same way of
cloud infrastructure where are you
shooting for 69s of reliability for
voice AI that's really important as well
is like what reliability are you looking
for for different parts of your product.
So now I want to talk a bit about how to
build an eval strategy because we
believe that eval are as important part
of your process uh it's the key part of
your product development and it's not
just an engineering best practice. This
is actually like a core part of thinking
through what does your product do.
So voice like thinking through what
metrics you should use is thinking
through what does your product do and
what do you want to be good at. You can
build a general voice model that's kind
of good at everything and that already
exists, right? like you can use the
OpenAI APIs, you can use like all of
these different endto-end um voice
systems that already exist that and are
generally useful, but really you're
probably building a vertical agent that
you're trying to make useful at
something in particular. And so thinking
about what you want it to do well and
what you don't care if it does is a
really important part of the process. Um
and it's not just about latency, it's
about interruptions. like your voice
application actually might not be so
latency sensitive because someone really
wants a refund, but if you're doing
outbound sales, latency is super
important because that person's about to
hang up the phone. Um, interruptions,
uh, workflows, workflows, like how much
you adhere to instruction following for
some applications is really important.
Like if you're booking an appointment
and don't get all the details, it's
useless. But if you're, you know, an
interviewer or a therapist, you might be
more tuned for conversational emails,
uh, conversational workflows.
And then, um, so really thinking through
like what you're trying to measure.
These are the five things that we see
the most. But, um, LLM as a judge is a
really powerful way of being really
flexible and you can build out evalu.
But something we get a lot is how do you
trust LLM as a judge? It's this magical
thing that can be so flexible to so many
cases and also it is very um you know
can be very noisy but I think the common
patterns that we see with LLM as a judge
is that you say was this conversation
successful that's a pretty that's going
to be a really noisy metric um you might
run that 10 times for the same
conversation will come back um with lots
of different responses so in Koval we
have this metric studio that we think is
really like um pretty different from
anything out there because it allows
allows you to iterate on this human
incorporate human feedback into really
correlate calibrating your metrics with
human feedback. So you can iterate over
and over until your automated metrics
are aligning with human feedback. And
now you have the confidence to go deploy
those in production and run them over
10,000 conversations instead of the
hundred that you labeled or the 10 that
you labeled um to to get that
confidence. So really, I think putting
in the time to saying this is the level
of reliability that we're looking for
and being thoughtful of that. Maybe just
labeling 10 conversations is important
to you. Um or maybe you really want to
dial this in. Um but using this workflow
can be really powerful.
Uh our other advice of how to approach
evals for voice AI is starting with this
system of reviewing public benchmarks
which can be a rough dial of this is how
um this is roughly the direction I want
to go in. Then benchmarking with your
own specific data. So using if you're
like a medical company using medical
terms that you're going to be using in
production to test out different
transcription methods um etc. then
running task based evals which are maybe
text or very specific um smaller modules
of your system. Again, what I talked
about in self-driving is you don't
necessarily need to enable every module
on the car in order to test the one
thing that you're trying to test. And
then end to end eval where you're
running everything at scale and um and
how it would run in production.
So we've done a lot of benchmarking. You
should check out our benchmarking um on
our t uh on our website. But we tried to
do continuous benchmarking of what are
the latest models out there. But doing
your own custom benchmarking is also
really important. So through coal you
can actually and you can also do this
yourself. I just happen to have a tool
that does this. Um but you can like run
on your specific data because you might
prefer different voices for the types of
conversations that you're having or you
might prefer different LLMs based on
your specific tasks. Um and so
benchmarking each part of your voice
stack is really helpful for choosing out
those models especially because voice
has so many models and then um building
out your task eval. So starting to get a
sense of baseline performance. Where are
the problem areas in your voice a
application? Where are things working?
Where could they be better?
Um and then creating an eval process. So
this means like what kinds of continuous
monitoring are we doing? What do we do
when we find a bug in production from a
customer? who takes it and where do what
test sets does it go into so we can make
sure that it doesn't happen again. How
do we set up our hierarchy of test sets?
Do we have test sets for specific
customers that we care a lot about? Do
we have types of customers that we have?
Um do we have specific workflows or
features of our voice agent? Um and then
creating dashboards and processes so
that you can check in on those things
continuously. I think this is an
underestimated piece of the process is
like what is our continuous eval process
versus just saying does the voice agent
work when I deploy it to this production
this uh uh customer on their during
their pilot period.
Um so yeah, always happy to talk more
about tips on like what we've seen
across all of the many voice systems
that we've seen. But one of the reasons
why we're so excited about the future of
voice and I think Quinn um uh stole a
little bit of this. But Quinn, I really
think that voice is the next platform.
Um so we had web, we had mobile, and I
think both of these were huge platform
shifts and what types of things do you
expect that companies will allow you to
do on those platforms? What types of
work where in the workflow where in your
daily live life are you meeting the
user? And I think voice unlocking all of
these new really natural voice
experiences. It doesn't mean everything
you should be doing via voice, but
there's really exciting potential there.
And in the next three years, we think
every enterprise is going to launch a
voice experience. It's going to be like
a mobile app where if the airline does
not have a good voice experience, it's
going to be like not having a good
mobile app and it will just be a
baseline expectation. And I think users
expectations of what really amazing
magical voice AI experiences will be is
just going to increase over um over the
next few years.
So we really want to enable this future
and so we think the next gen of scalable
voice AI will be built with integrated
eval using Cobalt
and we're hiring. So we're always
looking for people to join us. I think
this is like really one of the most
technically interesting fields that I
have ever worked in because you get to
work with every model across the stack
and there are so many different types of
models, different types of problems,
scalability, new frontiers of building
infrastructure and no one knows any of
the answers. So, it's a really exciting
space. Thanks so much everyone.