Shipping an Enterprise Voice AI Agent in 100 Days - Peter Bar, Intercom Fin

Channel: aiDotEngineer

Published at: 2025-07-18

YouTube video id: HOYLZ7IVgJo

Source: https://www.youtube.com/watch?v=HOYLZ7IVgJo

[Music]
Today I'm going to be talking about Finn
Voice. And Finnvoice is a voice agent
for phone support. Um, and we design it
to be a frontline teammate for invone
calls. So it picks up the phone, answer
customers questions, and then escalates
to a human agent when needed. And we
build this experience in about 100 days.
So today in this talk, I'll share what
it took to get there. And I'll also talk
about why I believe voice is the next
big frontier and AI for customer
service.
So first a little bit of context of my
company, Intercom. Um so we're a
customer service platform um and also an
AI agent company. And you might be
familiar with us because of our
messenger product. Uh seen it, you might
have seen it in the mobile app or on the
website. Uh but yeah, it's been our
foundation for years. Uh but we evolved
over time. We became a complete customer
service platform a few years ago. Added
robust tooling for other channels like
like email, WhatsApp, and phone.
And then two years ago, right after the
launch of GPD4, we launched Finn uh
which is an AI uh uh agent uh via on
text on chat. And Finn's growth has been
incredible. Uh we have over 5,000 of of
c customers and also in terms of
performance is reaching average
resolution rate of 56% and for some
customer to be 70 80%. Um and this is
defined as a percent of interactions
handled by Finn that are resolved
without human intervention.
And Finn is also a full system for
continuous optimization. So it's not
just that the agent but also tooling for
analyzing conversations, training the
behavior of the agent and also testing
and deploying changes.
Uh but yeah, up until now we didn't have
the voice channel and that's what we're
changing with Finn voice is the same
system but now it can answer phone
calls.
And a few thoughts on uh why voice? Why
are we investing in this channel? Um so
for all of users voice is simply the
preferred way to get help. Uh so when an
issue is urgent or sensitive uh they
don't necessarily want to type they just
want to talk and if you look at the uh
some of the top level data over 80% of
support teams uh still use uh phone
support and if you think about all the
conversations globally or customer
service interactions over onethird of
them are uh happening over the phone as
well. So it's not a legacy channel
that's going away it's still widely used
and it's also quite costly. Uh, if you
think about the average cost of handling
a a phone call in the US with human
support, it's about between seven and 12
dollars. And with voice AI agents, it
can be at least five times cheaper.
And a few more benefits of voice AI and
customer service. Uh, so first
availability, 247 support. Uh, you can
call your bank on the weekend. Uh, no
wait time, so instantly available.
There's no need to be to stay in be on
hold in a queue. uh no IVR menus, no
need to press one to go to support,
press two go to payments uh because all
is everything is happening via natural
speech and also multilingual so uh AI
agents can support 30 40 plus languages
obviously better for the for the users
and on the business side major cost
savings and also scalability as the
business grows or when you need to
handle peak times AI agents were much
better for that.
Um so how we built finoice uh over the
next few minutes um I want to cover uh
seven main areas that had the biggest
impact of how fin came about. I'll try
to be more practical practical focusing
on the some of the product decisions we
made and the challenges we faced. Um and
and yeah the first one is is the use
case. So the starting point for for
voice uh then also the scope of our MVP.
uh the tech stack behind it, the how we
approached the conversation design, how
we integrated it with the support teams
and also uh how we thought about
evaluation and pricing.
So starting with the use case, um if you
look at a lot of the voice AI startups
in the space, they typically start with
a a narrow problem space. I mean like
scheduling a dentist appointment or uh
booking a table at restaurant. And we
looked at at some of those options, but
eventually decided to go for a more
flexible uh knowledgebased agent. So an
agent that can answer helps uh her
article questions like what are your
pricing plans or uh what's your returns
policy and why did we decide to go this
way? So first uh we had a strong
evidence from chat. Finn over chat has
been handling those kind of
conversations for years and our
customers have constantly told us that
they seeing the same type of issues over
the over the phone as they see on chat
and we also validated this through extra
analysis of of call transcripts and this
confirmed that a very large percentage
of all the queries could be solved with
the with the knowledge base with the
help articles content rather than say
with the API integrations
and we're also thinking about the
initial wedge use case so like the
what's the lowest possible risk way for
companies to integrate voice agents and
we looked at the in office hours and
outside of office hours use cases. So we
pitch out of office hours as initial
wedge because essentially it allows the
team to not affect the main workflows
and try out this technology build up
more confidence over time and later
deploy it for their on their in the main
office hours. But in out of office hours
it just replaces their voicemail
experience.
And there's also a few other use cases.
We looked at uh authentication, so
verifying user identity on another
channel. Uh info gathering, so the agent
getting stuff like uh order ID, account
ID, and also uh smart routting to the
right team. So these use cases are uh
still very high leverage because they
can save a lot of time for the support
team agents. Um uh but they not
necessarily solving the issue end to
end. So we're still foc focusing on
those but they were in say the primary
use case for this initial version of the
product.
So now moving on to uh what we shipped
first. Um
uh when we uh when we started started
this like the biggest challenge there
was to ship something meaningful as soon
as possible. We had access to a lot of
customers uh because we already had
thousands of customers using our native
uh phone support product. So it's mostly
about how can we test as quickly as
possible. So we focus on the three main
experiences uh testing, deploying and
monitoring the agent behavior. So first
test uh this is what we call the fin
voice playground. And here was like as a
lightweight test environment for the
customer service managers to go in and
simulate a few sessions uh ask the
questions based on the knowledge base,
get those answers and get an idea how
this product actually works. And it was
also man we shipped this probably within
the first four weeks of the project. Uh
so it was like the the fastest possible
way to get get some feedback from
customer service managers. This was how
it's actually performing. So we can do
optimization not just based on our
internal views but also based on
customer feedback.
Um then the deploy experience this was
to allow um uh customer service managers
to actually deploy it on their phone
lines and included some uh basic
configuration in terms of agent behavior
and also how it should interact with the
customer service team's workflows
and lastly observability uh or
monitoring uh really want to provide
some visibility into into what's
actually happening on those calls with
an AI agent. Uh so we had those
experiences that did show uh the
transcripts recording and also the
transcript summaries and called outcomes
to customer service agents.
Cool. And now moving to the text stack.
I'm not going to go through the
technical detail of everything. I'm sure
that there's going to be a few more
talks on this that you might attend as
well. Uh but I'll mention some core
components and and fin. Uh so there's
the main uh chained loop for uh for the
voice agent ST lm tts uh so speech to
text converting uh speech into text lm
for uh actually dating the response and
text to speech for converting text back
into audio uh but there's also another
approach with the voicetovoice models
where uh everything is processed
directly in audio while skipping the
text layer entirely. So uh and the voice
voice to voice voice approach has the
benefit of uh potentially faster or
natural more sounding speech but also
gives you less control over the output.
Uh so in our approach we did start with
realtime API by itself from the get-go
and it allowed us to test very very
quickly. Uh but eventually we did evolve
our stack but we still using real time
API as part of of the core architecture.
And there are two other components I
want to mention rag and telefony. So rag
obviously is super critical for a lot of
agent experiences. Uh but obviously it
is important for the uh agent answering
questions based on the knowledge base u
and then telephony. Um uh so actually
being able to put the agent on the phone
lines and we had a bit of a head start
because our agent on chat already had
the rack set up and we already had a
native phone support product. So we got
some of those things for free.
Now once uh uh we have the technical
foundation in place uh is a question
about how do we actually design the
conversations for voice. Um so um
intercom has a background in chat but we
knew from the get-go that the approach
for voice will have to be a bit
different um and that voice is not
necessarily just chat with sound. So
there are like three key differences I
want to mention. There's obviously many
many more but there's a few that I
thought it's worth mentioning. So
latency um
on chat it's actually I think okay to
wait for a few seconds for response at
least from the user perspective that
there's a lot of tolerance but obviously
it doesn't really work on voice if the
agent goes silence for for a second or
two or maybe longer uh the user might
assume that something has gone wrong. So
in terms of our approach for the simple
queries uh we got it to about 1 second
so we didn't need to do anything extra
uh but for more complex queries when
they they're running a bit longer three
four seconds that we added injected
filler words I mean like let let me look
into this for you let me look it up uh
to maintain the conversation flow while
we generate the answer in the background
then the second one is answer length uh
so again on chat it's actually probably
desirable in those agents customer
service agents to provide a bit of a
longer response to uh provide as much
context as possible to the user and it's
easy to skim through the answer to find
the right information. But again, it
wouldn't work for on voice. You don't
want to wait there for a minute or two
listening to to the agent. So for more
complex responses or for responses with
multiple steps, we're breaking down the
answers into multiple chunks and deliver
chunk by chunk and after each we ask the
user to confirm whether they would like
to listen to the next step. And this
works really well for something like
troubleshooting when you have a few
steps uh to follow.
And lastly, the user mindset. So, well,
something interesting in during our ali
testing and real phone calls is uh some
customers would interact with invoice
like with an old school IBR. So, we just
using single words like support uh
password reset, yes, no. But then
throughout the conversation, I've
listened to a lot of those calls is like
they change their behavior during the
call and actually start using full
sentences while they hear the agent
using full sentences. Um, and one of my
colleagues like summed about nicely that
crazy how the human speaks more like a
bot and the bot speaks more like a
human. Uh, and I think this will change
over time. uh to some extent train over
chats and it's about uh obviously voice
getting better and people more being
more common as a technology people going
to get used to it but for now for now
it's on us to make those uh
conversations sound as natural as
possible so we help with this transition
now thinking about how Finn integrates
into support workflows uh so this was
super important and definitely
surprising for me uh when u when we got
to this point is that majority of the
feedback wasn't about the voice uh about
the model or about the latency was
actually about how does it work with the
support team workflows and don't get me
wrong I do think that all those core
model experience are super important but
this actually became a bigger blocker
for all those teams uh so we put a lot
of focus that the integrations points
are as smooth as possible so we did a
bunch of other things but I to mention
two here one is the escalation paths so
getting the calls um uh getting
configurations for how the calls get
escalated to the human support team and
also the context handoff. So after every
AI agent call, we generate a transcript
summary that gives a bit more context to
the human agent that gets the call to
what happened on the call. And yeah,
these are not like super flashy
features, but they were absolutely
essential to get this from this demo
stage to the deployment stage for larger
customers.
And then how do we know it's working? Uh
so uh there's a few topics I wanted to
touch on. So um one is the manual and
automated evals. Uh so we had a test a
set of uh test conversations that will
be running through on every ma major
code uh code change. Initially it was
mostly manual just in a spreadsheet but
over time we added some automation.
Number two is internal tooling and this
was super critical um h for
troubleshooting. So essentially we built
some internal the streamed web apps to
review the logs the the transcripts the
recordings. So any any time one of our
customer there's an issue we can review
in detail what happened in the
conversation actually troubleshoot it
with with the logs. Um number three
resolution rate. So this is our
northstar metric which actually tells us
whether we're delivering value for our
customers. So we define it as um uh
essentially either the user confirming
on the call that the issue was resolved
or the user disconnects after hearing at
least one answer um and then doesn't
call back within 24 hours. Um and yeah,
this is the main the main metric metric
metric we track. Obviously there's more
in customer service, but this is kind of
the main success metrics that we have.
And lastly, LM as a judge. This is more
experimental, but we're using another to
analyze the call transcripts to help us
identify issues or opportunities for
improvement.
And lastly, uh how to price it. Uh so
just want to touch on the cost and uh
some of the pricing models. The typical
cost ranges between three and 20 cents
per minute. And the cost here will
depend on the uh the complexity of your
queries uh but also on the providers
that that you choose. And in terms of
the pricing models, so the the main two
dominant on the markets are usage based
pricing and outcome based pricing.
Probably usage is still the most
dominant right now. Uh so user based
pricing very simple. It's like per
minute or per call very predictable but
it doesn't necessarily capture the
quality of the agent. Um so uh the
incentives are not very well aligned
between the providers and the customers.
This changes with outcome based pricing
because you only charge if you actually
resolve something for the customer. So
it has a lot of benefits but also it
also reduces risk because for a very
long call and the call that you
unresolved the provider actually needs
to take the cost. Um so uh so yeah so
there is risk there but over time I do
expect the market will converge toward
outcome based pricing because those
incentives are way better aligned.
Cool. Um and a few final thoughts. Um so
to recap uh we built a uh voice AI agent
and shipped it in about 100 days. We got
several enterprise customers to use it
on on their main phone lines. And when I
think about like some of the main
takeaways from this experience um is
actually like getting to the right
performance and those like latency and
the and the uh resolutions outcomes
obviously super important but it also is
not just a model problem. It's also a
product problem. So it's about in the
right use case, designing for the
realities of those phone conversations,
building the tools, both your internal
and external, uh, integrating with the
supporting workflows and actually
building trust with them because they
ultimately going to be decision makers
whether they want to release it. Um, and
yeah, and it's about making it feel
effortless even if there's a lot of
complexity behind the scenes. And that's
everything for me. Uh, thank you very
much. And yeah, if you're building in
this space, would love to chat with you.