Shipping an Enterprise Voice AI Agent in 100 Days - Peter Bar, Intercom Fin
Channel: aiDotEngineer
Published at: 2025-07-18
YouTube video id: HOYLZ7IVgJo
Source: https://www.youtube.com/watch?v=HOYLZ7IVgJo
[Music] Today I'm going to be talking about Finn Voice. And Finnvoice is a voice agent for phone support. Um, and we design it to be a frontline teammate for invone calls. So it picks up the phone, answer customers questions, and then escalates to a human agent when needed. And we build this experience in about 100 days. So today in this talk, I'll share what it took to get there. And I'll also talk about why I believe voice is the next big frontier and AI for customer service. So first a little bit of context of my company, Intercom. Um so we're a customer service platform um and also an AI agent company. And you might be familiar with us because of our messenger product. Uh seen it, you might have seen it in the mobile app or on the website. Uh but yeah, it's been our foundation for years. Uh but we evolved over time. We became a complete customer service platform a few years ago. Added robust tooling for other channels like like email, WhatsApp, and phone. And then two years ago, right after the launch of GPD4, we launched Finn uh which is an AI uh uh agent uh via on text on chat. And Finn's growth has been incredible. Uh we have over 5,000 of of c customers and also in terms of performance is reaching average resolution rate of 56% and for some customer to be 70 80%. Um and this is defined as a percent of interactions handled by Finn that are resolved without human intervention. And Finn is also a full system for continuous optimization. So it's not just that the agent but also tooling for analyzing conversations, training the behavior of the agent and also testing and deploying changes. Uh but yeah, up until now we didn't have the voice channel and that's what we're changing with Finn voice is the same system but now it can answer phone calls. And a few thoughts on uh why voice? Why are we investing in this channel? Um so for all of users voice is simply the preferred way to get help. Uh so when an issue is urgent or sensitive uh they don't necessarily want to type they just want to talk and if you look at the uh some of the top level data over 80% of support teams uh still use uh phone support and if you think about all the conversations globally or customer service interactions over onethird of them are uh happening over the phone as well. So it's not a legacy channel that's going away it's still widely used and it's also quite costly. Uh, if you think about the average cost of handling a a phone call in the US with human support, it's about between seven and 12 dollars. And with voice AI agents, it can be at least five times cheaper. And a few more benefits of voice AI and customer service. Uh, so first availability, 247 support. Uh, you can call your bank on the weekend. Uh, no wait time, so instantly available. There's no need to be to stay in be on hold in a queue. uh no IVR menus, no need to press one to go to support, press two go to payments uh because all is everything is happening via natural speech and also multilingual so uh AI agents can support 30 40 plus languages obviously better for the for the users and on the business side major cost savings and also scalability as the business grows or when you need to handle peak times AI agents were much better for that. Um so how we built finoice uh over the next few minutes um I want to cover uh seven main areas that had the biggest impact of how fin came about. I'll try to be more practical practical focusing on the some of the product decisions we made and the challenges we faced. Um and and yeah the first one is is the use case. So the starting point for for voice uh then also the scope of our MVP. uh the tech stack behind it, the how we approached the conversation design, how we integrated it with the support teams and also uh how we thought about evaluation and pricing. So starting with the use case, um if you look at a lot of the voice AI startups in the space, they typically start with a a narrow problem space. I mean like scheduling a dentist appointment or uh booking a table at restaurant. And we looked at at some of those options, but eventually decided to go for a more flexible uh knowledgebased agent. So an agent that can answer helps uh her article questions like what are your pricing plans or uh what's your returns policy and why did we decide to go this way? So first uh we had a strong evidence from chat. Finn over chat has been handling those kind of conversations for years and our customers have constantly told us that they seeing the same type of issues over the over the phone as they see on chat and we also validated this through extra analysis of of call transcripts and this confirmed that a very large percentage of all the queries could be solved with the with the knowledge base with the help articles content rather than say with the API integrations and we're also thinking about the initial wedge use case so like the what's the lowest possible risk way for companies to integrate voice agents and we looked at the in office hours and outside of office hours use cases. So we pitch out of office hours as initial wedge because essentially it allows the team to not affect the main workflows and try out this technology build up more confidence over time and later deploy it for their on their in the main office hours. But in out of office hours it just replaces their voicemail experience. And there's also a few other use cases. We looked at uh authentication, so verifying user identity on another channel. Uh info gathering, so the agent getting stuff like uh order ID, account ID, and also uh smart routting to the right team. So these use cases are uh still very high leverage because they can save a lot of time for the support team agents. Um uh but they not necessarily solving the issue end to end. So we're still foc focusing on those but they were in say the primary use case for this initial version of the product. So now moving on to uh what we shipped first. Um uh when we uh when we started started this like the biggest challenge there was to ship something meaningful as soon as possible. We had access to a lot of customers uh because we already had thousands of customers using our native uh phone support product. So it's mostly about how can we test as quickly as possible. So we focus on the three main experiences uh testing, deploying and monitoring the agent behavior. So first test uh this is what we call the fin voice playground. And here was like as a lightweight test environment for the customer service managers to go in and simulate a few sessions uh ask the questions based on the knowledge base, get those answers and get an idea how this product actually works. And it was also man we shipped this probably within the first four weeks of the project. Uh so it was like the the fastest possible way to get get some feedback from customer service managers. This was how it's actually performing. So we can do optimization not just based on our internal views but also based on customer feedback. Um then the deploy experience this was to allow um uh customer service managers to actually deploy it on their phone lines and included some uh basic configuration in terms of agent behavior and also how it should interact with the customer service team's workflows and lastly observability uh or monitoring uh really want to provide some visibility into into what's actually happening on those calls with an AI agent. Uh so we had those experiences that did show uh the transcripts recording and also the transcript summaries and called outcomes to customer service agents. Cool. And now moving to the text stack. I'm not going to go through the technical detail of everything. I'm sure that there's going to be a few more talks on this that you might attend as well. Uh but I'll mention some core components and and fin. Uh so there's the main uh chained loop for uh for the voice agent ST lm tts uh so speech to text converting uh speech into text lm for uh actually dating the response and text to speech for converting text back into audio uh but there's also another approach with the voicetovoice models where uh everything is processed directly in audio while skipping the text layer entirely. So uh and the voice voice to voice voice approach has the benefit of uh potentially faster or natural more sounding speech but also gives you less control over the output. Uh so in our approach we did start with realtime API by itself from the get-go and it allowed us to test very very quickly. Uh but eventually we did evolve our stack but we still using real time API as part of of the core architecture. And there are two other components I want to mention rag and telefony. So rag obviously is super critical for a lot of agent experiences. Uh but obviously it is important for the uh agent answering questions based on the knowledge base u and then telephony. Um uh so actually being able to put the agent on the phone lines and we had a bit of a head start because our agent on chat already had the rack set up and we already had a native phone support product. So we got some of those things for free. Now once uh uh we have the technical foundation in place uh is a question about how do we actually design the conversations for voice. Um so um intercom has a background in chat but we knew from the get-go that the approach for voice will have to be a bit different um and that voice is not necessarily just chat with sound. So there are like three key differences I want to mention. There's obviously many many more but there's a few that I thought it's worth mentioning. So latency um on chat it's actually I think okay to wait for a few seconds for response at least from the user perspective that there's a lot of tolerance but obviously it doesn't really work on voice if the agent goes silence for for a second or two or maybe longer uh the user might assume that something has gone wrong. So in terms of our approach for the simple queries uh we got it to about 1 second so we didn't need to do anything extra uh but for more complex queries when they they're running a bit longer three four seconds that we added injected filler words I mean like let let me look into this for you let me look it up uh to maintain the conversation flow while we generate the answer in the background then the second one is answer length uh so again on chat it's actually probably desirable in those agents customer service agents to provide a bit of a longer response to uh provide as much context as possible to the user and it's easy to skim through the answer to find the right information. But again, it wouldn't work for on voice. You don't want to wait there for a minute or two listening to to the agent. So for more complex responses or for responses with multiple steps, we're breaking down the answers into multiple chunks and deliver chunk by chunk and after each we ask the user to confirm whether they would like to listen to the next step. And this works really well for something like troubleshooting when you have a few steps uh to follow. And lastly, the user mindset. So, well, something interesting in during our ali testing and real phone calls is uh some customers would interact with invoice like with an old school IBR. So, we just using single words like support uh password reset, yes, no. But then throughout the conversation, I've listened to a lot of those calls is like they change their behavior during the call and actually start using full sentences while they hear the agent using full sentences. Um, and one of my colleagues like summed about nicely that crazy how the human speaks more like a bot and the bot speaks more like a human. Uh, and I think this will change over time. uh to some extent train over chats and it's about uh obviously voice getting better and people more being more common as a technology people going to get used to it but for now for now it's on us to make those uh conversations sound as natural as possible so we help with this transition now thinking about how Finn integrates into support workflows uh so this was super important and definitely surprising for me uh when u when we got to this point is that majority of the feedback wasn't about the voice uh about the model or about the latency was actually about how does it work with the support team workflows and don't get me wrong I do think that all those core model experience are super important but this actually became a bigger blocker for all those teams uh so we put a lot of focus that the integrations points are as smooth as possible so we did a bunch of other things but I to mention two here one is the escalation paths so getting the calls um uh getting configurations for how the calls get escalated to the human support team and also the context handoff. So after every AI agent call, we generate a transcript summary that gives a bit more context to the human agent that gets the call to what happened on the call. And yeah, these are not like super flashy features, but they were absolutely essential to get this from this demo stage to the deployment stage for larger customers. And then how do we know it's working? Uh so uh there's a few topics I wanted to touch on. So um one is the manual and automated evals. Uh so we had a test a set of uh test conversations that will be running through on every ma major code uh code change. Initially it was mostly manual just in a spreadsheet but over time we added some automation. Number two is internal tooling and this was super critical um h for troubleshooting. So essentially we built some internal the streamed web apps to review the logs the the transcripts the recordings. So any any time one of our customer there's an issue we can review in detail what happened in the conversation actually troubleshoot it with with the logs. Um number three resolution rate. So this is our northstar metric which actually tells us whether we're delivering value for our customers. So we define it as um uh essentially either the user confirming on the call that the issue was resolved or the user disconnects after hearing at least one answer um and then doesn't call back within 24 hours. Um and yeah, this is the main the main metric metric metric we track. Obviously there's more in customer service, but this is kind of the main success metrics that we have. And lastly, LM as a judge. This is more experimental, but we're using another to analyze the call transcripts to help us identify issues or opportunities for improvement. And lastly, uh how to price it. Uh so just want to touch on the cost and uh some of the pricing models. The typical cost ranges between three and 20 cents per minute. And the cost here will depend on the uh the complexity of your queries uh but also on the providers that that you choose. And in terms of the pricing models, so the the main two dominant on the markets are usage based pricing and outcome based pricing. Probably usage is still the most dominant right now. Uh so user based pricing very simple. It's like per minute or per call very predictable but it doesn't necessarily capture the quality of the agent. Um so uh the incentives are not very well aligned between the providers and the customers. This changes with outcome based pricing because you only charge if you actually resolve something for the customer. So it has a lot of benefits but also it also reduces risk because for a very long call and the call that you unresolved the provider actually needs to take the cost. Um so uh so yeah so there is risk there but over time I do expect the market will converge toward outcome based pricing because those incentives are way better aligned. Cool. Um and a few final thoughts. Um so to recap uh we built a uh voice AI agent and shipped it in about 100 days. We got several enterprise customers to use it on on their main phone lines. And when I think about like some of the main takeaways from this experience um is actually like getting to the right performance and those like latency and the and the uh resolutions outcomes obviously super important but it also is not just a model problem. It's also a product problem. So it's about in the right use case, designing for the realities of those phone conversations, building the tools, both your internal and external, uh, integrating with the supporting workflows and actually building trust with them because they ultimately going to be decision makers whether they want to release it. Um, and yeah, and it's about making it feel effortless even if there's a lot of complexity behind the scenes. And that's everything for me. Uh, thank you very much. And yeah, if you're building in this space, would love to chat with you.