[Full Workshop] Building Conversational AI Agents - Thor Schaeff, ElevenLabs
Channel: aiDotEngineer
Published at: 2025-07-31
YouTube video id: MPtCBaZn84A
Source: https://www.youtube.com/watch?v=MPtCBaZn84A
[Music] All right. Hello everyone. I hope you have enough space. Um, no. Thanks so much for joining. Um, we're building multilingual conversational AI agents. I know it's a bit of a mouthful. Um, but yeah, hopefully we'll we'll get something going. Um, yeah. So, different than on the poster. We're not from Eivelyn Labs. We're from 11 Labs. Um, have can I get a quick show of hands? Have you heard of 11 Labs? Okay, everyone has heard of it. That's great. Um, so we can we can do some nice things today. maybe look at some um new stuff maybe you you haven't played with yet. So this is me. Uh I'm Thor uh here on the the right side of the screen. I work on developer experience at 11 Labs. So you know a lot of it is kind of the conversational AI um you know agents platform and I also have my colleague uh Paul here with me. So if you have kind of any questions you know throughout the workshop um later on we'll be floating around and you know happy to kind of um you know answer your questions. So feel free to just put your hand up and we'll be uh floating around later. Um yeah Paul works with me on developer experience. So if you have any feedback as well uh you know on documentation examples developer experience or generally just the product um do let us know. Um yeah, also this you can uh scan the QR code to get the slides. So there's a couple resources um that are linked in the slides as well as there's a form um where you can uh fill in your email address if you want to get uh some credits so we can give you um you know a couple credits for the next 3 months to sort of you know play around with 11 laps and kind of try this out. Um, so yeah, feel free to just scan this and then, you know, kind of save it uh for later on. There's the resources linked in there as well. Cool. Um, yeah, also if you're building with 11 Labs, I recommend you follow our 11 Labs devs Twitter account. Um, this is specifically for, you know, updates in terms of API versions, client libraries. So anything you know if you're a developer building with 11 labs that's a good place to follow and kind of um you know be in the loop with what is happening and then um yeah I mentioned so if you if you scan the QR code earlier um there's also the link so you can tap on the QR code as well to open the form uh and if you just fill in your um email address we will send you after this workshop. Um, so in the workshop you can get started with kind of the free account. Uh, that should be should be plenty. But then, um, you know, we'll we'll give you a coupon code. We'll send it to you via email for kind of the next 3 months to to try it out. Cool. And as I mentioned, yeah, the resources in the slides. So, we'll um, you know, kind of tap into these later on uh, and then can kind of uh, you know, get started building. We can see as well kind of what folks are building. Um, you know, since we're looking at kind of multilingual um, conversational AI agents, do you just want to shout out kind of the the languages that, you know, you're specifically looking to unlock with conversational AI? Anyone, you know, anything else than English? Any other languages? Portuguese. Portuguese. Very nice. Uh, you're looking for like a Brazilian Portuguese accent or Cool. Yeah, we can we can look at that. So, we got Portuguese. Um, any other languages? Spanish. Spanish. Yes, we got some Spanish in there as well. That's good. So, Portuguese, Spanish, product, right? Hungary. Hungarian. Hungarian. Very good. I'm actually Do we have Hungarian right now? We don't yet, but I think hopefully soon. So, we're working on the version three of the multilingual models. And I do think we'll we'll need to double check. Uh, not that I promise something wrong, but we might be able. Maybe not today, but maybe in a couple weeks we can give you Hungarian. That's great. Uh, okay. Any other Mandarin? Hindi. Yeah. Um, huge population uh that speaks Hindi. Then again, India has 50 plus languages, I believe. So we're we're working on some adding some additional uh currently we have Hindi and Tamil. Um so we're working on some additional languages there. Okay. So we have a good mix that we can play around with. Um cool. Yeah. If there's any other languages later we can you know kind of explore those as well. U maybe one language that probably doesn't get spoken about um often enough uh is barking. Um actually if you uh want to build applications for the 900 million dogs that are out there uh I actually looked this up. Uh you can use our model. So this is our most recent launch. Um uh two months ago now I believe uh the the newest model we launched. So maybe we can can have a little listen. Uh the daxund is my favorite actually. Um, the golden retriever bit more and yeah, that one's a bit bit feisty. But um, yeah, if you're building application for dogs, that might be a great use case for you. Um, did anyone see this launch like two months ago? No. Yeah, you saw us. Um, yeah. Yeah, the unfortunate thing is we launched it uh 1st of April actually and so everyone thought it was in April Fools. Um so the timing was a bit unfortunate on that one. Um but as you can see, you know, text to bark, it's very real. Uh no, in fact, it was an April Fools. So be careful when you play this to your dog. They might get offended. um because the context we can't guarantee that it translates but the actual sounds that you're hearing are generated by our sound effects model. So we do have a a model if you go to the app um there is this uh sound effects model here um and you can actually this sound effect was also created but so truck reversing u maybe that's a good one you can click generate um and so basically what we do is we generate kind of four different samples for you that you can use so um you know not not so interesting for your conversations per say but like you know if you're creating video games for example I don't know if anyone um is doing that we might be having some uh internet I hope we have internet no okay the internet is looking okay so yeah not sure what's happening here but so for example this was like a drum cowbell I I generated recently um actually we do have um if you Google 11 Labs sound um board. Uh we built this recently which was which is pretty cool. Um so you can kind of loop uh is basically like a drum machine uh and you can get kind of the sound effects. The drums are pretty nice as well and and you know they are mapped to like the the keys on the keyboard. So you can you can play like So that's pretty cool. And then you can add kind of new sound effects and stuff, you know, just to give you an idea of some of the things um that we do. Now, obviously here, you know, we're talk we're talking about conversational AI agents. Um and so specifically if we kind of look at the different components that are involved in building conversational AI agents, we have the user um who is speaking you know some language uh and so we need to transcribe that speech into text. We then feed that into a large language model which is kind of acting as the brain um of you know our agent. So in our case we currently don't build any intelligence model. So we partner with kind of the existing large language models um providers. So like your GPT40 uh your Google Gemini what have you. Um and then this large language model which will act as kind of the brain of your agent will generate a text output. Uh and then we basically stream that text output back into speech. So that is roughly kind of the the pipeline that we have for building conversational AI agents. Now with that there is a bunch of system tools that are built in. So for example we have this language detection system tool which facilitates the language um switching um that that we're going to look at in a bit. Uh we also have you know function calling tool calling. So um you know if you need to give access to kind of specific functionality to your agent you can you can do that here as well. And then um you know there's different approaches. So like if you've seen open AAI real time for example so this doesn't actually go through text right so it goes sound token to sound token which has some benefits but what we've seen you know for deploying conversational AI agents at scale uh and really understanding you know what is kind of happening so if you're going sound token to sound token you're kind of flying blind a little bit so you know you're trusting the the model that sort of it actually rep replies intelligently um whereas if you're going through text you can uh have you know kind of better monitoring and sort of understanding what's what's going on um in your conversation. So while we're also exploring you know kind of sound to sound for kind of the conversational uh AI agents for now what we found works best is sort of this this pipeline that we've built. Um and we deploy kind of all these uh models very close to each other to kind of bring down the latency as as much as possible. Cool. And so now now we can look at kind of the the individual um sort of components within that. So for example for speech to text uh so this is uh actually the the most recent model we actually launched um you know that wasn't an April Fools and so this is our um speechto text model. So our uh automatic speech recognition model so ASR um and this works you know kind of benchmark leading across 99 different languages um at the moment. So what this does and you know you can see sort of the functionality that's built in there. There's you know word level timestamps there's speaker diorization um there's audio event tagging so if you want in your transcript you know cuffing laughing sort of some of these you know audio tags in there you can enable that as well uh and it's all kind of within you know structured API uh responses which is really nice. So, what you can see here is, for example, you know, we have a conference call. Quick check-in. Maple Street is a mess. Time to fix it. Totally. Some of those potholes could swallow a small car or a very brave skateboarder. We start next week. So, what you can see here is as we're kind of playing this audio, we see that the model recognizes the different speakers uh and kind of texts them as speaker one, speaker two. Um, and then we have the word level time stamp. So you see as I play this Jonas 4E timeline. Yep. Unless the concrete throws a So we can we can highlight the different words kind of you know word level um here. So that's really really useful. Um and you know obviously this is available through uh the API. So actually one thing um I did and you can um try this out yourself. This is just uh available kind of for free to sort of demo it. Um so if you're using Telegram, I've built kind of this little Telegram bot where you can forward voice messages um or videos uh to you know the Telegram bot and then you it automatically identifies, okay, what language is that? And it gives you back the the transcript um of that message. So you know you know this all too well. you're sitting in a meeting and your grandmother sends you a voice message and you don't know, oh, is this urgent? Is this important? So, what you can do is you can forward it to the bot. Uh, and very quickly you will get the transcript back. So, um, if you want to try this out, you know, you can you can do that. So, you can see it here. I was actually I hope you you recognize this uh if you were paying attention. Um, anyone recognize this? you were just saying. Yes, I was just saying that. Uh, so I I was just recording a voice message on my phone. Thank you. Uh, someone was paying attention. That's great. Um, and so, you know, I was just recording a voice message and then, uh, I get the transcript back. Now, the cool thing as well, um, so I I I actually live in in Singapore. So, if you spend time in Singapore, you might have heard kind of the Sing English, right? Which is sort of the Singapore English. Go to hell this person. Bastard. You know you I asked for a plastic bag. You must put the thing inside the plastic bag for me, right? She never She just put throw the B on the table. Yeah. Anyone understand what's going on? Do we got any Singaporeans in the house? No, it's not that easy. Even after six years in Singapore, um I still sometimes struggle with that. So, what we can do is we can forward it to our transcription bot. Uh, we can see. Okay, it was received. It's transcribing it now. And uh, yeah, go to hell, bastard. You know, I asked for a plastic bag. You must put the thing inside the plastic bag for me. So, you can you can see these are kind of the problems we have in uh, Singapore. Now, another accent that might be a bit challenging. Um, Scottish. Some things I need to know about Scotland. Yeah, they're trying to split up the country and all that and we don't want that. We want a decision council and all and that's basically that is pretty anyone understand what's going on in Scotland? No. Okay. No Scottish people here. Um again we can forward it. We can see um and so this is this is really cool actually that um you know even without fine-tuning the model it can actually understand specific accents um quite well. So you need to know they're trying to split up the country and all that and we don't want that. We want a decent council and all and that's basically that. There you go. I actually now listen to this so often that I can actually hear it. Um but yeah, you might you might not be familiar with it. Now one example that's maybe a bit um closer to where you're living if you're here in the US and Mr. President what you would like to say about the Bangladesh issue because we saw and it is evident that how the deep state of United States was involved to regime change during the Biden administration and then Muhammad Yulus made junior soros also. So what is your point of view about the Bangladesh and what is the role that the deep state played in in the situation? So what you're seeing here is uh president has um a translator for English to English um translation. So we thought you know maybe he can just use this transcription bot. So if we forward that um again so even if the audio quality isn't that great if there's like a lot of background noise the model is actually very good at kind of identifying that. Um yeah Bangladesh issue United States. There you go. Um so yeah so this is kind of the you know one component to it. So we need to um you know understand what the user is saying and then feed that into our LLM uh to you know have kind of a a meaningful conversation. So for the other part um you know we we don't specifically provide the intelligence layer. So that's where we partner with kind of you know the leading um model providers. You can also fine-tune your own model. um you know if you for example fine-tune and deploy it on say Google vertex AI you just need an open AI API compatible endpoint and then you can plug in your custom LLM into this pipeline as well um which we can look at in in a little bit and so now we have the other component so once the LLM starts streaming out the response we then want to start streaming the speech as soon as possible um so this pipeline is kind of streaming throughout so we have you know the the fastest kind of snappiest um response possible. So for you know the actual text to speech what's great with 11 laps you know we heard we want kind of um Brazilian Portuguese accent right so we have a huge library um of voices that are available on the platform. So we actually have more than 5,000 different voices um that you can choose from. And so if you go into your 11 laps account, um you can go to voices and you can explore um different voices. Now if you were sitting, you know, here in this workshop and you were like, "Oh, I really like this voice." You're in luck. So you can go to the voice library and you can type in German engineer. I'm originally from Germany. And you find me. True success is doing what you are born to do and doing it well. Does that sound like me? ah okay it's it's trained on some of my uh YouTube videos so I think maybe I talk a bit differently um in the YouTube videos but yeah this is this is great and the thing is so I basically clone my voice uh and I published it on the voice library and so anytime you use that voice uh I get royalties um so this is a marketplace uh we actually uh recently just surpassed the $5 million US milestone that we paid out to um you know our voice actors that are kind of publishing their their voices on the platform and so you know in this workshop if you use this voice uh I'll be very grateful because then uh I can have a coffee later that's great um no but obviously you know you can use my voice if you want to you don't have to uh the great thing is you can set really kind of narrow filters to find sort of the voice that you know you want so for example if you want to choose choose um Portuguese. Uh you can just put the language filter to Portuguese and then you can choose kind of the accent. So here for example uh we want a Brazilian Portuguese. We can then further kind of narrow this down in terms of you know sort of gender, age. So there's certain meta meta taxs that we can apply. Uh and then we can see maybe here. Okay. I don't I don't know. I haven't spent much time in Brazil. Um Does that sound Brazilian? No. Yeah. Okay. A little bit. Um, so I mean there's a lot of voices available on the platform. So you can kind of see um if you find one that that sort of fits, you know, the local accent that you're that you're looking for. Uh and then what you can do is you can um go and kind of put you know all these different pieces together into your conversational um AI agent. So here in the dashboard we can go to conversational AI um and we can actually configure a lot of our agent you know right there within the dashboard and then we can bring that into our application with um the JavaScript SDKs, the Python SDKs, kind of depending on um what applications you're building. Um so we can just do a quick demo maybe of an agent that um I had built for uh a conference in Singapore. Um so you know if you're familiar with Singapore there is uh four official government languages. So if you're building applications you know for Singapore you actually need to you know provide uh English, Mandarin, Chinese, Malay uh and Tamil. So these are kind of the official government languages in Singapore. Now obviously there's other languages uh being spoken you know Hindi, Japanese for example um you know as well but what you can do is you can uh we so we currently have within conversational AI uh I think 31 um different languages that the agent can speak and switch between and identify. Um, we're working on the next iteration, so the V3 of our multilingual models, which will, uh, bring, I believe, up to 99 different languages. So, um, yeah, you know, no matter where you are in the world, you can kind of get there pretty pretty good coverage. Um, so you can see here we can configure the LLM. So, we can, you know, choose kind of, uh, any of the the, you know, big LLM providers that are out there. we have directly integrated uh and kind of colllocated with with our platform. Um or you can put in your custom LLM which is just an open AI API uh compatible endpoint. Uh you can put in kind of a knowledge base. Uh you can upload documents, you can you know reference your website. Um there's rack built in as well. Uh depending on kind of the model you're using um you might not want to put kind of the entire knowledge base into the context. So you can you can enable rack there. Um and then the tools you know this is kind of your standard function calling um you know that you might be familiar with when you're working with um large language models. So depending on the model you choose as well um there might be some nuances with how reliably the function calling works but you know kind of the larger ones GPT4 um plus Gemini 2.5 you know these are generally very reliable with function calling. And then we have the system tools that are kind of built in. So you can see here the language detection system tool. So we can just enable this. Um and we have this kind of you know configured by default to identify different languages and switch between them. Um but you can also override kind of the the LLM prompt to sort of customize that to your use case. Um and then maybe we can just give that a little test and see how that works. Hi there. This is the API day Singapore conversational AI agent. How can I help you today? Um, can you tell me what's the name of that 11 Labs speaker? The speaker from 11 Labs at Appidday Singapore is Thorston Shea who is listed as a developer advocate. Would you like to know more about his session or background? Yeah, but what what languages can you speak? I can communicate in English, Chinese, Hindi, Malay, Japanese, and Tamo. If you prefer to continue in any of these languages, just let me know which one you'd like to use. Would you like to switch to another language now? Yes. Um, can we switch to Hindi, please? I'm Hindi. Chef 11. Um, okay. Unfortunately, I don't speak any Hindi. Um, can you repeat that in English? Of course, we can continue in English. Would you like more information about Thorston Shea or the 11 Labs? So, I guess you you get the gist. Um, so with the language detection tool, there's kind of two, you know, different modes with this. So, you can either, you know, as like in the first scenario with my broken Mandarin, uh, I was basically just saying, "Oh, sorry, I don't speak any English. Can we like speak Mandarin?" Um, and it would recognize, oh, okay, you know, even with my terrible Mandarin, it was like, oh, yeah, he's trying to speak Mandarin. Okay, so maybe let's reply in in Mandarin. Uh, so it was doing that. Uh, or we can specifically ask, okay, what languages can you speak? You know, can you speak Hindi? Uh, can we switch into Hindi, please? Um, so this is kind of the the built-in language detection system tool that we can use to facilitate kind of these multilingual uh conversations which which is really nice. Um, cool. So this is kind of roughly what I wanted to show you um sort of as a start. Uh, and now what we can do is kind of we have you know the next 30 minutes to play around with this your yourself and we'll we're in the room and you know if you have any questions we we can answer them. So there's various different ways that you can configure your agents. So you know by default you can get started in the dashboard and you can configure kind of a lot of the behavior and functionality uh in there and then um you know if you if you go back to the resources we have um you know the documentation we have different examples um that that you can use around conversational AI. So for example there's you know we have examples for Nex.js JS to build that into your next.js applications. We have examples for Python. Um if you want to build it, you know, on like hardware devices somewhere, you might want to use Python on like your Raspberry Pi for example. Um so we have the examples there and you can then once you've configured that you can bring that into you know your application. Uh alternatively you can also configure all nuances of your agents via the API. Uh so actually if you're building a marketplace where you are configuring agents on behalf of someone else you know you would generally do that through the API. Um and we also have uh an MCP server. Uh so if you're using um you know cloud desktop you can bring in the MCP server and you can just tell a natural language oh please you know set up an 11laps conversational AI agent um you know with this voice um and it'll go and it knows kind of what you know API uh calls to make to to set up your agent. But yes so uh thanks so much for for joining. Um if you have any questions you know we'll be floating around. Uh you can ask also ask the questions now you know if you want to ask it kind of in the audience but otherwise um you know please just go to 11labs.io IO um and create your account if you don't have one already. Um, and then you just go to app, uh, go to conversational AI, and then you can create, um, here in the agents interface, you can create a new agent. Uh, and maybe we just start off with, um, kind of the support agent here. Uh, and then we can go through and kind of configure this agent with, you know, your voices, your languages. Um, and then, yeah, would love to hear a lot of different, uh, agents speak, you know, at at the end of the next 30 minutes. Awesome. Thanks so much. Do let us know your questions and we'll be here for the next 30 minutes to help you set up your agent yourself. Thank you. Did anyone have questions that they wanted to ask in the room or I think you're you're welcome. There's like uh microphones. Yeah, there. Do you want to just go up to the microphone and ask? Uh hello everyone. First of all, great presentation. Uh the question is related to you said that it can switch to different languages without fine-tuning it like what is the background process of it. Can you explain it bit more like how well it shifts so perfectly that it can interact in the regional languages as well as the accent is also similar to that kind of thing. Yeah. So here um what you can see is in in my agent configuration. So I can actually within the voices tab uh I can assign different voices to the different languages. So for example in um in in Singapore the most commonly spoken Tamil accent in Singapore is the Chennai accent Tamil. Um, and so basically I went to the voice library and I found a voice that is a Chennai accent TAML and I then basically added that to my voice library and just assign that here. So in the voices tab you can configure the different voices for the different languages. Um, which then means that you know the so the actual language detection part that is the automatic speech recognition model. So the ASR model will actually identify which language you know with um so it it basically assigns kind of a a score like a likelihood score that this is the language that is being spoken uh as well as the transcript. Uh and so basically we use kind of with the system tool based on the the confidence score of this language being spoken um we then automatically switch to their language kind in the background uh and basically use the voice that you configured um to to reply with for this language. Does that roughly answer the question? Cool. Yeah. Do you mind um coming forward? So just uh because I think it's also being recorded then then we have it. Uh great presentation by the way. Uh quick question uh other than the changing of the language or language detection or whatnot. Does it have any other ability to make any other actions throughout the call? For example, use case appointment setting. Mhm. Does it have actions to maybe call a web hook and to to to take a look to see if there's any appointments available either in make or n and then report back like we can have that in the the prompts. Yeah, correct. So the basically the configuration of that is a um combination of your system prompt together with the tools. So you can configure um custom tools and so these can be serverside tools which then would be a web hook call um you know to your CRM to your system. Um so this is you know the the the standard kind of tool calling function calling um that the LLM supports. Um, so you you know like GPT40 or like the the the more modern models generally support function calling and structured outputs. Um, and so as long as the large language model that you're using to power your agent supports function calling um you can add your tools and and this can be combination of server side tools. Um so for example you know as you mentioned like the scheduling um you can put in uh so for example we we also have an example with cal.com uh you can put in the API endpoints for um cal.com and then the agent can actually look up oh okay is there availability in the calendar and it can schedule um so it can ask for like the the email address and then it it can schedule you know the the the ing for all the parties kind of through the conversational AI agent. Okay, one more question. Uh what would you suggest uh would be a a good structure for a conversational agent that has like super low latency? Obviously the model plays a big role. So obviously like price and model or like two like price and the latency are like the two biggest key factors when we want to do like a outbound or inbound dialing agent. What would you suggest for let's say an outbound dialing agent or even an inbound to decrease the latency to make it seem more like conversational like because obviously if you use a really good model it's really large the latency is just super high and it just doesn't really that that's not a meaningful conversation to have. Gotcha. Um I think that's a good question. And I think we have some so if you go to the documentation um within the conversational AI we have some um kind of best practices. I'm not sure if we have like specific I do that by testing but I want to skip the testing part. Yeah. Do we have specific guidance on like the best model? It kind of depends on your use case, right? Yeah. Um although I think it's like he's asking about like the best LLM to use right for right. Um yeah, I think it depends on your use case like you know depending on kind of the function calling and you know how much of that you have. You can probably go down to like yeah a Gemini flash or flashlight um to like reduce the latency in terms of um of that. But I think on our end in terms of like the voice models, so we use um where are the yeah we use by default kind of the flash models for the speech generation. Um so depending on the languages that you want to support um yeah it's just a test. I I think testing would I think I'll be able to figure out with like proper testing with different models flash on off all that good stuff. Thank you so much, man. Cool. Thanks. I have a couple of questions. So, number one, what's the total cost per minute coming down to? You mean like by default? So, it kind of depends what um pricing tier you're on. Uh so if you go to the pricing page uh so here conversational AI so depending on kind of the the tier you on um the you know there's a certain amount of minutes that are included. So currently the pricing is based on call minutes. Um and then depending on which tier you're on, there is additional minutes that are charged um at a specific price. So it does somewhat depend on um kind of the the pricing tier that you're that you're on. If you want to have like an application where you need like really long interaction times, say you want to do a companion that will talk to you while you cook, is there something you can do to mitigate that that cost? The timing cost. Yeah, that's a good question. And I think like for those use cases it might not be um great at the moment because like if they are just running this like 24/7 to like be able to talk to someone um the cost is is pretty significant. Um that's a good question. I think we I mean there's potentially like if if you reach out to the sales team um there is some like custom pricing that we can do based on the use case. Um but I think for now this is charged based on minutes used like the that the session the session is live. Okay. And final question um I tried to do an agent in your dashboard and it it had several tasks. So it was one onboarding task then one another followup task and it kind of got confused like it was not like identifying when to do one task when to do the other. So is there a way to mitigate this or maybe have a multi- agent configuration? Yeah, so there's um we have what we call uh agent to agent uh agent to agent transfers. Um so one way to do this is to um you know basically set this up um and so this is this is a system tool as well um where you basically set up different agents for different use cases. So like certain use cases you also might use want to use a different LLM to power that use case kind of you know depending on um which LLM is sort of best for the task. Uh and then you can configure different agents for different tasks. Uh and then you can configure kind of an orchestration uh setup that basically will then route kind of in the background to a different agent. Um, if you keep the voice the same, this actually happens somewhat like silently without the user actually knowing that they're being transferred. So, it's not like it's an immediate transfer. Uh, it just means that you can, you know, sort of develop these agents um potentially also across teams where you have, you know, one team that owns kind of this specific agent uh and then in the background you just kind of switch between the agents um for like the different tasks. Great. Thank you. Thanks. Cool. Any more questions? Yeah. Um, latency. So, in general, like simple examples usually, you know, great for demos, but when you have something more heavy, more enterprise grade, and you have more data, more, you know, your rags taking time to come back. um how do you not have a shitty experience because you think of it from the end users perspective right they're like uh like and then blank for the next minute or two um do you suggest fillers like you know does the conversational AI say I'm thinking let me think about it or h how do you kind of make it more natural because there is going to be like in an enterprise setup right like if I have to then go look up a patient's claim, you know, that might go hit a database. Once that comes back, it goes to other systems, does a whole bunch of things, but it takes time, right? Like uh h how do you set it up so that you know latency that can't be avoided? How do we make the conversational experience better? Yeah. So there's there's certain things that you can do. So you know generally kind of these are you know if you have like a big knowledge base for example like using rack is kind of one um way of kind of mitigating you know that taking too much time. Um and then also in terms of your tools um when you define your tools you can um configure the where is it kind of the response time. So when you you know add parameters um where was the configuration I think there's a configuration for yeah like the timeout um so basically how long you want to wait um for you know this tool to sort of come back um and also if you want to wait sort of for the response. Um, so I think the maximum time out we allow is like 120 seconds. Um, and then the the agent will actually like say, "Oh, I'm I'm currently looking that up in the system. Uh, sorry, you know, we're still waiting kind of on the response." Um, so that is sort of built into the tooling here. Now, you know, depending on your Yeah. like use case, you probably want to to to put that kind of, you know, fairly low because like yeah, if you're waiting on the call, I wonder if you can do something where it's like um oh, I I'll call you back, you know, like I I'll take that back and like take the action or but I think for now it would just yeah, depending on your timeout time. Okay, it basically will wait for the response and it will tell like it will stay conversational to like you know talk the user through that oh we're still waiting on on the tool response um there and are all conversations linear or can they branch off come back like while it's looking up something can it come back in 5 minutes later oh you know I found you know in the meantime they get other details from the customer or patient or um no I that's a good question. I don't think so. I think it yeah currently you would I think in that case you would like orchestrate it in a way where you put it into a queue. Um the only thing then is like how do you update the agent? I think like if you're using the I I would need to kind of look this up. There might be a way like with the websockets where like as information comes back uh you can inject that additional information into the conversation through um a websocket notification. Okay. Um, but yeah, I would need to look up kind of that specific use case because then what you could do is you put these tasks into a queue uh and you work through them in the background and like as kind of the responses come back you can then inject them back into the the context through kind of websocket uh events potentially uh but would need to double check if that's feasible. Okay, that's all I have. Thanks man. Cool. Yeah, thanks. Sweet. Is everyone while we're like going through the questions, is everyone setting up their agents? Are we No. Okay, good. Yeah, I see someone nodding there. Fantastic. Okay, cheers. Yeah. Hello. Um, could you explain or demonstrate what happens if you have multiple languages within an input and output and how that might work if you have different accents for each language? Yeah. So, so you mean like if the person is kind of mixing different languages? Yeah. Like for instance, if they say uh explain the difference between Shaden Floyd and in Chinese. Yeah, it's a good question. I haven't tested that specifically. Maybe you can try right now. We we can. Yeah. So, what was it? Explain the difference between Shan Fo and Okay. Great. Yeah. Okay. Hi there. This is the API days Singapore conversational AI agent. How can I help you today? Uh can you please explain the difference between Shardan Freder and Lao? Shardan freder is a German term that means the pleasure or satisfaction one feels from witnessing another person's misfortune or failure. It is a well-known concept and is often used in English as well. However, zinc hyohawk does not appear to be a recognized term in German or in psychology and there is no widely accepted meaning for it. It's possible there may be a spelling error or confusion with another term. Could you clarify what you mean by Zingshile Hawk or provide more context? That way I can give you a more accurate comparison. It might be my pronunciation, but it's it's meant to be a Chinese term. Thank you for clarifying that you meant a Chinese term. Unfortunately, Zinghawk does not sound like a recognizable phrase in Mandarin or other major Chinese dialects, at least based on that pronunciation. If you can provide the Chinese characters a different spelling or more context about what the term describes or when it is used, I may be able to help you ident. Yeah, it might be my pronunciation. But yeah, the problem is like if you mix like um you know, well, English and German at least are both like Germanic languages, right? But then Mandarin is so I think there might be might be challenges there and like depending on the more languages you mix it does get does get challenging. Yeah. Do you have a recommendation then if you want to build like a language learning application for example? Yeah. Um, I wonder if there's certain things you can do with like the prompt, the system prompt in terms of like improving how it's being picked up. But yeah, I think because we're like going through text here, um the like the language learning use case is a bit more challenging, especially if you're going um you know, Germanic lang languages versus um yeah, it's a good question. I I don't have an immediate answer for you there, but uh yeah, actually you might want to try like a sound token to sound token like open AI real time. I wonder if in that case it does better because it can Yeah, it does. You know, it doesn't go through text. Yeah, it does. Sometimes it like switches the accents too, which is kind of annoying because it will try to pronounce Chinese for instance in an English accent. Ah, interesting. Yeah. So, yeah, there is there's challenges with that, but yeah, that's a that's a good one. I'll I'll take that back and see, you know, kind of how we So, I know we have some we have a customer in India, Supernova, that does, but it's specifically English learning for um the Indian market. Um, so I think it's a bit of a different use case there. Do you know if it produces English in a particular accent or is it like using the you know Indian phonetic sounds to Um I think there's there's like a case study uh 11 labs supernova so maybe you can you can look that up. Um there's a video so maybe yeah yeah I'll look it up. Thank you so much. maybe take a look at that and then we can yeah if you if you connect with us uh we can we can also follow up on kind of some guidance on on that use case specifically. Perfect. Thank you. Cool. Thanks. All right. Hey, um I was curious if you're worried about scammers or fraudsters using these tools. Yeah, so there's definitely you know a worry with that like obviously kind of all this this technology. So, one thing that is kind of very important for us, so you can if you go to 11laps.io/safety, uh you can see kind of the safety tools that we're developing um you know uh in parallel to to our um features. So there's a there's a bunch of things that we do um like specifically you know we do like life moderation for certain things. So actually when you publish your voice to the voice library you can specify um terms that you don't want your voice to say. Uh and then in this case we actually have life moderation where um we will make sure that your voice isn't used to generate kind of specific terms or um sentences. Um we also kind of monitor in general uh what's being generated on the platform. Um so kind of the moderation and um sort of the the other tooling is that actually with any um speech that is generated on our platform we mark watermark it uh actually to the extent that we can trace back which account generated um this specific speech. Uh so if we identify um fraudulent activity, we can actually trace back which account generated kind of that um and you know can kind of ban them or you know provide um kind of information to the authorities um as needed. Yeah. And so the other things is just kind of in terms of um for we have like where was the we have like the voice capture um that we developed when you are creating a um a professional voice clone. Uh we actually generate kind of a random sentence that you need to read out to verify that you know you have permission to clone this this voice. Um so yeah we we do you know with kind of all the technology that we develop um we do put quite a large amount of you know focus and effort into uh safety tooling. Um but yeah there is obviously always a concern that your technology is being used for fraudulent activity but I think so far um you know we've been trying to mitigate that with like the safety tooling. Definitely. Yeah. Looks like a lot of good guardrails in place. Thanks. Thanks. Yay, you're back. You want to go? So just building on her one like um in our place um if a patient is asked hey you know how you feeling about this and say they are um they they try to speak in English they might hold a part of the conversation in English and then they might jump to Spanish and Portuguese and come back to English for you know like when they have to describe something that they can't in English, they kind of jump back to, you know, the the language that they are most comfortable. Sometimes they jump around different languages. Uh how would because like the way you explained it, it kind of you have some kind of a router that checks what kind of language it is and then shoots it off. But within a conversation they they kind of jump between like you know they'll explain a few things and then they'll go few words with a very you know Portuguese words and stuff. So like for you this is this is like English, Spanish, Portuguese kind of all mixed together. Sometimes like if you just ask them how you feeling okay then they come back hey you know this this you know I took this medication it hurts you know and then if you say okay where is it hurting? house. Then they suddenly kind of you know they re regress to whatever language is most comfortable to them to explain their thing. Gotcha. Yeah. I mean, yeah, you can see here that like the transcript it actually correctly identified, you know, shardan freder um because technically it's also an English word, right? But then like on the on the Chinese word it just completely you know well you know you can blame partly my pronunciation probably you can blame it a lot but um yeah I wonder if like a native speaker yeah I don't I don't have exact benchmarks on like you know how much like the transcript gets worse the more languages you introduce kind of in the same so I I think like generally if you have two languages intermixed it tends to perform okay but like if you're like now having like three different languages um it just you know progressively tends to to get worse but I don't have um exact benchmarks on like you know how many languages sort of yeah okay cool that that's good to know that's what I came up with but yeah it would be worthwhile if you have like recordings of like some of that to like put it through our um transcription model and see kind of how how it performs in like identifying that that that would be interesting. Yeah. Cool. Thank you. Uh say this question toward the end because it's kind of non-related. So uh I worked on a project where we use 11 Labs uh for uh the voice track of our avatar. uh and at 11 Labs functioned well, but we had a lot of more downstream issues in terms of like lip sync and like I think someone mentioned like slugs and timing and other things. So, is there any plan for 11 laps to come like I guess further down the stack in terms of like avatars or is is that even something you're thinking about? Uh interesting. So you you did you build like the the lip syncing model and like avatar stuff on your So uh we like within the like Nvidia Tokyo stack. So they have like a a stack and they have their like Reva voice model and we kind of switched that out for 11 Labs. Uh so they have like the full stack of like the avatar and then 11 Labs is just the voice portion of it. Yeah. Ah okay. And sorry which which stack was that? the Nvidia. Oh, Nvidia Tokyo. Yeah, it's like Yeah, it'll go from uh like we did everything with the the GPU, but then they have like the visualization and uh you can just plug your voice model or or in Tokyo like to O K I O T O K. Ah, you see I'm I'm thinking about Japan. Is it this one? Yeah. Oh, interesting. Okay. Yeah, I I personally don't have um experience with that one. So, I know that we're mostly working with partners like Hedra and Hey Jen kind of for sort of the the avatar side of things. Um I don't know, Paul, do you know any? No. So, this is something I I would need to come back to you and like look into. Um it's it's interesting. So like you're saying out of the box it uses like an Nvidia model for speech generation or Yeah. Yeah. But uh our client which I assume is one of your partners. We can't talk about that but like our client couldn't use the Nvidia model. Got had a contract to use 11 Labs model. So okay it's kind of interesting. Yeah. Sorry I don't I don't have a good answer for you there right now. But yeah this is interesting. we we can go back to the team and see um if there's any resources that we can we can give you in terms of like improving that. No worries. Thanks. But interesting use case. Thank you. All right. Last question. Last question. Nice. Uh hi to thanks for the presentation. Um I have a question regarding the transcription model around adding custom vocabulary. like uh at the company I work for, we use a lot of threeletter acronyms and uh let's say let's say I want to have the model read out SAP as SAP and not SAP. Is there a way to tell it to do that? And is there a way to uh like tell it to read words a certain way and kind of nudge the interpretation of what I say towards certain words that we use in our vocabulary. Interesting. Yes. So you have this both so you have this use case both on like this the speech to text. So you need to correctly identify the acronyms but then also you need the agent to reply back with the correct so like for the reply back we do have um have you seen the like um pronunciation dictionaries? Um so we have a way for you to provide um you know pronunciation dictionaries with like kind of phone alphabets uh to actually you know identify specific you know basically like here tomato tomato I guess tomato tomato um and so you can provide for for that you can provide the pronunciation diction dictionaries to make sure the text to speech pronounces you know the acronyms and the words in the way that you want them to. Now the other side of like the speech to text that's an interesting case. I don't think we have uh a way to like fine-tune that specifically for different acronyms. That's a good question. So, you know anything there? No. Right. Oh, interesting. Yeah. So you you can to a certain extent you can do it through the system prompt where you put in kind of a normalization layer to basically identify things in the transcript that are acronyms and then basically have the LLM sort of massage that into what you want. I think that's what you were saying, right? Yeah. So that could be interesting to see if that works well. Yeah. Have you have you tried it out already or um where I was coming from is uh the company is using um like in their own custom chat called jewel and it's like the unit of work but whenever I read transcripts it's often times used as jewel as the diamond. Gotcha. And so that's kind of the struggle that I'm facing. Okay. Is this actually at SAP? Nice. I'm an SAP child myself. My my father was early as well. Okay. Anyway, too too much information. Cool. Uh yeah, thanks thanks for that. We'll we can we can chat some more and see sort of if that's something we we can get going. Sweet. Uh yeah, and with that we're at time. Um yeah, thanks again. Thanks so much for joining. Please do um you know connect uh find the resources, fill in the form for the credits. Uh yeah, I'll I'll leave this up uh in case you haven't had a chance to scan it. But yeah, thanks so much for joining. Enjoy the conference. And uh we will also have a booth at the expo. So if you come up with some more questions, you can come uh find us there. Thank you. Thank you, Shane. [Music]