[Full Workshop] Building Conversational AI Agents - Thor Schaeff, ElevenLabs

Channel: aiDotEngineer
Published at: 2025-07-31
YouTube video id: MPtCBaZn84A
Source: https://www.youtube.com/watch?v=MPtCBaZn84A
[Music]
All right.
Hello everyone. I hope you have enough
space.
Um, no. Thanks so much for joining. Um,
we're building multilingual
conversational AI agents. I know it's a
bit of a mouthful. Um, but yeah,
hopefully we'll we'll get something
going. Um,
yeah. So, different than on the poster.
We're not from Eivelyn Labs. We're from
11 Labs. Um, have can I get a quick show
of hands? Have you heard of 11 Labs?
Okay, everyone has heard of it. That's
great. Um, so we can we can do some nice
things today. maybe look at some um new
stuff maybe you you haven't played with
yet. So this is me. Uh I'm Thor uh here
on the the right side of the screen. I
work on developer experience at 11 Labs.
So you know a lot of it is kind of the
conversational AI um you know agents
platform and I also have my colleague uh
Paul here with me. So if you have kind
of any questions you know throughout the
workshop um later on we'll be floating
around and you know happy to kind of um
you know answer your questions. So feel
free to just put your hand up and we'll
be uh floating around later. Um yeah
Paul works with me on developer
experience. So if you have any feedback
as well uh you know on documentation
examples developer experience or
generally just the product um do let us
know.
Um yeah, also this you can uh scan the
QR code to get the slides. So there's a
couple resources um that are linked in
the slides as well as there's a form um
where you can uh fill in your email
address if you want to get uh some
credits so we can give you um you know a
couple credits for the next 3 months to
sort of you know play around with 11
laps and kind of try this out. Um, so
yeah, feel free to just scan this and
then, you know, kind of save it uh for
later on. There's the resources linked
in there as well.
Cool. Um, yeah, also if you're building
with 11 Labs, I recommend you follow our
11 Labs devs Twitter account. Um, this
is specifically for, you know, updates
in terms of API versions, client
libraries. So anything you know if
you're a developer building with 11 labs
that's a good place to follow and kind
of um you know be in the loop with what
is happening and then um yeah I
mentioned so if you if you scan the QR
code earlier um there's also the link so
you can tap on the QR code as well to
open the form uh and if you just fill in
your um email address we will send you
after this workshop. Um, so in the
workshop you can get started with kind
of the free account. Uh, that should be
should be plenty. But then, um, you
know, we'll we'll give you a coupon
code. We'll send it to you via email for
kind of the next 3 months to to try it
out.
Cool.
And as I mentioned, yeah, the resources
in the slides. So, we'll um, you know,
kind of tap into these later on uh, and
then can kind of uh, you know, get
started building. We can see as well
kind of what folks are building. Um, you
know, since we're looking at kind of
multilingual um, conversational AI
agents, do you just want to shout out
kind of the the languages that, you
know, you're specifically looking to
unlock with conversational AI?
Anyone, you know, anything else than
English?
Any other languages? Portuguese.
Portuguese. Very nice. Uh, you're
looking for like a Brazilian Portuguese
accent or Cool. Yeah, we can we can look
at that. So, we got Portuguese. Um, any
other languages? Spanish. Spanish. Yes,
we got some Spanish in there as well.
That's good. So, Portuguese, Spanish,
product, right? Hungary. Hungarian.
Hungarian. Very good. I'm actually Do we
have Hungarian right now? We don't yet,
but I think hopefully soon. So, we're
working on the version three of the
multilingual models. And I do think
we'll we'll need to double check. Uh,
not that I promise something wrong, but
we might be able. Maybe not today, but
maybe in a couple weeks we can give you
Hungarian. That's great. Uh, okay. Any
other Mandarin?
Hindi. Yeah. Um, huge population uh that
speaks Hindi. Then again, India has 50
plus languages, I believe. So we're
we're working on some adding some
additional uh currently we have Hindi
and Tamil. Um so we're working on some
additional languages there. Okay. So we
have a good mix that we can play around
with. Um
cool. Yeah. If there's any other
languages later we can you know kind of
explore those as well. U maybe one
language that probably doesn't get
spoken about um often enough uh is
barking. Um actually if you uh want to
build applications for the 900 million
dogs that are out there uh I actually
looked this up. Uh you can use our
model. So this is our most recent
launch. Um uh two months ago now I
believe uh the the newest model we
launched. So maybe we can can have a
little listen. Uh the daxund is my
favorite actually.
Um, the golden retriever bit more
and
yeah, that one's a bit bit feisty. But
um, yeah, if you're building application
for dogs, that might be a great use case
for you. Um, did anyone see this launch
like two months ago? No. Yeah, you saw
us. Um, yeah. Yeah, the unfortunate
thing is we launched it uh 1st of April
actually and so everyone thought it was
in April Fools. Um so the timing was a
bit unfortunate on that one. Um but as
you can see, you know, text to bark,
it's very real. Uh no, in fact, it was
an April Fools. So be careful when you
play this to your dog. They might get
offended. um because the context we
can't guarantee that it translates but
the actual sounds that you're hearing
are generated by our sound effects
model. So we do have a a model if you go
to the app um there is this uh sound
effects model here um and you can
actually
this sound effect was also created
but so truck reversing u maybe that's a
good one you can click generate um and
so basically what we do is we generate
kind of four different samples for you
that you can use so um you know not not
so interesting for your conversations
per say but like you know if you're
creating video games for example I don't
know if anyone um is doing that we might
be having some uh internet I hope we
have internet no okay the internet is
looking okay
so yeah not sure what's happening here
but so for example this was like a
drum cowbell I I generated recently um
actually we do have um if you Google 11
Labs sound um board. Uh we built this
recently which was which is pretty cool.
Um so you can kind of loop uh is
basically like a drum machine uh and you
can get kind of the
sound effects.
The drums are pretty nice as well and
and you know they are mapped to like the
the keys on the keyboard. So you can you
can play like
So that's pretty cool. And then you can
add kind of new sound effects and stuff,
you know, just to give you an idea of
some of the things um that we do. Now,
obviously here, you know, we're talk
we're talking about conversational AI
agents. Um and so specifically if we
kind of look at the different components
that are involved in building
conversational AI agents, we have the
user um who is speaking you know some
language uh and so we need to transcribe
that speech into text. We then feed that
into a large language model which is
kind of acting as the brain um of you
know our agent. So in our case we
currently don't build any intelligence
model. So we partner with kind of the
existing large language models um
providers. So like your GPT40
uh your Google Gemini what have you. Um
and then this large language model which
will act as kind of the brain of your
agent will generate a text output. Uh
and then we basically stream that text
output back into speech. So that is
roughly kind of the the pipeline that we
have for building conversational AI
agents. Now with that there is a bunch
of system tools that are built in. So
for example we have this language
detection system tool which facilitates
the language um switching um that that
we're going to look at in a bit. Uh we
also have you know function calling tool
calling. So um you know if you need to
give access to kind of specific
functionality to your agent you can you
can do that here as well. And then um
you know there's different approaches.
So like if you've seen open AAI real
time for example so this doesn't
actually go through text right so it
goes sound token to sound token which
has some benefits but what we've seen
you know for deploying conversational AI
agents at scale uh and really
understanding you know what is kind of
happening so if you're going sound token
to sound token you're kind of flying
blind a little bit so you know you're
trusting the the model that sort of it
actually rep replies intelligently
um whereas if you're going through text
you can uh have you know kind of better
monitoring and sort of understanding
what's what's going on um in your
conversation. So while we're also
exploring you know kind of sound to
sound for kind of the conversational uh
AI agents for now what we found works
best is sort of this this pipeline that
we've built. Um and we deploy kind of
all these uh models very close to each
other to kind of bring down the latency
as as much as possible. Cool. And so now
now we can look at kind of the the
individual um sort of components within
that. So for example for speech to text
uh so this is uh actually the the most
recent model we actually launched um you
know that wasn't an April Fools and so
this is our um speechto text model. So
our uh automatic speech recognition
model so ASR um and this works you know
kind of benchmark leading across 99
different languages um at the moment. So
what this does and you know you can see
sort of the functionality that's built
in there. There's you know word level
timestamps there's speaker diorization
um there's audio event tagging so if you
want in your transcript you know cuffing
laughing sort of some of these you know
audio tags in there you can enable that
as well uh and it's all kind of within
you know structured API uh responses
which is really nice. So, what you can
see here is, for example, you know, we
have a conference call. Quick check-in.
Maple Street is a mess. Time to fix it.
Totally. Some of those potholes could
swallow a small car or a very brave
skateboarder. We start next week. So,
what you can see here is as we're kind
of playing this audio, we see that the
model recognizes the different speakers
uh and kind of texts them as speaker
one, speaker two. Um, and then we have
the word level time stamp. So you see as
I play this Jonas 4E timeline. Yep.
Unless the concrete throws a So we can
we can highlight the different words
kind of you know word level um here. So
that's really really useful. Um and you
know obviously this is available through
uh the API. So actually one thing um I
did and you can um try this out
yourself. This is just uh available kind
of for free to sort of demo it. Um so if
you're using Telegram, I've built kind
of this little Telegram bot where you
can forward voice messages um or videos
uh to you know the Telegram bot and then
you it automatically identifies, okay,
what language is that? And it gives you
back the the transcript um of that
message. So you know you know this all
too well. you're sitting in a meeting
and your grandmother sends you a voice
message and you don't know, oh, is this
urgent? Is this important? So, what you
can do is you can forward it to the bot.
Uh, and very quickly you will get the
transcript back. So, um, if you want to
try this out, you know, you can you can
do that. So, you can see it here. I was
actually I hope you you recognize this
uh if you were paying attention. Um,
anyone recognize this?
you were just saying. Yes, I was just
saying that. Uh, so I I was just
recording a voice message on my phone.
Thank you. Uh, someone was paying
attention. That's great. Um, and so, you
know, I was just recording a voice
message and then, uh, I get the
transcript back. Now, the cool thing as
well, um, so I I I actually live in in
Singapore. So, if you spend time in
Singapore, you might have heard kind of
the Sing English, right? Which is sort
of the Singapore English. Go to hell
this person. Bastard. You know you I
asked for a plastic bag. You must put
the thing inside the plastic bag for me,
right? She never She just put throw the
B on the table. Yeah. Anyone understand
what's going on? Do we got any
Singaporeans in the house? No,
it's not that easy. Even after six years
in Singapore, um I still sometimes
struggle with that. So, what we can do
is we can forward it to our
transcription bot. Uh, we can see. Okay,
it was received. It's transcribing it
now. And uh, yeah, go to hell, bastard.
You know, I asked for a plastic bag. You
must put the thing inside the plastic
bag for me. So, you can you can see
these are kind of the problems we have
in uh, Singapore. Now, another accent
that might be a bit challenging.
Um, Scottish.
Some things I need to know about
Scotland.
Yeah, they're trying to split up the
country and all that and we don't want
that. We want a decision council and all
and that's basically that is pretty
anyone understand what's going on in
Scotland?
No. Okay. No Scottish people here. Um
again we can forward it. We can see um
and so this is this is really cool
actually that um you know even without
fine-tuning the model it can actually
understand specific accents um quite
well. So you need to know they're trying
to split up the country and all that and
we don't want that. We want a decent
council and all and that's basically
that. There you go. I actually now
listen to this so often that I can
actually hear it. Um but yeah, you might
you might not be familiar with it. Now
one example that's maybe a bit um closer
to where you're living if you're here in
the US and Mr. President what you would
like to say about the Bangladesh issue
because we saw and it is evident that
how the deep state of United States was
involved to regime change during the
Biden administration and then Muhammad
Yulus made junior soros also. So what is
your point of view about the Bangladesh
and what is the role that the deep state
played in in the situation? So what
you're seeing here is uh president has
um a translator for English to English
um translation. So we thought you know
maybe he can just use this transcription
bot. So if we forward that um again so
even if the audio quality isn't that
great if there's like a lot of
background noise the model is actually
very good at kind of identifying that.
Um yeah Bangladesh issue United States.
There you go. Um so yeah so this is kind
of the you know one component to it. So
we need to um you know understand what
the user is saying and then feed that
into our LLM uh to you know have kind of
a a meaningful conversation. So for the
other part um you know we we don't
specifically provide the intelligence
layer. So that's where we partner with
kind of you know the leading um model
providers. You can also fine-tune your
own model. um you know if you for
example fine-tune and deploy it on say
Google vertex AI you just need an open
AI API compatible endpoint and then you
can plug in your custom LLM into this
pipeline as well um which we can look at
in in a little bit and so now we have
the other component so once the LLM
starts streaming out the response we
then want to start streaming the speech
as soon as possible um so this pipeline
is kind of streaming throughout so we
have you know the the fastest kind of
snappiest um response possible. So for
you know the actual text to speech
what's great with 11 laps you know we
heard we want kind of um Brazilian
Portuguese accent right so we have a
huge library
um of voices that are available on the
platform. So we actually have more than
5,000 different voices um that you can
choose from. And so if you go into your
11 laps account, um you can go to voices
and you can explore um different voices.
Now if you were sitting, you know, here
in this workshop and you were like, "Oh,
I really like this voice." You're in
luck. So you can go to the voice library
and you can type in German engineer. I'm
originally from Germany. And you find
me. True success is doing what you are
born to do and doing it well. Does that
sound like me?
ah okay it's it's trained on some of my
uh YouTube videos so I think maybe I
talk a bit differently um in the YouTube
videos but yeah this is this is great
and the thing is so I basically clone my
voice uh and I published it on the voice
library and so anytime you use that
voice uh I get royalties um so this is a
marketplace uh we actually uh recently
just surpassed the $5 million US
milestone that we paid out to um you
know our voice actors that are kind of
publishing their their voices on the
platform and so you know in this
workshop if you use this voice uh I'll
be very grateful because then uh I can
have a coffee later that's great um no
but obviously you know you can use my
voice if you want to you don't have to
uh the great thing is you can set really
kind of narrow filters to find sort of
the voice that you know you want so for
example if you want to choose choose um
Portuguese.
Uh you can just put the language filter
to Portuguese and then you can choose
kind of the accent. So here for example
uh we want a Brazilian Portuguese. We
can then further kind of narrow this
down in terms of you know sort of
gender, age. So there's certain meta
meta taxs that we can apply. Uh and then
we can see maybe here.
Okay. I don't I don't know. I haven't
spent much time in Brazil. Um
Does that sound Brazilian?
No. Yeah. Okay. A little bit. Um, so I
mean there's a lot of voices available
on the platform. So you can kind of see
um if you find one that that sort of
fits, you know, the local accent that
you're that you're looking for. Uh and
then what you can do is you can um go
and kind of put you know all these
different pieces together into your
conversational um AI agent. So here in
the dashboard we can go to
conversational AI um and we can actually
configure a lot of our agent you know
right there within the dashboard and
then we can bring that into our
application with um the JavaScript SDKs,
the Python SDKs, kind of depending on um
what applications you're building. Um so
we can just do a quick demo maybe of an
agent that um I had built for uh a
conference in Singapore.
Um so you know if you're familiar with
Singapore there is uh four official
government languages. So if you're
building applications you know for
Singapore you actually need to you know
provide uh English, Mandarin, Chinese,
Malay uh and Tamil. So these are kind of
the official government languages in
Singapore. Now obviously there's other
languages uh being spoken you know
Hindi, Japanese for example um you know
as well but what you can do is you can
uh we so we currently have within
conversational AI uh I think 31 um
different languages that the agent can
speak and switch between and identify.
Um, we're working on the next iteration,
so the V3 of our multilingual models,
which will, uh, bring, I believe, up to
99 different languages. So, um, yeah,
you know, no matter where you are in the
world, you can kind of get there pretty
pretty good coverage. Um, so you can see
here we can configure the LLM. So, we
can, you know, choose kind of, uh, any
of the the, you know, big LLM providers
that are out there. we have directly
integrated uh and kind of colllocated
with with our platform. Um or you can
put in your custom LLM which is just an
open AI API uh compatible endpoint.
Uh you can put in kind of a knowledge
base. Uh you can upload documents, you
can you know reference your website. Um
there's rack built in as well. Uh
depending on kind of the model you're
using um you might not want to put kind
of the entire knowledge base into the
context. So you can you can enable rack
there. Um and then the tools you know
this is kind of your standard function
calling um you know that you might be
familiar with when you're working with
um large language models. So depending
on the model you choose as well um there
might be some nuances with how reliably
the function calling works but you know
kind of the larger ones GPT4
um plus Gemini 2.5 you know these are
generally very reliable with function
calling. And then we have the system
tools that are kind of built in. So you
can see here the language detection
system tool. So we can just enable this.
Um and we have this kind of you know
configured by default to identify
different languages and switch between
them. Um but you can also override kind
of the the LLM prompt to sort of
customize that to your use case. Um and
then maybe we can just give that a
little test and see how that works. Hi
there. This is the API day Singapore
conversational AI agent. How can I help
you today?
Um,
can you tell me what's the name of that
11 Labs speaker?
The speaker from 11 Labs at Appidday
Singapore is Thorston Shea who is listed
as a developer advocate. Would you like
to know more about his session or
background? Yeah, but what what
languages can you speak?
I can communicate in English, Chinese,
Hindi, Malay, Japanese, and Tamo. If you
prefer to continue in any of these
languages, just let me know which one
you'd like to use. Would you like to
switch to another language now? Yes. Um,
can we switch to Hindi, please?
I'm Hindi.
Chef 11.
Um, okay. Unfortunately, I don't speak
any Hindi. Um, can you repeat that in
English? Of
course, we can continue in English.
Would you like more information about
Thorston Shea or the 11 Labs? So, I
guess you you get the gist. Um, so with
the language detection tool, there's
kind of two, you know, different modes
with this. So, you can either, you know,
as like in the first scenario with my
broken Mandarin, uh, I was basically
just saying, "Oh, sorry, I don't speak
any English. Can we like speak
Mandarin?" Um, and it would recognize,
oh, okay, you know, even with my
terrible Mandarin, it was like, oh,
yeah, he's trying to speak Mandarin.
Okay, so maybe let's reply in in
Mandarin. Uh, so it was doing that. Uh,
or we can specifically ask, okay, what
languages can you speak? You know, can
you speak Hindi? Uh, can we switch into
Hindi, please? Um, so this is kind of
the the built-in language detection
system tool that we can use to
facilitate kind of these multilingual uh
conversations which which is really
nice. Um, cool. So this is kind of
roughly what I wanted to show you um
sort of as a start. Uh, and now what we
can do is kind of we have you know the
next 30 minutes to play around with this
your yourself and we'll we're in the
room and you know if you have any
questions we we can answer them. So
there's various different ways that you
can configure your agents. So you know
by default you can get started in the
dashboard and you can configure kind of
a lot of the behavior and functionality
uh in there and then um you know if you
if you go back to the resources we have
um you know the documentation we have
different examples um that that you can
use around conversational AI. So for
example there's you know we have
examples for Nex.js JS to build that
into your next.js applications. We have
examples for Python. Um if you want to
build it, you know, on like hardware
devices somewhere, you might want to use
Python on like your Raspberry Pi for
example. Um so we have the examples
there and you can then once you've
configured that you can bring that into
you know your application. Uh
alternatively you can also configure all
nuances of your agents via the API. Uh
so actually if you're building a
marketplace where you are configuring
agents on behalf of someone else you
know you would generally do that through
the API. Um and we also have uh an MCP
server. Uh so if you're using um you
know cloud desktop you can bring in the
MCP server and you can just tell a
natural language oh please you know set
up an 11laps conversational AI agent um
you know with this voice um and it'll go
and it knows kind of what you know API
uh calls to make to to set up your
agent. But yes so uh thanks so much for
for joining. Um if you have any
questions you know we'll be floating
around. Uh you can ask also ask the
questions now you know if you want to
ask it kind of in the audience but
otherwise um you know please just go to
11labs.io IO
um and create your account if you don't
have one already. Um, and then you just
go to app, uh, go to conversational AI,
and then you can create, um, here in the
agents interface, you can create a new
agent. Uh, and maybe we just start off
with, um, kind of the support agent
here. Uh, and then we can go through and
kind of configure this agent with, you
know, your voices, your languages. Um,
and then, yeah, would love to hear a lot
of different, uh, agents speak, you
know, at at the end of the next 30
minutes. Awesome. Thanks so much. Do let
us know your questions and we'll be here
for the next 30 minutes to help you set
up your agent yourself. Thank you.
Did anyone have questions that they
wanted to ask in the room or I think
you're you're welcome. There's like uh
microphones. Yeah, there. Do you want to
just go up to the microphone and ask? Uh
hello everyone. First of all, great
presentation. Uh the question is related
to you said that it can switch to
different languages without fine-tuning
it like what is the background process
of it. Can you explain it bit more like
how well it shifts so perfectly that it
can interact in the regional languages
as well as the accent is also similar to
that kind of thing. Yeah. So here um
what you can see is in in my agent
configuration.
So I can actually within the voices tab
uh I can assign different voices to the
different languages. So for example in
um in in Singapore the most commonly
spoken Tamil accent in Singapore is the
Chennai accent Tamil. Um, and so
basically I went to the voice library
and I found a voice that is a Chennai
accent TAML and I then basically added
that to my voice library and just assign
that here. So in the voices tab you can
configure the different voices for the
different languages. Um, which then
means that you know the so the actual
language detection part that is the
automatic speech recognition model. So
the ASR model will actually identify
which language you know with um so it it
basically assigns kind of a a score like
a likelihood score that this is the
language that is being spoken uh as well
as the transcript. Uh and so basically
we use kind of with the system tool
based on the the confidence score of
this language being spoken um we then
automatically switch to their language
kind in the background uh and basically
use the voice that you configured um to
to reply with for this language. Does
that roughly answer the question?
Cool.
Yeah. Do you mind um coming forward? So
just uh because I think it's also being
recorded then then we have it. Uh great
presentation by the way. Uh quick
question uh other than the
changing of the language or language
detection or whatnot. Does it have any
other ability to make any other actions
throughout the call? For example,
use case appointment setting. Mhm. Does
it have actions to maybe call a web hook
and to to to take a look to see if
there's any appointments available
either in make or n and then report back
like we can have that in the the
prompts. Yeah, correct. So the basically
the configuration of that is a um
combination of your system prompt
together with the tools. So you can
configure um custom tools and so these
can be serverside tools which then would
be a web hook call um you know to your
CRM to your system. Um so this is you
know the the the standard kind of tool
calling function calling um that the LLM
supports. Um, so you you know like GPT40
or like the the the more modern models
generally support function calling and
structured outputs. Um, and so as long
as the large language model that you're
using to power your agent supports
function calling um you can add your
tools and and this can be combination of
server side tools. Um so for example you
know as you mentioned like the
scheduling um you can put in uh so for
example we we also have an example with
cal.com
uh you can put in the API endpoints for
um cal.com and then the agent can
actually look up oh okay is there
availability in the calendar and it can
schedule um so it can ask for like the
the email address and then it it can
schedule you know the the the ing for
all the parties kind of through the
conversational AI agent. Okay, one more
question. Uh what would you suggest uh
would be a
a good structure for a conversational
agent that has like super low latency?
Obviously the model plays a big role. So
obviously like price and model or like
two like price and the latency are like
the two biggest key factors when we want
to do like a outbound or inbound dialing
agent. What would you suggest for let's
say an outbound dialing agent or even an
inbound to decrease the latency to make
it seem more like conversational like
because obviously if you use a really
good model it's really large the latency
is just super high and it just doesn't
really that that's not a meaningful
conversation to have. Gotcha. Um I think
that's a good question. And I think we
have some so if you go to the
documentation
um within the conversational AI we have
some um kind of best practices. I'm not
sure if we have like specific
I do that by testing but I want to skip
the testing part. Yeah. Do we have
specific guidance on like the best
model? It kind of depends on your use
case, right?
Yeah. Um
although I think it's like he's asking
about like the best LLM to use right for
right. Um
yeah, I think it depends on your use
case like you know depending on kind of
the function calling and you know how
much of that you have. You can probably
go down to like yeah a Gemini flash or
flashlight
um to like reduce the latency in terms
of um of that. But I think on our end in
terms of like the voice models, so we
use um where are the
yeah we use by default kind of the flash
models for the speech generation. Um so
depending on the languages that you want
to support um
yeah it's just a test. I I think testing
would I think I'll be able to figure out
with like proper testing with different
models flash on off all that good stuff.
Thank you so much, man. Cool. Thanks.
I have a couple of questions. So, number
one, what's the total cost per minute
coming down to?
You mean
like by default? So, it kind of depends
what um pricing tier you're on. Uh so if
you
go to the pricing page
uh so here conversational AI so
depending on kind of the the tier you on
um the you know there's a certain amount
of minutes that are included. So
currently the pricing is based on call
minutes.
Um and then depending on which tier
you're on, there is additional minutes
that are charged um at a specific price.
So it does somewhat depend on um
kind of the the pricing tier that you're
that you're on. If you want to have like
an application where you need like
really long interaction times, say you
want to do a companion that will
talk to you while you cook, is there
something you can do to mitigate that
that cost? The timing cost. Yeah, that's
a good question. And I think like for
those use cases it might not be um great
at the moment because like if they are
just running this like 24/7 to like be
able to talk to someone um the cost is
is pretty significant. Um
that's a good question. I think we I
mean there's potentially like if if you
reach out to the sales team um there is
some like custom pricing that we can do
based on the use case. Um but I think
for now this is charged based on minutes
used like the that the session the
session is live. Okay. And final
question um I tried to do an agent in
your dashboard and it it had several
tasks. So it was one onboarding task
then one another followup task and it
kind of got confused like it was not
like identifying when to do one task
when to do the other. So is there a way
to mitigate this or maybe have a multi-
agent configuration? Yeah, so there's um
we have what we call uh agent to agent
uh agent to agent transfers. Um so one
way to do this is to um you know
basically set this up um and so this is
this is a system tool as well um where
you basically set up different agents
for different use cases. So like certain
use cases you also might use want to use
a different LLM to power that use case
kind of you know depending on um which
LLM is sort of best for the task. Uh and
then you can configure different agents
for different tasks. Uh and then you can
configure kind of an orchestration uh
setup that basically will then route
kind of in the background to a different
agent. Um, if you keep the voice the
same, this actually happens somewhat
like silently without the user actually
knowing that they're being transferred.
So, it's not like it's an immediate
transfer. Uh, it just means that you
can, you know, sort of develop these
agents um potentially also across teams
where you have, you know, one team that
owns kind of this specific agent uh and
then in the background you just kind of
switch between the agents um for like
the different tasks. Great. Thank you.
Thanks.
Cool. Any more questions?
Yeah. Um, latency. So, in general, like
simple examples usually, you know, great
for demos, but when you have something
more heavy, more enterprise grade, and
you have more data, more, you know, your
rags taking time to come back. um how do
you not have a shitty experience because
you think of it from the end users
perspective right they're like uh like
and then blank for the next minute or
two um do you suggest fillers like you
know does the conversational AI say I'm
thinking let me think about it or h how
do you kind of make it more natural
because there is going to be like in an
enterprise setup right like if I have to
then go look up a patient's claim, you
know, that might go hit a database. Once
that comes back, it goes to other
systems, does a whole bunch of things,
but it takes time, right? Like uh h how
do you set it up so that you know
latency that can't be avoided? How do we
make the conversational experience
better? Yeah. So there's there's certain
things that you can do. So you know
generally kind of these are you know if
you have like a big knowledge base for
example like using rack is kind of one
um way of kind of mitigating you know
that taking too much time. Um and then
also in terms of your tools um when you
define your tools you can um configure
the
where is it kind of the response time.
So when you you know add parameters
um where was the configuration
I think there's a configuration for yeah
like the timeout um so basically how
long you want to wait um for you know
this tool to sort of come back um and
also if you want to wait sort of for the
response. Um, so I think the maximum
time out we allow is like 120 seconds.
Um, and then
the the agent will actually like say,
"Oh, I'm I'm currently looking that up
in the system. Uh, sorry, you know,
we're still waiting kind of on the
response." Um, so that is sort of built
into the tooling here. Now, you know,
depending on your Yeah.
like use case, you probably want to to
to put that kind of, you know, fairly
low because like yeah, if you're waiting
on the call,
I wonder if you can do something where
it's like um oh, I I'll call you back,
you know, like I I'll take that back and
like take the action or but I think for
now it would just yeah, depending on
your timeout time. Okay, it basically
will wait for the response and it will
tell like it will stay conversational to
like you know talk the user through that
oh we're still waiting on on the tool
response um there and are all
conversations linear or can they branch
off come back like while it's looking up
something can it come back in 5 minutes
later oh you know I found you know in
the meantime they get other details from
the customer or patient or um
no I that's a good question. I don't
think so. I think it yeah currently you
would
I think in that case you would like
orchestrate it in a way where you put it
into a queue. Um the only thing then is
like how do you update the agent?
I think like if you're using the I I
would need to kind of look this up.
There might be a way like with the
websockets
where like as information comes back uh
you can inject that additional
information into the conversation
through um a websocket notification.
Okay. Um,
but yeah, I would need to look up kind
of that specific use case because then
what you could do is you put these tasks
into a queue uh and you work through
them in the background and like as kind
of the responses come back you can then
inject them back into the the context
through kind of websocket uh events
potentially uh but would need to double
check if that's feasible. Okay, that's
all I have. Thanks man. Cool. Yeah,
thanks.
Sweet. Is everyone while we're like
going through the questions, is everyone
setting up their agents? Are we
No. Okay, good. Yeah, I see someone
nodding there. Fantastic. Okay, cheers.
Yeah. Hello. Um, could you explain or
demonstrate what happens if you have
multiple languages within an input and
output and how that might work if you
have different accents for each
language?
Yeah. So,
so you mean like if the person is kind
of mixing different languages? Yeah.
Like for instance, if they say uh
explain the difference between Shaden
Floyd and
in Chinese.
Yeah, it's a good question. I haven't
tested that specifically. Maybe you can
try right now.
We we can. Yeah. So, what was it?
Explain the difference between Shan Fo
and
Okay.
Great. Yeah. Okay.
Hi there. This is the API days Singapore
conversational AI agent. How can I help
you today? Uh can you please explain the
difference between Shardan Freder and
Lao?
Shardan freder is a German term that
means the pleasure or satisfaction one
feels from witnessing another person's
misfortune or failure. It is a
well-known concept and is often used in
English as well. However, zinc hyohawk
does not appear to be a recognized term
in German or in psychology and there is
no widely accepted meaning for it. It's
possible there may be a spelling error
or confusion with another term. Could
you clarify what you mean by Zingshile
Hawk or provide more context? That way I
can give you a more accurate comparison.
It might be my pronunciation, but it's
it's meant to be a Chinese term.
Thank you for clarifying that you meant
a Chinese term. Unfortunately, Zinghawk
does not sound like a recognizable
phrase in Mandarin or other major
Chinese dialects, at least based on that
pronunciation. If you can provide the
Chinese characters a different spelling
or more context about what the term
describes or when it is used, I may be
able to help you ident. Yeah, it might
be my pronunciation.
But yeah, the problem is like if you mix
like um you know, well, English and
German at least are both
like Germanic languages, right? But then
Mandarin is so I think there might be
might be challenges there and like
depending on the more languages you mix
it does get does get challenging. Yeah.
Do you have a recommendation then if you
want to build like a language learning
application for example?
Yeah.
Um,
I wonder if there's certain things you
can do with like the prompt, the system
prompt in terms of like improving how
it's being picked up. But
yeah, I think because we're like going
through text here, um the like the
language learning use case is a bit more
challenging, especially if you're going
um you know, Germanic lang languages
versus um
yeah, it's a good question.
I I don't have an immediate answer for
you there, but
uh yeah, actually you might want to try
like a sound token to sound token like
open AI real time. I wonder if in that
case it does better because it can Yeah,
it does. You know, it doesn't go through
text. Yeah, it does. Sometimes it like
switches the accents too, which is kind
of annoying because it will try to
pronounce Chinese for instance in an
English accent. Ah, interesting. Yeah.
So, yeah, there is there's challenges
with that,
but yeah, that's a that's a good one.
I'll I'll take that back and see, you
know, kind of how we So, I know we have
some we have a customer in India,
Supernova, that does, but it's
specifically English learning for um the
Indian market. Um, so I think it's a bit
of a different use case there. Do you
know if it produces English in a
particular accent or is it like using
the you know Indian phonetic sounds to
Um I think there's there's like a case
study uh 11 labs supernova so maybe you
can you can look that up. Um
there's a video so maybe yeah yeah I'll
look it up. Thank you so much. maybe
take a look at that and then we can yeah
if you if you connect with us uh we can
we can also follow up on kind of some
guidance on on that use case
specifically. Perfect. Thank you. Cool.
Thanks.
All right. Hey, um I was curious if
you're worried about scammers or
fraudsters using these tools.
Yeah, so there's definitely you know a
worry with that like obviously kind of
all this this technology. So, one thing
that is kind of very important for us,
so you can if you go to
11laps.io/safety,
uh you can see kind of the safety tools
that we're developing um you know uh in
parallel to to our um features. So
there's a there's a bunch of things that
we do um like specifically you know we
do like life moderation for certain
things. So actually when you publish
your voice to the voice library you can
specify um terms that you don't want
your voice to say. Uh and then in this
case we actually have life moderation
where um we will make sure that your
voice isn't used to generate kind of
specific terms or um sentences.
Um we also kind of monitor in general uh
what's being generated on the platform.
Um so kind of the moderation and um sort
of the the other tooling is that
actually with any um speech that is
generated on our platform we mark
watermark it uh actually to the extent
that we can trace back which account
generated um this specific speech. Uh so
if we identify um fraudulent activity,
we can actually trace back which account
generated kind of that um and you know
can kind of ban them or you know provide
um kind of information to the
authorities um as needed. Yeah. And so
the other things is just kind of in
terms of um for
we have like where was the we have like
the voice capture um that we developed
when you are creating a um a
professional voice clone.
Uh we actually generate kind of a random
sentence that you need to read out to
verify that you know you have permission
to clone this this voice. Um so yeah we
we do you know with kind of all the
technology that we develop um we do put
quite a large amount of you know focus
and effort into uh safety tooling. Um
but yeah there is obviously always a
concern that your technology is being
used for fraudulent activity but I think
so far um you know we've been trying to
mitigate that with like the safety
tooling. Definitely. Yeah. Looks like a
lot of good guardrails in place. Thanks.
Thanks.
Yay, you're back. You want to go? So
just building on her one like um in our
place
um if a patient is asked hey you know
how you feeling about this
and say they are um they they try to
speak in English they might hold a part
of the conversation in English and then
they might jump to Spanish
and Portuguese and come back to English
for you know like when they have to
describe something that they can't in
English, they kind of jump back to, you
know, the the language that they are
most comfortable. Sometimes they jump
around different languages. Uh
how would because like the way you
explained it, it kind of you have some
kind of a router that checks what kind
of language it is and then shoots it
off. But within a conversation they they
kind of jump between like you know
they'll explain a few things and then
they'll go few words with a very you
know Portuguese words and stuff. So like
for you this is this is like English,
Spanish, Portuguese kind of all mixed
together. Sometimes like if you just ask
them how you feeling okay then they come
back hey you know this this you know I
took this medication it hurts you know
and then if you say okay where is it
hurting? house. Then they suddenly kind
of you know they re regress to whatever
language is most comfortable to them to
explain their thing.
Gotcha. Yeah. I mean, yeah, you can see
here that like the transcript it
actually correctly identified, you know,
shardan freder um because technically
it's also an English word, right? But
then like on the on the Chinese word it
just completely you know well you know
you can blame partly my pronunciation
probably you can blame it a lot but um
yeah I wonder if like
a native speaker
yeah I don't I don't have exact
benchmarks on like you know how much
like the transcript gets worse the more
languages you introduce kind of in the
same so I I think like generally if you
have two languages intermixed it tends
to perform okay but like if you're like
now having like three different
languages um it just you know
progressively tends to to get worse but
I don't have um exact benchmarks on like
you know how many languages sort of yeah
okay cool that that's good to know
that's what I came up with but yeah it
would be worthwhile if you have like
recordings of like some of that to like
put it through our um transcription
model and see kind of how how it
performs in like identifying that that
that would be interesting. Yeah.
Cool. Thank you. Uh say this question
toward the end because it's kind of
non-related. So uh I worked on a project
where we use 11 Labs uh for uh the voice
track of our avatar. uh and at 11 Labs
functioned well, but we had a lot of
more downstream issues in terms of like
lip sync and like I think someone
mentioned like slugs and timing and
other things. So, is there any plan for
11 laps to come like I guess further
down the stack in terms of like avatars
or is is that even something you're
thinking about? Uh interesting. So you
you did you build like the the lip
syncing model and like avatar stuff on
your
So uh we like within the like Nvidia
Tokyo stack. So they have like a a stack
and they have their like Reva voice
model and we kind of switched that out
for 11 Labs. Uh so they have like the
full stack of like the avatar and then
11 Labs is just the voice portion of it.
Yeah. Ah okay. And sorry which which
stack was that? the Nvidia. Oh, Nvidia
Tokyo. Yeah, it's like Yeah, it'll go
from uh like we did everything with the
the GPU, but then they have like the
visualization and uh you can just plug
your voice model or or in Tokyo like to
O K I O T O K. Ah, you see I'm I'm
thinking about Japan. Is it this one?
Yeah. Oh, interesting. Okay.
Yeah, I I personally don't have
um experience with that one. So, I know
that we're mostly working with partners
like Hedra and Hey Jen kind of for sort
of the the avatar side of things. Um
I don't know, Paul, do you know any? No.
So, this is something I I would need to
come back to you and like look into. Um
it's it's interesting. So like you're
saying out of the box it uses like an
Nvidia model for speech generation or
Yeah. Yeah. But uh our client which I
assume is one of your partners. We can't
talk about that but like our client
couldn't use the Nvidia model. Got had a
contract to use 11 Labs model. So okay
it's kind of interesting.
Yeah. Sorry I don't I don't have a good
answer for you there right now. But yeah
this is interesting. we we can go back
to the team and see um if there's any
resources that we can we can give you in
terms of like improving that. No
worries. Thanks. But interesting use
case. Thank you.
All right. Last question. Last question.
Nice. Uh hi to thanks for the
presentation. Um I have a question
regarding the transcription model around
adding custom vocabulary. like uh at the
company I work for, we use a lot of
threeletter acronyms and uh let's say
let's say I want to have the model read
out SAP as SAP and not SAP. Is there a
way to tell it to do that? And is there
a way to uh like tell it to read words a
certain way and kind of nudge the
interpretation of what I say towards
certain words that we use in our
vocabulary.
Interesting. Yes. So you have this both
so you have this use case both on like
this the speech to text. So you need to
correctly identify the acronyms but then
also you need the agent to reply back
with the correct so like for the reply
back we do have um have you seen the
like um pronunciation
dictionaries? Um so we have a way for
you to provide um you know pronunciation
dictionaries with like kind of phone
alphabets uh to actually you know
identify specific
you know basically like here tomato
tomato I guess tomato tomato um and so
you can provide
for for that you can provide the
pronunciation diction dictionaries to
make sure the text to speech pronounces
you know the acronyms and the words in
the way that you want them to. Now the
other side
of like the speech to text
that's an interesting case. I don't
think we have
uh
a way to like
fine-tune that specifically for
different acronyms. That's a good
question.
So,
you know anything there? No. Right.
Oh, interesting. Yeah. So you you can to
a certain extent you can do it through
the system prompt where you put in kind
of a normalization layer to basically
identify things in the transcript that
are acronyms and then basically have the
LLM sort of massage that into what you
want. I think that's what you were
saying, right? Yeah. So that could be
interesting to see if that works well.
Yeah. Have you have you tried it out
already or um where I was coming from is
uh the company is using um like in their
own custom chat called jewel and it's
like the unit of work but whenever I
read transcripts it's often times used
as jewel as the diamond. Gotcha. And so
that's kind of the struggle that I'm
facing. Okay. Is this actually at SAP?
Nice. I'm an SAP child myself. My my
father was early as well. Okay. Anyway,
too too much information. Cool. Uh yeah,
thanks thanks for that. We'll we can we
can chat some more and see sort of if
that's something we we can get going.
Sweet.
Uh yeah, and with that we're at time. Um
yeah, thanks again. Thanks so much for
joining. Please do um you know connect
uh find the resources, fill in the form
for the credits. Uh yeah, I'll I'll
leave this up uh in case you haven't had
a chance to scan it. But yeah, thanks so
much for joining. Enjoy the conference.
And uh we will also have a booth at the
expo. So if you come up with some more
questions, you can come uh find us
there. Thank you. Thank you, Shane.
[Music]