Building Conversational Agents — Thor Schaeff and Philipp Schmid, Google DeepMind

Channel: aiDotEngineer
Published at: 2026-04-30
YouTube video id: cVzf49yg0D8
Source: https://www.youtube.com/watch?v=cVzf49yg0D8
Hello everyone.
>> Hello.
>> Perfect.
>> It's just that Philip and I were both
Germans and we thought it was funny.
Maybe we we can do it in German.
Actually, looks like there's a there's a
German crew there, which is nice. Uh,
no, no worries. We'll we'll we'll do it
in English. We'll do it in a couple
different languages. Maybe we'll find
out. Do we have other languages in the
room?
>> Other nationalities? Yeah. What do we
have?
>> Shout it out.
>> So, Spanish.
>> Any Icelandic?
>> No.
>> D. Okay. Close. Close.
Romanian.
>> Romanian. Nice.
>> Dutch. Pña. Netherlands.
>> Okay.
>> Which?
>> Hindu.
>> Yeah. But Hindu. Okay. Canada. No. Okay.
Bangalore. No.
>> Uh,
>> Farsy. All right. Nice.
>> Czech.
>> Czech. Yeah. Brilliant. Okay. This is uh
fantastic. Well, thanks everyone for
making your way over here with with all
your languages. Really appreciate it. we
can we can really put the model to the
test today which is great. Um yeah, hi
I'm Thor or to for the German speakers
amongst you. Um
>> hi I'm Philip um only Philillip so
>> for German and English speakers. Yeah,
it's nice. We so we work on the
developer experience at Google DeepMind
broadly covering kind of Gemini API and
also uh working on Google AI studio as a
tool for you know developers to try out
the models quickly uh and then also the
the API um interfaces to use the API.
You do need an API key. Uh actually who
has an API key already? Gemini API.
Okay, couple folks. Uh, who has used AI
Studio before? Google AI Studio. Okay,
couple folks. Great. Um, maybe we can
quickly Sorry,
>> anti-gravity.
>> Yeah, that's good. Uh, you actually
don't need an API key for anti-gravity,
but um, so if you don't have an API key
yet, uh, if you have your machine on
you, I hope you do. Uh, it's it's it's a
hands-on workshop. I do apologize. They
took away your tables. Um, so it's a
very laptop situation. Uh, like
literally, uh, laptop. Um, I hope you
didn't bring your Mac Mini or whatever.
Uh, but so yes, if you just go to
ai.dev.
Um, and I mean there's you can also go
to ai.studio. You can go to
aistudio.com, but we paid a lot of money
for AI.dev, so please use that. Uh it
just it just redirects to AI studio. But
um
>> let me
>> so AI.dev
API key.
>> Uh yes or on the corner left side
there's a get API key and then API- keys
is where you
>> German.
>> Uh yeah it's it's my personal account
and we cannot change language. So uh you
need only a Google account. So no credit
card, no nothing. All of the things we
are going to do is part of the free
tier. So if you go to AI studio or
AI.dev and there's some like signin form
you can like just use your Gmail. No
worries we will not charge you. Uh and
then on API
keys uh you find at the top right corner
normally something called API key create
or create API key uh and when you do it
for the first time you uh might need so
similar for me uh I can import my
projects or I can create my projects. I
can just click create give it any name.
I mean we can call it uh I I
workshop
uh and then create project
I can translate it to English.
Okay, takes a few seconds and then you
should be able to create key and this
key will be used for the demos or the
hands-on things we are going to do use
later. Wait, it shows the API key.
>> Yeah, I will delete it. So don't don't
copy it. But you can like
>> copy very fast.
>> You can use that key. You can like put
it into your bash RC or set HRC file or
later you can like directly inline it.
Whatever you prefer. If you have one,
you can use this one. And since we have
a few more minutes time, we really want
to make sure everyone who wants to
follow along, take your time, create
your API key. If there's an error
appearing or something else is not
working, feel free to raise your hand to
will come to you and we'll help you. And
>> and yeah, just a reminder, it is a
secret API key.
>> So don't do what Philip is doing there.
>> This is why I delete my key again.
>> Yes. So we do get that a lot that people
um leak their API keys. Mostly it's
clock code that is like pushing their
API keys to to to GitHub. So I recommend
you don't do that. No, just just
kidding. Just remember it is a secret
API key. Um, so treat it like a secret.
Uh, don't share it with your neighbor,
you know.
>> Um, but yeah, do create that API key
now. And we'll give you a couple minutes
to do that. And while you do that or,
you know, once you finished creating
your API key, I'd love for you, you
know, to just briefly introduce yourself
if you want to and just sort of, you
know, let us know what you would like to
get out of the workshop. Maybe there's a
specific use case you're working on. Um,
yeah, we'd love to kind of get to know
you all a little bit as well. So while
you create your API keys uh if you want
to feel free to just you know shout out
uh your name or nickname uh and sort of
what you're working on what you would
like to get out of the workshop. Yeah,
with Raol from the Thursday podcast
working for Koreas
and big fan of Gemma. Yeah, Gemma 4 and
Gemini. I'm one of the few who are using
it for coding and stuff as well by H&
um I'm also running it on the classes
actually.
>> Oh, nice.
>> And I want to see more of this, you
know.
>> Okay.
>> And I want Google classes.
>> Okay. Yeah. We'll we'll see if we can
get to the glasses by the end of the
workshop. Sorry. Yeah. If you have a
seat to your right, can you jump in so
we can fill up on the side here? Makes
it easier to be left.
>> And while we are waiting, uh, we
launched something cool on Chrome, which
I haven't activated it, but when you go
on your tap bar, click right, you can
now move tabs to the side. And we have
vertical tabs now. So,
>> yay.
So, more screen. That's good.
>> Nice. Okay, cool. So
glasses
are you are you are you running
what are you running on the glasses well
technically on the phone right software
where you use which calls open
clawization
for
>> two
>> and then that's just using Gemini live
kind of through the websocket on the
phone right yeah nice yeah that's cool
yeah vision claw if you haven't heard of
that uh pretty pretty fun open source
project um
and I think you can run it uh sort of on
the well now that Meta has opened up the
SDK for the the Meta Rayban uh you can
actually hook in something like Gemini
Life API uh into the glasses. Well, the
glasses connect to your phone and then
the phone actually connects to Gemini
Life which is cool
>> but uh it shows what will be possible.
>> Yeah. Nice. Any problems creating API
keys?
Everyone has an API key. Should we
check?
>> Okay. Uh, anyone else? Anything specific
that Yeah,
>> Michael work for Get Your Guide in
Zurich. Um, we're building like our
first kind of AI customer support agent.
So,
>> cool.
>> Just interested to think in the general
how you guys are doing it, comparing.
>> Okay, cool.
>> Nice.
>> Last chance. Anyone else?
>> Okay.
>> All right. So
we can start before we go into the
hands-on session. We I have like 10 15
minutes slides um to give a bit of a
background what we are going to use to
build the first session which I'm going
to do is more on like the building an
agent without any live audio input
that's where to take over later and then
we are going to building some very nice
conversational agents. So um who of you
has used the Gemini API to make an API
call to Gemini before? That's a few.
Have anyone of you used the interactions
API?
One. Okay. Two. Okay. At least some
person. So the interactions API is a new
API we launched in December and beta
which hopefully will succeed generate
content soon. It's a unified API to use
with models uh with agents and is much
more aligned with I would say the
industry. So much closer to what you are
familiar with from open models with the
chat completions API or from entropic or
from open AI. Um and what we are going
to do the slides then we build a coding
agent like a small little clot code with
reading running uh reading files writing
files running bash commands and then we
have some time at the end if you have
questions around it and then we do a
short break toilet drinks and then tour
will continue. We all have an API key.
So as I said, we want to build an API
which works for both models and agents.
And when we launch the interactions API,
we also launch deep research. So maybe
you have used deep research in chat GPT
or in the Gemini app where you basically
start off a query and then you get back
a plan and then the model goes on and
like does deep research for 10 15
minutes visiting hundreds of sites and
um the API um supports both models and
agents and it's very simple to switch
between those. You basically either
define a model which could be Gemini
free flash flesh or you define your
agents which could be deep research and
we are working on it to for you to bring
your own agent or to define your own
agent that you can customize all of
these behaviors and it's the same
surface. So we'll see later, but when
you send a request to Nano Banana to
generate an image, you can like
basically chain those interactions to a
flash model to do something else to Lura
to have you like generate audio and uh
or even to hopefully soon uh vo to
generate you u video and the interface
is very similar to what you see from
OpenI. So we now have like those uh
content blocks basically. So every input
you provide and output you provide is
the same type. It has a type field which
could be a function call, a thought
signature, a text, audio, video, image
which hopefully makes it for you easier
to build with the Gemini API and it is
less I would say Google branded, less
protospaccific, less gRPC to make it
easier for developers to build.
Um, the core primitives of the
interactions API in addition to making
it easier, we also introduced state on
the server which will which we will use
for building our agents. So we don't
need to manage our loop and always send
back the whole history. Very similar to
uh responses API, you now have a
previous interaction ID you can provide
which basically attaches to the existing
history. So you can like just send a new
input. Um as mentioned we have an agent
with deep research and background true.
So you can start your research can pull
it or um soon use web hooks to get
notified when your research is done. So
you don't need to keep the connection
open. We have the type blocks and then
also we have the same like streaming
pattern very typical for web development
using SSE which also makes it hopefully
easier for you to build. We all support
the built-in tools but we also now have
support for remote MCP and I think two
weeks ago we launched tool combination
so you can now combine Google search
with your own custom function which was
one of the big features people were
asking for years I think and now you can
do this. Um and to summarize again the
differences between so generate content
is what we have today interactions API
is what we will have tomorrow we will
have um serverside state management but
you don't need to use it so if you say
hey I want to manage my turns I want to
manage my context I don't trust you or I
need to do like context engineering I
need to remove certain parts you can do
this um you send can it's way easier to
send new input basically we have the
built-in agents we have also background
support. So you will get asynchronous
execution. We all see with agents when
you send a prompt that might take 1 2 3
4 minutes to complete and keeping HTTP
requests or connections open for I would
say more than like 10 seconds is not a
very good practice. So you want to like
use asynchronous calls to either get
notified or like do polling uh when it
is done and it is less proto oriented
which I really prefer. um it's much
closer to what people know from like the
the developer ecosystem and also a side
effect of the state management is that
the implicit caching for the API is much
better. So for the one of you who don't
know what is implicit caching. So when
you send a request to the model, the
model needs to encode all of your input
tokens. And you can cach those encodings
for a follow-up request to save cost.
And cache requests I think are 90%
cheaper for the input tokens. And when
you have to manage your state or context
yourself, you maybe strip out line
breaks, remove um certain parts, this
breaks the cache. And using the
serverside state, the server keeps the
context. So the chances for your cache
hit rate is much higher. And we see like
two to three times better cache rates
from like the the startups using
interactions API today.
Quick code example on what we mean with
making it simpler. So on the left side
you have like the very protospecific
oneoff input parts with inline data or
text where the the field basically
describes the type and on the right side
almost looks similar to other APIs. I
would say so should be much easier if
you decide after the workshop to give it
a try. Um then I guess all of you know
roughly what an agent is. We have a
brain or our model which decides what it
wants to do. Calling a tool generating
text doing something else. We have tools
which basically gives our brain hands
and eyes to interact with the
environment where it is in. We have the
context. It's basically all of the all
the model knows uh what it has to do,
what it can do. if there are certain
preferences, if there are certain
constraints and then we have the loop
which basically combines our model with
hands and tools and runs it until the
model no longer calls the tools and
generates a text.
So
some quick examples on how to use the
API and then we go into like the the
nice hands-on part. So basic chat usage
with serverside state becomes very easy
because we have our interactions create
call. So we define our model, we define
our input. What's the capital of France,
we get back the output and then we can
just continue providing the previous ID
and the model behind the scenes or like
the server behind the scenes basically
has our user input, our model output and
then appends the new user input that you
don't need to have like a client side
history object where you append the user
turns, the model turns and then the user
turn again. This becomes very helpful
when you build agents where you have a
loop and always need to append new user
input and um as mentioned before that
also works for agents and models. So we
really want to build this unified
interface where you can continue your
conversations no matter what model you
use. And in this example, we basically
run a deep research request on research
AI agents in 2026. And then we take the
research output and just continue with
the nano banana model to generate a
visual for it. And it's like four lines
of code basically without for you to
what's the context, how do I provide the
input and hopefully makes it a lot
easier. But as said, you don't have to
do it. So like the input field also
accepts the same array with ro user ro
model model and all of the inputs.
Um for two use it's also hopefully now
much easier. We have a type function our
name for our function which we want to
use describe and then we have the
parameters the model needs to generate
and then here's roughly what we are
going to build. So we make our API call
with the tools and then we check what
the the output of the interaction is.
Requires action basically means you as a
client you need to do something. So the
output generated some function call or
some object which you need to react to.
We iterate over our um output types.
Check for the function call. Execute the
function. Append the result. send a new
interaction until the model no longer
decides it wants to call a function and
generates a text or something else.
Okay.
Um
the last time I did a workshop, we were
all still coding manually. So that's my
first time doing a hands-on workshop
where at least I don't code much
manually anymore. I'm not sure how you
do it. So what we are going to do today
is we don't code manually. We are going
to use your preferred IDE agent CLI of
choice. I'm not sure how many of you are
following, but um to make it easier, we
created agent skills for our agents to
use. So if you go to the Gemini, you can
search for Gemini API Docs coding
agents.
>> You zoom in.
>> Ah yes, sorry. Of course. Um
I can also Can you see it?
>> Yeah, I think you can also just Google
>> I mean I can
skills
>> Gemini
agent skills.
Okay.
Um
yes this one it should be des set up
your coding agent with Gemini MCPN
skills. I think that should bring us
here
>> and very different results
>> really.
>> Yes. Or if not you can go to the
documentation and then on the left side
in the getting started section there's a
coding agent setup.
Are you are we successful?
Okay. Um
>> so one question should we install this
skill globally or just for the project?
>> You can you don't need to install it
globally. And that's a good question. So
how do you install? So we have multiple
skills. We have the Gemini API dev
skill. We not going to use this one.
That's for the generate content API. If
you want to use that API or if you are
familiar with it, go for it. But we have
the the Gemini live API. That's what
tour is going to use later. So we also
don't install this. But then we have the
Gemini interactions API. And here you
can either pick the the first command or
the second command depending on what you
want. And then like just copy it uh open
your workspace um where you're working
in. And then I already did it, but you
can like add the command npx install and
then you should get a wizard asking you
to install it. There are many
pre-selected.
Don't get confused by it. This just
means that all of those agents are
compatible with the agents folder. So
you might have agents/kills/gemini
interactions API and in there we have
our skill.
Yep. I let me also
uh wait. So I tested with cursor and
anti-gravity and also Gemini CLI. So if
you are using one of those three, you're
good. If you're using cloud code, I
think it should work as well. I'm not
sure if they follow agents best
practice. If not, I mean I'm sure you
can.
>> They invent invented it, right? Like the
skills.
>> Yeah, the skills, but I think they only
look at claude. Uh I think
>> look at agent
>> but I think it installs into skills.
>> I don't know. Okay. So
>> yeah.
>> Okay. Yeah. Question.
>> Uh you show two different commands. One
skills and the other one context 7 or
>> Yeah, they both both work the same.
>> What but what's what is context 7?
>> Uh so context 7 is I think a product
based out of upstach. So context 7 is
has an MCP server and now a skills CLI
which you can use to get access to like
skills. It's like a public repository.
So and skills.sh is the same from the
versel team.
>> Okay.
>> So it's like and it it works both works
on GitHub. So the Google-gemini-gemini
skills is a GitHub repository and the
GitHub repository includes all of the
skills we we documented. So you can also
go there and find it there. There are
also the the deal in instance commands
and it makes it much easier than cloning
it and making sure you have it in the
right directory. And then what we can do
is um to make sure our skill works we
can just like ask our agent uh what
skills can you
use and then it should if it if you have
installed it correctly you should see
yes we have our specialized interactions
API skill. So anti-gravity here picked
up our skill in the agents/skill folder.
Okay. How are we doing?
>> Can you zoom in just a bit?
>> Zoom in more.
>> Okay.
>> Better.
>> Yeah.
>> Um I was using telegram with my
result of this being run on the system.
So
>> would I have to have it run?
>> No, you can like you should be able to
do it on like your open claw kind of
stuff if you tell it install the
interactions API and then
>> just the final product of the workshop.
Will it run?
>> Yeah, it should run. It's a Python
script. We are going to execute it
later.
>> So because I'm only running should work.
>> Okay. Any any difficulties installing
the skill? Any questions what a skill
is? Everyone familiar with skills? Okay.
What we can do maybe uh to give you some
uh insights on what skill we created or
what it contains. Right? The importance
when creating skills is it should be
either something the model cannot do
reliably or if you have some personal
preferences on like how to do a certain
workflow or I don't know you always need
to run tests using bun or something like
this and what we did with our skill is
we made sure that the agent is aware of
which Gemini models are available. A
common issue we saw before that is like
Gemini always used Gemini 1.5 which is
no longer the latest model. Um we also
included the agents here. uh we have
some like very high level information on
how it works but we did not include like
all of the documentation. What we did
instead is you should see yes uh a link
to our documentation which is available
as markdown. So instead of the need to
always update our skill with like for
example we added a new feature to
interactions API to combine tools we
would have needed to update our skill
and then every one of you also need to
update your skill to be able to use it
and then we are not making a lot of
progress in terms of like knowledge
cutoff. Instead we provide the the
information as part of the skill. So all
of the agents now have like web fetch
tools. So they can query the information
based on the skill and then like we only
need to maintain like the documentation
which is mostly up to date. Yeah.
>> How do you find that work like
efficiency?
>> Sorry.
>> How do you find that works with like
tooling efficiency and all that sort of
thing? Do you find it like it having to
go and fetch the page to
>> I mean it it normally you would provide
it as a reference on your local file,
right? So it either needs to do a read
file call or a web f call which is the
same I would say in terms of like cost
and it works very well. Okay. So what we
are going to do as a first example since
we are not wipe coding uh we want to
build something more substantial we not
just say build an agent we want to be
more specific. So we want to build um
create
an agent class with a
constructor
and a run method.
The constructor
creates
uh Gen AI client. So the Gen AI client
is what we are going to use to call our
model. We also need uh defines model
and we also need uh global
previous
interaction
ID and then add the main method to run
an example.
uh do all of that in
uh workshop
pie. Okay.
So, and as we can see the model in this
case or the agent as a first step read
our skill,
analyzed our main file is implementing
the skill. Okay, it checks. It should
probably fail. Yes, because I'm using
UV, so I stop and I tell it use UV uh
from the
workspace
and
and then
we let it generate. We Okay, it checks
if we have installed the library.
We have. So maybe in your case, the
agent still tries to install Google
Chenai. If not, you can do it yourself
with like pip install or uei pip install
um google- geni and then
make sure to okay so
we have our starting agent class. We
have our genai client. We have our model
ID. It defaults to Gemini free flash. It
would never have defaulted to Gemini
free flash if it would it hadn't read
the skill, right? because then we were
stuck with Gemini 1.5. We have our run
method which calls uh makes our
interactions.create uh call. We have the
input text. We have our um previous
interaction ID. We set our new previous
interaction ID and then we return the
text and then nice as a main example
um which we will run in a bit. We create
our agent. We have turn one. So my name
is Phil and then the agent uh runs it
and then what is my name to then we can
check if our
Gemini created Gemini agent uh works
with uh multi-turn and our interactions
API. You might see a warning similar to
the one I got here. Uh interaction usage
is experimental. That's as we are still
in beta. We really work hard to get the
API out of beta to make sure you can use
it in production. And our call was
successful. Hi Phil, nice to meet you.
Okay, maybe we weren't successful. I
don't actually have a name. So what is
my name?
Maybe we should check. Did we do it
correctly?
I mean,
my name is maybe a Okay, that's still
the call. Sorry, I asked it. What's your
name? I mean it's a language model for
trained by Google. It makes sense. And
what is my name? Your name is Philip.
How can I help you? So yeah
3 flash as a coding model.
>> Yes.
>> I mean it it it works really well if you
are providing good instructions with
good skills and good context and don't
expect it to I don't know like cure
cancer. Like if you have a very good
understanding of what you are trying to
build Gemini free flash is like very
fast. I mean it didn't take much long.
It didn't consumes credits, right? Every
one of us is somewhat token constraints
at the moment and it it works really
well. Okay.
>> Use it for agendic use. Yes. But for
planning and coding.
>> Yeah. No, it works. I mean we will going
to use free well. I mean like it also
asks like if it wants to run. So we are
closing the loop even like with Gemini
free flash. It tries to run our script
to make sure it works. And then we will
continue in a bit. How are we doing?
Anyone making successful calls?
Yes.
>> Okay. Perfect.
>> Background.
>> Okay. Any any issues? Any errors? Any
questions?
>> Yeah.
>> I'm using TypeScript.
>> Okay.
>> It it works.
>> Yeah.
>> Nice. Great.
>> Works for me too.
>> Mhm. Awesome. Even coding on a phone.
So, okay. So, normally the next step for
our agent, right? We now have our like
very basic run. We can chat with the
model. Now we need to add tools to it
and we want to build some kind of a
coding agent. So first tools uh we are
going to add is a read and write file
tool and we just continue in our uh main
agent thread. It's like okay
uh next we need to add a
read file and write file tool. create
the
create a basic Python implementation
and also the
JSON
schema definition. So when you use
function calling or tool use right we
need to create a JSON schema which we
provide to the model so the model
understands what it needs to generate.
Once it generates that schema, we also
need to have some kind of a code
implementation which we can then run on
the client. So we ask it to create the
Python implementation and also the JSON
schema uh and a map to
for the key and
key
function schema.
Okay, let's see what it will come up
with.
Okay,
it's cheating a little bit. So, I have
like a solution folder and it looked up
the implementation implement. Yeah, I
mean that's why it's still important to
like check your wait I can like it it it
found like the solution and then it was
like oh I got this solid Python example
in solution.h agent to guide me. The
task is implement read file and write
file.
>> Yes. So what we got back is we have two
new like very basic very very basic file
implementation. So we have a read file
tool with a file path which uses um
Python syntax to open it to read it and
we have a write file tool with file path
and the content and writes it. And then
we have our read file schema. Um, reads
a file and returns the content. Write
file, writes the file and returns the
content. And it made some updates to our
agent. So what did it change? Okay, our
input is now a text uh string and a
list. Makes sense since we now need to
return a function call. We check what do
we do? Okay, we create a tool definition
for our model which is the
T schema. So that's our tools map
schema.
And then we have our loop which you
might be familiar with from the slide I
showed. So after we run our request, we
check the interactions output. So the
interactions outputs include all of the
events generated by the model. And since
Gemini is a reasoning model, it also
includes for example the thoughts and
the thought signatures which we need to
return since we are using the previous
interaction ID and the server side state
that's done by us and we only need to
check okay do we have a function call we
have a nice debug so we can check it and
for our outputs or function call we
check our tools do we have our tool or
not I mean maybe the model wants to edit
a file but we don't have edit a file
tool we would catch it here it calls our
model and then it creates the tool
results. So for function call we have a
function result that's also part of the
change. We really want to make it easy.
Um we will see later when we use Google
search there will be a Google search
call and a Google search result to have
very the same schema and then we use
recursion. So if we have two results we
basically call ourself the selfrun
method again. If we don't have two
results, we returned the interaction
text and it also updated our example.
Write hello from the agent to a file
named hello text and return it back. So
we I mean the changes roughly look good
to me. We can try and run it. So u run
python workshop main.
What do we get? We get our tool call
with write file. We get our two result.
We get a read file call again and then
the final agent response. The file hello
agent text was successfully created with
the content.
Agents are not only like singleton,
right? Maybe we wanted to respond. So
next step is basically very ambitious
telling it I want to have a
continuous
I don't know stood in
implementation
to test.
Let's see
it.
So let's see where it will continue.
No, I hope at least it should update our
main function where we have an input a
while loop basically waiting for the
input using our agent. So, yep. So, we
have our user input uh and then we have
we always continue with our agent.r run
method which then inside the agent runs
it in a loops until there are no tool
calls anymore. And if there's a result,
we basically get back our response.
Okay, let's quickly wait. Is it too fast
or are we still on track? Roughly
>> it's fast.
>> Okay. I mean we will share the code
later. Um also very happy to answer
questions. Uh we have many like blog
post and examples of that online. So you
can if you are interested in like
rebuilding it later with less fast um
but I I try to be a little bit slower
but I also want to give to enough time
to have you speak. So what we can do now
since we or like our agent implemented
our like while loop to um to to provide
input we can say something like hello
right normally we should now not do any
tool calls because hello is like yeah
our agent correctly kind of understood
or Geminina in this case understood hey
hello is nothing I need to solve with a
read or write file to write. So I can
say hello, how can I help you? Can you
write or maybe can you create uh CSV
with a thumbs up
might take a while a bit. So, it does
the thinking and it does the function
call and
okay um certainly here's a CSV file blah
blah blah. Um okay, can you write it to
disk?
Uh
yes, I can if you wait maybe
our model is not having our tools. Let's
check.
There's the tools map. It does.
Why is it not using our tools? What
tools can you use?
>> Okay. It has a read file.
Yes. Maybe we are we're not explicit
enough.
And what we can do to improve this in
one second is we can add a system
instruction to tell the model hey you
are a coding agent. You can use tools to
write and interact with the
okay there we got our tool call write
file with our CSV. Cool. Since we saw
our mistake, we can now tell the model,
hey, add system instructions
for the interactions API call and add an
example
prompt for a coding
agent.
Okay. And what's really nice now since
we uh loaded the interactions API skill
in the beginning, the model still has
the awareness of okay, how do I add a
system instructions to the interactions
API and what I can guarantee you is that
the Gemini free flesh has never seen any
code of the interactions API because the
model was trained be before we even
released the API. So all of the work we
were doing so far is based on like the
skills and like the the coding
infrastructure and hasn't been part of
the the training. Okay. So what did we
get?
Um okay we can provide it on the run
command or we have one when we create
it. That's good. And okay coding persona
you are an expert software engineer and
helpful coding assistant. You have
access to the local file system. Okay
let's accept it. Let's start our agent
again. Let's say hello.
Okay. Hello. How can I help you with
your software engineering and coding
tasks? I mean, definitely better than
what we had so far. And what did we send
before? We said, can you create a SVG
with a thumbs up? So, can you create an
SVC with a thumbs up? And now let's see
if it calls. Yes. And now at this time
we got our write file tool call. And
then also we got a hey, I have created a
thumbs up SVG file with a simple line
art thumbs icon. So
um can I
>> I think it should. Yes, there we go.
>> We got a thumbs up icon. Cool.
Um, and of course, what's missing for a
coding agent right? We need to get some
bash tools. That's now not part of the
solutions folder. So, let's see how we
will get our bash tool. Now add a
similar run command
tool that allows the model to execute
bash commands.
Okay.
Creates an implementation blend.
You're asking
>> Phil I think there's a question can you
>> I say like I was just testing it does a
little bit like try to do like
>> even with
>> okay so we have our run command which
uses a subprocess which in this case I
guess is okay we don't care too much
about security for this example and then
our output is a stood out. Um,
works.
Edit. We have our run command tool.
Yeah, it even updated our system prompt.
And now let's stop that. Let's clear
that. Let's run it again. Um, any
suggestion on what we should test?
>> Use a banana to create an image. I'm not
sure that will work because we don't
have any skills or any information for
it.
>> Time.
>> Oh,
>> get the time.
Tool call. Run command date Wednesday,
April 8th.
>> It looks good.
>> Um, yeah. Cool. That's our small little
coding agent.
>> Delete all files. No, I mean let's not
do this.
>> Any more questions? Any more ideas? We
have like roughly five to seven minutes.
Yes.
>> General question. So because the state
is kept in the cyber. Yeah. Is it
possible to like for the history?
>> Yes.
>> Yes. So that
>> um
what you always can do so what we are
doing here is right we always use the
previous interaction ID from the
previous term. So we basically
stack it and you can always go back to
any index in the stack and branch from
there. So if you would keep the
interaction ids on your client side, you
can always use those to I don't know
branch out and like I don't know have
like a first um prompt on like do basic
web search and then like use this as
like a base for like five parallel
requests doing some other work and you
can always get the context. So we have
an interactions.get method
which you can use to retrieve the
interaction and then also get the
previous interaction ID. So you can
basically go back until the beginning
and get all of your state if you want to
save it for later. And the default for
those interactions being stored on the
server for free tier is one day for paid
usage is 55 days at the moment.
>> Yeah, you had the question.
>> Sorry,
>> you answer.
>> Okay, perfect. more questions. Yeah.
>> Does it mean then it can have infinitely
long context window?
>> Uh no. So once so Gemini models have a
million tokens context what would happen
now if you reach that you will get an
error but we are working on context
compaction techniques but it's easier
said as done and um it's still something
you currently need to maintain on on
your client side. So when you say that
it will retain the server will retain
for one day. So that means every day it
resets to zero and then
>> no no so when you send a request you get
an ID the interaction ID uh and the
interaction ID stores your input and the
output of the model in the free tier the
input and output and the ID is stored
for one day. So meaning if you send a
request now and you continue 8 hour
later from that point the state or like
the context is still available. If you
would send a request tomorrow, it would
basically say cannot find um requests
with the old interaction ID because it
basically is pruned after a day. But if
you use the paid API key, it's stored
for 55 days. And um the interactions API
is also coming to Vert.Ex. And I think
there might be a little bit more
flexibility in terms of um how long you
want to store or customize it. Yes. Is
there a specific API to reward context
if you want to do context on client
side?
>> We don't have one yet, but hopefully
soon.
>> Yes,
>> give an idea. You could have the API
joke being spoken out with the especi
>> the especi
tool to actually speak the joke about
APIs mentioned in
>> and we can also use Gemini which has a
TTS model which can speak to but
speaking and listening I mean to will
show many many cool things. Any
questions regarding the interactions API
and the small little agent?
No. Okay. Then you get 8 minutes.
>> I do five minutes.
>> Okay. Five minutes.
>> Break and then we'll be back here
>> with
>> to make your agent talk.
>> Yes. Cool.
>> Cool. Thanks.
>> That worked really well.
>> Nice. Yeah.
I used the old one and super happy that
now there's a new one version was 2.5
before. Yes. Yeah, it's it's been a
while.
>> How much better?
>> Yeah, much better.
>> Much
we didn't even pay him for it.
>> Big upgrades.
>> Big upgrades. Yes. Um
>> yeah, maybe uh I know there's a couple
more minutes, but
>> can I ask a question while
Caching.
>> Caching.
>> Yeah.
>> Yes. Input tokens.
>> Yeah.
>> Is that those
conting?
>> That's a good question. We probably need
to find Philillip to answer that. I
actually I actually don't know.
>> Phillip caching question. Uh the input
tokens
>> y
>> what was it?
>> It is not on an interaction level. It's
more on an like object level. So,
>> for example, when you provide an input
like no interaction ID, first input PDF
4,000 tokens and the text input with 10
tokens and you do an follow-up
interaction call, maybe only the PDF
will be cached and not the other like
the short text and then if you do
another one, maybe the PDF and like the
follow-up turns will be cached. So, it's
like more on like an object level. But
since you I mean, how is it? It's very
easy to make a mistake in caching if you
like even the slightest change in your
prompt removing whites space line breaks
will break it. So like having this rely
on the server to keep it. It's more
guaranteed that it's secure and it could
be as easy as hey my user says there's
an empty line break at the end. Okay, I
remove it and then I use that history
again and then it falls apart.
>> You do it right. Ideally, every
conversation, this previous part of the
conversation is captured.
>> Yes. Uh, optimally. But I mean, it
depends on where your request kind of
hits it and like how fast you follow up,
but the cache rate should be pretty
high.
>> Cool. Maybe uh to kick things off, we
can sort of look at one of the examples
that that we can build uh with the the
live API, the new model. So this is now
uh Gemini 3.1 Flash life which came out
two weeks ago I think. Um very exciting.
It's been a while. I think the previous
2.5 native audio model was December. So
it's been a while. The reason being we
did kind of major underlying
architecture. we work to, you know,
ideally
lower latency, you know, better
scalability in the in the future as this
REMs up. So, uh, that's very exciting.
And then, you know, in parallel, we're
working on bringing, uh, the life API to
the interactions API as well. Um, so
that's something that Philip and I are
working on at the moment. Not there yet.
So, we still have to use the, uh, live
API for now, but let's do a little demo.
And so this demo uses the life API in
combination with LIA 3. Uh so on the
Gemini API have have people played with
LIA? So the new LIIA model is actually a
music generation model. And so LIA 3 can
now generate songs uh you know including
songs with lyrics. Uh so this is the
clip model which is kind of a 30-second
uh clip generation. Uh but then also we
have a full song model now where you can
generate kind of full full songs. So the
idea is that we have our conversational
agent which is kind of the DJ that
you're interacting with and then that DJ
has a tool call to generate music,
right? Um and so you know I so I'm
originally German but you know growing
up I learned kind of English through
Harry Potter and BB BBC radio one. Uh,
and if you remember back in the day, you
could like phone in and like wish for a
song. And so that's sort of the the the
idea of the experience here,
right? Then you're on live jukebox. What
absolute banger are we cooking up today?
Tell me, what genre are we leaning into?
And what sort of vibe are you going for?
That's
>> Can we get some sort of high energy
German um technos schlaga about the AI
scene in the UK?
>> German technoslagger about UK AI. That's
absolutely brilliant, mate. Proper Radio
One style. Any specific lyrics or themes
you want crammed in there?
>> Just surprise me. Let's go.
>> Right, you beauty. Let's get those
speakers rattling. Prepare for a burst
of machine mayhem. Here comes German
technoslagger about the UK AI scene.
Let's go.
>> In London town, the future's bright and
keen. The cleverest robots that you've
ever seen. The computers wear a proper
bowler hat. And the androids all enjoy a
lovely chat. They're drinking builder's
tea with digital grace. A happy British
smile on every metal face. OH, VUNDERAR,
THE AI IS SUPER GORGEOUS. YEAH, SUPER
GORGEOUS.
YEAH. Um
the the TPUs enjoy when you clap for
them. So, um appreciate that. Uh any
anyone have any idea for another one? We
can we can do one more.
Just shout it out. Any languages,
any genres?
>> We have a nursing
>> aur nursing song
>> in Swahili.
>> Swahili. Okay.
>> Right then. Welcome to the live jukebox.
You're on air with me.
>> Can we get a nursing song in Swahili,
please?
A nursing song in Swahili. Brilliant.
That's something we don't hear every
day. To make it an absolute banger, what
sort of vibe are we going for? Upbeat
and celebratory or something more
relaxing and gentle? And what specific
bits about nursing should we squeeze in?
Let me know and I'll get this track
buzzing.
>> Yeah, techno hardcore
about
>> techno hardcore and swuah about nursing.
Wicked. Prepare to blast those speakers.
I'm dialing that in now. Get ready for a
total banger, mate.
I never know. Well, we can feed it in
Gemini to uh you know get the audio out.
Okay,
that is the live jukebox. DJ uh DJ you
uh if you want to it's it's on so this
was built fully with Google AI studio uh
so it's it's kind of a little vipodus um
demo I've published it in AI studio as
well so you can try it out uh you will
need a paid API key for this one because
the music generation actually requires a
billing account um so if you have you
know an account with credits you can use
that uh for example
uh cool yeah Gemini
uh 3.1 flash life. You know, benchmarks,
obviously, benchmarks don't really tell
you the truth uh as much. They are great
for benchmarking
things. Uh the real world, especially in
kind of live audio, does often look a
bit different. So, um you know, ideally,
we just try it out ourselves. So, Gemini
3.1 Flash Life, it's the model that is
now in Gemini Life in your phone. So if
you're using Gemini app uh on your phone
um you're talking to that model
as well as I think search life has it
now in there as well. So if you're
talking to Google search uh I think
that's the same model and then you can
build applications using this model on
the life API. So the life API is a
stateful kind of websocket API. um you
are able to send real time text, audio,
video feeds uh to the model. So audio
you're sending in kind of you know
buffer chunks so of the the real-time
audio you're streaming that in uh video
you can stream in uh at a maximum frame
rate of one frame per second. So this
can be you know a camera feed, this can
be uh a canvas. So uh it could be like
your your your screen share, right? So
you could share the screen with the
model. So for example, uh Shopify is
using this for uh Shopify site kick um
where it's actually kind of like a tech
support walking you through, you know,
if you're like, "Oh, how do I set up a
custom domain for like my Shopify
store?" It would basically like talk you
through how to do that and it can see
kind of where you are on the screen by
sort of ingesting the frames of the
screen. And then in return uh the
websocket gives you kind of real-time
events back. And so these are basically
streaming back audio buffers. Uh and
then also you can get the audio
transcription. So that's kind of the
text um of it. And then we have um tool
calling built in. So Google search
grounding is built in by default. So if
you need kind of real-time weather
information, you can access that as
well. Um yeah, some key features. So
what's really cool about this model
again it's it's kind of native audio
model. So what that means is we're not
going through text. So it's kind of not
a cascading pipeline where um you're
transcribing the text running the text
through an LLM and then generating
speech. Uh but rather the model itself
um is you know going sound token to
sound token and the intelligence is kind
of baked into this audio model. Um so
it's based on you know Gemini 3.1 so
decently intelligent. Uh you have
different kind of thinking levels that
you can enable. Um and so the great
thing with that is kind of the
multilingual support. So it's I think 97
languages that are kind of supported in
preview uh at the moment which uh and
the great thing is because it is kind of
a you know native audio model it can
actually it has sort of the audio
understanding of of Gemini built into
it. So um it can understand a mix of
different languages. It can you know
sort of denlish for example which is
like a mix of German, Deutsch and and
English, right? So it would be able to
sort of naturally switch between kind of
different languages as well which is
really great. Um yeah, barge, you know,
obviously there's kind of automatic um
voice activity detection sort of built
into the model so you can interrupt it.
you saw it earlier the DJ you kind of
trying to have a conversation but we're
trying to get it to to you know use the
tool tool use is the other big thing um
so major improvements in kind of tool
use and instruction following here with
that model um and so you can build some
some really cool things with that um
so obviously we currently only give you
uh a websocket API so that is kind of a
downside if you use something like GPC
real time before you know you get a
direct web RTC um kind of infrastructure
which which can be helpful. So we have
partnered with you know a lot of uh sort
of integration partners like lifekit,
pipecat uh software mentioners uh in
Poland. They they built a great service
called fish gem uh vision agents vox
implant. So these partners have
integrated kind of the life API directly
and then give you sort of easy web RTC
integrations if you if you um you know
want that or need that kind of for your
system.
Um yeah let's let's try it out. So you
can try it out yourself. Uh, and it's
going to be interesting if we all try it
out in this room at the same time. So,
uh, we'll see how that works. But, uh,
again, AI.studio or AI.dev and then
slife.
Uh, you can try out the model and so you
can, you know, ingest your webcam here
as well. Uh, so we we can give it our
webcam feed. Um, allow this time. Uh,
and then, you know, we can send text as
well. So, we could send like, uh, how is
my outfit?
Uh, so in this case, you know, I'm not
ingesting any.
>> You're wearing a green jacket over a
blue t-shirt paired with a black cap.
The combination looks casual and
comfortable. Is there a specific
occasion you have in mind?
>> Yeah. Okay. So, um, you know, that
obviously is a bit further away from our
upbeat sort of British, Australish,
Australian, uh, live DJ. So, what we can
do is we can, um, you kind of adjust our
our voice. So through the system
instructions. Um now in terms of the
bass voices, we don't have that many.
There's kind of 30 different bass
voices.
>> Got a project in mind
>> and they are you know fairly generic
sort of but because you know Gemini has
very deep audio understanding. What we
can do is actually we can modify the
voice through prompts through system
instructions. So what we can do is
actually we can give it um kind of a
system instruction here for example and
we can just say okay speak in a friendly
Irish accent, right? Um and so now we
have that bass voice puck and uh speak
in a friendly Irish accent.
Hey, can you hear me?
Well, hello there. Loud and clear so I
can. What can I do for you on this fine
day? What do you think of my outfit?
Well, now you're looking very smart, so
you are. That green jacket suits you
well, I must say. A grand's casual look.
Were you thinking of heading out
somewhere?
Uh, no. I'm just here at the AI
engineer. Okay. Anyway, so I I realized
I didn't put on So, what I wanted to
show you as well is um there's grounding
with Google search. So what we can do is
um you know obviously the the model
itself is trained to a certain knowledge
cutoff date and so um you know if we
need kind of real-time information like
this gorgeous weather that you know no
one would actually believe us the model
wouldn't believe us you know if it if it
didn't have um Google search grounding
but now you know we can we can have
Google search grounding we can have
different thinking levels as well so um
no thinking is actually very minimal
thinking. Um but it means you know the
model will respond faster. It's kind of
lower latency because the more thinking
we have obviously the more thinking
steps are kind of involved and um it'll
you know increase the latency. So you
can see in the benchmarks as well
there's kind of thinking high uh
thinking low uh sort of different
thinking levels. Um, but so now if we
turn on kind of the the Google search
here.
Oh, hello there. What range of
shenanigans can I help you with today?
Anything at all? Just give a shout.
>> Uh, yeah. What's the weather right now
in London, actually?
Well, now it looks like it's about 9°
and mostly cloudy in London right now.
Might be a bit chilly, so you'd be
wanting a jacket for sure. Anything else
you need to know about the goings on
over there?
>> What? What are you talking about? Can
you check again? What date is today?
>> Ah, apologies there. I seem to have
gotten ahead of myself. Today is
Wednesday, April 8th, 2026. Is there
anything specific you needed to know
about today?
>> Yeah. And now look up today's actual
weather.
>> Right. for Wednesday, April 8th, 2026 in
London. You're looking at temperatures
between 5 and 13 degrees with a decent
chance of rain, so don't forget that
umbrella. Anything else on your mind?
All right. Uh here we are. It looks like
I didn't pray to the the demo gods.
Okay, there seems to be something going
on. I wonder if we messed up the UI
there. Uh it should it should work a lot
better than that. Um I think the Google
search grounding for some reason isn't
isn't working. So but what we can do now
is we can obviously try it out ourselves
in an application. So the easiest way we
could do that is you know like the live
jukebox life jukebox DJ um we could use
Google AI studio to kind of v code our
integration. Um so we have this pill
here which is called uh add you know
voice conversation real time voice with
you know Gemini life API uh and then we
could say um build kind of a
multilingual interview assistant that
allows me to you know train for
interviews in different languages uh
like German and English and Spanish and
you know what have you
uh And so we can now kind of fire this
off. So this uses Gemini 3 flash
preview. Um it is limited to uh kind of
JavaScript full stack um environments at
the moment. So I think you can choose
between kind of next.js Angular. Um
there's like XR building blocks as well
if you're building for for kind of
glasses um sort of web VR experiences.
Um, so you know, feel free to kind of
fire one of these off uh right now or
you can also clone uh the live jukebox
DJ that I shared with you earlier and
you can try that out. Um, it'll it'll
take a little time uh and you'll hear
like a little chime once it is uh ready.
So in the meantime, what we can do is uh
so if you go to the Gemini Life API
docs, Gemini Life API,
uh there we are.
So, you know, we've done this. We tried
out kind of the life API in Google AI
Studio.
Um, we also can use the coding agent
skills. So, uh, Phil showed you this
earlier. Um, we have dedicated coding
agent skills also for the the Gemini
Life API. So, you can install that.
It'll help, you know, your coding agent
integrate the life API um more easily,
more quickly. Uh but then also you know
we have good old example apps on on
GitHub which can be very very helpful.
So what you can do is um you can clone
these example apps. Uh so in GitHub you
know you can uh like you do in GitHub
right uh you can clone this. So, we'll
open kind of new terminal.
And yes, feel free to follow along. Do
that a bit bigger.
Uh we'll make a new directory.
We'll call it AIG
uh Europe.
I like that they call it Europe, right?
But it's uh I mean, I guess, yeah, the
UK is still part of Europe, just not the
EU. Fair enough.
Um, and so we'll we'll go in there. Uh,
and then we'll just do a git clone
um of our app. Uh, and so now we have
our app in here. Um, so there's actually
a couple different examples that we can
use in here. Uh, so if you're using
anti-gravity, there's a handy agy
command to uh open your examples in
anti-gravity.
Uh and so we can you know look at our
different examples here and how we kind
of need to set that up. So we have you
know two different scenarios. So the
Gemini Life Genai Python example uses
the Gemini Life API on the server. So it
creates a websocket connection from your
server to uh you know Gemini life API
and then on your front end you basically
set up a proxy
um to proxy the websocket connection to
your client side right because your your
browser window is kind of your client
and so that is what is capturing um your
your uh audio feed your video feed and
so in this example we're just using fast
API Okay. Uh, and we're basically, uh,
just setting up kind of a web socket
here, uh, that our client can connect
to, uh, and then we're basically just
receiving sort of our, um, you know,
audio cues, our video cues from the
client site. Uh, or, you know, our text
input queue as well. Uh, and so we're
receiving that. We then setting up um, a
live session. So um we're using our
Gemini client. So we kind of abstracted
sort of all the life API stuff into this
Gemini life file here. Uh and you can
see like starting the session we're
basically setting up kind of our life
connect config. Maybe I close this for
now. Uh so you can see that um we're
setting up some system instructions. So
that was you know earlier kind of said a
helpful assistant. We also said kind of
speak in a friendly Irish accent for
example. Um, you know, this is where we
put our uh sort of system instructions,
our guardrails. We can, you know, make
that pretty long in terms of covering
sort of what we what we want. Uh, and
then we can define kind of our tools
here. Um, and then we're basically
setting up our um, you know, session and
uh, our session sort of is, you know,
the websocket session. Uh, and then
we're just receiving our audio and video
cues from the client side. uh and
proxying that through. So that is kind
of one approach. That's sort of the
server to server approach. Um and you
know we're just using kind of UV here.
Um so if we're setting this up for the
first time, uh we can go into So this is
our Gemini live um Genai Python uh
example here. So we can set up uh our
virtual environment. We can then um
activate our source here.
Uh we can install our dependencies.
Uh okay, I might have this is a a fun
part of uh uh a Google laptop
security.
Come on. All right. Just look away.
Don't look. Don't look at that. Um and
then yeah, install our requirements.
And so we'll need uh an API key. Uh so
the API key, you can see kind of here uh
how the config configuration is. So we
basically just need our Gemini API key.
Uh and we need to set up uh an
environment variable for that. So you
can see we have an example file here.
Not a lot in there because it's
basically just that. So we can copy our
env.example into ourv
uh file here. And then do you remember
how to get your API key? Where do you
get your API key?
>> Yes. AI.dev. There we are. Fantastic.
Love it. Um, okay. AI.dev.
Uh, so that is where you get your API
key. Uh, I think I have a couple API
keys. So, it takes a little while to um
load them here. Uh which one is maybe
we'll we'll use this one. So once you've
created your API key uh you can copy the
API key from here. Well actually maybe I
should create a new one because later
I'll need to delete it. Uh so we'll say
AIG Europe.
Uh we'll we have a couple projects here.
We'll just use
we'll just uh which one which one we
using? Too many projects. Okay. We'll
just use this one.
Uh, and so now I really don't like us
actually returning the clear API key
here. I think we learned something
today. Um, okay, we save that. Uh, and
so now we have sort of our API key set
up. And now what we can do is, uh, we
can run our demo. And that was just the
main.py.
And so now our demo will be up and
running.
on localhost 8000 here. And so we can
see um it's just kind of a basic sort of
demo. Uh and when we connect
top of the morning to you, I'm Gemini
live. A little demo of what this API can
do. Why not try out some fun features
like hearing me speak and different
apps? Can
>> you cancel me saying?
>> Ah, I can see you. All right. You're
sitting there with your handsome face
looking straight at me.
Oh, that was that was
this uh I was asking.
>> Sorry.
>> Oh, yeah. Yeah. Yeah. but it didn't
transcribe it.
>> Um,
all right. Uh, we're finding out a lot
of things here, uh, to improve, which is
nice. Um, but yeah, so what we can see
is, um, so this is, you could actually
notice that the latency is a bit worse
because we're actually having that jump
from our client to our server. Um I
would love to blame the Wi-Fi but uh so
with the serverto-s server setup you
just have that additional latency of
sort of going through um your client. So
what we can do as well is we can go
directly from our client to the server.
Uh and so this is the other example that
we have um which is this one here using
ephirmal tokens. So
uh for effirmal tokens we can just kind
of look into uh the setup here. Uh very
similar. We'll just go uh back uh
actually let me get a new terminal up
here. So we'll say uh effirmal tokens.
So our effirmal tokens are basically
shortlift tokens uh that we generate
with our API key on the server side and
then we send that effirmal token to our
client. So you know our phone, our
browser uh to then initiate the
websocket connection directly from the
client to the life API. Uh again similar
setup here. We want uh a virtual
environment. Uh we then uh activate our
virtual environment
and we install our dependencies.
Uh we also need our uh API key uh again.
So I think what we can do is just use
the same one. So we'll copy maybe just
copy this one here and then paste
uh paste that in. Is it big enough? I
don't know. Can people see that? Maybe
we'll zoom in a little bit more.
So now we have our key here um as well.
Uh and so
now that we have our dependencies
installed, uh we can run our server. Uh
and so our server here,
um
we can look at the server real quick.
So this server is basically just uh a
very you know slim sort of back end that
just has our Gemini API key
uh and then it generates an effirmal
token for us. So sort of um ephirmal
token is currently on the v1 alpha API.
So you'll need to use actually a
different API uh at the moment for this.
And then you would pass in kind of this
expiration time. Uh because ideally, you
know, the token should be shortlived. Uh
so should the token ever leak, you know,
uh it shouldn't be too costly because
it'll expire um pretty soon.
So there we are. That's our token. And
then we return our token back to our
client. So on our front end, we then
have our Gemini Life kind of integration
here. And so this is an example of just
a pure websocket integration without
sort of any SDK. Um so you know if you
you can use kind any sort of websocket
framework um here and you can see sort
of you know how all the the different
sort of raw websocket events um are
handled.
Uh and so you can see here we have um
you know this is kind of our websocket
uh API. So here we need to use kind of
the v1 alpha. Um and then this is the by
so birectional. Um you know we're
streaming in both directions. Uh and
here we we pass our access token which
is our shortlift token.
Uh great. So what we can do now is now
our server is up and running. Uh so we
can see um this beautiful interface here
which was uh handcrafted. um you know,
no agent involved in the creation of
this one um back in the day. And so we
can just see kind of all the different
um knobs here that we have. So we'll try
this kind of Google grounding as well.
And hopefully it was uh just a thing we
need to fix in the UI. Um here's our
flash life uh preview model. So that's
3.1 flash life. Uh and then we can uh
connect here.
And so we see kind of the the server
events. Uh actually let me maybe uh do
that again. So we say enable Google
search grounding connect. So what we can
see is kind of our websocket
uh setup. So if we look into the
network,
we now see we have kind of this um
websocket connection here. So we see we
have our access token. So our token
request to the back end. Actually, let
me see that.
Okay, so you can see we're just getting
our token. So, um the response was just
our O token and when it expires,
uh what we sent in terms of the payload,
where's our payload?
Okay, actually that was no payload. Uh
yeah, we're we're just getting a token
back. So once we have our token, we then
set up our um websocket and so we can
see our messages here. So basically the
first message in our websocket um we're
sending our setup uh configuration. So
you see we're you know sending our model
as like Gemini 3.1 flash preview. We
have some real time uh input there. We
have the tools. So hopefully now the
Google search uh should work. And so now
what we can do is we can um start our
audio.
Hey, can you hear me?
>> Yep, I can hear you loud and clear. How
can I help you today?
>> Um I'm in London right now and I'm
wondering, do I need an umbrella? What's
the weather today?
>> I'm not getting the current weather for
London right now. Sorry about that. Your
best bet is probably to quickly check a
weather app on your phone or search
online. That'll give you the latest on
rain.
>> Oh, you're killing me. Just use Google
search.
>> My apologies. I seem to be having some
trouble pulling that up right now. I
know that's frustrating. You can get the
forecast easily by just asking your
phone or searching online. I'll try to
get this sorted on my end.
>> Gemini, why are you doing this to me?
That's just sad.
I understand your frustration. Sometimes
technology gives us hiccups and I'm
sorry I couldn't get that weather for
you right now. I'm always learning. So
hopefully next time will be smoother.
Was there anything else I could help you
with? All right, that was painful. Uh I
do apologize for that. Okay, so
something is definitely going on. That
is um not good. Not not nice. Um, let's
see if we have more uh luck with the
custom tools. So, we're also passing in
Do we pass in some custom tools or is
there actually some issue with our
uh
Hey, can you show me a hello world
alert?
There you go. A hello world alert has
been displayed. Anything else I can help
you with?
>> Yeah. Can you change the background uh
of the page to green, please?
>> You got it. The background color should
now be green. Let me know if you need
anything else.
>> Yeah. Can you tell me what's the weather
right now in London?
I understand you're asking about the
weather in London, but unfortunately the
Google grounding, but for some reason
it's Yeah. So, with the Google
grounding, it should work. So, I I
wonder if I messed up something.
>> No worries at all. The weather
information isn't something I can grab
right now, but if you have any other
questions or need help with different
functions, feel free to ask. Let's try
this again.
Hey, can you hear me?
Ah, yeah. We need some audio. Hey, can
you change the the background back to
white, please?
>> I wish I could, but I don't actually
have control over your display settings.
You might need to look in your devices
settings menu to change that.
>> Anything else I can help with instead?
Oh, no. Sorry. Off the page. The the
background of the page. Back to white.
>> Oh, I see. Sadly, I still can't change
that for you. Those appearance settings
are part of the website or app you're
using.
>> Oh, yeah. You're right. Um, no, but I
thought Okay.
Well, disabled custom tools. That
shouldn't be the case anymore. All
right. Um lots of work to do
but uh so yeah this is sort of how you
can um get up and running. Uh did anyone
manage to get it running on their
machine?
Yeah.
Is it is Google search grounding
working?
>> No.
>> Okay. That's uh
that's a shame.
All right. Um,
what else did we have? So, we need to
fix uh lots of takes away from this. Um,
yeah, if you didn't get the link, so
this is the link to the live API
examples. So, um,
they also link from the docs. So, if you
go to kind of the Gemini life, uh, API
docs, you can find them there.
Um
is that the end of my slides? I do think
we have yeah the agent skills as well.
So, I mean, if you just go to the Gemini
Life, uh, docs,
uh, Gemini Life API,
uh, and in the on the main docs, uh,
we've actually linked it from, uh, the
top there, uh, try the live API, Google
AI studio, uh, the GI up examples or use
the coding skills. Um,
so that is, yeah, that is how you can
get started.
Cool. That was, uh, not how I was hoping
this would go. Um, yeah, we'll we'll
figure out what what went wrong there,
but um, yeah, I appreciate you all
joining and yeah, we we'll take
questions.
>> Yeah.
>> Yeah. So, you can look at kind of the
session management. So um there's
actually sort of
so the live API has kind of this context
window compression. Um so without
compression kind of audio only sessions
are limited to 15 minutes. Um audio
video sessions are limited to 2 minutes.
Um so what that means is that sort of
you will the session will kind of
terminate. it will give you sort of this
go away pane. Um, and what you can do is
sort of enabling context window
compression. So context window
compression is kind of like a you know
sliding window where you basically say
you know like I want to keep that much
context sort of in my window and sort of
as the conversation progresses it will
then actually like forget kind of the
previous context before that window. Um
so yeah so context window compression is
something that you can uh enable kind of
you know to have that sliding window to
uh make the sessions longer but then you
know there's only so much context
depending on um the frame rate that
you're feeding in in terms of images
depending um yeah mostly mostly it's
images audio is sort of uh audio only
sessions there's more kind of context
you can keep in the window but And as
you're adding kind of video frames to
it, uh it does
yeah compress it.
>> Uh yeah. Oh yeah. Yeah.
>> Um is there any real life use cases that
this is being used for that's very
creative from a like a business context?
very
>> see what you've done today is a lot of
fun, but like applying it to business.
Is there any businesses that are
actually doing this really well right
now?
>> Yeah. Um I mean so Shopify has it in
production with Shopify sidekick. Um
there is uh a bunch of so like uh
actually one that I really enjoy if you
Gemini life API blog
uh we had kind of a a case study which
is um this startup. Yeah. So I mean
stitch is using it as well. You can like
vibe code vibe design with your voice
and stitch but then you know stitch is
built by Google. So um you know you
probably got to discount that. Um
Whimo is also integrating it. So um you
know we first we got rid of the drivers
in the cars and then you know you do
want to talk to someone in the car you
can then uh talk to Gemini uh in the
future. So they they are working on
that. Um I like this one. So this is uh
Hey Ado. Um it's a it's a great startup
from Argentina. Uh so they are building
these um voice sort of companions for
the elderly uh in combination with uh
kind of an app uh for sort of the
caretakers or you know the the the the
children of the elderly where you know
they can get notifications
you know should the elderly mumble about
something that no but so it's it's a
really nice interaction where like the
multilinguality
um really shines, you know, because like
in Argentina, uh you know, a lot of it
would be Spanish. Um but there's a nice
example you can kind of look at. Um
which is really sweet.
>> I came to visit my grandma Maria. I want
you to say hi to
>> you you can you can look at it in your
in your own time. Uh and then I think
there was another one. So, um, yes, it
is somewhat
more of a future music use case, just
kind of looking at, you know, some of
the rough edges and limitations. And I
think for now, if you're,
you know, like in in a real business use
case, you might be, you know, better off
with kind of the cascading pipeline
because it gives you sort of the
observability at each step of the
pipeline.
um which kind of with the the real- time
you know native audio um you don't
really have that fine grain control and
observability in terms of you know
plugging into say like rewriting the
response before it's being said. Um you
know obviously there are certain
benefits with that in terms of the
natural flow of the conversation but for
certain business use cases it might
might just not be there yet. So it is
somewhat you know a bet in the future of
like what uh kind of real-time
conversational interactions will look
like in the future. Um but yeah
depending on your business use case
today it might not be the best fit just
yet.
>> Are the transcripts allable?
>> Uh no. So you in the session you can't
uh retrieve them. So you would have to
store them on your on your end. Um
again so that is sort of where the um
integration partners come in. So if uh
yeah so lifekit pipe cat they all have
like really good offerings to you know
store the entire audio as well as the
entire transcript and sort of give you
additional observability tooling on top
of that as well. Um, so it's not
something that is kind of all available
sort of, you know, from just the Google
site. Uh, so that's something where
currently we're relying on kind of the
partner integrations to to sort of give
you that additional functionality.
>> Yeah.
>> Oh, sorry. Uh, behind you. Yeah.
>> Do you want to go first?
>> No.
>> Okay. Um so you said that this is not
you know like super business ready for
more more complex use cases but what are
your thoughts for example in um
replacing interview interviews for
recruitment using this?
Yeah, I mean I'm not sure I would, you
know, replace your entire interview
pipeline with that, but I think it is a
great scenario, you know, for certain
steps in the interview process or, you
know, being able to screen more
candidates, for example. Um,
yeah.
>> Do you think is that ready?
>> Do I think that's ready? um
>> because there's a lot of like um context
that you also want to give it, right?
Especially even if it's like a first
screen or the quick screens that you're
mentioning, you still want to put some
criteria that would guide and guard
rails on how to guide the the
conversation. And then the second part
to this is also
can it work for example with if my
company is in a a Google environment
Google meets right now we have auto
transcript. So can you put these tools
together?
>> Um yeah so it is uh a preview. So it is
something you can use in production. Um
depending on your use case I think you
need to evaluate like you know do you
need like sock 2 sort of there's
potentially a bunch of things that you
need to build around it for it to
actually fit your business use case. Um,
so
yeah, it is ready for experimentation.
>> Does that help?
>> I mean, I think that's what we're doing
a lot in.
>> Yeah. Uh,
oh yes.
>> Um, an issue I always have is when I'm
demonstrating voice agents, a lot of
people are talking and it's it can't
differentiate the speakers.
between each other. So is there a
solution for this already?
>> It on a voice sample or something in
>> terms of like identifying the different
speakers.
>> Yeah. And so it only listens to you if
it's your agent for instance.
>> Oh, interesting.
Um
it only listens to you.
>> Imagine you have a coding agent and
somebody makes a prank and says delete
the file or stuff like that. You don't
want that.
>> Yeah. No, that's that's interesting. No,
I don't think there's any specific
ability to sort of like say just listen
to me. Um, so there is sort of
kind of this proactive audio where you
can tell it to kind of only respond in
certain,
you know, to certain context. So like
ignore things that aren't relevant to
the conversation. Um, so to some extent
that works. Um, but I don't think it's
super reliable at the moment where you'd
say like only listen to me, ignore
anyone else.
>> I've done that with parakeet from Nvidia
where you can train a little 10 seconds
and it can differentiates the speaker
that way. But it would be nice to have
something like you talk to it and then
it recognizes your voice and ignores the
other voices for the rest of the
>> That's cool. That's a great great idea.
Yeah. Thanks.
>> Uh yeah, I think in in the Yeah.
>> What does thinking look like? Uh is it
does it think in text or is also
thinking in speech?
>> Yeah. So you get the thinking uh only in
in text. So there's uh text events uh on
the websocket channel that um so you can
you can opt into getting the thinking um
as text. Yeah. it wouldn't speak out the
thinking. Yeah.
>> Uh thanks for the demo and uh for being
brave to go up against the demo gods. Um
I had a sort of question around the
multimodality side of things. Uh one of
the areas that I really want uh good
text to no speechtoext models is having
them be grounded in what I'm looking at.
like when I'm coding I want to just you
know word vomit into cursor sorry
anti-gravity uh and have it understand
the context and you know when I say
something that is a specific class name
it should just actually use that um how
does that work inside of this framework
like would would this be a way to
actually do do that grounding or would
you recommend some other API for that
>> um and this is
for so so
general Gemini models are really good at
audio understanding. So, um if your use
case doesn't require like fully real
time transcription, I would actually
recommend using like Gemini 3 Flash um
to basically transcribe but also get,
you know, like ingest context and sort
of basically get contextually aware
transcription.
>> Um
at the same time.
And provide
mult
awareness.
Yeah, there's uh so I mean depending on
your use case if you need it to like be
fully real-time conversational
um then yeah so you can kind of use text
to sort of ingest additional input or
you know imagery if it makes sense but
then again that you know reduces the
context um window size um or you know if
you don't need kind of the fully real
time um then like using Gemini
you know, just flash to, you know,
transcribe is actually pretty good.
>> Frame per second means I'm talkingual
awareness and your visual doesn't
change, you can send
input, then speak for 5 seconds and not
send an additional image. then you are
not using so much context. I think like
one image is around 1,200 thou 1,200
tokens. So not too much. Um so if you
are like in your editor you want real
time you basically can use the API.
First input is send real time image and
then send real-time audio and then you
stop basically and the model has the
image input and the audio input and can
respond to it. So you there's no need to
stream the image consistently if you
don't if it doesn't change or if you
don't need like it to react to it. Yeah.
>> Cool. Yeah. Uh I think we have a bit
more time uh there in the back. Yeah.
How do we get you the microphone or do
you just want to shout it out? Oh yeah,
we'll do it.
Thank you for your interesting
presentation. Uh I have a question about
the personalization or adaptation. Can
it uh recognize the speaker's level or
the knowledge during the interaction and
then based on the speaker's knowledge
produce the results or not?
>> Sorry, can you repeat? Can it
>> can it recognize the speaker's knowledge
or the background during the interaction
to produce the response based on the
speaker's knowledge or something like
that to personalize itself? So you you
want to ingest kind of initial context.
Is that what you're saying?
>> Not context. For example, suppose I'm
talking to that about the civil
engineering and candidate recognize I'm
a civil engineer and based on my
knowledge produce the result use the
advanced keyword in civil engineerings
or not
>> based on my knowledge. So you would you
would have to if it's like special
specialized knowledge you would have to
give it away to access that. Oh, that
mean doesn't have any memory to
recognize the human's background or to
find the main context of the information
of the interaction and based on that
produce the next result.
>> So you have to somehow feed in that
knowledge. So you could either do that
sort of before you you know like as you
set up the session you can ingest kind
of the the knowledge as initial context
for example and then it has that in
context to talk about or you would give
it kind of function calls to access
knowledge sort of during the session as
you as you converse.
>> Yeah I see. Thank you. that
>> yeah know I was to find that uh look out
the GPT for example during the some
turns it can find speakers or the users
knowledge or the main concept of the
information and based on that the GPD
can produce some results that means a
step by step gradually it personalize to
the main context of the conversation so
my question is can it a step by step
personalize itself to the main context
of the uh of the interaction and then
produce a result to point to the uh
>> so like as long as the the context stays
within the context window during that
conversation.
>> Yeah. It it would it can
like it can identify different speakers
and sort of remember what was said in
the conversation and like if they
introduce themselves with their name as
well it can remember kind of that
person's name and so like yeah that's
kind of the the the audio understanding
sort of within Gemini.
>> Thank you.
>> Okay cool.
Uh yeah do you just want to pass it
forward there?
Yeah.
>> Uh thanks for the nice presentation. Uh
could you share maybe some um of your
experience on how to evaluate these um
live uh voice apps because I can imagine
that this becomes a lot more complicated
than typical apps.
Um
yeah I've
so it definitely depends on your use
case in terms of like
uh you know like what are your
requirements in terms of you know do you
have HIPPA do you have sock 2 like what
is the amount of function calls the
amount of guardrails so there's
definitely a lot you know
>> these demos are nice and and fun but
like to bring that in a b business
context there are definitely a lot more
um steps involved and so that is kind of
where the partner integrations come in.
So uh you know lifekit has built kind of
their entire business around sort of
giving you all the batteries around sort
of voice agents and so I would recommend
if you know sort of looking at the
partner integrations for the sort of
real business use cases potentially.
Thanks.
>> Cool.
>> Yeah.
>> I hope you don't mind if I just ask a
simple question about the interactions
API. Going back to the previous talk.
>> Yes, please.
>> When's that going to be available on
Vertex?
>> Um, hopefully soon. I mean, if you speak
to some Google Cloud person, some Vertex
person, the more you tell them, I need
it on Google Cloud, the the easier it
gets.
We'll do.
>> Yeah, it's certainly not in our control.
>> Okay. Yeah, that's bad. That's bad.
That's cool. I mean it the API will be
the same. So you can start today on
Gemini API uh start testing. If you need
higher rate limits, anything else you
can like always reach out. If you need
vertex enterprise specific features,
then you might need to wait a little
bit. Can I ask like in terms of like PII
in any like conversation history, do you
know how that's like stored in terms of
like like data sovereignty? Can you
specify your own data sources or is that
all handled like back end um with with
like within the API?
>> So you can always disable storing
anything. So we have a store equals
false flag. So we would not store
anything but not storing means no
serverside state. So if you would like
to use this that's a bit difficult for
other vortex features in terms of data
soy where you call the model I would
expect them to be like similar to
generate content so if they have it
today they will have it there in the
future as well. Okay, cool. That's
great. Thanks. Appreciate it.
>> Cool. Uh, one last one.
>> No. Okay. Ah,
>> sorry.
>> No, please.
>> Thank you. So, I have a question about u
hallucinations. So, we
>> Sorry, the what?
>> Hallucinations. Yes. So you've have
shown some examples uh with the weather
that
didn't work so well. Uh but uh how do
your clients actually deal with that
stuff on production? Because I can
imagine that for like some examples that
we've seen here this is fine but uh in
real life this is a different story. So
can you give some best practices or how
to deal with that?
>> Yeah definitely. Um, so I mean on for
the demos the
there's definitely a lack of best
practices in terms of like system
instructions and you know there's a lot
that you can do sort of with uh the you
know better system instructions to uh
have the agent actually follow um the
system instructions and not you know go
off and like hallucinate the weather for
example. Um, so yeah, I think we we have
like there's some best practices do
talks. Um, so I'd recommend kind of
going through through those. We have an
example as well sort of, you know, how
to sort of structure um your your system
prompt and sort of put you know your
guardrails in there, guidelines and um
kind of the tool definitions as well.
And so once you build that up, uh, the
agent gets a lot better, you know, at
following the system instructions and
kind of staying within those those
parameters.
>> Cool. Cool. Um, yes, thanks so much
everyone. I I do apologize for the
hiccups, but we we learned something and
we'll we'll improve upon it. Uh, but
yeah, would love for you all to test it
out and um, you know, let me know over
the next couple days. I'll be around
what you find and let me know your
feedback. Thank you. Cheers.