AI Engineer Paris 2025 (Day 2)

Channel: aiDotEngineer

Published at: 2025-09-24

YouTube video id: wyUdpmj9-64

Source: https://www.youtube.com/watch?v=wyUdpmj9-64

[Music]
Heat. Heat.
[Music]
[Applause]
[Music]
[Music]
Heat. Heat.
[Music]
Ladies and gentlemen, please join me in
welcoming to the stage page. Your MC for
the AI engineer Paris, developer
experience engineer, Ralph Jabri.
[Music]
Good morning, Paris. Does this work? Can
you hear me? All right. Nice. How's it
going, guys?
Did you guys enjoy yesterday? If you
liked yesterday, today is going to be
incredible. You're going to just love
it. So, I'm very happy to have you here
today for day two. Uh yeah, yesterday
was a fantastic day. We had an amazing
opening ceremony with Swix and Ben,
co-founder of AI engineers. And we had a
great talk also from Mr. AI who talked
to who talked to us about the problems
that they face in the enterprise world,
which I found fascinating. Um and we
also had a welcome party which was
incredible in my opinion. I got to admit
these are the best moments uh in these
type of conferences is where you get to
meet everybody where you get to meet
attendees, sponsors and all the
engineers and you know and the founders
and you can hear it a little bit in my
voice. I had a a little bit of a few
chats here and there. Um but these are
my favorite moments. So, why don't we
get started with something? I want you
to look at the person on your right and
the person on your left and just take a
moment and take 10 seconds uh and
introduce yourself. Let's do that. All
right.
All right.
That's a spirit. That's what That's how
I like it. It's super cool. All right.
So, we have an amazing lineup of
speakers for you today. And we're going
to cover topics from agents, MCP, open
models, generative media, and more. And
I've seen some of the talks, and I have
to say they're incredible. So, you got
to be here. Uh, but before we get
started, I would like to invite CEO of
COB, Yan Leger.
[Applause]
Uh, the mic is not working for Yan.
>> Oops.
>> So, maybe I'll take and now I'm back.
>> Yay.
>> All right. So false starts but here we
are for the second day of this amazing
event this conference that we assembled.
Um I want to speak a bit about content.
Um so you might have noticed uh we went
from one track to five tracks which is
quite incredible in terms of
coordination. Uh we have over 35
speakers coming today coming from all
part of the world. Some who flew over
from the US landed yesterday from
Sweden. uh Emil coming uh with us just
after um and I hope you you'll enjoy it.
Um one key thing this the discovery
tracks are also um amazing. Uh we we we
had over 500 submissions on the CFP. Uh
so you will find incredible content. Um
and if you don't you cannot attend this
uh sessions all of the sessions will be
recorded and available on our YouTube
channel. Um, so don't fear uh missing
out. You'll be able to to see them all.
Um,
one last thing I wanted to say is to
thank you all again for for joining us.
Um, uh, one thing you need to be aware
is COB is our serverless platform. We're
providing high performance serverless
infrastructure to simplify application
deployments. You will find a lot of I
mean our entire team on site today. So,
uh, we have two boosts there. Uh, feel,
uh, you're welcome to stop and chat with
them. And now I'm going to hand it over
back to uh, for the next stage of this
morning.
>> Thank you, Yan. Let's give it up for
Yan.
Something tells me something tells me
that this is not the last time we're
seeing this morning. But yeah, so if you
want to have a look at the full
schedule, please um download the app
where you can find um everything that
you got to that you got to know about
all the tracks. So this is the main
stage, but we have two three other
stages. We have discovery one, discovery
2, and we have a lot of workshops and we
also have coffee at the expo uh offered
by our friends from Tinfoil. So please
uh go and check it out. And speaking of
the expo, I highly encourage you to also
go and check it out and go and meet the
engineers and founders that uh that came
here to to to meet you guys and uh make
new connections and and find you the
your next partnership or uh or or job
opportunity. All right, so without
further ado, I would like to also thank
the sponsors. Uh I would like to thank
the gold sponsors. So Sentry, Arise AI,
Deep Mind, and Alolia. and al also would
like to thank our platinum sponsors
Docker and Neo4j.
And speaking of Neo4j, our next speaker
is co-founder and CEO of Neo4j. His work
helped investigate uh investigative
journalists cracked the Panama papers
and enabled NASA to reach Mars two years
ahead of schedule and drove
breakthroughs in cancer research and
fraud detection and so many more areas.
Please join me in welcoming to the stage
CEO of Neo4j,
Emil Efim.
[Applause]
[Music]
[Music]
>> All right, let's Perfect. Thank you.
In theory, the mic is working. I feel
like I hear my own voice. So, yes. All
good. Perfect.
Bonjour.
That's all the French that I know. My
apologies. Uh, I have a French chief of
staff. I have French investors. My
daughter is learning French. So, I'll
pick it up for next year. I'll be able
to say a little bit more. I I I promise.
So the slide here this is says the state
of AI engineering. Um I'm not going to
talk about the state of AI engineering.
I don't think that's my talk to give. I
think that's Swix or Ben. I feel like
actually Swix did some version of that
last night. Um but I spent two decades
of my life in databases and knowledge
representation. And so instead I'm going
to talk to you all about managing state
in AI engineering.
And more specifically over the last
couple of years we've observed hundreds
of projects built inside of big
companies and small companies building
AI applications. And I'm going to share
some of the observations from that.
Last year in San Francisco, I did a very
hands-on practical talk about Graphra
and the benefits of Graphrag and how you
can get started with that. This talk is
a little bit more high level. It's an
opinionated view of where I believe the
data layer for AI should go. Or in other
words, we've identified four properties
of a kick-ass data layer for AI
applications.
All right, but before that, let's start
with a little bit of context
engineering. Who here knows what context
engineering is? Raise your hand. The
morning gymnastics exercise is going
well. Fantastic. Karpod, I think, had a
good oneliner for it. Context
engineering is the delicate artisan
science of filling the context window
with just the right information uh for
the next step. It was coined by this
gentleman Dexworthy from human layer who
wrote a really phenomenal blog series
called 12 factor agents which I really
recommend for those of you who haven't
read it. It's a it's a really it's a
great treatment and in factor three he
talks about context engineering. He has
a really really simple intro for it.
Everything is context engineering. LLMs
are stateless functions that turn inputs
into outputs.
Therefore, to get the best outputs, you
need to give them the best inputs. So
simple, three sentences, a really good
kind of summary.
And he go went on and he showed like a
vin diagram of what context engineering
is here. And today in this talk, I'm
going to talk about I'm going to focus
about the three main sources of state in
AI engineering. The state out of your
rag corpus, the state out of your
agentic memory, and then the state in
your application. I call this the state
in AI engineering.
Okay. But before that, we're going to
talk a little bit about a brief history
of application architecture. So I'm
going to start by bringing you back to
ancient times many many years ago before
2022 and we were all building some
version of simple crowd apps and lots of
architectures around but at some high
level this I think was the canonical
architecture. We have some UI. We have a
backend. We have a database, right? We
store data in the database and in object
storage. More specifically, we put our
structured data in the database. We put
our unstructured data into object
storage. And then between kind of the UI
and the back end, we speak JSON, which
is semistructured data.
Very simple. But then something
happened, right? Chat GPT was launched
end of 2022. That's why we're all here.
Swix coined the AI engineer term six
nine months later, midsummer 2023,
right? And we all started building chat
bots, right? We started building
relatively simple chat applications,
right? That sit on top of some kind of
orchestration layer. We stuck a bunch of
our data and unstructured form into
what? Into vector databases, right? And
so these very simple but pretty powerful
applications ran on top of honestly a
reduced version of our unstructured data
like the embeddings. Right?
So that's kind of what we did in 2023
and then in a very gross annual
simplification, right? It's only been
two years that we've been doing this,
right? 2024 is when everyone started
talking about agents at least talking,
right? Debate of how much was was doing.
Towards the end of 2020 uh 4, anthropic
launched MCP, which really helped us
think through what tools look like and
simplify tool access in in agents. And
so as we sit here in 2025, we can see a
stack is starting to form, right? At the
top, it's not just simple chat bots
anymore. It's a real application, right?
With some AI embedded AI features on the
side or at the core of it. The app, the
former orchestration layer now consists
of one or multiple agents. And of
course, there's a big debate between
single agent versus multiple agents that
wrap prompts. They use tools, they use
memory, right? They use one or multiple
LLMs as part of that, right? And then we
have the data layer, right? So what's
going on then in the data layer? The
stack is starting to form, but that
really is the upper half of the stack.
So let's double click and spend a little
bit of time on the lower part of that
stack. So we started out with vector
databases, right? But after a while
realized that you know what just
querying on unstructured data in a
semantic ANN approximate sort of way
right it's powerful but it's actually
not enough and so the vector databases
realize that you know what we need
support for structure data as well and
they tend to call it metadata and you
use it in metadata filtering for example
right but then at the same time there's
obviously a bunch of other databases
around that started using for retrieval.
It's the relational databases, the
document databases and graph databases
for example. And we all said, you know,
that whole semantic search thing that's
kind of cool. We're going to add vectors
as a feature, right?
So here what we have is we have
unstructured like a data store centered
around unstructured data which is adding
structured data and then we have I don't
know let's call it Postgress which is
centered around structured data that is
adding support for unstructured data and
a database like JSON
semistructured data that is adding
support for unstructured data. So it's a
real party down there and there's a lot
of different models. Everyone's kind of
adding what the other folks are having
and honestly it's a little bit of a
mess. And the big question then is where
is this going?
And we spent a lot of time thinking
about this. What's really required in
order to have a kick-ass data layer for
that makes it really easy to write AI
applications. And we've identified four
properties for a kick-ass data layer for
AI applications. Four properties. and
I'm going to walk you through them one
by one. Okay. So the first property you
can hear how kind of I talked about it
in the state of the union of the data
layer, right? Where we're going with
this. The first property is I believe
that in order to make it really easy to
write AI applications that requires you
to be able to have a data layer that in
a very easy way manages unstructured and
structured and semistructured all three
types of information in a single data
layer and do that well right so store
and retrieve of course but you want to
index them you want to be able to handle
transactional scope across them and all
that kind of stuff. So that is the first
property. The ability to handle
structured, unstructured, and
semistructured data.
Okay. So let's talk a little bit more
about unstructured. Let's double and
triple click on that. Right. Okay. For
property number two, what do we need to
do? What do we want to do with
unstructured data? So we just talked
about how important it is for all to be
able to handle all three types of of
data. But of course the fundamental
currency of an AI application is the
unstructured data. That's one of the L's
in LLM, large language model, right? But
there's another observation that's been
happening as well over the last year or
so. There's a lot of talk and you heard
it yesterday from Mistral, which is AI
applications. They are well
applications. There's a lot of software
engineering involved in this, right? And
if you want to write an application, you
probably want to do that with types. You
want to have structures and objects and
classes. And if a lot of your
information is unstructured, the
question then is okay, so we have an
application, that application is
handling people, you probably want to
have a person class, but then you have a
lot of unstructured data in your data
layer. And so you need to be able to
bridge somehow take the data out of your
unstructured data layer and reflect that
in persons and objects and types up in
your application so that you have a
convenient and good developer
experience. Right? So how do you do
that? So let's talk a little bit about
that. Let's say that we have an
unstructured source. I'm not sure if you
can read it in the back here, but we
have some kind of text maybe out of our
rag corpus which has three sentences.
Andreas is here today. He is a
carpenter. Shout out to ABK. Right?
Three simple things. So step one here is
to identify the entities, the concepts,
the things out of that data. That's
called named entity recognition. It's a
technique that has been around for for a
while actually quite quite a while. And
we identify Andreas and he and ABK. And
then you can look at this and you can
see that the people here the human
beings we can read this and
instinctively we feel like this is
probably the same individual Andreas and
he and ABK. So how can we resolve that?
How can we figure out a way of combining
these things that is called entity
resolution and that's the next step
resolving these into a single entity.
Right? Entity resolution. If you look at
this data right here, it's actually
sturd data. This piece we can easily
reflect up in our objects in application
state.
This process right here, the named
entity recognition, the entity
resolution is going to be a really
important part of writing applications
in an convenient way for developers on
top of unstructured data. So the second
property of a kick-ass data layer for AI
application I believe is the ability to
consistently and reliably extract
entities out of unstructured data. So
that's the second one.
Okay, the third one. So let's say that
we're building an application and that
application is for an online furniture
shop retailer, right? And you can see
the application. We have some kind of a
central part where we have a product
gallery of some sort. It has some weird
Nordic name for the for the table, the
Kira. Table, right? And then we have an
AI powered community forum on the on the
side. We probably have some kind of an
AI bot helping out with assembly
instructions, right? Um the data layer
of this might look something like this
where we have a product database, right?
We have the purple boxes in here. We
have some kind of bill of material
table. Uh we have kind of a product
issue tracker table and they're kind of
joined up in some way. But that assembly
bot that helps out with assembly
instructions in the community forum.
Well, it probably needs to read data out
of our some product bour assembly
guides, right? And so we put that in
there as well. and we've extracted the
central entities out of that
unstructured data like we just talked
about. So that's the first piece. But
then we have this community forum where
there's a lot of chatter, right? And
people are talking, they're logging on,
they're maybe asking for help, they're
commenting um and there's individual
chat history memory, but we've also
extracted entities. And in particular,
we're focused on the entities that are
globally interesting. In this case, we
have a user called ABK who's a
carpenter, and he wrote that, you know
what, for the Kirina table, the assembly
instructions weren't really great, but
when I applied two drops of wood glue to
the table step, whatever, to the table
legs before step eight, then all of a
sudden, I could easily assemble it.
So of course when we look at this we
would like our AI assembly bot right to
be able to answer questions not just
based on the data over here but also out
of some of the stuff out of our agentic
memory right but of course the problem
we end up with here is that we have
multiple Kirona tables in here. So in
order to really be able to answer not
just with the data out of the assembly
guide but out of these instructions we
need to be able to link them together.
There are two ways you can link them
together. One is link during retrieval
time right. So you query across your all
your data sources. You want to find the
kirina table right and you do that in
retrieval and there you go right that's
one way. The second way is you do it in
some kind of a unified layer where you
link it together in the data layer of
the stack.
I think the second is going to be by far
the most powerful for most applications
out there for three reasons. The first
one is performance. If you do multiple
retrievalss across multiple of tools,
you will frequently end up with
dependencies between those retrievalss
which adds which adds up the latency.
You also have to manage intermediary
results which frequently ends up being
relatively big actually. You reimplement
joins in that application layer and it
adds a lot of memory requirements. So
that ends up being tricky when you have
real production type deployments for
trivial small stuff. It's always doable
that way.
The second reason is we just talked
about this actually it's high
complexity. you know, we spent 15 years
trying to implement joints well in the
Neo Forj database and we're still
working on it. It actually is pretty
non-trivial to make that to make that
work well. Um, and then the third,
there's actually a reliability piece.
Imagine that there was like a trademark
dispute or something like that for the
Kiruna table, right? And then we have to
change it, right? So if we have to
change it then all of a sudden
in this world right we have to do some
kind of a text search across all of our
data sources right and update them which
is entirely doable but we all know that
that kind of start is a little bit messy
especially in production especially at
scale if you have it all linked up in in
the data in the data layer it's a
trivial change that automatically
propagates across the entire
state in your AI application.
So I believe that the third property of
a kickass data layer to make it really
easy to write AI application is an
ability to link entities across
persistent agentic memory and into your
application data which tends to come out
of your ragged corpus.
So that's the third property. And then
the fourth property speaking of the
ragged corpus.
If we go back to prehistoric times again
we think about the crowd app. One way of
thinking about this is that it has
structured data and an unstructured data
in the data layer which I talked about.
Another way of thinking about this one
is that all the data in here is
firstparty data. Right?
So what do I mean by that? What is what
is firstparty data? Well, in this
context, in the context of application
development, firstparty data is the
information that your application
directly collects from its own users.
It's data that you own. And couple of
examples, profile data, right? The names
that people type in, their preferences,
that kind of stuff. But also activity
data, how long time did you spend on
various things, right? Transaction data,
those kinds of things. And if you think
about to that CRUD era, basically all
the information for most of those apps
was firstparty data. I'm sure there were
some integrations and stuff like that,
but by and large most of it was first
party data.
But then if you look at our more modern
AI app and we see that the first party
data really sits here in the purple
boxes,
but the yellow data there at the bottom
is actually derived data. It's coming
out of the rag corpus the assembly
guides for the kirona table.
And what this what this presents an
interesting tension because on one hand
this is fantastic. We want to co-mingle
first party and derived data. So that
when we refle reflect that state up into
your application, the kirin table has
all the kernet table object in your
typescript application has all the
relevant information for it. That's the
best developer experience. It also leads
to better queries and return treat them.
So we want that. The flip side though is
we have to treat them differently and a
great example that's we have to have the
ability to treat them differently. A
great example of this is what if that
trademark dispute thing happened and we
wanted to change the name of the kurina
table or maybe we wanted to add to the
instructions right people handle this
differently in rag but a very common
pattern probably the best and simplest
pattern today is that you drop all the
derived data and you recreate it when
the rag corpus changes right some people
try to apply change sets and stuff like
that but I think that's tricky in
reality Okay. So then we would drop all
the yellow stuff, right? And recreate
this when the rag corpus is updated,
which is great. But if it's co-mingled
with the purple stuff, you don't want to
accidentally kind of cascade across that
and delete some of that information. So
the fourth property of a kick-ass data
layer of for AI applications is the
ability to disambiguate between first
data and first party data and derived
data so that we can handle it
differently in the application. It'll be
different in different types of
application because people have
different strategies for it but is all
is based on this ability to know is this
first party data or derived data.
Okay. So we've gone through a little bit
of history. We've talked about these
four properties of a kickass data layer
for AI application and properties that
are required I believe in order to make
it really convenient to write great AI
applications. It's the ability to handle
structured, unstructured and
semi-structured data. It's the ability
to re extract entities from unstructured
data to link them between both agentic
memory and your rag corpus and
application state and then finally the
ability to disambiguate between first
and and um first party and derived data.
So I don't know if any of you are uh
work in big enterprises and you're
buyers of application software. I think
I've seen this many many times happen
where someone is in front of me a
salesperson and they talk about okay
here are the needs that you have and
then magically they show up with a
product that solves exactly those needs
right so that is not what we're trying
to do here these are our objective
intellectually honest observations we
don't have this at at Neo4j today but
the reason why we're here Um the reason
why we spend so much of our cycles in
focused on AI engineers is that this is
of course where we want to go and as the
CEO I think I have some amount of
influence at least over the product
roadmap. So this is exactly what we're
building towards at at Neoforj. So let
me give you a little bit of a quick demo
of the first two principles which are
not perfect yet but they're really
making progress really fast and what it
looks like in Neo forj. So imagine that
we have a Wikipedia page that we want to
as as an example of unstructured data
that we want to add to the database and
we're going to see our ability to look
at structured, semistructured and
unstructured data in one platform and
the ability to extract entities from
unstructured into structured form. So in
theory this should work. So here we have
the Wikipedia page for Paris, right? We
take that and we copy paste the URL. We
go into the knowledge graph builder here
which will upload it and it's going to
start processing. So couple of things
will happen right now. We start
processing and chunk up that Wikipedia
page and we put it into the graph in two
forms. The raw data, the chunks, but
then also the extracted entities. And
what you can see here is both of those
things at the same time. We're going to
remove the entities for a moment. And
here you have the raw unstructured data.
These are the chunks of that Wikipedia
thing. But much more interestingly, of
course, are the entities. Those are the
ones that mean something to to us, to
humans, to applications. So here you see
that we've extracted like some
demographic data. It's too hard for you
to see, but British citizens and United
whatever US citizens in the city of
Paris, it's in the country of France.
And then it's automatically extracted
notable individuals. Right? So these are
the whatever this collar is greenish
something collar you can see a bunch of
really notable individuals out of Paris
which is a very humbling list of people
by the way this is Igor Stravinsky as an
example uh right now so this is all
automatically extracted from that
unstructured source and you can imagine
that once you have these entities it's
really easy to reflect them out into
your application state in terms of typed
objects it's also really easy to query
across them in a semantically use useful
way. So those are the two properties at
play here as an example.
So that's it for me. The four properties
that I believe are required to build
great applications. If you think this is
interesting and we're going in this
direction, uh Neo Forj is available for
free in in our cloud service Aura. So
you can check that out the QR code. Uh
we have a great graph academy where we
teach people how to get up and running
and build these types of applications. A
few weeks ago, I don't know how many
here are in at startups. A few weeks
ago, we launched a startup program where
we give away free credits and stuff like
that. But maybe even more importantly,
we're building a team of experts that
help startups get up and running with
Neo Forj, right? So apply for that at
nej.com/startup
program. And that's it for me. Thank you
very much.
All right. Thanks, Emil, for the amazing
presentation.
>> Thank you.
>> Awesome. Let's go now for a few
follow-up questions if you don't mind.
>> Yes.
>> Cool.
>> I have the coffee. That's the reward for
getting through the representation. You
get the coffee.
>> Very important, guys. Um,
all right. So, you touched on memory and
the AI layer. Um, what do you think
agentic memory would um would look like
in the future?
>> Yeah, it's interesting, right? Um, how
many in here are using some kind of an
agentic memory system today? Like uh so
probably third or something like that.
It's it's funny, right? Like so this was
more more my objective observation,
right? as a database guy, as a NeoAj
guy, as a graph guy, there's a lot of
people who independently come to the
conclusion that agentic memory is
intrinsically graph oriented, right?
There's a couple of YC startups like
Zap, right? Um, Mezzero, Cognney, right?
Also, we mentioned MCP. Everyone in here
knows what MCP is. Few people know that
the initial launch of MCP actually
shipped a tiny little agentic memory
implementation.
>> It's a toy. It's 300 lines of Python,
right? But what is it? It's actually a
>> graph,
>> right? And so there's a lot of people
who haven't talked to me who
independently come to the conclusion
that graph is a natural form factor for
for for memory. And so that's one of the
key reasons why we're really really
interested in it and especially like
being able to marry memory and the rag
corpus like I just talked about I think
is so powerful.
>> Interesting.
All right. Um to the to my ne to to my
next question then uh from what you are
seeing in the enterprise uh more comp do
you think that more companies are
building a using AI to solve their
internal problems or you think they're
buying AI solutions instead
>> like building application solutions
rather than sorry buying application
solutions rather than building
>> yeah it's interesting um so obviously
we've watched this very carefully we
sell prime primarily into the global
2000, right? That's that's who we sell
to. That's why we can do startup
programs to give it away for free for
startup to startups because we make a
lot of money from Bank of America, you
know, kind of thing, right? Um, and so
we obviously
think in the early days, let's call it
12 to 18 months ago, almost all of
enterprise adoption of AI was in the
form of applications. So in other words,
they bought solutions, right? We've seen
that change a lot maybe experiment
experimentally a year ago where people
started trying to build a lot but then
in 2025 people really put it in
production so I think there's a massive
surge of build and then like the whole
thing of whatever AI amplification of
software engineering means that
engineers are more productive and then
also the barriers of adoption for
becoming a software engineer or like
being able to produce code is going down
right so I I think the way that it's
trending right now and will continue is
way more build than buy in terms of AI.
>> At least that's what I'm hoping since
look, they build on us, right? Like so
that's my that's my hope, but I I think
that's where it's going.
>> And do you think that lowering the
barriers for software engineering is the
primary motive here or do you think that
there are other factors that are pushing
that? Are they seeing better quality
just by building or maybe the products
out there are not fully responding to um
what they're expecting from an AI
product? I think the they they see the
promise of higher quality
>> but I don't think it's realized yet.
Right. If you think about a company like
CLA, who here knows about CLA?
>> Yeah, most people. So, they just went
went public two weeks ago.
>> Swedish origin fintech. the CEO there
publicly tweeted about wrote this super
long form tweet saying you know we
kicked out Salesforce we kicked out
Workday and 1,200 other SAS tools
>> thanks to this AI platform that we built
and in that long form tweet it was very
clear that the key secret sauce was
graphs and Neoforj
>> so they built this entire platform
kicked out everything right and when I
talked to CIOS of Fortune 500 companies
there.
Not I've never seen the amount of
emotion against SAS brawl. Not since the
the hate for Oracle. By the way, I like
Oracle. I think Oracle is a good
database, but man, have CIOS hated
Oracle over the years. Right now, CIOS
hate the SAS sprawl sprawl that they
see. Look, we run I run a small. It's
we're a thousand person company, right?
Our software budget is massive. I have
no idea what SAS products we're we're
buying, right? uh imagine you know
Verizon or like massive companies like
that right and so I think there's a
massive promise in doing doing the CLA
thing of rationalizing your kind of
software ecosystem get software that is
bespoke to you and not get 100 features
that of which you use four right
>> right right uh I think there's a lot of
promise in that that I see a lot of
traction inside of the enterprise but it
it's too early it hasn't yet been
realized
Uh, one final question for you. So, uh,
Neoforj, you mentioned it, is a European
company, right? You're based out of
Sweden.
And, uh, but you also have a huge
presence in, uh, in the Bay Area. So,
um, how do you see innovation happening
in the AI space? Where do you see it
most? Do you see it in Europe?
>> Technically, we're actually an American
company. Um, we're incorporated in
Delaware. We're headquartered in Silicon
Valley, but I founded it in Sweden. All
of our engineering is in is in Europe,
primarily Sweden and and and London, not
a lot in in Paris, but we are hiring, by
the way. So, if you find this kind of
stuff interesting, uh please please
apply. Um and we talk internally about
building an American company with a
Swedish soul, trying to marry the best
of both worlds. Probably going to end up
in some 2 by two where we get the worst
of both worlds, but we're trying hard to
get the like the best of both worlds
kind of thing, right? Um I think it's
shifted a lot. When I moved to Silicon
Valley, I moved there in 2011 and then I
moved back before the pandemic. So I now
live in Sweden. But when I moved there
in 2011, really to build a company like
us, you had to be in the valley, right?
Like developer, deep tech,
infrastructure, there's just no other
place. And you probably didn't have to,
but I liken it to running uphill versus
running downhill, right? You know, the
center of gravity for everything was
Silicon Valley, right? That in the early
days of AI, it felt exactly the same.
like in 23 maybe early 2024 felt exactly
the same way. AI innovation wasn't even
in Silicon Valley. It was like a eight
block radius in San Francisco,
>> right? So it was right there. I think
things are shifting really really
quickly. I love being here in Paris
catering to a home crowd but like it's
just been amazing to see what's going on
here in in in Paris. Even Stockholm in
Sweden is getting getting some real
traction there with the kind of lovables
of of the world. Berlin, of course,
London is a little bit slower than I
would would would expect, but there it's
really starting to happen in in in
Europe right now. If I were starting out
today from scratch, right? It's not
obvious to me that I would move to
Silicon Valley.
>> Interesting. Well, that that's it for
me, Emil. Thank you so much.
>> Perfect. Thanks everyone for paying
attention. Let's give it up for Emil.
All right, so up to our next speaker.
So our next speaker leads engineering at
Docker. Please join uh join me in
welcoming to the stage VP of product
engineering at Docker, Tushar Jane.
[Applause]
[Music]
[Music]
Hey everyone, hope you're all having a
good time. Uh, this is a great
conference by the way. Really glad to be
here. All right. I hope everyone here
will agree with me. Hey, I hope everyone
here knows Docker. I'm going to assume
you all do. And I hope you all agree
that like a key thing Docker has done
over the last 10 years is make it easy
for all of us and all developers to
adopt microservices and containers.
bringing standards, easy tooling and
trust in the ecosystem with things like
official images and hub. We now see a
need to do the same for agents and
tools. And so that's what I'm going to
talk to you about today. A framing I
like and we like to uh think about this
is agents of the new microservices. The
same way we move from monoliths to
microservices and needed containers,
similar shift is happening with agents.
We're now going to move to agents
calling each other. Containers are still
the right paradigm, but we need to build
on top of them. We need to build
standardized packaging for agents that
understand um what agents are, what the
dependencies are. We need trusted
cataloges and we need to make it easy
for everyone to share and use these. So
today I want to briefly talk to you
about two things we're doing in this
space.
First, this is an early exploration
we're doing which is we think there
should be standard packaging for agents.
You can package them as containers, but
it's not aware of what an agent is. We
today if you just package in a
container, we don't know what tools
they're using. If you share them and use
them uh you have the same problems you
had earlier which is like what are my
tools what are my configs how do I run
them in any environment what's the
runtime I need there should be something
similar to a docker file like an agent
file maybe like imagine a docker agent
build pushpull run so we're going to go
do that and we're exploring this and
building this and to get started here to
kickstart this we've open sourced C
agent uh that's the GitHub link please
go check it out it's an easy to use
agent builder that makes it easy to
build agents but importantly package
them up as OC artifacts and makes it
easy for you to share them around um via
an OCR registry or hub. And so this is
early exploration for us on how to
package agents and share them. So we'd
love for you all to go try it out and
put any feedback and we're going to work
more here. And if you're interested in
talking more about this, please do find
us at a booth.
Okay, next um for agents to work well,
you know, everyone here has heard about
MCP while using MCP. But we think for
developers to use this to make it easy,
there are a few things needed. one you
need same as everything else you need
good packaging um packaging local MCP
servers as containers is we think the
right thing you get security with it you
need easy discovery and you need trust
still like how do you go find what is
the right MCP server for the thing we're
using uh like what's the docker official
images version of MCP like how do I know
what's trusted here and then you need
security and easy tooling around this uh
prevent rug pulls prevent all other
security threats so uh to do this uh we
are with two things we've done that we'd
love to talk to you about. First is our
MCP catalog. You can go to
hub.d.doccker.com/mcp
and you'll see a trust uh a catalog of
MCP servers. Think of this as the docker
official images for MCP the trusted
verified MCP servers. We'll be adding to
this. We'll have community servers.
We'll have a way for you all to add
anyone to add images. But on top of
that, we'll add a bunch of security. So
prevent rugps, containerize local
servers. We support remote ones too. uh
but it's an easy trusted way to get
servers you need and second in docker
desktop we've added tooling to make it
easy to use MCV servers because today
you have to go configure each client
using cloud desktop cloud code Gemini
independently we can make it easy to
discover these configure them once use
them easily and add a bunch of trust on
your laptop when you're doing it so
you're not doing just npm installs
random software with access to your
whole machine but containerized secure
So I'm going to quickly show you a demo
of that and hopefully you can go try it
out after that. All right. So bear with
me as we do this demo. Okay. So let's
first orient ourselves. Um hopefully
it's legible. So if you go to docker
desktop you'll see MCB toolkit there. Go
check it out. Here's a catalog of
servers. This will be growing. Um I can
easily add them. I can connect them to
clients. So let's do this. The setup
here is um you know I've got some PM.
They've been collecting feedback in
notion. I've got feedback from like, you
know, people here on Docker Desktop.
What I'd like to do is have something
consume this and go create issues for me
in GitHub that I can go work on. So,
let's go do this. I think I need a
GitHub client. So, cool. Let me just add
that. And I've already configured this
with OOTH as you can see. And then let's
go get a notion one too. So, let me go
enable that. Cool. And I've already
configured this with my secret here.
It's all stored in secret management is
done. And I'm going to use cloud
desktop. So I can just come here. You
see a whole number of clients we support
easily. You don't have to go manually
edit config files. Let's just connect
cloud here. So just do that. Done.
Great. Let's start up cloud.
Da da da. Let it come up. All right. So
you can see here in tools there's docker
and a whole bunch of tools there. Let's
just do a quick test with this. Um
uh Uh what feedback do you see in and
what's the name of this page
in the notion page? Uh this will run for
a second and then I'll kick something
off uh briefly here. The problem these
demos is hopefully cloud is fast. So you
can see here it's gone easily connected.
I don't have to go muck with any config.
Easy configuration. It just worked. And
while that's running, I can just go do a
little workflow here that I can automate
that says read from here and go create
issues for me. And so what I've done
great, it's done stuff. It's it's
getting more. I'm going to I'm going to
start a new one. So we can just try
something new here.
All right. So now I'm going to say go
look at the feedback and uh categorize
it and put in GitHub.
This will take a while to run. So I want
to make you all wait for it. The key
thing I want to show here is it was
really easy for me. I don't have to go
muck with any config files. I got
official servers that I can trust.
They're running containerized versions.
They don't have access to everything
they shouldn't. We have protection here
for any rockps etc. We'll be adding lots
more security controls in here. And as a
developer, I personally find this very
easy to go use and run. And now I can
start automating my workflows here. And
now we will take this further in the
future to let you easily build agents
using this tooling. So you can automate
all this stuff.
Um, and that's running. I won't wait for
the whole thing to run. All right. I'm
going to go back here.
Cool. Here's some pure codes. Go try it
out. Uh, go try out MCP toolkit.
Hopefully, you find it useful. Give us
any feedback, please. And do go play
with C agent and start building agents
and seeing the packaging. And we love to
see any feedback you have there. And
over time, you should see from us coming
out standardized packaging, hopefully an
agent file and more software NCP. That's
it. Thank you.
Let's give it up for Tushar. Thanks.
>> All right.
Up next is someone who's been at the
heart of how developers build and
collaborate for years. He's the vice
president of developer relations at
GitHub where he helped shape open source
communities and has been involved with
GitHub copilot since the very beginning.
Today he's here to talk about the MCP
protocol and and share some hard-earned
lessons from running one of the most
widely used MCP servers at GitHub scale.
Please give a warm welcome to vice
president of developer relations at
GitHub, Martin Woodward.
[Applause]
[Music]
Hey everybody, thanks for having me.
It's good. It's always exciting when you
see the thing you're about to talk about
being demoed live on stage just before
you get on stage. So thankfully the MCP
server still works. So that was good. Um
my name is Martin Martin Woodward. I
work at GitHub and we're going to talk
about um our MCP server at GitHub. But
actually what I'm going to be talking
about mostly is um the MCP protocol and
how you can get involved in the MCP
community. Uh we'll not be touching that
much on the uh the GitHub MCP server at
all. It's mostly mostly about the MCP
protocol. Hopefully that's good. Um so
as you know, you might have heard of
GitHub. You know, it's the home for the
world's developers. Um but we've created
um GitHub co-pilot in June 2021. Um and
that kind of changed how we think and
how a lot of people work with developer
tools. It was a very exciting project to
be on. I was lucky to be involved in the
the first version and have been involved
ever since. Um as we've been doing that
project, we've been learning a lot along
the way. Um, we've got now over 15
million users. We started by just doing
autocomplete.
So just, you know, because LLMs weren't
that good yet. So we did autocomplete
because it was only a little bit of code
that needed correcting. And then we, as
LLMs got better, we went into chat. And
now we're at where we are in 2025 where
the LLMs have got so good and can run
longer times unsupervised that we're
able to move to the world of the the
software engineering agent which is
where the industry is this year and
where we've seen explosive growth um
across uh LLM usage for development.
But um you need to get data, you need to
do all that sort of thing. And that's
kind of where MCP has has come into
place to be able to do things with your
agent and be able to get data into your
agent as we just saw. But what I thought
would be useful first of all would be to
kind of um you know with all these
agents together and interoperating, how
did we get here? Where you know, how did
we get to where we are today? Uh so it's
amazing to me. I'm old. uh um very old
now and it's interesting to be here at
the beginning of yet another epoch yet
another change of the way that we build
things. Um so as I say copilot
introduced in June 2021
uh we only got function calling in GPT
in like June 2023. So that's less than
you know just over two years ago as we
sit here in the room. So this is very
very fast. We then rapidly followed that
with um copilot extensions and that was
basically a way for end users to be able
to plug in to that whole function
calling thing and and developers to be
able to provide tools to co-pilot
specifically to be able to then talk to
the rest of your development system.
We're GitHub. We know that, you know,
while we're at the center of your
developer universe, you need to be able
to talk to everything otherwise there's
no point having it there. So that's why
we introduced extensions. Then later
that year, Anthropic announced uh MCP
and the cool thing about MCP was that
the tool discovery is more dynamic. You
know, um you can tell the LLM what tools
you have available and you can give the
LLM sufficient context for it to be able
to do the tool calling. Uh rather than
the function calling which is a lot more
kind of API based. Um, we've got full
support of MCP inside of Visual Studio
Code. And then in April of this of gosh,
this year is this year's been f it's
been a long year already. So, April, we
did the uh local MCP server for GitHub,
the official one. When Anthropic
launched, they did a a um a version of
an MCP server that used our APIs and
then we worked with Anthropic and we
have our own and made it open source and
everything's good. And then we also have
uh now um a remote MCP server. So if you
don't want to install anything um as you
saw in the demo uh from a colleague at
Docker, you can just talk to a remote
MCP server and everything works. So
that's kind of the history and it's
moved very very fast. But it's important
to know that you're all here at the
beginning of this new wave. And so
things you might think why how does this
work yet? It's because it's so early is
is why a lot of times
as we've been building MCPs, we've
learned a few lessons. Um the first
lesson that we've learned actually is
that um you know everybody knows MCPs
for tools. It's what made MCPs
successful uh and why people start using
MCPS. allows you to do things like
create issues, send email, execute
scripts, perform actions, as well as
allow you to get data to add to the
context of your prompt for your LLM. So,
that's the key values. Tools is at the
center. But it's much more than just
tool calling. Um, if you just call
tools, then you miss out on some of the
other parts of the MCP protocol that
allow you to call tools better, more
efficiently.
Uh, I did uh an update on the uh MCP
protocol. Um, this these are the basic
constructs in MCP. So we obviously have
tools at the center on the server. But
what you can do as a client, you can ask
for resources from the server. So you
know if you're talking to GitHub, that's
things like files, issues, you know,
like data that you need to be able to be
part of your context. Or if you're
talking to notion, that's your, you
know, notion forms. If you're talking to
a database, that might be database
schema, that sort of thing. So the
resources you can separately access them
as a client of an MCP server. We also
have prompts which is cool. Prompts are
a bit like the the stored procedures of
the uh MCP world. You can ask an MCP
server, hey, what good prompts do you
have? Um and there's variable
substitution in there that you can
insert data. So if you know, can I have
a prompt to do this thing please? Can I
have a prompt to do that? You can
actually ask the server for some good
prompts. and the LLM can do that and
it'll be great. Over on the client, um
there the client can also provide a
couple of things. One is um sampling. So
a client can allow the MCP server to
send back um like be able to do a lookup
against the LLM that the client is
using. So at at GitHub and GitHub
Copilot, we allow the developer to pick
whatever model that they want from
whatever development environment they're
in. And so it's essential for an MCP
server like ours to be able to use the
what the customer wants their LLM to be.
So if you want to use clawed sonnet for,
great, we can do that. And now our MCP
server can go to the client and say,
"Hey, use the uke users chosen LLM to do
this bit of work for me." Um, or if you
want to use Gemini or if you want to use
whatever models. So that's why that that
exists. And then finally, root is the
ability to um the client to specify
where in the file system, where in the
client server uh resources live and
where the server can access. So you
don't go accessing for for things that
you might need root level permissions
for example to be able to see or you
know you stay within your sandbox.
Um and then dynamic discovery we kind of
is part of the MCP protocol but people
know it discovers tools. Lastly is
actually elicitation that's very
underused as you'll see but that allows
the MCP server to tell the client hey I
need more information about X. I need to
make a decision. Can you ask a question?
Do I do A or B? That usually means that
goes to the client as a question. Do you
want me to do A or B? Or it might be the
LLM makes that decision for you, but you
you're asking the client for uh for
input at this point. And that helps
prevent hallucinations. That helps
prevent ungrounded work.
If we look at um you know the tools,
it's great. Um, and the reason why MCP
was super successful is because it is
such a pragmatic way of calling tools.
But the problem that we have today is a
lot of our examples with MCPs are just
focused on the tools rather than the
bits around the tools. And so unless we
as a community do more work to, you
know, get better at tool calling,
nobody's going to be aware of what else
is possible, which is one of the reasons
I'm here to try and help you see what
else is possible uh and help you dig in,
encourage you to dig in. Um I did a
quick survey of all of the MCP clients
uh on Monday and of those clients,
there's about 80 today, different client
implementations of MCP.
um 79 of them know about tools but if we
look here only three of them know about
elicitation part of the protocol or have
it implemented so it's very very early
in terms of our implementation of the
spec was only written in June so it's
nobody's fault but it's still early days
um if you are experimenting and want to
you know you're building an MCP server
and you want to test out everything um
actually I would encourage I'm not on
the VS Code team, so they're a sort of
sister team to my team at GitHub. But I
would encourage you to give uh VS Code a
try. It's the only like it's the only
client that's the end end client that
implements everything in the MCP
protocol and all the authentication
specs. So try it there and then that
will help you hopefully plug it in
everywhere else.
Right. So the first lesson that we
learned once we started adding lots of
tools is that actually tools are not the
answer. Just like humans, the more
choice you give to your LLM, the more
likely it is to get confused and the
same is true with tools uh to LLMs. Lang
chain did some great research around uh
the degradation in terms of uh
performance of the LLM for coding. the
more tools that it had access to. So um
you you get overloaded it can break down
and it can slow down the performance as
well. Therefore, uh what we did
initially, I have a very very easy job
because you are my end users, your
developers, your technical. Uh so it's
easy for me to do things because I just
I I can ultimately just give you control
and you can decide which tools to switch
on and off and you like that as
developers. Um many people who are not
building developer tools don't have that
luxury because if if I went to a normal
person when they were booking a flight
or trying to do some insurance
comparisons you know agentto agent
protocols and asked them to pick tools
they would no they're not going to you
normal people don't want to do that
developers like to do that. So we
initially gave you uh all of the control
in the client to be able to control
which tools that you select, but what we
obviously need to do is dynamic tool
discovery. That's what's great about the
MCP protocol in that you can query the
server to say which tools are available.
But what's been added in the June
version of the protocol is actually the
ability for the server to say, okay,
we've been having this conversation. I'm
now going to add a new tool to you LLM
that you can call that it kind of
unlocked at this point in the
conversation and it can broadcast the
fact to the calling client that this
tool is now available. And then it can
broadcast this fact that this tool has
gone away to the LLM as well. And by
doing that it reduces the amount of
tools available to the calling model at
any one time to again minimize the
amount of tools to prevent increase the
accuracy to prevent confusion. So that's
what dynamic discovery is good for.
Um the next lesson that we learned
unfortunately is uh installing MCPs is a
pain. Uh nobody wants to do it. what's
um you know trying to do crossplatform
MCP installation is an absolute
nightmare. Now Docker we've just done
Docker is fantastic as a developer for
getting a containerized MCP server but
that's fine for us as developers that
isn't going to work for end users you
know for for normal people. Um so
containerization is great but only gets
you so far. And if you look at the
amount of commands you can use when
you're defining your MCP server, you
know, there's everything here. Trying to
get it to run on Linux, on Mac, on
Windows, basically impossible to do
reliably um as a local MCP.
Now, for oh, the other lesson that we
learned, I did a quick GP of the logs
for people calling our endpoints from
MCP servers. Nobody upgrades a working
local MCP installation ever. Once it
works, I am not touching it until it
stops working. Now, this protocol is
changing so fast. Our world is changing
so fast. That's not sustainable for us.
You know what I mean? Um, if people are
never going to upgrade and because of
the way that we do installation, like
trying to do auto upgrade is basically
impossible.
So local again is great for development
is not going to work at scale once we
start rolling these out to normal
people. So um remote MCP servers have
been introduced and this was kind of in
the protocol from the very early days.
MCP has always been a network protocol.
The you know and a remote MCP server is
obviously installed remotely. um it
doesn't have any access to the client
like file system and things without
asking the clients and it's very easy to
upgrade to scale. So as a service
provider remote MCP is great and as an
end user it's fantastic too. But as a
developer there are some downsides you
know um if I'm building an MCP server
and I have to stand up a remote server
first it's hard. So local MCP has some
great advantages to you as a developer
when you're building MCP servers uh for
speed of iteration and for
experimentation.
The CL the critical thing that you need
with remote MCP servers we mentioned
there is is good authentication but
we're going to touch on that in a
second. Locally you can rely on local
secrets and the permissions of the local
user.
Um, so, uh, the trick that we did is we
built both a local MCP server and then
we used that exact same logic hosted to
be our remote MCP server. So, our local
server fully open source, everybody can
develop on it. Anybody can add features.
It's all good. We can as an internal
team iterate very quickly. And then we
can take that logic, host it remotely,
and then we upgrade our remote server as
well. And that then enables server to
server scenarios. So you can go to
github.com talk to the GitHub MCP server
and you know manage that whole thing
because there's a remote instance rather
than having to rely on your local
instance.
The next lesson that we then had once we
did remote MCP servers is uh password or
pat based authentication is bad. It's an
antiattern. Um it uh you can very easily
like managing those secrets is hard.
they always become long live secrets
which is bad for security. Um you you
know preventing them from being accessed
by other MCP servers that have access to
the file system is very very
complicated. You know there are lots of
reasons why passwordbased pat
authentication is bad which is why in
June we actually added um oorthth
support to the MCP protocol. So now MCP
supports proper tokenized security with
shortlived tokens. It supports OICD
connect as well. So OOTH support is key.
But not every client SDK has OOTH
support yet. And we all have if people
here have built OOTH integrations. It
can be, you know, there's more pain
there. There's more friction to go
through, but it's necessary friction
before you productionize something
that's talking to MCPs. Um local you can
get away with it fine, but once you go
into production, you really need to be
supporting OAR.
Okay. And then the final kind of lesson
that we learned is your MCP is
completely pointless until people can
find it. You saw in the the Docker
example just now uh going into a
marketplace, finding MCPs that did what
they needed and plugging them in.
Without that discoverability, nobody can
use your MCP server. Um so we need to
make it easier for people to be able to
find MCPs. So what we've done there is
we've uh as a community we've created
the uh an open- source MCP registry.
This just shipped like two weeks ago. Um
there's a link there. Go and have a
look. You basically have now the ability
to use an API to publish details about
your MCP server to the world as part of
a open-source. Everybody shares a base
repository of the MCP servers that are
available.
Now, I've been around, like I said, I'm
old. I've been around through lots of
different uh programming languages with
package managers. I've been around with
lots of different, you know, like npm
and uh .NET and things like this. I've
been around in different ecosystems. The
thing that I'm very keen we prevent is
uh creating like a single point of
failure, a single monopoly of registries
because no one registry solves
everybody's problems. Um there's a lot
of trust involved in which MCPs you
switch on and you enable and so we need
to be able to make it so people can have
their own MCP registries and be able to
distribute that load. Um, GitHub now
have an instance of the an MCP server
that you know goes and you can get like
the you know the GitHub uh catalog of
MCPS that adds additional data from the
base catalog such as how many stars it's
got, how popular it is. So we can see
context 7's really popular. You know the
GitHub one is obviously Markdown that's
a cool plugin from Microsoft actually
which turns stuff into like doc files
and things like that into markdown so
you can crawl them. It's cool super
popular. Um what we want to do is build
an ecosystem just like you saw with the
Docker example just now where we have a
central registry where we can make MCPs
discoverable and then you have local
registries. So Docker has a registry
that you can go to when you're using
stuff inside the Docker area. Your
company probably should have a registry
at some point that allows you to decide
which ones you trust in your environment
and all those sorts of things. And that
discoverability is kind of the key to
the next le next level to make MCP, you
know, the success it is. MCP has now
become the API layer of AI. It's one
it's how it's going to work. I don't see
anything else coming along here. There
are complimentary things like A2A for
being able to talk between agents. But
in terms of tool calling, being able to
do things, MCP is the protocol we're
going to be using. U MCP is fully open.
Um we it's available for everybody to
use and talk to. So I would encourage
everybody to get involved. If you want
to be part of this community, now is the
time to influence the direction of what
we're probably all going to be relying
on for the next 20 years. So, um, if you
want to go to the MCP community site and
get started, that's there. And then
there is also a a Discord as well that
you can join, um, in terms of, uh, being
involved and being part of the community
steering the future of MCP. Leave that
briefly up for a second till I see the
cameras go down. Great. One more second.
There we go. Okay. So, there was the
lessons we learned very briefly. I'm
around all day today if people want to
talk more and chat a bit more. But uh
what I want to do is encourage everybody
to get involved in the MCP community.
Let's build from here together. And now
on to questions. Thank you for your
time.
Awesome.
Thank you, Martin. All right. Would you
like to join me for a few questions?
>> Oh, we're sitting down. I'm I'm I'm
cool.
>> Let's do it.
>> Yeah.
>> All right.
Wow. So so many lessons about MCP,
right? Um, and I I actually loved when
you mentioned uh uh that you dynamically
pull the tools.
>> Yeah.
>> Um, and you spoke about remote MCP and
OOTH.
>> Um, where do you see the protocol going
or what do you see is there anything
that you're actually like needing in the
protocol at the moment you feel like?
>> Yeah, I mean the protocol as we did it
as as it's done in June is pretty much
is pretty complete now for what we know
we need. Um, I think the biggest thing
is to get all the client implementation.
So getting the SDKs to make it easy for
developers to be able to use the full
protocol because you don't want to you
don't want to care about elicitation and
dynamic discovery. You just and prompts
and things. You just want to go to an
API and say call this in a good way, you
know. So um we the next bit of work is
to kind of simplify access to the
underlying protocol I think and then
probably the discoverability in the
registries is the next area
>> right yeah I I totally feel you with
discoverability I think like maybe
agents should be smart enough to be able
to go and and find out about the right
MCP server just
>> that's totally why there's an API there
so that agents can autonomously do that
but as we've seen um the there are
judgments to be made in terms of trust
you know, cuz you you're having a thing
execute things on your behalf. So, um,
and there's also value decisions as to
say you had five flight booking MCPS,
which one of those do you call? And that
logic as to which one you call is like
critical business logic that you want to
have control over. So, um, yeah, it's an
interesting area, but I think the now
the fundamentals are probably in there
in the protocol.
>> Awesome.
>> Um, you also mentioned that the MCB is
new API layer.
>> Yeah. Um, so do you encourage everybody
to get started with just turning their
APIs into MCP servers? Is that the the
right idea to get started? Like if I
start with an MCP server, what do you
think?
>> Yeah, I think um LLMs are ridiculously
good at writing MCP servers. So the
easiest thing is to, you know, crack
open VS Code, you know, Copilot, crack
open claude code, whatever you want to
use, and get it to build you a quick MCP
server that you can play with. Um I've
just done like we did an MCP server that
allows you to play a game of tic-tac-toe
or you know allows you to kind of
interact. Doing that locally and being
able to see um how it's working just
helps you kind of understand the base
protocol and then yeah sure um having a
remote MCP endpoint for your business
might make sense for a lot of people. I
also think there's probably the need for
some MCP servers uh for like platform
engineering to help us in how we you
know build and deploy things inside of
our organizations. If we can make it so
that every developer from their
development environment can say give me
a new development environment please you
know add these add this database
connection like all those things where
you need to go talk to a human and need
to get into a ticket queue and wait. If
you can automate some of those with MCPS
inside your organization, that'll really
speed up your development flow, I think.
>> Awesome. Um,
>> uh, yeah, you you mentioned that you've
been involved with with the developer
community for quite some years now.
>> Um, so when you meet new students, um,
>> yeah,
>> who are who who want who want to get
into computer science, what do you tell
them usually?
>> Yeah. Yeah, I'm looking we've got a
minute left. So yeah, I I've got skin in
the I've got skin in this game cuz my um
my son is uh a third year of a fourth
year computer, you know, masters in
computer science. Um I look after at
GitHub. I look after um our education
programs as well globally. So you know
uh millions of students worldwide. Um I
firmly believe we need more computer
science students uh tomorrow than we
have today. um the the
the students that are coming out, we
need to give them exposure to all these
tools. So we we give it to them for free
so that they can them and their teachers
can have exposure to using AI tools. But
it's really interesting seeing that the
teams um it gives them much more
exposure into working with somebody else
i.e. the AI agents than they would have
typically coming out of an education
program which is encouraging to me. Uh I
you know typically with um when you're a
student you basically come out thinking
I need to do everything myself and group
projects are painful. I never want to
work in a team is kind of what you learn
from your degree. And so uh but when
you're working with AI you know you can
you're getting used to analyzing the
code that was created by another agent
seeing what works seeing where it's gone
wrong and being able to critically like
fix things. And that's a skill as an
engineer we need to just do every single
day. So I'm actually pretty encouraged
by what I'm seeing in the data of what's
coming out. But um I would strongly we
need more and more engineers and so you
know strongly encourage students to do
that but also strongly encourage
companies to make sure that they're
hiring new junior developers and uh
because nothing grows a senior engineer
more than growing a junior developer.
>> So yeah it's very interesting when you
hear people saying that the junior
developer is kind of dead but
>> yeah not at all. we need more than ever
and but the the impact that that junior
developer can have is a lot greater than
it has been.
>> I think there's been never a better time
in history to be a software developer,
>> especially if you don't speak English as
a first language as well because now the
LLM can explain code to you in your
native language and things. That's huge.
Like I take it for granted because I
speak okay English, you know, but if
you're learning computer programming to
have to learn English as well is is
horrible. So it's great we're improving
that too.
>> Awesome. Thank you so much, uh, Martin.
>> Good to meet you. Thank you for your
time, everybody. Thanks.
>> Let's give it up for Martin. Thank you
so much.
>> All right. How are you guys feeling?
>> Good. Ready for a break? Okay. So, 11:30
is when you need to to be back to this
stage. Um, so please, yeah, let's go for
a quick break and back at 11:30. See you
in a bit. Thank you.
[Music]
[Applause]
[Music]
Ladies and gentlemen, please join me in
welcoming to the stage your MC for the
AI engineer Paris, developer experience
engineer Ralph Chabri.
[Music]
And we're back.
[Applause]
You guys are have enjoyed the expo. It's
incredible, honestly.
So, have you have you enjoyed the expo?
Yeah. I'm I'm I need some energy. I need
some expo energy here on on the main
stage, please. Okay. All right. Um, so
coming right up, our next speaker has
spent over a decade building
infrastructure and now he's building the
new infrastructure for AI workloads.
Uh, in his talk, he's going to talk
about what happens after the GPU gold
rush. Are agents going to need more
compute, more storage, more network?
Well, we we'll we'll hear about that. Um
and uh he's going to talk about diverse
hardware as well. So, please join me to
welcoming to the stage my friend CEO of
COB, Yan Leger.
[Applause]
[Music]
All right, and we're back on stage.
So, um, this time I'm not going to speak
about AI uh, engineer Paris. I'm going
to speak about what we actually do and
how we see AI continuously redefining
cloud infrastructure. Um, my main u um
job is to uh to run a company called
COB. And as uh Ralph mentioned, I've
spent the last 14 years uh in the space
of infrastructure in general. So I've
seen a bunch of evolution uh from the
beginning of of cloud infrastructure um
to now and this new wave with AI. Uh so
we're going to look to spend 20 minutes
together looking a bit about the
evolution um and what is coming next
with the agentic era.
So the first thing I want to to
highlight is if you look a bit back um
so two years ago uh two three years ago
with genai engineering the main thing
which was uh which popularized AI
engineering is LLM backed by GPUs. So
typically CH GPT and all this the open
AI models were running were LMS running
on GPU. So if you look at the early 2023
AI stack two years and a half ago it was
relatively simple um in the sense that
in terms of number of components you
tended you used to have like an API
vector databases on on one end then you
have a limited you had a limited choice
of of models. So typically uh in early
2023 you had the GPT models from OpenAI
closed source and you had a few
opensource models. So the llama models
from meta and stability but mistral
wasn't even there for instance um and it
was running on GPUs with Triton
inference server pro probably simply
because even VLM didn't exist and we're
speaking of only two years and a half
ago
but now AI engineering is also content
engineering VLM so models which include
like image and vision capabilities video
models
VIP coding a big trend of this year MCP
and agents.
Um we see all of this as an
infrastructure provider. Um and I want
to dive a bit with you on what the
future of AI infrastructure looks like.
Um how we are trying to build for the
this agentic era and what are the
changes we see.
So the agentic AI stack is way more
sophisticated.
I picked um set of technologies which we
see in this new stack. So goes from the
uh front end which is regularly vibe
coded. Uh so we have app builds uh manus
or lovable to v code your front ends. Uh
our team made an experiment on this
which you can see uh on the expo floor.
Um
we have APIs which might be in more
diverse languages because you have more
SDKs available. uh you have the a agents
um which are probably going to be
written in Python uh still um but are at
different like components and then even
the historical databases have vector
capabilities or capabilities for AI. Um
then you have MCP and for the execution
and the inference of the mall you have
also more diversity. So AMD is a
credible player now and can be used for
specific models. So uh you might have an
an image model running uh video model
running on a different kind of uh
hardware um or LLMs uh running on
accelerator here for instance tensor
um which is an upcoming uh provider of
hardware uh trying to disrupt the NVIDIA
market the market of Nvidia. Um, and you
might not even see any of this because
you might also be operating on a higher
level of the stack with inference
endpoints.
So if you look at what it means in term
of hardware infrastructure um, and
infrastructure to operate now it's a mix
of GPU, CPUs and accelerators. You have
large training which is still performed
with uh GPUs, time tuning and small
training where you you will probably
still have GPUs in France where you can
have accelerators
uh and people are probably today
starting to train with AMD GPUs. Um
but influencer is where you're going to
see mostly AMD GPUs and then agentic
workloads which are actually running on
the good old CPUs.
Now let's dive into agents. Um I want to
actually start by defining what is an AI
agent, how we see it. Um the and because
the definition of agent is varies
highly. So um the the simple definition
if you look it up is it's going to be a
software system that uses AI to pursue
goals and complete tasks on behalf of
users. They show reasoning, planning and
memory and have a level of autonomy to
make decisions, learn and adapt.
Um for us there is a key question of
what is important in this uh definition.
What impacts infrastructure
uh so it's software system at the end
it's still like Python or TypeScript or
another language which is going to run
on our servers. uh it's going to
complete task and they have a level of
autonomy and um what it means
technically is that the level of
autonomy you're going to run untrusted
co code uh in general which is generated
by a model which is nondeterministic.
So anybody who ever worked in security
should have should think about like the
danger that it involves
um and actually the requirements it
involves
and the first uh requirement we see with
agent loads is secure sandbox
environments. Uh so to make sure that
this entrusted code uh is not dangerous
uh to your production uh platform or
environment um we have we see
requirements in terms of performance
because they still need to run fast. The
agents are mostly ephemeral uh so they
run for a short period of time. um
because there is a high volume
efficiency is a key topic and the
deployment speed is also a a key topic
that we see.
Um now on the co side we build a global
serverless platform and we do this for
agents and inference. So we provide um a
diverse set of hardware to support all
these of these use cases. And I'm going
to briefly show you um I'm going to do a
live demo if uh it's goes well. Um and
we're going to execute sandbox codes uh
using the COAB MCP server.
Let's do it.
All right. So um we are going to use uh
code um with um
MCP uh connector
um here. So the it's called co sand send
sandbox what what it's going to do
technically it's going to create a
sandbox to execute um our untrusted code
will be today will be relatively safe we
will execute some simple um
uh addition but we'll see how what
happens behind the scene uh so here I
have something which is um preconfigured
if I have okay I should be fine um which
is preconfigured to
which should be preconfigured to run um
in a using the cos and box projects.
That's interesting. Um okay, so we do
have network here. Um and basically,
okay, here we go. So I'm going to uh it
basically instructs the agent to u um
to create a to use a cosmbox to execute
untrusted code. So what we're going
going to do now is we're basically going
to ask him to um execute and trusted
code. Can you please um execute
this untrusted code? and it's going to
use our MCP um uh server to uh create a
sandbox.
All right. So, here we have the account
technically where we run the MCP server
already and we're going to see um um
uh sandbox being created behind the
scene. So, um typically we've done this
demo with code. It could be done done
with any anything else. So we also have
a demo running with like um merol uh to
create the sandbox. Here we see that
it's creating a a sandbox called
calculation sandbox and that it's
getting getting created behind the scene
uh by the the prompt. Um and so
eventually like so this sandbox is is
healthy and we are going to get the the
result here. Now let's look at what's
happening behind the scene
and what it means. Um right now I just
did this quickly. I went to code I uh
executed code but coding agents are
doing this automatically for you without
you specifically asking to do it. Um so
behind the scene as a service provider
was what is what we see is that a
agentic workload need to create
thousands of secure sandboxes daily with
subsecond starts like in this case you
want the result to be immediate and this
year is a year where we started seeing
people coming to us telling us hey we
need to deploy uh 10,000 container a day
uh because we have this agency workloads
which do this automatically
Um so if you use lovable behind the
scene they are going to automatically do
this for you and they are going to
create a container on an infrastructure
provider like us.
Um for instance is doing this on on cob.
Um one of the key question for us from a
technical SP standpoint is how fast can
we boot an agent and how many um agents
can we boot so that the experience uh
for end users is seamless. Um the
technical problem behind it is uh we
deploy docker containers so they are
stored in our registry and we need to
transform them in an agents executing
your your code.
Behind the scene our stack uh is
actually leveraging virtualization to
isolate the containers. Uh so we use
cloud hypervisor um to run the container
we have and on top of this we have the
container the agents and we might have
if it's uh an inference um workload we
might have an inference engine like VLM
and below we're operating on on bal
machine. So when someone types a prompt
like this, we need to uh get this all uh
ready.
So bottlenecks
um
multiples you have multiple bottlenecks
knowing that we're aiming to get
subsecond like uh start. So if you were
to create a bal servers bare metal
server from scratch, it would take
minutes. Um if you want to pull an image
from a registry can take um several
seconds for small images. If it's a
large model it's going to be way longer.
Um your networking is going to uh be a
problem because you have to uh converge
between all your servers. So
networking configuration and then you
also still have to to execute the
runtime of your agent. So typically uh
the Python runtime is going to take some
time to start.
In our case, one of the main main
bottleneck when we started seeing this
increase of volumes was networking. Uh
some people might think it's actually
the virtualization engine. Uh
firecracker has been quite popular
lately even even so uh uh it's it's not
totally new but in practice for us um
the virtualization engine is not the
bottleneck. Uh in most cases it's really
the networking part and it might even be
something completely different in your
system. So if you have a vault system uh
it might starting to take a lot of
request and just like be the bottleneck.
Um we are able to we were able to
mitigate most of this problem. So we
preemptively start the machines. We cach
the images on the um
hypervisors and on the networking side
we made a lot of optimization to reduce
the time and we where uh the last part
which we don't control is actually the
agenting runtime uh which were we don't
optimize the python runtime ourself um
so this part is still like one of the uh
component which might take time to start
we do have mitigation though uh to be
able to uh to to um for this workload to
support this workload.
Um
and the mitigation is actually uh called
scale to zero. Um that's how we mitigate
most of the cool start. Um so let me
back up a bit. Uh if you
on the agentic workload we have two key
patterns. you have FMR sendes which are
going to be executed like during one
minute and if you are for instance using
lovable it's going to create a web
server at the end which is going to uh
um run constantly. One of the challenge
uh for this providers is a lot of people
try it out uh and then you have a huge
fleet of uh services which are actually
idling most of the time. Um and so but
someone might still be using them and
they are looking to maintain it up and
running. Um so to mitigate this problem
of um cost
we scale down to zero the workloads. Um
if you've been in infrastructure uh in
the past one of the challenge when you
scale to zero is how fast you can
restart the server uh because we
completely shut down the VM. um and so
you will hit this time of of startup.
So we have two techniques like scale to
zero and autoscaling uh to increase
efficiency and we're able to reduce uh
call starts with uh snapshots. Um so
what we do technically behind the scene
is we do memory snapshots. So your
agents we're going to save the memory.
Um we call this mechanism light sleep
and basically it reduces the call stop
time from several seconds to 100 200
milliseconds which is not visible at all
for end users and so we end up being
able to manage large of machines uh
without um breaking uh increasing your
cost uh as a as an operator of such a
technology.
Um we this is available now on CPU you
this technology um we also have scale to
zero on GPU we're bringing the same
principle of memory snapshoting to GPU2
um for this kind of of workloads
um if you want to know more you're
welcome to to get started with our
platform and we have our engineering
today on site uh so uh please uh Don't
hesitate to ask them any question if
you're curious about how it can be
helpful. Thank you everyone.
[Applause]
>> All right. Thanks. Yeah. Thanks, Yan.
We'd like to for a couple of questions.
>> Yeah, let's do this.
>> Let's do this.
Uh I really enjoyed your intro yesterday
and I think we need to make sure that
everybody understands that COAB is not
an agency company. It's not an event
company. It's actually an infrastructure
company. Right. So
>> um and you said that you were in the
space for over 14 years which is super
impressive. So uh can you touch on a
little bit on uh what changed you know
in this 14 years building infrastructure
um like you know for for cloud and now
for uh and now for AI. Yeah, I mean
that's um I do love this question uh but
uh uh which uh because we I mean we saw
a bunch of of changes um while some
stuff are consistent. For instance, what
is funny is like uh if you look at our
engineering
technologies and our stack, our
technical stack um we started four years
ago with firecracker as a virtualization
technology. Um we and then at some point
when we added support for GPUs we went
back to QMU KVM because it was not
supported with V firecracker and QM KVM
is a technology we were using
>> 14 years ago already
>> it's been around for a while.
>> So there are some things which are kind
of consistent in terms of changes then
it moves like the the scale at which we
have to operate in terms of number of
containers. Um
it's like before you you would have this
here but with several customers. Now you
have a single customer which is coming
to you and bringing all this workloads
for instance. So some of the the um
topics are consistent like GPU if you
look back in 2017 we were already
deploying GPUs right. So I was
mentioning 20 uh 23 2017 is what like
eight years ago. So we were already
deploying GPU. The difference is at the
time uh it was actually consumer grade
GPUs from uh uh Nvidia um and they were
only starting to to create like data
center grade GPUs. So you have this
stuff which are um consistent and um
over time
not changing and you do have things
which are completely new like the
diversity and accelerators. um this is
still maturing and we we believe it's
going to take five five more years at
least to have a decent competition where
Nvidia start losing market share.
>> Is that what you're predicting? This
this is not financial advice by the way.
So just saying um
>> yeah I mean like so we uh we we do
believe that like the monopoly of Nvidia
is going to fall. Uh we saw it in the
past on CPUs. Um if you look back 10
years um back um ARM was completely
inexistent in the data center market.
Intel was a dominant player. AMD was
struggling on the data center market.
>> Uh then AMD came to to market um pushed
back the market share of Intel and now
Intel is not in a great situation. So
>> um so yeah we do believe that the same
is going to happen with GPU. Then the
question is how fast.
>> Wow. Okay. So you're saying that we had
the building blocks to build the AI
infrastructure before and and now it's
just the scale of it, the scale of the
demand is just higher than what we used
to see before.
>> The scale is completely different and
the if you look back at like the slide I
made on the AI stack, it looks again
like a micro service stack. At the
beginning like two years and a half ago
everyone was like just rack GPUs and um
you you have a set of new players who
came to market by just providing GPUs
right
>> uh now um I think if you want to have a
decent infrastructure uh you need to
have actually um a set of services it's
not like as simple again it's uh even if
behind the mall itself you have a lot of
complexity but it's also now several
services that you need to coordinate it
again.
>> All right. Okay. So, um in your
presentation you touched on something
that is really important to me which is
ephemeral environments and you said you
talked about secure send boxes.
>> Yeah.
>> Um so do you see anything missing in
those uh secure send boxes for for for
AI agents? Do do what's the next step
for them? So on the agentic side what we
um see we see a lot of development on
actually higher level of the stack. So
you need this primitives whereas the
container uh starts fast. So you need to
be able to execute containers because
people want the flexibility to execute
different runtimes.
>> Yeah.
um then you need this isol is isolation
which we're uh providing uh as a
building block and then people tend to
look at SDKs which doesn't don't need to
be u um sophisticated but basically
where they can create quickly send boxes
so we are looking into this also helping
like um AI engineers create send boxes
without having to think really about the
life cycle of the sandbox typically
that's what the MCP uh tools are are
doing in this uh in the demo I just
made.
>> All right. Well, thanks Yan.
>> Thank you
>> for your time. Thanks for being here and
thanks for the talk.
>> Let's give it up for Yan once again.
>> Thank you.
>> Thank you.
>> All right. So, I'm particularly excited
for our next speaker.
Coming up next is someone who worked
whose work has shaped the entire
generative AI landscape. He's the
co-founder of Black Forest Labs, the
team behind the state-of-the-art model
Flux. Before that, he was a researcher
at LMU Munich, Nvidia, and stability AI,
where he co-invented latent diffusion
that powers stable diffusion,
midjourney, and deli. He was even named
one of the German uh Germany's top 40
under 40. uh today he's going to pull
the curtain on Flux and tell us how it's
uh how it's built. So let's see about
that. So please uh join me in welcoming
co-founder of Black Forest Labs, Andreas
Blackman.
[Music]
[Music]
So hey everyone, thanks for having me
today. Thanks Ro for the nice intro. Um
I want to talk about today about our
most advanced image model family flux
and I want to explain you the concepts
you really need to understand how it
works.
I think R has said a lot about me. So
I'm Andy. I'm a co-founder of Black
Forest Labs. We are the company behind
the well-known Flux model family.
And at BFL, we believe that visual media
will be the central interface for human
communication in the future. And our
mission or our vision is to become the
central infrastructure provider to power
all the human communication through that
visual media in the future. Not only
what cameras can do, but way beyond
that. With that mission in mind, we've
incorporated the company in April 2024
and launched it in August 2024. We've
grown it since to 40 full-time employees
that are distributed among two
headquarters. Our main office is in
Fryborg in Germany, actually in the
Black Forest and we also opened an
office earlier this year in San
Francisco.
So, we've launched our first model
family, the Flux model family at the day
when we launched the company. Um, and
since then we've mainly structured our
releases in three tiers. I just want to
share a bit light on those. So we have
the Flux Pro models. Those are our best,
most advanced, highest quality and
fastest models and they are only
available via the BFL API. They are just
enterprise grade and you can scale from
zero to massive volumes in yeah nearly
instantly and without any infra hustle.
As some of you might know my co-founders
and me we've been the original
developers behind latent diffusion and
stable diffusion. Um so we've still very
strong roots in the open source
community and that's why we also still
are publishing openly available models
also here in two tiers. We have the flux
dev models that are open weights. These
are available for everyone downloadable.
Yeah. On hugging phase basically they
are self-hostable. So perfect for
someone who wants to host a text to
image model in their own infrastructure
and they are fully customizable. So you
can fine-tune them to any extent and in
fact they actually have a by now a huge
ecosystem attached to them of loras of
fine tunes like a lot of stuff going on
in the open source community. Super nice
to see this. And finally we have the
open fully open source flux Chanel
model. This is super fast and ultra
lightweight. Um and it's basically the
perfect entry point into the flux
ecosystem.
Um and if you talk about the ecosystem
we can look at the um model atlas on
hugging face here um which basically
shows all the ecosystem meaning fine
tunes and stuff around the most
important foundation models across
domains
and guess which single model has the
largest ecosystem attached to it. It's
actually our flux dev model. So uh you
see that we already shape I think the
image generation space in the open
source community very heavily and as
said our our vision is really to advance
this further to become the central
provider of images and videos that
humans will communicate with in the
future. Um all right
that much to the company. Let's now come
to the core part of the talk. How to
unify text to image generation and image
editing.
Why is that important? I think yeah,
first of all, image generation has made
huge leaps in the past five years. We've
been really at the forefront of this, I
think. But image editing has since very
recently not really kept up with this
speed of advancement in image
generation. I would argue that image
editing is at least equally if not more
important than text to image generation
because it allows us to iterate over
content multiple times and gradually
refine it. I'll show you what that means
in a second. And by that it just gives
us much more control over the output.
Whoever has professionally worked with
images I think will totally understand
what I mean here. Um so I think that's
why it's a very interesting and
important problem to solve.
Um and with flux context that we
released in June 2025. So earlier this
year we published the first diffusion
model or flow matching model that
combined text to image generation and
editing. And that really unlocked
new properties that we have not seen
before. Things like character
consistency, style reference, local and
global editing. everything within one
model available within seconds once you
prompt it. So at really high speeds and
I think the top row of images here that
I brought to you pretty much visualizes
this. So we start here at the right uh
um part here um with this image we
remove the object from her face.
We keep the character then and transform
it into a completely new scene. So in
this case this inner city and in the um
rightmost example
we just change it to a winter scene.
This is all possible within single
seconds and arguably whoever has used
Photoshop before this took just very
long in Photoshop uh before we released
this model. Right? So um here are a
couple of more examples. Style transfer
is really nice. So here at the left
example, we just take the input image
and transform the or transfer the style
of it to a new prompt basically to a new
content or we can do fun things like
text editing. Here we changed in the
right example Montreal to Fryburg while
keeping the font uh exactly the same.
Importantly, Flux context also solves a
lot of interesting business problems. So
we can actually get from an in the wild
image of this skirt to a fullblown
product shot within a couple of seconds
or on the right side we see that we can
transform a simple sketch into a
full-blown render also in a matter of um
yeah seconds.
So let's look a bit more at the pipeline
and how image generation is different
from image editing. Let's start with a
classic image generation or text to
image generation pipeline. We just use a
prompt that describes a scene. We push
it through the network and we arrive at
an image that hopefully follows that
prompt. In this case, it does, I think.
Um, for image editing, it's quite a bit
different. Here, we start with an image
and instead of describing a whole scene
that we want to generate, we only
describe what I now call an instruction
text prompt that actually tells the
model how it should change the initial
image. So here I say convert this to a
Lego scene and we get the image of this
church as a Lego scene. This all again
in a matter of seconds.
Combining these two aspects in a single
model is super important because it just
gets rid of very manual fine tunes that
we had to do before or complex
workflows. So this is again text to
image generation and textbased
generative image editing combined in one
model and we don't need to fine-tune
anymore or add workflows which was the
primary means to be get to these editing
capabilities um before we release the
model.
All right, I think this shows gives a
bit of a glimpse of the impact. I'll
also show you a live demo of the model
later.
Now I really want to dive into how this
model actually works.
And in order to understand this,
we have to look at an algorithm that's
called latent flow matching, which is
defined based on two concepts. It's
latent and flow matching. And I want to
share light on these two concepts
because these are really important to
understand how the model works. So let's
start with latent. This comes from the
algorithm latent generative modeling
that me and my co-founders invented five
years ago while we still were students
at the university.
Um, and to start with, I just want to
show you these two images. The one, the
left one is a JPEG image, which is an
approximation of the right one, which is
a PNG image. I would say these images
look exactly the same. But if you look
at the file size, we we see something
remarkable. I think the JPEG image is
actually 10 times or like close to 10
times or close to an order of magnitude
smaller than the PNG. So apparently we
can remove a lot of information without
actually noticing it. And this is very
interesting and we can even visualize it
a bit more differently if we plot the
perceptual similarity of an
approximation of an image in the last
example the JPEG and the original image
in the last example the PNG against the
file size we get a plot that looks
conceptually quite like this. So we have
here on the left side that the um
perceptual similarity quickly increases
with file size but then stays very very
constant. You might ask okay what what
does this now have to do with generative
modeling? This actually has to do a lot
of generative modeling because it shows
us that when we train the model on a
perceptual signal like an image for that
image to look real you don't need to
model all the high frequency
imperceptible details. So this part,
this flat part of the curve because here
we don't increase the perceptual
similarity anymore. This doesn't
actually matter to our eye and training
on this would just be a great waste of
compute and time. So we should not do
this. And this is at the core of latent
generative modeling. We really want to
find a representation of an image that
only contains those details in an image
that actually matter to us. And this is
how we do it. We train well we call this
representation latent space. This is
where the term latent generative
modeling comes from. So generative
modeling in the latent space
and this is how we train the pipeline.
We train an so-called autoenccoder to
extract that latent space before we
train the actual generative model. And
it works like this. We start here on the
left side with an input image, push this
through a CNN encoder, extract that
latent representation where we will
later train the generative model on and
then we apply an operation that's called
regularization. This forces the model to
discard information from this latent
representation. It can be done by
discrete or continuous. Um we mostly use
continuous operations here. uh and then
we push it again through a decoder to
reconstruct the original image and we
train this model end to end. So we use a
reconstruction loss that just minimizes
the difference between the reconstructed
and the input image. And importantly we
add this here on the um top right this
discriminator loss that is basically a
prior to the human perception. This one
takes care that
the details that are contained in the
image are actually only those that
matter to our perception. And like this
we can arrive at this latent space that
really only contains those details that
matter to us. So this is the first part
and once we've done this we train the
generative algorithm to generate images
from basically easy and tractable
distributions. I come to this in the
next slide. So um and this algorithm is
called flow matching. Flow matching um
is a general family of algorithms that
wants to
find a vector field which is
parameterized by a neural network that
maps from a simple distribution. This is
here visualized on the right side. So
here which is always the normal
distribution to a very complex and
unknown distribution which is our
natural distribution of images. This is
visualized here. So this is again in the
latent space happening. I just explained
this. We have the encoder to encode here
and we model everything in the latent
space.
So we want to learn a vector field that
is parameterized by a neural network and
flow matching actually gives us a super
simple algorithm to learn this. All we
do is we couple each sample from our
data distribution with a random sample
from our normal distribution. So say
this one and we just repeat this for
every sample in the data set. And like
this we can construct a artificial
vector field. This is obviously this
looks really wrong because in a vector
field trajectories can never cross
otherwise it's uh by definition not a
vector field.
The interesting thing with flow matching
is if I just do this and train the model
to approximate this horribly wrong
vector field, I end up with a true
vector field actually. So if I do this
very often, the model in the end
approximates the true vector field where
we see that trajectories don't cross.
This is the flow matching algorithm. And
obviously since we want to control the
output based on text prompts, we
condition our deep neural network always
on those text prompts to be able to
control it later. And once we sample the
model, so say now we've learned this
vector field, the network represents it.
Once we sample from it, we just apply
the nect vector field step by step in a
numerical integration scheme.
So we start with a sample from the um
easy interractable distribution from the
normal distribution and just apply a
numerical integration scheme where each
integration step is actually a forward
pass of the neural network and like this
in 50 steps in uh more or less we arrive
then at from a sample from this easy
distribution at a sample from the data
distribution that we then again push
through the um decoder and we arrive in
at the at at a generated image in the
end. So this is the flow matching
algorithm combined with this latent
generative modeling.
All right. So this is how you can
generate images based on text. But now
how does this apply to flux context
because that also can do image editing
right?
We do this with a very simple trick I
would say. I want to share a bit about
the architecture of the model. So we
train a general transformer model um
which is the backbone of what we're
doing for the flux models and we
condition that this is the same for
image editing and text to image
generation on a text prompt. This is
here on the um top left part here but
instead of only conditioning on one
image that we want to generate we
condition at on an additional image that
is now the context image for the model.
So we here have two images.
We push them through the latent space
and in the latent space we form a
sequence of tokens. We have text tokens
and we have image tokens for the two um
images. the one we want to generate and
the one that is our context image. And
we handle these this token stream with a
transformer architecture which is
composed of double stream blocks where
we have domain experts for the visual
and the um text tokens and then classic
transformer blocks that are just
handling all the tokens similarly. And
this is how we parameterize our network.
And we can just change from text to
image generation to image editing by
using or removing this context image. So
if you do a text to image generation, we
just remove that image. So this is how
it works. Just a conditioning trick in
the end.
Last question I want to answer. How is
it so fast? I think whoever knows the
flux models knows that they are pretty
pretty fast. Um what do I mean when I
say fast? they are up to an order of
magnitude faster than the um
competition. So we see the flux models
here on the left side of each plot and
this is both for image editing and um
image generation. So they are always on
the left side and say for yeah image
generation we are 25 times faster than
GPD one image for instance where we're
still for um image editing up to 10
times faster than the slowest
competitive model. So how is that
possible? And the algorithm that powers
this is called adversarial diffusion
distillation.
Here the goal is to bring down the
number of numerical integration steps. I
if you remember I I said these were 50
for trained models or more or less 50 in
in in uh as a rule of thumb down to as
little as four. And in order to do this
we do two things. We take a pre-trained
flow matching model and initialize two
new networks from it. First the teacher,
you see it here on the um bottom part of
the plot which is just this model
itself. And then we initialize another
new model. The student the student
should learn to generate images that are
as good as the teacher images but only
in four steps. This is the goal and
here's how we do it. We start with an
image again encoded into the latent
space and then we use the student. This
is visualized here on the top part. We
use the student to basically generate
this image in or generate a generated
image from it in four steps. Then we
arrive here at the um top right side at
an image that in the beginning when I
start training this is very blurry and
looks very unrealistic. And I want to
improve these images as much as
possible. How do I do this? We use this
generated image from the student, push
it again through
the encoder and do the same generation
procedure with a teacher. But here I use
50 steps and I arrive at a very faithful
and nicely looking image. and we just
train a distillation loss that compares
the images of the student with the
images of the teacher and pushes the
student to follow the teacher
distribution.
This alone is not enough which is why we
add another um loss on top of it which
is visualized in the bottom left part
and initialized with a feature extractor
called Dino V2. This is a discriminator
loss which is again a prior to human
perception. This takes care that the
images that the student creates are
actually following or perceptually
pleasantly looking for us humans. So
again they they make really sharp edges
for the um for the student images um and
make them realistically looking in the
end.
So this is super nice and we can with
this algorithm we can get down the
number of numerical integration steps
from 50 to four which is great but it's
very complicated. is like two times
encoding. So we see here uh in the in
the top row we see an encoding operation
also in the bottom row we see an
encoding operation and same two times
decoding. So it's super tedious to train
this and very computationally heavy and
this makes us arrive at the final
algorithm that we actually used to train
the flux model. It's the latent
adversarial diffusion distillation. You
know now what latent generative modeling
is. So I guess you know um the
motivation behind this. We just want to
get rid of all the encoding and decoding
and we want to put the entire algorithm
in the uh latent space. And this is what
the latent atl diffusion distillation
algorithm does. It just does exactly the
same thing
than the last algorithm but it applies
it completely in latent space. And for
that we change two things.
we replace this dynino I go I go back
once to see it this this kind of
discriminator loss this feature
extractor
with the teacher itself so we use the
pre-trained model to to um as the prior
to um calculate this adversarial loss
and we throw away
the distillation loss completely by that
we can actually achieve much much less
compute effort sources with it. Yes,
much much less compute efforts and can
speed up the algorithm a lot. So in a
nutshell latent adversarial diffusion
distillation is just a um the
adversarial diffusion distillation
applied to the latent space and by that
we can actually bring down the number of
integration steps from 50 to four which
is again this order of magnitude speed
up that we've seen in the plots. So here
we get a 12 times uh 12.5 times speed
up. All right. Now you know why it's so
fast. And as a final um
I think thing that I want to show you, I
have brought
a short demo to you to to actually see
how the model works. So let's maybe use
this image. Uh here we have an image of
my favorite football club, the SA
Fryborg, which is uh the local football
club in Fryborg. And we can now just
play around with this motive. So we say
um put this logo onto
a t-shirt. Oops.
All right.
Here we go. Hope internet is okay. Yes.
Cool. Then we wait for a couple of
seconds
and what we get is
live demo.
Nice. Okay. Now, now now we're hot. I
think now it should be faster. This was
too too uh slow for Flux normally. So
now we get this nice uh picture of the
logo onto an image. The nice thing now
is I can use this image and iterate over
it. This is what I why I think editing
is super super exciting. So now I can
say okay this logo is too large make the
logo smaller
and put it on the
top right part
of the t-shirt.
Again, we submit here.
And here's what I get. Again, four
images. Uh, and I can I can just go on
like this.
Let's make the t-shirt red.
And I think there's the next thing I
want to do some some something more.
challenging. Um, let's put this t-shirt
onto a human and transform them into the
wild. Because so far we've done like
somewhat local image edits. Let's do
something more global.
Oops.
Oh, in the park.
All right,
I'm now generating two examples. I don't
know what what's happening here. Okay,
par also model code infer what what what
what I wanted to do. That's great. So,
um nice here also good.
That's user and one final edit maybe. Um
so this was a global transformation
which is nice. And now let's uh an as an
example of style transfer um make this a
watercolor painting
as a final one.
Come on. Last one. Ah nice. Cool. And
here we go. Now we could print it out uh
and place it on on I don't know, take it
at home, whatever. Um no, I think I
think um the I think the general thing
is coming across. It's very powerful,
very flexible. Um and I think this was
it with this demo. I want to say thank
you. We are hiring. Come visit our
booth. Um we're just right next to the
main stage here. Um we have a lot of
openings. So, if you if you're
interested in what we're doing, please
apply or visit our playground. This is
where I basically just demonstrated the
model. Thank you so much and have a
great conference.
>> What a great presentation. Thank you,
Andy. Uh, would you like to join me for
a couple of questions?
>> Absolutely.
>> And by the way, for Zuckerberg, this is
how we do a demo.
>> Just Just saying.
>> Thank you. This was live.
>> All right. Please have a seat.
Um
I have to say I'm blown away to see how
fast uh the flux models are. Uh it's
incredible if you've who tried the flux
before.
>> Wow. Yeah. So everybody has noticed like
how fast it is. Um but how much faster
could you make it actually? So do you
have a theoretical
speed anything you think we can reach
>> the I think once we reach
larger models um we can actually get to
one step generations I think that that
that is really the the goal flow
matching models diffusion models they
came from this kind of what I said
numerical integration scheme or sampling
process that just takes a lot of steps
and I think our our goal is to really
get down to one step like our fluxell
model for instance does four steps Um
and I think once we get to larger models
we can actually go down to one step. uh
and sure the I think the the challenge
or the goal is in the end real-time
generation and I think that is very
feasible getting better hardware
getting better algorithms getting better
optimization procedures like on
blackwell chips we can do
less precision so we can FP4 do FP4
quantizations for the models and I think
we will definitely get to um real time
>> wow
>> generation with flow matching models yes
>> wow Wow. So you're saying by having
bigger models we're going to make them
faster. So it's
>> a bigger model is effectively um can
represent a more complex transformation.
In the end like what I showed also um
this vector field or this approximating
this vector field is just in effectively
a mapping from an input to an output.
And the more complex your model gets or
the more parameters I add the more
expressible these functions are. So I
can model way way more complicated
mappings and that allows me to just do
less steps because if I I can look at it
from a from a linearization perspective.
Let's say I have a curve and I want to
approximate with um linear linear um
parts right the more steps I make for
this approximation or with linear pieces
the more steps I make the better my
approximation gets. Mhm.
>> And um making a more complex model can
get you can make the the function you're
learning more nonlinear. So you can
actually also add things like a
curvature or something into into what
you're modeling in in one part. And by
that you can model more complex curves
in um in less steps in in that way.
>> Wow. Okay. Um, and do do you think that
in the future uh people like like me who
like to use Flux for uh image editing
will just drop Photoshop altogether and
just be uh editing using these models.
I don't know how we will you will be
using these models. They could be they
actually getting integrated into
Photoshop soon. So uh
>> wow.
>> Um
>> breaking news.
>> I I I think yeah how you will certainly
use them for image editing. where you
will use them, I cannot yet say, but I I
guess BFL will also make a a good
proposal for um using the models on our
platform soon.
>> I see. Okay. So, it's going to just make
like these tools a lot better.
>> Yes. And and and I'm 100% sure that um
>> these models will be the backbone for
all the image editing um that we will
see in the future. Yes.
>> Wow. Incredible. Okay. So, uh, yeah, you
touched on the the different models that
you you guys offer, pro, dev, and and
Chanel, and, um, so some of them are
open weight. Um, but how do you balance
your focus between, you know, building
uh for building for open source, but
also like focusing on state-of-the-art
at the same time?
>> Yeah, I think um we've just seen it as a
very nice value proposition to structure
it like having state-of-the-art models
in our API. These are just the best
models for people who really want to get
very fast results um and don't want to
deal with things like customization and
fine-tuning because to be fair that's
not everyone's business and not
everyone's interested in that. So I
think that's always good for people who
want to get to the most powerful models
in as little time as possible. But then
you have this huge customization um need
that a lot of real world applications
have like when I'm when I'm building a
face app or something face gen face
editing app say I need a crucial amount
of fine-tuning to just get very good at
this very specific problem and for this
open weights models are perfect because
everyone can use them everyone can
customize them or can get help with
customization but they can tailor them
>> to their specific use case and I think
This need for customization goes very
well with an open source uh approach
because effectively the whole community
can work for you. You can just use all
the insights that the open research
community got from uh like like
>> yeah investigated in the in the past
right and by that this this gives you
just a huge tool stack that you as
someone who wants to customize the model
can get when they just build on our flux
platform. So I think um we see this as a
very complimentary kind of approach that
we are definitely seeing also for future
releases.
>> Absolutely. Yeah. So um for for most
people they're going to just go to
state-of-the-art model but for people
who are have specific problems those
open open and open source solutions
might help them to
>> absolutely and I cannot predict how this
will continue in the future. I would not
say that most people are are going to to
close models. I think it's it's just a
distribution and I see this as a kind of
constantly evolving
>> and changing distribution. So we might
up in we might end up in a future where
everyone uses open models
>> or everyone uses closed models I don't
know but I think from what we see right
now
playing on both fields is very important
to us and we we just love to see also
what people are doing with our models.
That's why we also from a I think
personal um personal standpoint cannot
stop working on open models.
>> All awesome. That was my final question.
Uh Andy, thank you so much for this
presentation.
>> Thank you. Let's give it up for Andy.
Thank you. All right. So,
I can
All right. So 2 pm is when we need to be
back here for uh yeah our next talk.
Okay. So and and now we can wrap it up
and go for lunch and uh please enjoy and
see you at 2 p.m. Thank you.
Heat. Heat.
[Music]
[Applause]
[Music]
[Applause]
[Music]
Heat. Heat.
[Music]
Heat.
[Music]
[Applause]
Heat.
Heat
[Music]
[Applause]
[Music]
[Applause]
[Music]
Ladies and gentlemen, please join me in
welcoming to the stage your MC C for the
AI engineer Paris developer experience
engineer Ralph Chabri.
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
I always dreamt to do that. My god, the
timing was perfect. Hello everybody.
How's it going? Great. I had so much fun
this morning. So, we had Docker, Neo4j,
we had Black Forest Labs, and you know
what? The fun is not over because our
next speakers,
yeah, they're going to be so good. And
our next speaker has actually nearly a
decade at Deep Mind where he was working
in super cool projects like Alph Go,
Alpha Fold, Alpha Star, and he worked
also in Chinchilla, Retro, Gemma, and
Gemini. but now he co-founded and he's a
CTO of the age company. Um so please
without further ado um join me in
welcoming to the stage uh Lie co-founder
and CTO at the age company.
[Applause]
[Music]
[Music]
Hello everyone. Thanks R for the intro.
Um I'm Lawrence. I'm a situ and
co-founder at H. And uh today I'm going
to talk to you a little bit about our
work and really our work literally like
what it is. Um so me I like to start the
day and and I saying to myself I'm going
to do one thing one one thing very well.
So I'm going to do this slide deck and
then uh I'm going to open my email. I
can see I have 289 emails and I'm kind
of a zero inbox person. So it really
disturbs me quite a lot. And then I open
my pro email and I have 1,000 emails and
then I'm going to open my calendar and
then I'm going to like bunch of stuff
and like I think we're going to do a lot
of micro task u uh throughout the days
and uh and uh and this is my desktop
what it looks like. I just put
everything in it. I think it's simpler
and then uh eventually I I just create a
new folder old and uh I put everything
in it and old is already taken so I have
to name it old four for example and then
um so there has to be a better way um
and this is uh what we are trying to do
at H we're trying to build something uh
at the intersection of agentic uh UI
automation and also uh models uh So
agentic is basically anything that has
to do with the autonomy, decision
making, planning. Uh UI automation is
like how do you deal with software um
but through the the the user interface.
Um so it can be RPA or uh task execution
and and model. We think model is
important um because yeah you can build
on big models generist models and I will
say more about that but uh we think it's
massively inefficient and uh we can do a
lot better with a specialized model. So
um really what we are building at age we
are building computer user agent so um
agents that can control a computer the
same way uh a human would. So through
the the graphical user interface. So
basically the you take a screenshot and
uh some context you pass that to the
agent and the agent outputs actions uh
but in the same action space as you
would like mouse clicks, scrolls, uh
keystrokes
and um and the for that is that right
now agent is a big word. you have many
agent company a lot of that is really
rag or uh tool call MCP APIs and so on.
So plugging at LLM interfacing at the
LLM with a with like um strongly um um
type tools but we think you know it's
not going to work for the long tails of
tasks that I showed at the beginning um
you know not all software will have APIs
MCPS soon um and I think there are two
reasons for that one is inertia in like
especially like big enterprise world uh
you know things are slow and then there
are more adversary reasons for that um
like business model I think there are a
lot of legacy software that um um have
incentives so that this doesn't happen
uh because they sell by the license by
the seat so if with Y license you can do
more because you have an agent talking
to the license then they kind of lose
money so they have strong incentives to
make that transition slower and since
sometimes they have monopoly release
it's going to take a while to to happen
this transition to agent doing a lot
more in enterprise. Um there are also
more positive reasons like there's
actually a lot of intelligence that has
been put into making the API uh making
the the UI I mean um the UI you know
it's like an effective way to present
the information
um and uh and yeah maybe like you can
already use that context that has been
built by many developers as as the
context for your LLM. Um and uh also if
you want to assist humans uh it's good
if you work in the same action space as
they do so that they can show to the
agent uh how they work what they do and
the agent can show to them uh and and
automate. So um there's already RPA um
where you can automate uh robotic
process automation. So you can automate
things through UI through scripts and so
on. But it's very heavy to deploy. It's
brittle. So if you know exactly the
workflow changes a little bit or if
there is ambiguity it's not going to
work.
So I think those are the reasons why we
uh need and and we are what why we are
building computer use agent. Um so
concretely I I can show you maybe an
example. Um so example is a task. Uh
here it's like okay we want to go to
Google flight we want to find the oneway
business class flight from Buenoseres to
Amsterdam on January 10. Um and we want
the details of the shortest flight. Um
so there are a lot of kind of like weak
constraints in this um and let's see um
you know what what um an agent is doing.
So yeah if you can please uh maybe
enlarge the Yeah. So, so we can see. So,
what happens is that
you have basically um the software. So,
here it's a it's a web browser
um where um we're going to navigate on
Google flight and then the agent is
going to take actions. So, and every
time we see uh um the action um so here
it's like clicking on the on the
departure. Now it's writing the the um
provenence buays.
Um and um yeah, so maybe I can speed it
up a little bit. Um let's put uh 2x.
Um but um yeah, so the agent doesn't
really know how the website works. Uh
how the you know, if you click on
something, what's going to happen? It's
really adapting in real time. Uh here uh
it's um um and every step we can see uh
basically it outputs three things um
some notes uh of what happened what it
thinks happened on the previous steps uh
some thoughts so what it wants to do now
so now it wants to um
select um um one way um and then the
action is uh click on uh click on the
dropdown for round trip because it wants
to um to set uh one way. Okay. So,
so yeah and um now we are coming to the
departure. So, departure is really hard
because you have to navigate through a
calendar. Every calendar on every
website is different. And here we want
January. So, we need to click several
times on the next month uh until we
finally see January. And then we can
click on 10. And uh if you want to
script that in appear, it's like a lot
of work to explain the loop and all of
that. Um so now it's selecting the the
class um web search. So now it's going
to search. It's going to see the
flights. It's going to expand. So it
found the the the shortest one. It's
expanding and uh and yeah, so you can do
that with RPA, but it would be a lot of
work. And here is just uh one prompt
basically.
And um and yeah, so this is um this is
um our agent surfer H. Um okay,
how do I
Okay, maybe you can uh lower the the
screen, please. Um, and the agent the
way it works. So, you have the task. Uh,
that's the prompt I I showed at the
beginning. Um, and then you have the
memory of the agent. Um,
and the memory it's there are a few
things in it. There's the task, the
thoughts, uh, the actions, the notes,
and the previous screenshots. All of
that goes through a policy model. So,
it's a VLM, a visual language model. So
like it can be GDP5 or or set or or
module. Um and then it outputs actions
and we have a few actions. Um we have
like refresh, go to we have scroll, wait
and write and click. So the agent can
write u some text onto some element or
it can click on some element. And then
we have um a localizer model. So when
you when you use an action that has a
element target uh there's a second call
to a model to uh outputs the coordinates
um of that uh element and that's the
localizer and finally when we answer
when we produce the final answer we have
another call to a um validator model and
if uh which validates if the task has
been correctly executed or not and if it
has been um we just return the answer or
we incorporate the feedback into the
memory and we continue and every time
the action is uh executed uh in the
browser
and for that as I said we can use uh any
model but we can also use our own model
that we developed um that comes in u so
those are our olo uh olo models um that
uh we open sourced um they are really
state-of-the-art on this task of uh
localization that is given uh so I'm
going to talk more about this later but
yeah I think one question is why would
you uh use uh you know a specialized
model this is 2025 so everyone is just
printing prompting a big model and put
your API key and just ship the product
that's what everyone does um nowadays so
why why bother and why you know try to
do
another model that does something. Um,
and the the reason is really um
efficiency. Um, and if you look at
efficiency, let's take an example. Let's
take chess for example. And chess, you
have GP5 or Grog that can play chess uh
very well, like like 1,600 ELO. It's
probably like stronger than any like
most people in this room stronger than
me. Um so it's quite remarkable that
such a big um model can play at this
level without having be you know
specialized to do it. But if you compare
this to um a specialized agent and model
like for example Alpha Zero Alpha Zero
it plays at uh 3,600 ELO. So 2,000 ELO
more. So that means GBD5 is going to win
one game out of every 100,000.
And it's also a lot bigger. It's like
one trillion parameter compared to 20
million parameters. This runs on like
maybe a big P of GB 200 one $1 million
supercomput. And that one can basically
run on a on a cheap gaming laptop. So
all in all, G5 can play chess, but it's
like maybe five billion times less
efficient than Alpha Zero at it. So
100,000
100 million times less cost effective.
So yeah, you have many orders of
magnitude and it really doesn't make
sense um you know economically to play
chess with GP5
and um so that's what we are trying to
do here for computer use agents. We are
trying to like basically move the the
parto front of uh of performance of the
agent. So this is um web voyager. So web
vagure is a is a benchmark. Basically
the task that I showed at the beginning
it comes from this benchmark and so you
have like 600 tasks and we measure
success rate and that is basically the
success rate of the agent when you use
different models uh when you use uh GPT4
when you use GP4.1
uh when you do yeah 4.1 is here and this
is when you use our model that has been
specialized for this and we have like
currently modest best 5x um cost
effectiveness factor but we are we are
you know aiming at something closer to
uh 5 billion hopefully. Um this is um
yeah the task I described about
localization. So given a screenshot and
an intent here the intent is click on
departure date input.
um you want the model to predict the the
coordinates
um and um that's what we worked a lot on
and for this particular task we also
have measured the performance
uh against the size of models and here
it's not just like us being better but
it's also I think what's interesting is
that you have this what I call the
specialization frontier where basically
every model below there are generous
models so here we We have Quen, we have
Sonet, we have actually uh maybe GBT
much below on this particular task. And
then here these are specialized models.
Uh and so all of these models they have
been trained specifically for that.
These are UI venues from uh ends group.
They have big team working just on that.
This is UITAR from Bance also a big team
working just on this problem. And uh and
this is our latest release from two
weeks ago. all 1.5.
So yeah, I think we're going to see a
lot of these like parto plots where
specialized model really shine and
people start to move away from uh from
generalist because they they find it's
uh it's not very effective. This is how
we train um OLO. Um so we initialized
the model with um quen 2.5VL which is
probably the the state-of-the-art in
term of uh opensource uh VLM so visual
language models we need to give it a
screenshot so it needs to be a VLM
um we take we start from that and then
we do a little bit of fine tuning uh on
a mixture consisting of UI localization
examples so we've scrapped many
different softwares and websites uh to
get like screenshot intent coordinates.
Um we have other auxiliary tasks that we
think are interesting like question
answering on on tables on on UI and then
we have a lot of data um that that
consists in execution of our agent on on
many synthetic tasks. So we have
actually expanded u this web vure to
something like 100,000
uh synthetic task on which we've run the
agent and we collected the the
successful trajectories and we are um
training on the successful trajectories
for web for Android
and um so that's the phase one
supervised fine tuning and then we do a
little bit of gRPO which is
reinforcement learning
uh and for now we so that is Um uh we
are training the model to be successful
to optimize like end to end for being
successful at this task. Not really
reproducing the training data and uh
right now we are doing it on on UI
localization tasks but we hope to to
expand to um to many more tasks in the
future. So why open uh we could also
keep the model to ourselves. Um and
these are I think all of the reasons why
we want to do open weights models. Uh I
think for the customer uh it builds
trust. Uh they know what's in the agent.
Uh and they are uh we are startup so
they don't want to be fully reliant on
us. If we gave them the model they know
they can run it if we go under. Um it's
a good conversation starter. U people
know like maybe uh they've seen the the
model or the benchmark. So it's it's a
good conversation starter and they can
try it before they buy it. Um I think
it's good for the brand because um like
you don't have to trust me. You can
download the weights and you know run
the evaluation. So it's verifiable
performance
um for the tech also as a consumer. So
when we build open weights we uh also
start from covl in our case. So um and
we start from we use existing training
pipelines maybe TRL from face maybe uh
we can use also inference like VLM like
all of the stack is basically already
done uh and we don't have to reinvent
the wheel we can just uh focus on where
we add value and then for the employee I
think it's a good drive as well uh they
get visibility along the way it can be a
long way for a startup to be successful
but uh this gives them intermediate
milestone and visibility and it helps
with retention and and attractivity. So
I think all in know it's a it's it's a
good deal for everyone.
uh what is next for us um in term of
research we want to extend to we've done
the web we want to extend to desktop and
mobile and mobile we want to generalize
so before we trained um on web vagure we
evaluated on web vagasure and because
it's reinforcement learning it's kind of
okay uh to train on the test set but uh
really it's not so now we are moving
away from that and we are training on
this big um only on this big um
synthetic task mixture that I described.
Um we want to annotate use like more
humans in the loop. Sometimes the agent
is really stuck and you need to just
show it uh what to do. So we are
building tools for that.
Um and then um reinforcement so do more
reinforcement learning to optimize end
to end for the task execution success.
Um so multi-turn and so on. Uh and then
in term of platform, we are uh the the
stuff I showed at the beginning the
demo, we're gonna uh make it public
soon. So that's a surfer showcase. We're
going to have a portal where you can go
and create um you know an account. You
have your API key, you can uh launch the
agent, you can do inference of the
model. Um so so that's coming um very uh
very soon. And um so with this uh I
thank you for your attention. I leave
you with this quote uh which is um you
know why we think it's important to um
you know assist uh more uh some some
kind of uh low-level tasks and um thank
you.
>> Thank you. Would like to join me for a
few questions. Yeah.
>> Thank you so much. All right. Please
have a seat.
Okay. Wow, what a great presentation. I
really enjoyed it. Actually, I wrote so
many questions, but I'm gonna try to
keep it try to keep it a short. Um, so I
really like when you highlighted the
limitations of API and um and and MCP
and you spoke about the contrast with
computer use a uh computer using agents.
Um I was just wondering what's the
biggest challenge in your opinion in
building computer user agents.
>> Um so I think one one of there are many
challenges uh in term of deployment. Um
you have um there are many questions
like do you run it locally? Do you run
it on a virtual machine? Uh what are the
credentials? Um, it's a big question.
Yeah, we book a flight, but then what's
next? Like, are we logged in? Are we
going to just have the, you know, credit
card uh number uh in the prompt?
Probably not a great idea. So, um, yeah,
I think many questions around roles and
permissions and I think it's it's a
great opportunity because like it's
already done in humans interface. Yeah.
>> And we can reuse some of that.
>> Yeah. So if uh an agent needs to act on
our on the behalf of a human like what
kind of data they can have access to
etc. what kind of permissions are we
allowing uh allowing them to have very
much
>> yeah so it's really dependent on
deployments um and we are uh getting
starting on the on the go to market side
and so on but um yeah I think either it
takes control of your laptop and you
have your credentials there
>> um but then maybe needs to confirmation
and so on to send you confirmation
before it you know pass an order or send
an email. But yeah.
>> Yeah, I would like that too if I'm if
I'm informed. You spoke of uh um go to
market uh strategy. So uh can you tell
us a little bit more about the business
model that the age companies uh actually
have for this?
>> Yeah. So I think we're going to have the
the surfer showcase, but I don't think
it's going to be uh something you can
pay for. you can try it but uh if you
want to we don't want to basically build
a per token business model we think it's
a kind of a race to the bottom so we try
to u do uh this uh forward deployed
engineer model and we have goautier our
CEO who comes from balante and he's been
very successful at uh at doing this in
his previous role and uh we'll try to
charge by by the value And like how how
much uh more can you do and we take a
fraction of that
compared to uh you know how many tokens
like open air is uh like $2 per million
tokens how how much you know we're not
going to make money like that I think so
yeah
>> yeah it's quite hard to actually
understand how many how much tokens or
how many tokens does a computer agent
needs right
>> so that makes sense to me Um, well, this
is it for me. Thank you so much, uh, L.
>> Let's give it up for L.
>> Thank you so much.
>> All right,
so we're ready for our next speaker.
This is going to be a speedun, but I
think it's going to be exciting. So, our
next speaker calls himself a technology,
technological humanist. He's built
systems for NASA, brought medical
informatics to Zambia, and today at
Neo4j, he's helping democratizing graph
databases. His motto, everything is
connected.
Okay. So, uh he's here to show uh to
show us how to generate data and how
generating data can be fun. So, please
join me in welcoming to the stage
generative AI lead for developer
relations at Neo4j, Andreas Colliger.
[Applause]
[Music]
[Applause]
[Music]
Thank you for that introduction. Hello
everybody. Hello Paris. So we've all
probably done a little bit of vibe
coding over the last year or so. I work
at a database company. So I think a lot
about well can you do vibe coding with
data? Can you vibe with data? So this is
my talk. This is the snack version of a
fuller version of what I've got um that
I'll share later. But it's all about and
actually I love the previous talk. The
idea that as soon as you go down the
route of making any kind of agents, you
run into all kinds of challenges that
lead you down to making a multi-agent
system. And you do multi-agent system
for a couple of reasons. You do that
because the agents have got too many
responsibilities. They get confused.
They have context rot that happens,
semantic drift that happens. So you
break down the problem to make it
simpler and have more agents that are
focused on different parts of the
problem. Along the way, because you've
got multiple agents, you have the
opportunity to use specialized models
the way that we saw in the previous
talk. Specialized models outperform
general models in all tasks. So you have
the opportunity as soon as you break
things up to have models that do just
one part of a puzzle. Okay, that's the
quick setup. So I've been building a
multi- aent system to actually vibe with
data to help you build a knowledge graph
because I I work for a graph database
company. And this starts with top level
agent that interacts with the user and
like helps figure out what they user
wants to get done. But then there's
three specialized channels here for
either taking in structured data,
unstructured data, and then once you've
built a graph, retrieving information
out of that, what's called graph rag
from the from the graph perspective. So
I'm going to take you through a speedrun
of what this looks like in action. I
have a longer talk about like the
details of how you implement this. All
the code is open source. I'm happy to
share that with you guys as well. Okay.
So here's the speedrun in video form.
Imagine you want to build a bill of
materials graph. You don't know how to
do that. So you turn to this agent. You
say, "Hey, I would like to create a bill
of materials graph." Send that off. The
first job of the agent is to figure out
what do you really mean? What are you
trying to get done? It goes through a
bit of bouncing around here like figure
out which agent should pick up this this
whole thread and realizes okay I don't
quite understand what you're trying to
get done here. Let me ask some
clarifying questions. So it asks tell me
more about what you're trying to do with
the knowledge graph. And I tell it okay
great I want to trace from the products
all the way through the bill of
materials all the way down to the
suppliers so I can do root cause
analysis for like product issues that
are reported like on a website review or
something. Okay. Human in the loop is
important to me. So after the the agent
decides I think someone understands what
I just said it says this is what I heard
you say was that correct I approve it
and then it moves on to the next stage
and the next step having understood what
I'm trying to get done it goes off to
find what data is available that might
satisfy creating that kind of a graph it
has tools available to it to look
through the file system for all the
files that are available it can kind of
graph through those and find the content
of the files and then here it makes a
recommendation about here's the files
that I just found some product files
some assembly files s and also some
suppliers.
It presents that to me again for
approval. I said, "Sure, that looks
great." It goes on to the next agent.
Now, we know the user's intent. We know
the data available. The third agent's
job is just to focus on given the data
and the user's goal, what would the data
model look like? So, it goes off and
does some analysis on that. And the
outcome of the analysis is that it comes
up with a proposal for what the graph
could look like. And again, it's just a
prop proposal. It doesn't actually do
anything yet. Shows that to me. I can go
ahead and look through it and see that
looks correct. It looks like it covers
all the kind of data that I'd like to
have inside of the graph. And here I'm
carefully scrolling through looking sure
making sure all the nodes are there,
that the nodes are connected to each
other in a good way. I've got a good
connected graph. And here again, I'm
just going to go ahead and approve that
part of the workflow. With that
approved, we can finally transfer over
to the agents that's going to go ahead
and build the graph. And the building
part of it and this is where the
specialized agents really would shine.
Specialized both agents and models is
that if you go from just ideiation with
the user, you kind of maybe want OpenAI
generalist model for that. As soon as
you go down to a specialized task,
having specialized models would be
amazing. And here we've gone through all
the way pulling in those data files,
creating the graph, and it's even
recommended some queries that I might be
able to run there inside of the
interface. Now, instead of, you know,
dealing with interface, I'm going to pop
over to the database side to see what
the the data actually looks like. And
here, I've done a match looking for um
the products that are inside of the
graph that was just created. This
particular product is the Stockholm. I
forget if it's a table or a chair, but
that chair is made out of some assembly,
some different pieces, and those pieces
have some parts, and those parts come
from suppliers. All that was created on
the fly by this multi- aent system. That
is my speedrun. If you'd love to hear
the long view, come talk to me
afterwards. I've got about 10 seconds
left, so I can't take any questions.
Thank you.
[Applause]
>> Thank you. Thank you, Andre. Awesome.
So,
our next speaker leads a developer
experience at community at HuggingFace,
uh, where he's been championing open
source, audio, and ondevice machine
learning. He's here to walk us through
the state of open source and uh uh open
source LLMs in 2025. Please join me in
welcoming to the stage head of developer
experience and community at hugging
phase VB.
[Applause]
[Music]
[Music]
Hello AI engineers. Uh it's nice to be
here. Um and uh thank you so much for
tuning in for the talk. Um in the next
20 to 25 minutes or so, I'll present a
report on the current state of open LLMs
in 2025. Uh and hopefully you will uh
you'll be inspired to try them out by
the end of the day or by the end of the
talk. Uh let's see. Um so when we talk
about open LLMs like the one one of the
first questions that that pops to the
mind is like how do they stack up
against u the likes of GPT5 the likes of
um Claude Sonet and so on and so forth.
Thanks for us. There is um an evaluation
service um or rather a company which uh
has made it uh which has made it its
mission to benchmark closed as well as
uh open models against a standard set of
evaluations. What you see on the screen
is the average performance of recent
LLMs on a bouquet of events ranging from
uh math, coding, um scientific rigor and
so on and so forth. The black bars that
you see are um are proprietary models.
Um and the blue bars that you see are uh
open models. And um what you can see
right now on the screen is that the like
in the in the top 10 there are pretty
sizable amount of um open models. So uh
I can see three or four here. And uh if
you look at in look at it in absolute
scores uh the the open models are
actually quite close to the closed
models as well. You can see that GPD5
high currently with like reasoning
effort high is at 67 absolute score
whilst uh GPDOSs another open model from
um open AAI is at 58. And uh there is a
small caveat here is that um whilst the
open models are tracking proprietary
models quite closely uh the the closed
models on the screen that you see uh
come at a higher cost um at as well as
at like a higher token budget meaning
like you have to um reason for for
longer and and so on to get to the same
um stage.
Whilst now we've established that um
open models are sort of okay. Um
the next question comes is like are
these are these easy to use? Um surely
like you know you have to deal with like
model weights, you have to deal with
different runners, you have to deal with
you know um setting up your virtual
machine and so on and so forth. So
surely this must be like difficult to
deal with. Um well let's look at it. Uh
there are typically three ways of um of
of like working with open lens. Uh
number one is is a serverless API. This
is similar to how you would interact
with um with open AI with anthropic with
uh Gemini and so on and so forth. Um
which is essentially that you take like
a fiveline snippet, you give the model
ID and you uh pass a prompt in and you
get some sort of generation out. Um
second is is a managed deployment
wherein you select the model weight you
you click on a few buttons and
automatically this um um a provider this
could be similar to model labs could be
similar to coab um hugging face and so
on and so forth would package these
model weights up and you get an endpoint
out. Um or last but not the least, since
you have the the model weights
themselves, you can deploy it yourself.
Um which means that you set up your own
virtual machine. This could be a VM
right in your basement or could be a GPU
cluster on AWS, Google Cloud. Pick your
favorite cloud provider um or at your
company and so on and so forth and you
set it up. Um
now let's let's let's look at these in a
bit more detail, right? Um, let's talk
about the serverless API. I'm pretty
sure at like everyone in this room would
have um dabbled around with a code
snippet similar to this. This is like a
boilerplate uh open AI chat completions
um code snippet that you would use to
interact with GPT4, GPD40, uh GPD5 and
the likes of it. Um in the year of 2025,
we've we've we've adopted this as the
standard. be it chat completions API, be
it responses API, you can uh access
pretty much any open um LLM um with
pretty much the same standard uh which
means you you have a fi you have a
familiar SDK. So you in this case you
can swap GPD5 with for example GPD OSS
or could be like one of the latest Quinn
models um and so on and so forth. You
you can choose by a bouquet of um of
routers. So you could use um hugging
face inference providers, open router
and there are many other providers like
this. So there is quite a lot of
optionality for you. Um
and um that's it and then just like plug
in your prompt and just go go about and
build your um applications. So it's
pretty much the same experience as the
frontier models which are proprietary um
and so on. You can go one level down the
stack as well. Um, which is what we were
talking about in managed deployment. Um,
in this case, you typically go to a LLM
market space. Uh, this could be, you
know, uh, Hugface inference endpoints,
Lambda Labs, Prime Intellect, model.
There's there's a lot of competition.
There's a lot of healthy competition out
there, healthy providers out there each
trying to um, give the best sort of
service to you. Uh, pick your own uh,
GPU. So this could be an H100, A100, T4,
L4, whatever works with your budget
size, whatever works for your for your
specific use case. Um, and that's it.
And then just deploy, you know, like in
in most CAS cases, it is as the as the
image on the on the slide is. Um, you
pretty much just select a model. In this
case, we select Quinn next. And um, you
just hit deploy, right? Um, and that's
it. And within like two or three
minutes, you have a deployed endpoint
that you can then use. Um, last but not
the least is um is how to use um um is
is is if you want to sort of deploy
these models yourself, you might want to
do this in case you want to have the
maximum control. Um you know, you want
to sort of make sure that wherever your
prompts go is is something that you you
want to have full control on. you want
to make sure that the that the model
abides by your specific rule cases and
so on. Um in this case you you would
typically start by choosing your own
inference engine. Uh there is a huge
variety of inference engines uh namely
VLM, SGLAN, TGI, there's a lot of other
like niche uh inference engines which
you can choose from. Um once you've
choose these um these inference engines,
you provision your own um sort of cloud
GPU cluster. This could also be just you
know like a GPU lying in your uh
basement and um you set you set it up
and that's it. And you have like a
private and secure deployment. Uh each
of the three sort of um ways that I
spoke about have their own utility,
right? So if you want to like run fast
uh you would probably want to start with
something which is serverless. Um if you
want to have a bit more control and you
want to make sure that the net
intelligence that you supply to your app
uh without having like a dedicated
devops or MLOps team uh then you might
want to look at manage deployments. If
you want to have like really a lot of
control on like what is going through um
you want to have full provenence of your
prompts, you want to have full
provenence of your outputs, then you
want to sort of have something um that
you want to deploy yourself. Um the best
part about this all is that given you
have the model weights, you have the
optionality to you know choose
essentially. Um and you can go through
the stack, you can you can work through
the competition, you can work through
all the platforms and so on. Um but now
you might be asking like why do I even
need so many deployment options, right?
This is a this is a classic sort of buy
versus build um sort of you know um
argument. you no matter where you are in
your sort of like life cycle of a of a
startup of a business and so on you you
want to always increase the the
optionality that you have with respect
to um what you're providing right so in
case like your your your startup or your
application depends on uh LLMs you want
to have it um you you want to always
have like one failover over the other
you don't want to always depend on just
one provider uh case in point is um is
the recent issues faced by Anthropic. Um
no shade to Anthropic. I use Claude
every day. If there's someone here from
Anthropic, I love you guys. Uh but this
is a recent uh blog post from them which
was about the timelines of um um of
issues that they faced while serving uh
Claude Sonet um as well as um if I'm not
mistaken Opus models and the issues here
specifically was that um they had like
an issue with like context window. So
they had an issue with uh with the way
they were routing um um the prompts. And
what what all of this led to is that you
were you were calling the anthropic
models the same way as you were, but a
small percentage of these requests were
being were not being fulfilled the way
they were supposed to. Meaning you were
getting like slightly lower quality uh
outputs at the end. Given you have no no
sort of visibility into what's going
wrong, this can cause like quite a bit
of issues because without having any
visibility on what's going wrong up the
stack um you just get degraded outputs
and um this is something that you know
based on whatever use your you're um uh
you're working on you would want to
optimize, right? And of course like
we've we've all heard the lore, right?
like Sonet is dumber during the day but
you know it becomes much smarter during
the during the evening like you know uh
when you're when you're coding in the
night like cloudset is much nicer and so
on and um um of course a lot of this is
is is placebo but the fact that we don't
really have access to the model weights
makes it much difficult to be able to um
narrow it down down to one thing um and
as uh one of the one of my favorite
researchers that I look up to quite a
bit Andre says um not your weights, not
your brain.
Um and and again, you know, like you
want to have as much optionality as
possible and um and so on. So now that
we've we've established that open LLMs
are are good enough, um open LLMs are
sort of there are quite a lot of options
for you to use these LLMs. Uh let's look
at some of the recent um trends in um in
the open LLM landscape. Um so we'll go
through three trends. Um the first one
is um is up until up until like last
year um up until December of 2024,
OpenAI 01 um was the state-of-the-art
reasoning model. Um and uh when they
when they when they released the model,
they um uh it was it was one of the
first um sort of thinking model in the
sense that before you get to the final
response, the model sort of contemplates
about the response itself. And because
it it contemplates you get a higher
quality of output. U when they released
the model they they decided to hide
these chain of thoughts or like this
this contemplation from the model
itself. Um fast forward to um fast
forward to January of 2025. Deepseek one
of the um one of the key um LLM players
from China released Deepseek R1. Uh this
was a very huge 685 billion parameter
model. Uh MIT licensed meaning you can
use it for commercial uh use cases. You
can use it for any of your own bespoke
use cases. Um it like they released this
model which was competitive with uh
OpenAI 01 with Gemini um at that time
and um and a lot of other models. The
best part about this was that they
opened the entire chain of thought for
anyone to use. Uh which means that chain
of thought was not sort of commoditized
anymore. Um any sort of open
participant, any startup, any business
could have the same sort of capability
of getting higher result from the same
model. Um from there on in fact they
didn't just stop there. Um they proved
that because the because the chain of
the thoughts are now because the chain
of thought is now public, you can
distill this chain of thought to smaller
models. Um the way they proved this was
by distilling the the R1 chain of
thoughts um into a smaller Quinn 8B LLM.
Um and to their surprise, um when they
when they trained it specifically on
math, uh this model beat a 25 times
larger open model. So that's an 8B model
sort of uh beating um 235 billion
parameter model. Um
and and this is where the the the sort
of reasoning revolution sort of started
in this year. um uh and and now pretty
much reasoning has become like a
standard across all models. Uh in fact
the recent GPT5 series models all of the
models have reasoning except they they
have like a reasoning effort. So the
models can have you know low reasoning
effort, medium reasoning effort, high
reasoning effort um and so on and so
forth and and pretty much all open
models have have a thinking variant,
have a non-thinking variant and so on.
Next, um, in general, LLMs, um, are as
good as the amount of context you give
them, right? So, the more context you
give them, the better they would be at
their downstream task. Let's say you
want to summarize a research paper, if
you just provide the abstract of the
research paper to the LLM, um, the the
the the output itself would be quite
sub-optimal. But at the same time if you
to the same LLM with the same prompt if
you provide the entirety of the research
paper the the summary of the research
paper would be would be much more guided
which be much would be much more on
point and so on and so forth. What this
means is that these LLMs require a ton
of context to be effective. Back in 2024
um your average open LLM had a really
small context. So uh 32k to like 64k
tokens was uh was kind of like the
standard like you know you would see
your llama you would see your uh Quinn
and Gemma and so on and they would
pretty much range around 32 to 64k which
means that they were good enough but
they weren't really like useful for uh
for like large context tasks as you know
openai was or probably anthropic was and
so on. Fast forward to 2025, 128K, 256K
has become, you know, the standard,
which means that now all the other use
cases that you had to previously depend
on um proprietary models for you don't
you you can do the same with like open
models and not just that um you know 1
million context even is now feasible.
This is also feasible with like
proprietary models. So this is not
something new but um we have coin models
which are capable of 1 million context.
you have llama 4 which is capable of
even 10 million contexts. So you so you
you really are now not bounded by the
context itself. you can really like just
try a lot of um these um experiments
as we started sort of increasing in this
in this context, right? Like as we
started um um as we started being able
to sort of squeeze in as much
information to these models, there was
another interesting trend uh which is
that the that the cost of these models
also started decreasing. A lot of this
sort of came from um uh from you know
like optimizations at the software level
at the hardware level as well as as well
as at the architecture level. But the
fact of the matter is if you had one
euro or one USD um you you you get much
more bang for buck now as you got the
same time last year. Right? U this is
again like a like a chart explaining the
same thing from the good folks at
artificial analysis. um and so on.
Heading to the to the third trend, um
back in 2024, um open LLMs required a
steep learning curve. Um there were a
lot of problems. So um for you to be
able to deploy um a local LLM, you had
to look whether or not the chat template
is correct. So chat template is a is a
way in which you your model learns how
to map an input to an output. So if you
say hey summarize this this paper um it
would take this it would um format it in
a way that u that the LM understands and
then you know you get an output um and
these were quite malformed. didn't have
any sort of you know standards. Um tool
calls were very difficult. Um we had uh
model precision issues. Um so
essentially like we didn't know like
what was the model trained like whatever
precision the model was trained on
versus like whatever precision the model
was deployed on. So there were quite a
lot of issues like this and of course
like there was latency issues and you
know memory requirements um for those.
Uh fast forward to 2025 uh we we now
have standards for chat templates. So
everyone has defaulted to chat ML
format. Um 4bit and 8bit quantization
has become first class citizen. In fact
like all the new LLMs that you see are
pretty much FB8 first which means that
you don't you do not require as much um
VRAM on your GPU to um to use these. And
these are native which means that you
don't you don't really like lose out on
um on any of the performance. Uh and in
fact now with the recent OpenAI GPT OSS
we've seen that 4bit is also becoming
kind of like a standard.
Um but surely there has to be a catch,
right? Like all of this is too good to
be true, right? Um and in some cases
maybe it is. Um let's look at where do
proprietary models still win today. Um
first of all for for general reasoning.
So what we've seen is that open models
are very good at certain specific tasks.
So you would see open models being
really good at for example tool calling,
really good at you know math, really
good at uh science, really good at
coding and so on. But there's no like
one model which is good at everything.
Um much like what your proprietary
models offer. um similar to you know
anthropic similar to open AI and so on
and so forth. So there is still like a
there is still a hill hill to climb for
like general purpose reasoning when it
comes to it. End toend multimodel is uh
is quite superior for um for proprietary
models and more specifically for um in
this case openai. Uh, OpenAI has a huge
sort of margin when it comes to GPT real
time, um, GPT40, advanced voice mode.
Just the fact that you can have like a
like an actual 5 minute chat with u with
the advanced voice mode, tell it whether
whether you want to go it um whether you
want to ask it to go fast, slow, um, or
like ask you to teach sort of French,
ask you to teach German, whatever it may
be, you can just ask things. um when it
comes to advanced mode and also GPT real
time. This is something which we haven't
really gotten to when it comes to open
ecosystem yet. Um
last but not the least um um proprietary
models have a sort of very nice well-
definfined safety and jailbreak
scaffolding. Uh open models when when it
comes to open models you often have to
define this yourself. you have to make
sure that they are um you know covered
from all sorts of issues or all sorts of
uh potential um jailbreaks and so on and
this is something which like proprietary
models have kind of mastered at this
point.
So if we were to sort of summarize this
in like one slide, um what would be a
playbook for you to simply just try open
models is pick your pick your simple
project any project um that there may be
swap the proprietary with the with the
open model um evaluate on the same test
that you have been evaluating on it swap
with another model tune your prompt and
let it drip pretty much. Um
last but not the least um what's next
and like what is some stuff that I'm
quite excited about from these trends.
Um first of all I'm quite excited about
like smaller and domain specific models.
Um in the recent times um the Gemma team
at Google DeepMind released Gemma 290
million parameter LLM. It's a small
multilingual LLM which can run on your
browser, can run on your devices and so
on. Um and and these sort of small
domain models are in my opinion the way
to go like you you you lower the cost
you increase the access to a lot more
people um and so on. Um second is effort
based reasoning. So this has started to
become like a thing which openai is
doing which is um essentially from the
same model you can just define what kind
of effort do you want from the model. So
do do you want like a low reasoning
effort, a medium reasoning effort or a
high reasoning effort? And this
effectively makes your singular model uh
effectively three models and uh for the
same deployment you get much more bang
for the buck. Um better quantization
schemes. So you know FP4 becoming uh
like a norm and 4bit um quantization
becoming like the go-to for uh for all
deployments. Um and last but not the
least like sparse and faster. So at this
point we all know that all major
frontier LLMs aree which is mixture of
experts and the sparser the uhe is the
lesser the active parameters and and the
faster the inference would be and we're
seeing quite a lot of um we're seeing
some trends towards like spareres and I
I hope the community decides to sort of
double down on that. Um, that's it. You
know, I would recommend be be close to
the source. Um, and always default to
open. Um, thank you very much.
[Applause]
Thanks, VB. Please, let's give it up for
VB.
>> Thank you. And if uh if you have any
questions for him, I think uh I believe
you're still around, right?
>> Yeah. Yeah, I'm tight. Yeah.
>> Awesome. Yeah. Thank you.
>> All right. So our next speaker,
our next speaker built machine learning
infrastructure at Uber, Apple, and Adobe
before she co-founded Arise AI. She also
been recognized on Forb's 30 under 30
list for her impact on AI. Today, she'll
show us why system prompts shouldn't
stay static and how agents can actually
evolve their instructions in the real
world in real world environments. Please
join me in welcoming to the stage chief
product officer at Arise AI, Aporna
Dinakaran.
[Applause]
[Music]
[Music]
Hey everyone, welcome, welcome, welcome.
All right, today I'm going to talk to
you all about prompt learning. Um
hopefully we have a good next 20 minutes
together. So a little bit about what
what I do. One of the founders of
AriseAI. A risei is one of the leaders
in AI observability. We help teams go
all the way from development to
production. So we help teams actually
trace their applications, evaluate them.
We're going to talk a little bit about
Swix's controversial statements today uh
about the eval matter. Um and then also
actually help them develop and iterate
using prompt iteration. Um, so let's
jump into it. So this is actually a real
post from Hacker News that I just put on
this slide, which is I think just really
emblematic of what people are feeling
today. Somebody's asking, are there any
real examples of AI agents doing work?
Um, and he's asking kind of anyone have
an example which I understand to be
intelligent that isn't just a glorified
or rebranded workflow automation. And I
think you get a sense of this kind of
skepticism from people who are building
with agents today because they're just
really brittle. It's really hard to get
them to work in the real world. And um I
think what we're starting to see is that
there's common patterns um that really
good agents are starting to have. And
when they don't have them, they tend to
be brittle. One of them is the system
instructions. If they remain static and
they're not consistently kind of being
updated, this is what we're going to
dive into today. that tends to make you
keep it so that the agent doesn't
actually learn from its environment. You
have I think there's screenshots now of
all the to-do lists inside of cursor and
inside of cloud, but actually planning
and being able to have you know planning
that updates is something that's
starting to become a really common
pattern in agents. Um how they call
tools or the guidance around tools tends
to be something that I think some of the
better agents that we're using do really
well. And then context engineering is a
whole whole domain that you know when
the context isn't passed correctly
either between hand between agents in a
handoff we start to realize that this is
this is a you know where agents end up
tending to be brittle but today what
we're going to actually talk about is
system prompts. This is a really viral
tweet that Andre Karpathi kind of
tweeted in May. Um and I think he's
actually starting to hit on something
which is really interesting in this
space. What he's talking about here is
that there's a major paradigm kind of
missing for LLM learning. He's given it
a name, system prompt learning. He's
like pre-training for knowledge,
fine-tunings for habitual behavior. But
there's a lot of human learning that
actually feels like a change in the
system prompt. So, you learn something
from your environment. Something doesn't
go right. You end up taking a note for
yourself. Hey, next time I see this
scenario, I'm going to act like this.
And this needs to go somewhere. And if
you end up putting it in pre-training, I
mean, that's a lot of work. It's a
significant amount of effort. Uh you can
do it in fine-tuning, again, significant
amount of work, but you actually have
English feedback and an explanation that
you can just put in to the system
prompt. And this is significant a
significantly kind of higher dimensional
feedback than just a scalar score with
typical, if many of you guys are
familiar with like RL type of um
approaches. And so why not use these
system prompts to actually pass in this
type of feedback and make sure that the
system prompts don't just remain static.
This is actually Claude's system prompt
that got leaked on GitHub uh a couple of
months ago, 24,000 characters, uh 18K
words. And you can see here that this is
it's pretty detailed in terms of all
sorts of conditions and and how Claude
should behave. And I think the key
takeaway from this is that this didn't
happen in a single iteration. No one
just wrote the system prompt overnight
and then put it into production. This
was meticulously shaped over collecting
data, looking at workclaw didn't do
well, and then actually using that to
iterate on the system prompt. If there's
anything you take away from everything
I've said so far, it's that the system
prompts are actually really key to
building effective agents. And as agent
builders here, that is something that
you can actively shape. And there's all
sorts of different prompt optimization
approaches that we're going to talk
through today that you have in your
toolkit. There's kind of the old
approach which is more traditional RL
type approaches. So the traditional RL
type approach what you have is a scalar
reward. You have a scalar reward. You
know the there's some sort of gradient
descent type algorithm that actually
uses it to go update the RL system.
Requires a lot of examples. It's you
know really expensive in this world.
There's other type of approach called
metarrompting where basically you are
now actually updating the the prompt
itself but you're still using kind of a
scalar score to identify what are the
tweaks in the prompt the system prompt
you should actually make what we're
going to be talking about today is kind
of a new approach prompt learning
inspired by Karpathi which actually uses
the English feedback so not just the
scalar scores but it actually uses the
English feedback back to improve the
prompt. And I'm going to show you on
actually an agent uh a type of agent
that's really successful coding agents.
Um so just to recap on what system
prompt learning is, it basically takes
the data. So this is the inputs, the
outputs, also the explanations or the
annotations. It takes the original
prompt. You're going to pass that into a
meta prompt and then you're going to get
a new prompt. Um this is kind of the
theory. Let's see how this actually
works. We put this to test and I'm going
to run you through the benchmarks that
we did on the Klein system prompt. So
for those of you who don't know what
Klein is, Klein's one of the leading
open source agents uh coding agents. So
the entire kind of you know it's
entirely open source. You can go look at
the system prompt today. Um we heavily
cut the system prompt. It's like 30
pages long. Um but the system prompt
basically has you know whatever the
system prompt is and then a section
where you can actually add rules. rules.
So if you're familiar with like cursor
rules, client also has something like
you know its own rules that you can go
in and add. Typically when you use the
out of the box client, the rules are
empty. So we started off with kind of a
this was the system prompt with empty
rules. And the first thing we did was
run an initial benchmark. How does the
initial kind of system prompt do um just
on its own without any modifications? We
tested it on bench light. Sweet bench
light is like you know 300 plus software
engineering problems. Um and we ran kind
of client on plan mode on Sweetbench
light. So client just for context has
both plan mode and act mode. Plan mode
is basically where it generates a plan
but it's not actually generating the
actual code and then running the code on
the actual problem. We're actually
currently working on act mode results.
So you know hopefully I'll tweet about
this soon. Um but we tested this using
plan mode and uh in plan mode I mean the
results were you know okay as you can
see it's not great but like around 31%
basically on swe
um and if you go in and you actually
look at what you know where client
actually fails and these are some type
of examples just to kind of build some
intuition. So this is actually one of
the problems from uh a library called
marshmallow. It converts complex data
types and client was asked to fix some
type of bug where the program um you
know if the input was none then the
program used to just crash. So can you
handle that better? Um so client
basically you know just did exactly
that. If data is none just just return.
Um but it fixed just that single case.
But the input type can actually be not
just a single kind of word. It can
actually be a list. It can be a complex
data type. So it actually has to go in
and check all these different scenarios
before it can just return. So this was
one type of you know scenario where
client jumped to a very minimal fix
without looking at what are all the
different types of cases that input can
be. Um here's another one that kind of
initially it failed at. uh another kind
of repository, Simpy uh Simpai, Python
library for kind of symbolic mathematics
where basically uh if there was some
sort of operand like you know two times
a matrix but you had some other type of
operand like an at sign it would uh
incorrectly behave like multiplication
instead of just failing. Um what it
should do is just like raise some type
of error or um you know suggest some
type of fix. In this scenario, what it
basically did was that it actually, you
know, it ignored kind of the the actual
Python language kind of contract around
what you should do if it's, you know,
the fallback option for Python and it
wrote some sort of ad hoc fix which
didn't take care of multiple scenarios.
Um, again, another type of situation
where the fix kind of looked right, but
you dug in deeper and it wasn't actually
the right solution. So, we decided to go
and actually test out prompt learning on
this. uh we took kind of Klein original
Klein on the original kind of system
prompt that it came with. We ran it
across the entire kind of Sweetbench
light on all the sets of problems and it
generated a whole suite of actually
outputs.
We then took those outputs and then the
the test set we you know that we tested
on SweetBench actually had a golden data
set. So it had the ground truth
basically with the actual kind of PR
with the actual test patch and we passed
the solution that you know Klein
generated against the golden data set of
the ground truth and we asked an LLM as
a judge to actually evaluate you know is
this the right solution? Did it generate
a plan that would actually solve this
problem? Um we wrote a whole template.
We passed those inputs in. And this is a
key part that's important here. We not
only just asked like correct false, we
actually asked to give an explanation.
So why did it actually why was it wrong?
Give me a reason why. And this is
important because whether you do this
using an LLM as a judge or whether you
do this using a human annotation, this
is the English feedback that Karpathi
was actually talking about in his tweet,
this is the higher dimensional feedback
that an LLM can actually take and use to
update its system prompt rules. So we
took these explanations
and we passed all of this into the meta
prompt. So the meta prompt now has the
original prompt that we started out with
which remember had no rules. We then
passed in this data. This data is what
was the problem to solve, what was the
solution that client came up with and
then here's the you know whether it got
it right or wrong along with the
explanation. All of this got passed into
the metaprompt. And there's a lot here
to unpack around how do you manage the
context window like will all of this
fit? How many examples? Um you know hold
that thought for a bit but basically all
of this goes inside the meta prompt and
we were able to generate kind of a new
system prompt. So this is just like a
diff view telling you in the old world
the rules were empty. In the new world
there's all of these rules and you know
I couldn't fit it all on a slide but you
know even in something like five 10
loops that we did there was you know
close to 100 different rules um all of
these different types of um different
types of use cases and errors that
client would make and all of these got
passed into um the rule section. So fun
fact if you're using cursor if you're
using client this is actually something
you can do today. Um so all of these
goes into the rule. You might be asking,
okay, well, does this really work? Like,
what was the performance? Tell me what
the benchmarks were. Okay, before I get
into that, I'll just, you know, tease a
little bit with the same problems that
we saw earlier. So, this is one where,
you know, it was performing on that
Marshmallow library, the Simpai library.
Both of these, we actually reran this
with the new updated system prompt. This
time, it was actually able to generate a
correct solution. It's actually each one
of them had a corresponding rule that
was in the rules um that was added to
the system prompt. Um and overall
I think with about 5 to 10 loops we
actually saw about uh 15% accuracy in
kind of the the test performance. So
something around like 30% to like 45%
accuracy after 10 loops. Um we did a
typical just in case any of you are
curious we bench light uh we we broke it
up 50% train 50% test. Um and we ran
this across 300 kind of software
engineering problems. Um and this took
probably about 5 to 10 loops to see some
sort of improvement. We also tested this
on BBH which is you know known for being
much more difficult tasks for language
models and BBH actually has a different
categories of software engineering
problems. And I think what we were
really excited about and we're going to
share some of these results, you know,
more publicly soon, but uh some of the
more harder types of problems like
salient translation error detection or
snarks, we actually saw even after just
five loops, something that the LLM was,
you know, client was not doing so well
on able to see pretty massive jump in
terms of performance and actually very
minimal kind of uh, you know,
regressions in some of the other
categories. So, this is one where just
adding the annotations and adding where
actually it didn't go well was was
something that was able to see some
pretty incredible improvements.
Um, so what's the takeaways for for for
this group for agent builders? We just
ran this on some of the hardest, you
know, you would say most successful
types of agents out there, which are
coding agents. I think the key takeaway
that we wanted to share with you guys is
one collecting those errors and actually
either running evals or doing any kind
of annotations on them is actually
really important because you won't
really be able to understand what to go
fix if you're not collecting those
examples. Um this is an example of
basically one of the things that we do
which is where you trace the entire
application. Um, this is typically when
you hear people talk about things like
online evaluation versus offline
evaluation. I mean, I think one of the
things that Swix was saying yesterday in
his keynote was like like eval should
they really be blockers before you ship
something into production. I actually
really don't think they should be. I
think you should ship things you should
put them into production. But I think
what most people actually don't talk
about is that online evals, actually
running evals in production on top of
your data is probably way more important
because now you actually have the data,
the traces, your your logs that you can
then use to identify what's working,
what's not working. And that's more
important because it's your own data.
You're not writing some BS tests that
you're using as a blocker before you
roll it out. you're actually evaluating
your traces and then you trust those
traces to then go back and feed it into
iteration. You use, you know, just like
we did where we grabbed those
annotations, we fed them back into
prompt iteration. Well, while we were
iterating, we were running those same
eval. This is just a, you know, an
experimentation view where we tested
those system prompt changes. We tested
what rules were better and we actually
used that as almost like unit testing
before we deployed it into production.
And so when people tell you, oh run eval
offline, run it before you deploy, I
think in you know the way we think about
it is that those are unit tests and unit
tests are important but you know you can
think about them as I I always think
real data real production systems
getting visibility into that is often
more more important. Um some of you guys
might ask okay well did prompt learning
only work on coding agents do you really
see it useful in other domains? Um, Bron
learning is actually we we've been using
it across a lot of different domains. We
use it with ton of our customers and use
cases even our own agent actually. This
is uh another set of results that we've
published around structured JSON webpage
generation. So more structured outputs.
Um basically after uh five loops we were
able to see significant increases in
kind of accuracy. again here um this is
interesting where basically here the
more the rule set kind of grew um you
can see here that didn't necessarily
always mean that accuracy was going to
be consistently up. So actually having
uh differentiation and basically how
much rules you actually pass in um is is
something that you need to kind of
iterate and tweak on in in your use
case. Um this is another use case
support query classification. So think
more of like those customer support bot
type agents. Here you can see here uh it
goes vertical. So one loop all the way
down to five loops. We're able to see
about 8 to 9% kind of accuracy increase
as we as we increase these. Um so um big
takeaway I think what we're going to see
a lot more this year people talk about
is how are you consistently having the
prompts learn from the environment. Some
of these updates will be how I showed
you human in the loop. Human kind of
going and updating those system prompts.
But I actually think a lot of those
updates are going to be totally
automated. You're going to collect data.
You're going to run evals. You're going
to run a prompt optimization approach.
You're going to update your system
prompt. Um, and this is actually going
to be a big paradigm shift in how people
build agents. Awesome. Well, thank you
so much for your time. I have any
questions?
>> Awesome. Thank you, Aparno. Would you
like to join me for a few followup
questions, please?
>> Yeah, let's do it.
>> Awesome. Okay,
please.
>> Uh, this was fascinating. I like that
you used a software engineering engine
to demonstrate that. Oh, that was pretty
cool. Um, so actually, yeah, I have a
question regarding the context window
since you said that.
>> Yeah,
>> we're going to talk about it later.
>> Yeah.
>> Um, yeah. So, how do you manage that?
>> Okay, good question. So there's kind of
two two maybe parts to this question
that people commonly ask. One is won't
the meta prompt just become sorry the
the new prompt that's generated just
become super big
>> because you have all these rules that
can be generated. Is it creating like a
new rule each time every time it sees
some type of error?
>> I mean one of the things that you can
actually do is
>> you have a metaprompt. the metapar
prompt you can actually pass in rules
like um hey keep keep the new prompt to
be of a certain you know length keep it
to be x number of words so the
metaprompt actually gives you some kind
of ability to force the new prompt to
fit within you know fit some sort of
parameters so that you can't actually
pass all of those kind of uh you know
rules inside of your new prompt question
you might be asking is okay well what if
I just have a lot of examples are all of
those going to fit inside the metaprompt
and what does the context window
management of the meta prompt actually
look like?
>> I think this one's actually the lot
harder question because the more you
actually have this application or this
agent running in production,
>> right,
>> you're going to collect more and more
scenarios of failures. That's a good
thing.
>> But
>> every time you do these types of
identifications, you know, we see teams
trying to okay, well, this is one
specific scenario, this is another
specific scenario. What we end up
recommending is trying to find
categories of problems. So doing kind of
error analysis but also error
categorization so that you don't have to
pass every type of example into the you
know metaprompt but you actually pass
kind of you know examples that are
replicative of a larger type of category
of problems.
>> Okay. Um but when you pass on yeah all
that data all that information to the
metapro um can we can we get to a place
where we we start seeing some
degradation in the
>> maybe in the generations
>> for sure I mean I think this is like a
classic you know ML type issue where you
know do you get to a part where you
overtrain or you overfit so you overfit
the new prompt to be replicative of just
your
>> errors. but it's not actually kind of uh
it doesn't translate. So I think this is
where it's important to have, you know,
make sure you're testing the train
accuracy, seeing that the train accuracy
doesn't go down, comparing that with
your test, you know, your test set, but
then also having a complete blind set
that you're not kind of overfitting your
metaprompt to
>> and also LLMs are getting so good,
right? Like we now we have context
windows of million, two million, which
is and I I think it's only going to get
better going forward with all that
infrastructure that we're building.
Hopefully,
>> yeah,
>> that's what's going to happen.
>> Um, all right. Okay. So, can you tell me
more about the difference between prompt
learning and how do we how do we compare
it to frameworks like DSPI?
>> Yeah, this is a good question. So, um we
love DSPI. I think DSPI was is is
awesome and kind of, you know, I think
they put out a lot of different prompt
optimization approaches. I don't think
about this as like a us versus them. I
think there's going to be more prompt
optimization techniques out there. So,
DSPI actually has um a couple which is
slightly different strategy maybe. So
myipro and um like the fshot learning.
So those are some of those are a lot
more you know call it programmatic in
that it takes a few shot you know it
takes a few examples and just passes
those in and those end up becoming the
actual um that ends up being passed into
your your new prompt. Um some of them
still use scalar rewards. Mhm.
>> So like I was explaining earlier, they
don't actually use the English feedback,
but they just use the skill or reward to
actually improve the prompt. So some of
those strategies are just different than
what we're proposing. But actually, they
did release a new approach called GEA.
Uh this is very recent. It's uh
philosophically similar to what we just
talked about with prompt learning. So
we're actually yet to run some I think
we're currently running some benchmarks
right now on GEA to see how it performs.
So we'll probably tweet out some of
those results soon. Awesome. Um, one one
final question for you.
>> So, uh, how does the prompt improvements
fit fit into the context window of the
metaprompt?
>> Okay. Yeah. So, this is this is kind of
what we were talking about earlier,
which is like around managing your
context window. Um, I again, I think,
you know, there's probably two things
that really matter in, you know, I think
we think about all the time is
identifying and doing error analysis is
is really important. So looking at your
traces or your application, finding
those errors and then instead of you
don't have to pass every single error
back into the metaprompt, but you can
start to pick category, you know, kind
of examples that are replica, you know,
reflective of an entire category. So
that way you're kind of being smart
about what you pass to the meta prompt.
>> All right. Well, thank you so much,
Aara. Thanks for your time and for the
talk.
>> Thanks everyone.
>> All right. Oh, and if you're interested
at all in any of the things we talked
about today, there's a workshop I think
my colleague's running actually on
offline evaluation and prompt iteration.
So, it's a little bit more hands-on if
you want to try it on your own agent.
>> Awesome. Thank you. Let's give it up for
our partner.
Awesome. So, uh, our next speaker was
part of the founding team of Zenley,
later acquired by Snap, and he's the
founder of, uh, ZML, a high performance
AI inference technology aiming to push
the limits of what's possible beyond
GPUs. So, this is going to be
fascinating. Okay. Today, he's
introducing a breakthrough attention
mechanism. Please join me in welcoming
to the stage founder of ZML, Steve
Morren.
[Applause]
[Music]
[Music]
Hey, thank you. Thank you everyone. Um
I'm Steve. Uh
and today we I'm here to talk to you
about some breakthrough technologies
we've been you know working for the
limbs which you know hopefully in our in
our you know uh hopes pave the way for a
limited context. Uh first of all we are
ZML or ZML depending on which side of
the Atlantic you're watching. Um we are
building this universal inference uh
stack and engine. So essentially it's
like the same literally the same code
the same binaries even uh that runs on
you know any for any models on any
language
sorry any chip sorry any language model
or any model on any chip uh but this is
not what I'm here to talk to you about
today we're going to do a little bit you
know of a deep dive into llama and and
also probably one of what what we think
is one of the most fundamental problems
in LLMs today. Uh so this is the lamb
architecture.
So usually every time you like generate
a token this happens. That loop you see
in the middle is for each layer that
happens. Um what we're interested today
is one mechanism that is fundamental to
the way transformer works and it's
called the tension. This thing happened
each at every you know single layer.
There's a few operation layer but this
one happens every layer and it has a
tiny um a tiny problem. This is the uh
the mathematical formula of you know
attention. Um, you don't need to get
into the details, but the problem we all
have with attention and ultimately that
binds everything we see from the
hardware to the the the reason there's a
context window etc. is because that
algorithm runs in quadratic complexity
which means for everything you add to
the algorithm you get you get to do the
square of it and that makes us pretty
sad and also pretty limited in what we
can do. Um, but there's something, you
know, about that formula like if you
squint your eyes real quick, you'll see
some you might see something that is of
interest and that interest is the soft
max, right? So, what's with the softmax?
Well, this is the equation of of the
soft max. Uh, and long story short, you
know, you don't need to uh to read it,
but it's essentially exponential of all
the elements divided by the sum of the
exponential. And what that means it
means that tiny signals becomes a lot
bigger.
And what is the end result in the end?
The end result is that this is an actual
you know output of an of an attention
for tokens. And what you can see is that
it's mostly sparse.
There's a lot of empty room. So there's
in the end a lot of completely close to
useless calculations that are being done
every time you know you run that
algorithm.
So, you know, as an ex engineer, as an
ex backend engineer myself, I look at
this and I squint my eyes and I see,
well, is this a graph really? Could we
only do the yellow dots, right? And not,
you know, not do the um the black dots,
right? The the GPU will compute
everything. So, is this a graph, right?
And turns out, yes, it is very much a
graph. It's a graph in latent space, but
still a graph. And what's very cool
about the fact that you know if we model
this thing to be a graph problem instead
of a pure raw matrix multiplication is
that we can run it in log of n.
And as you can see on that graph, you
know, log of n is, you know, flattens as
the context sizes augments, right? And
this, you know, paves the way for
unlimited context ultimately because
it's not a ben and it's not, you know, n
square anymore. But there's it's very
nice, you know, very promising, but
there's a tiny tiny problem. The tiny
problem is that because it's a graph, we
need branching.
And this this means that for this GPUs
are close to useless. They are very bad
at branching. They can do it. But if you
have you know worked on GPUs you might
you know understand a bit deeper why. Um
but CPUs are good right and we get into
this thing in which we might do the
calculation and we might you know model
this as a graph problem uh but only on
CPU. So the question is can we do it
fast enough? Now maybe let's do some
math real quick. So this is
the uh per layer. This is this is a
layer of llama. And what we see is like
there's 32 32 of them in an 8b. If we
want to run it at you know 100 tokens
per second that is 300 microsconds per
layer. Now this is like our time budget
if we want to you know achieve that that
that that throughput which is roughly
5090 territory.
Um if we look at everything around
attention we see that already 200
millisecond is of the equation because
of all the other operations that leaves
us about 100 microscond to do the actual
you know calculation than the actual
attention that's not a lot of time right
so the question becomes you know is it
enough like do we have enough time to do
the calculation of attention number one
like the vector and the multi the matrix
multiplications etc as a graph
on the CPU and all of this in less than
100 microcond.
And actually we can and the reason we
can is that because we only compute
essentially the yellow dots, right? So
this is what you see as a trace of a
whole one layer. And so that is pretty
interesting or at least encouraging
because we only spent about 30
microscond doing the act the actual
calculation. And remember, you know, we
skip the black parts.
But there's another side benefit to that
because
if we run the attention on the CPU and
not on the GPU, it means that we we have
more GPU memory. In this case, if you
run a model on the GPU, you have part of
the GPU dedicated to the model and part
of the GPU, sometimes the majority of
the GPU dedicated to the KV cache. that
creates a plethora of problem for those
who are deploying LLMs in production
environment static routing and all of
these things but if we do the attention
on the CPU in system memory then
suddenly the KV cache doesn't need to
live in the GPU which gives us more
memory for the model
and actually makes also the model
completely stateless the GPU sorry
completely stateless and so in this case
KV cache now lives in system memory the
GPU PU will send data to the CPU. The
CPU will run the
calculations, update the KV cache and
send back the attention to um to the
GPU.
But there's a another catch to this is
that we are in need of a lot of CPU
cores then um roughly about one per KV.
So the question is like how do we get
these scores like let's say we want to
do a batch run.
So perhaps you know is there a way we
could get these scores uh maybe not
locally on the machine which has
sometimes you know underpowered CPU
because they're pretty much useless
except for PCI lanes maybe but could we
get them like these somewhere else
because remember now the attention and
the KV cache is completely separate from
uh the GPU
and so maybe maybe could we get you know
this CPU power over the network
remember we spend about 30 microcond or
so on the raw calculation that leaves us
about 70 microsconds to let's say to
attain this throughput so let's do you
know are is physics on our side not easy
um let's do some calculations
so roughly the payload we would need to
send is the the number of you know
attention dem attention heads KV heads
etc which is about 10k for an 8B per
layer per request.
Now, let's be conservative about this
and maybe we're running on a 10 gigabit
internet connection, right? So no
infinity band, no crazy stuff, you know,
somewhat I would say uh premium 10
gigabit, but it's not the end of the
world and it's not very I would say
unattainable and physics tell us that we
could run this thing at least do the
roundtrip of the data would go at about
16 microsconds.
So suddenly we we might ask ourselves
are we ready to spend 15% of the time
budget on essentially unlimited CPU
cores because we can provision and
unprovision them over the network.
I think that's a pretty good deal if you
ask me. And so this is what it would
look like. So you would do the
calculations, the dense calculations on
the GPU, then extract the data, send it
over the network as UDP of course
because you want to be you want to be
compatible uh feed it feed it to the
other machine, compute the attention on
the CPU and send everything back and you
need to do this faster than the GPU
would calculate it himself.
Now the way the reason we cheat is that
we do much less calculations than the
GPU would.
So, I mean, let's try it, right? Let's
do a ping and oh man. Oh crap.
So,
physics is on our side, but engineering
obviously isn't.
So, how do we square that circle, right?
There something's got to be there. There
must be something we can do because, you
know, physics is on our side. We're not
at least at this point, we're not
fighting physics. So maybe
we have a little trick up our sleeve we
can pull and it's time for kernel
bypass.
And so the Linux kernel we are at this
stage you know uh thankfully for us or
maybe not so thankfully for us uh at
which the Linux kernel is might be too
slow for what we are trying to do.
There's too much latency
and so this is where we enter another
technology in the mix and it's called
DPDK for those who recognize it.
What it allows for us is directly talk
to the network card in a way that is
very very low latency. The driver is
actually built inside the application
itself.
And what this allows us is to get much
much closer to the theoretical latency.
So theoretical latency in this case
would be about 16 micros microcond. We
get about 20 measured. So that is, you
know, still within the budget and this
is exactly what we've been building and
we call it attention D.
And the clicker has a slight lag.
And if you're up for it, I'm here to
give you a demo.
We're good.
All right. So just so you know, this is
what it's running on, right? So what you
what you see is like a a a machine with
GPUs and a machine without. All the
attention is running on the machine
without the GPU. So nothing is missing,
right? So let's crack this one.
Can you see it? Nope. But it's fine
because
I can do this
and that, right? Okay. All right. So, of
course, this is a this is a video
because uh demo gods, right? But what
you see on the left is attention the
running on the left machine. You see all
the cores 100%. And what you'll see on
the right is a standard LMA model
running standard inference. There's
nothing happening on the model side.
They're communicate communicating over
the network and you know I'll let you
see what it you know how it performs.
So we're loading the weights
and then off it goes.
But oh, where's my
I have lost the uh
quick time. Okay,
but we're not done. I think you are in
uh in luck because maybe we can do a
little bit more than this.
You see the KB cache is not on the GPU
anymore. That leaves us more room. So
maybe we what if we could fill the
entire GPU
without any, you know, room for
conversation, any room for the KV cache,
only the model, completely stateless on
the GPU.
And so this is what we have for you. A
live demo I'm actually going to do, you
know, as we speak in which we're going
to run a 32B model FP8 on a 32 GB GPU,
which is, as far as we know, has never
been done. Um, because there's it's
useless in the real world except if you
have this.
All right. So, oh, I have the feedback
here. Good. Um,
so are we
live? Okay. Hopefully the Wi-Fi is with
us. So I'm going to run attention D on
the bottom terminal.
Yeah, it's live.
And on the top terminal, I'm going to
run a 32B model, Quinn 32B on a 32GPU.
Just so you know, 32 GB GPU. Just so you
know, that model fits when it's done
loading. There's 20 megabytes remaining
in the GPU. So essentially useless. So
let's try it. I'm going to tell it to
write an MLP CUDA kernel. Hopefully it
doesn't crash.
It crashed.
That's how you know it's a real demo,
right?
And there we go.
So just so you know, this is as far as
we know, it's the first time it's ever
been done. That model isn't supposed to
run. Thank you.
So, and just so you know that it's not,
you know, a fake thing, I'm going to
kill attention D. And yes, generation
stops because it's now, you know,
shooting packets into the void.
All right, let me just
go back. So, the demo gods were with us
today. Pretty good. Thank you very much.
Uh that is us.
Feel free to or be sure to check out our
repo. This is open source technology. Um
and you know would love to see you part
of the community. Thank you.
>> Amazing talk and demo. Would like to
join me for a few.
>> All right.
>> Okay.
>> Um you can have a
>> Oh, water. Thank you. Yeah.
>> It's surprisingly hot. I know, right?
Especially when you're on a live demo,
>> right?
>> And you don't know if it's going to
flash. Yes.
>> Um,
>> all right. So, I'm going to start with
the elephant in the room. What What is
written in your t-shirt?
>> Ah.
>> So, for those like me and like our team
which are terminally online, this is a
reference for Zuck.
>> Okay. It was a he had like this shirt,
but it means essentially ZML or nothing.
and he had like a Zuck or nothing. So
I'm like, well, no, this is us, right?
>> This is ZML now, not not Zuck.
>> All right. Okay. So, um, you touched on
branching and you said the GPUs are bad
at branching. Can you can you explain
that to the rest of us don't understand?
>> Yeah. Well, GPU are very good at
parallel processing and the reason is
they operate in a mode called SIMT,
single instruction, multiple threads.
And so when you have branching, so we
can go into the nitty-gritty
nitty-gritty details, but at least you
know bird, we can like because there's
like warps and etc. But
>> long story short, when there's a branch
in in GPU code, the the cores that are
going to do one side of the branch will
wait for the other side to finish.
>> Yeah. So essentially if let's say you
have let's say modulus two and you want
half the GPU to work let's say on odd
numbers and the other one on even and
you do an if in the middle
>> half of the GPU will always be asleep
waiting for the other side because the
instruction uh sequence is common to at
least the the whole I would say group of
processing. So they're really bad at
doing this which is why you know people
spend a lot of time trying to do clever
tricks etc. But it's very good if you're
like doing image processing for instance
because it's always the same thing you
do for every pixel or matrix
multiplication for that matter. All
right. Okay. Um yeah so yeah you showed
this and you spoke about attention. So
um I'm wondering if this is specific to
uh related to the transformers
architecture or can we imagine other
architectures also being compatible with
uh
>> at least it's related. So the
transformer architectures uses
attention. Uh there's other architecture
that are using like everybody's rushing
against that n square.
>> N square is the reason we have hbm. It's
the reason data sizes are shrinking are
trying to shrink quantization and so on.
Um so it's part of the transform
architecture at least attention but not
only uh sometimes you see like I would
say assemblies of you know common
architectures and there are also some
linear architectures uh memba for
instance um so but for now at least for
a good while everybody's trying to
unseat it it hasn't been
>> unseed yeah transformers are here for
now at least you know for until next
time famous last words, right? But
um
I had one more question for you if
that's all right. Okay. Um so I think
where do you see ZML going after this?
Like what would you like the community
to do with it? Obviously it's open
source. I see that you have we have a
GitHub link here. What do you think is
the next step for it
>> as a community? I mean what we're about
is being
very very radical about inference
um at essentially all the the the layer
in the stack from you know single digit
microscond stuff to you know throughput
whatever uh you know loading a model in
like we have benches in which we load
like an entire model in like one second
from SSD
um so we're very radical about very high
performance, very low level um and
building the tools to achieve this,
right? Um it's I mean number one, it's
way too expensive. Um but also we are we
haven't been you know computebound in a
long time
>> and so what we're trying to push with
with with ZML is
how can we build you know an ecosystem a
software stack um a product line
which is entirely built around the
thesis which is universal you know chip
support very low latency
uh you know everything is a hot pass you
know type of thing because that you know
attention mechanism you know we demoed
uh you could you know in theory
implement it in PyTorch right but it
would run at like maybe a tok maybe a
token per second
>> uh because that 30 microcond would
become millisecond right so at some
point fighting latency is very hard and
this is what we're about um but from the
ground up and as a community yes we have
a framework that that is open source we
want people to use it there's refes we
are the first users of our framework so
you know bear with
uh and it's constantly changing etc. But
it's you know improving very very fast
and we have some cool demos many many
more incoming.
>> Awesome. So pushing the boundaries of
physics and uh trying to make physics on
your side and
>> stay compatible but stay closer I would
say to physics.
>> Awesome. Well thank you so much Steve.
That was fascinating.
>> Thanks man.
>> All right. Thank you.
>> Let's give it up for Steve one more time
please. All right.
Okay, so we're back at 4:30. Now it's
time for a a welldeserved break. So, uh,
we still have coffee by our friends from
Tim Foil. So, I'll see you here at 4:30.
Thank you.
[Music]
Heat. Heat.
[Music]
[Applause]
[Music]
Heat.
[Music]
[Applause]
[Music]
[Applause]
[Music]
Heat.
[Music]
Ladies and gentlemen, please join me in
welcoming to the stage your MC for the
AI engineer Paris, developer experience
engineer, Ralph Chabri.
[Music]
[Applause]
[Music]
Yes. All right.
Welcome back to the main stage. We still
have three amazing speakers left, so I
hope you're going to enjoy this. I I
know that uh uh we had a lot of speakers
have prepared a lot for this. So um up
next is our developer relations engineer
at Llama Index where she helps
developers build product ready agentic
applications. Today she's here to show
us realistic abstractions on how to
build an alternative to notebook. So you
got to pay attention to this. Please
join me in welcoming to the stage
developer relations engineer at Llama
Index, Tuana Chelik.
[Music]
Hello everyone. It's great to be here at
the first uh AI engineer at uh in
Europe. Let me just make sure this
screen is on. All right. Uh, all right.
So, um, hopefully I can get my slides up
and I'm going to be talking about
building a notebook LM alternative fully
open source and hopefully I can inspire
some of you to try it out. Uh, without
further ado, this is me. My name is
Tannum, although you just had an
introduction. And I have been with Llama
Index since May. So, why Notebook LM?
Well, Notebook LM is pretty cool. If you
have not used Notebook LM yet, I am
going to show you quickly what it's all
about. And this whole talk is about
realistically what are the abstractions
we need to actually build an alternative
to Notebook LM.
So this is notebook LM. uh when you
start up a notebook in notebookm you're
greeted by this kind of page where
you're asked to drag and drop maybe a
file or you can also link it up to your
Google workspace provided some URLs
YouTube links um or just simply just
copy paste uh content in there and what
you get out of it is actually a lot of
stuff so here you see me the other day
I've provided um a document on an IKEA
guide to build a kitchen and you have a
summary here. Uh you can uh do some
question answering. Um and you can do so
much more. Uh on the right hand side you
see that you can get an audio overview.
Uh you can get uh sort of flashcards for
an FAQ. uh you can get even a video
generated and one of our favorites uh
you can also create a mind map out of
the whole context present in your
document which is a great way to
abstract over all of the information
present in your document and kind of
make it easier to think about.
So we set ourselves a challenge
um and we also decided to make our lives
a bit difficult by setting ourselves
these challenges because we really
wanted to focus on creating an
alternative but we wanted to make sure
we were doing it justice. So these are
the three main things we wanted to focus
on. Um we wanted to be able to provide
complex documents and when I say complex
documents I'm talking about documents
that maybe provide that have tables
images. what I love to call a layout
bonanza because not every document is
uniform in the layout. Um, we also
wanted to make it and this is I'm going
to show you later what was really the
pain. We wanted to make it um reusable
especially the functionality and tools
that we were building. Again, this is an
open- source project. We wanted it so
that you can go ahead and use any of
those tools and functionality separately
without it having to be our project that
you're using it with.
And finally, and you'll see why this is
important, we also wanted to have some
level of control over the flow, which is
where things start to differ to what you
might perceive as classic agents, where
agents get all the agency to decide what
to run when. We wanted to make it so
that we had a bit more control over the
process.
And it so happens that I work at a
company that provides many tools and
products that help us get to this uh
project and make it easy to build. Um so
I'm going to talk about Llama Index very
briefly. We have two main sets of
products at Llama Index. A lot of you I
hope some of you uh know about Llama
Index, the open source framework that
allows you to build your own agents and
design your own logic. That is our open
source framework. That is one layer. But
we also have Llama Cloud. And Llama
Cloud has quite a few products now. It's
growing every day. Uh which provides
tools and and um and uh products that
allow you to pause complex documents,
extract structured information out of
them. You can also use Llama Cloud as
your own vector store that you manage
that you use to chunk and embed your own
documents as well. So great, we have
some ingredients to start the process.
So we set off and the two main uh Llama
cloud products that we initially started
with are Llama Pars and Llama Extract.
So one thing we wanted to do is make
sure that users that want to use what we
now are going to from now on call
notebook Llama could be able to upload
documents of any shape or form. So llama
paths are super useful here. We don't uh
we don't worry if there's any tables,
images, um layout differences at all.
And also Llama Extract.
Llama Extract is my favorite out of the
bunch. Uh I love working with it. And a
lot of the times we see people use Llama
Extract in situations that are probably
a lot more boring than what we're going
to be talking about today. But we often
see llama extract being used in
situations like this where you may have
long complex financial documents and
really the thing that you want to focus
on is a set of information that you want
to extract from it and you really don't
care about the rest. So we have here the
SEC filing by Nvidia and we want to get
from that complex PDF and just focus on
a few key components from it. And you
can do that in a few ways. You can
either go ahead and select one of the
predefined schemas we have in Llama
Cloud or if you're like me and you
prefer being in code, you can go ahead
and do the following where you have a
paidantic schema. And what's important
here, not only are you defining the
schema, so also the data types that
you're after, but you're also providing
a model again of your choosing, a
description as to what it should be
looking for within the document. And
this description is used to then go and
extract that relevant information.
And then you can go ahead and create
what we call an extraction agent. You
can either again do this within Llama
Cloud or in code. you deploy an
extraction agent which can then be
reused over and over again. Um, and so
here you see my extract agent for SEC
filings. Now what we noticed is that we
can actually start using our very own
llama extract to start off with the
creation of the initial uh notebook
llama where again like you saw in the
original notebook LM you had FAQs ready,
you had a mind map ready, a summary
ready, etc. We decided to go ahead and
use Llama Extract as a way to start the
notebook up. And you'll notice here
we've defined a schema where we're
asking for a summary of the document,
highlights, so bullet point highlights
of the context in the document. Uh we're
asking for questions and answers.
So that's step one. And the next thing
we decided to do was make use of our
framework, our open source framework,
specifically the workflows because this
is a way that we can define um our sort
of uh business scope uh what the agent
is able to do and not able to do. Uh and
we're able to kind of architect the
pathway that our application is taking
without it relying on an LLM to make a
decision in the first place. So, we're
going to see how our workflow is built a
bit later on, but um this is the point I
want to shout out Cleia because this is
really her brainchild. Uh this is my
colleague Cleia. You can follow her on
LinkedIn if you like. Uh she's our open
source engineer and she put a lot of
effort into notebook llama. So, I hope
you like it and I hope you decide to
contribute to it in the end as well. So,
what does notebook llama do? Notebook
Llama. I'd want to point out again that
this presentation I'm definitely not uh
presenting to you a finalized polished
product that just exists and that you
can go ahead and use, of course, you can
go ahead and use it, but really our aim
is um that this is a start and if you do
want to contribute or you see that
something's missing, it's not the
complete Notebook LM replica obviously.
So, please do feel free to contribute to
it. Um and we do a few things. We do
have our mind map generation. That's the
first thing we decided to implement. Uh
we can ask questions and do QA over
files. We're going to have a look at it
in a bit. Uh we can extract tables and
images as well. We have podcast
generation. We've very recently added
observability to um the product as well.
Um and now I'm going to quickly show you
what we can do with it. All right. So,
I've decided I'm not running this
entirely live because I don't uh trust
the Wi-Fi. So, this is something I've
done just before. And I'm going to just
increase uh this. Uh so, here you see me
uploading um the IKEA kitchen guide. uh
once you've uploaded it, very similar to
Notebook LM, we're getting a summary,
we're getting bullet points, we're
getting an FAQ,
and we're getting a mind map that I have
to be careful when I scroll over because
it moves around a lot. Uh our mind map
does not look as sleek as you may see.
It does not look as sleek as Notebook
LN, but nevertheless, we have it. Uh you
also have document management where you
can look at previously uploaded
documents. This is the same one. So,
it's not very interesting. Um, I've
tried the IKEA kitchen guide many times.
Um, and we also have document chat. So,
here again is me asking the question,
what are the main steps I should take to
install countertops in my new kitchen?
And what I get is a very sort of
in-depth answer with citations. And I
can also look into a bit more detail
about what the sources are and why it's
claiming that this is the best way to
install countertops.
Now before I switch back to the
presentation, one quick thing I want to
point out is uh let's not worry about
the PDF loading. But here we see our
extraction results of the IKEA kitchen
guide. You saw earlier the schema that I
had defined. The schema is right here.
And here is the result. And you'll
notice that these are the exact results
you saw in the uh UI for notebook llama
as well. This is how we set up the
notebook. We get the extracted summary,
highlighted bullet points, question and
answers, and that's basically our entry
point into notebook llama. With that,
let's switch back and go on. All right.
So, I'm going to focus on explaining to
you two main components of notebook
llama. So let's first start off with the
first uh homepage where we load the file
and we generate the initial uh notebook
with the mind map and the bullet points
etc. So how does that all look? We want
to result in the uh image you see on the
right hand side. So we want to start
with a PDF. Next we want to be able to
generate some sort of mind map. And next
we want to generate the whole notebook
with the summary and bullet points etc.
So actually by by thinking about this
we've already kind of described to
ourselves exactly what the logical flow
should be ideally
and how do we do that in uh llama index
we have a abstraction called workflows
which is a class that you can extend to
create any workflow and a workflow
really consists of two main things
you'll notice here that there are
methods called steps and steps are
expecting events and they can also emit
events.
In this case, you see a one path system
where we have three steps. Each of them
are expecting a certain event. But the
cool thing is we can define our own
events completely from scratch. So what
we did is define three events. We've
defined our own uh custom start event.
We call that file input event which is
simply a file name. We want to end up
with a stop event which is again our
custom stop event. And we've called that
notebook output event. You'll notice
that the notebook output event has the
mind map, summary, highlights, etc. as
well. And intermediate we have a mind
map generation event. This is again
custom code by my colleague Cleia where
she used the summary and highlights um
to generate a mind map from scratch.
Now we have the steps. We have to
actually put it in some logical order.
So again we define our own custom steps
and the first step in the process is a
step that we call extract file data.
This is expecting the file input event
and it can actually branch out. We
account for failures. So maybe a mind
map wasn't uh generated for some reason
in which case we return the notebook
output event uh which is an empty
notebook with maybe a message indicating
as to why a failure happened. But if not
and if we're successful we actually
return the mind map uh event mind map
creation event. Next we actually have an
event called generate mind map and this
is where in a successful scenario we are
able to generate the mind map and now we
can create uh what we call the notebook
output event. Now one thing you may have
noticed here is actually most of the
logic is within MCP tool calls and the
reason for that is the challenge we set
ourselves at the in the first place and
we wanted to make sure that any sort of
custom logic we were building into this
application. We could actually put it
into an MCP server and provide them as
MCP tools. So if you like the
functionality that you see and you want
to extend the repository in any sort of
way, you can use all of those tools in
isolation or go ahead and actually use
the mindmap generator with maybe claude.
Uh so that's the reason why we decided
to put all of the main functionality in
MCP tools. So let's have a look at our
first tool which we call the process
file tool. Uh, and the most important
thing here is the description. If you've
been using MCP tools and servers with
your MCP clients, you'll know that the
description is pretty important because
it's the one thing that the LLM has to
use to decide whether it's time to run
that tool or not. So, we've described it
as this tool is useful to process files
and produce summaries, question answers,
and highlights.
So what's going on here is again you
might notice there's an extract agent.
This is the point in time in our
workflow where we call our extraction
agent to return all of the highlights
and FAQs. We already have our extraction
agent deployed in Llama Cloud. We don't
have to worry about that. But when we
hit this point in our workflow, we call
that extraction agent and we wait for a
structured output. Once we have that,
we're done and we can actually start
generating the initial notebook.
Next thing is the document QA
functionality you saw. And this we
manage with Llama cloud indexes as our
sort of database, our vector store if
you will. And we also make use of claim
verification. And for that we have Llama
index uh Llama cloud indexes which you
can connect up to any of your own vector
stores. And once you have your data
there, you get to decide when you're at
the rag process where you want to do
retrieval augmented generation or claim
verification etc. This is where we again
picked to use our Llama index open
source framework because this allowed us
to make it so that you can switch around
models easily, maybe build it into a
completely different workflow if you
wish to do so as well. So how does that
look? Again, we've wrapped this in our
MCP tool. We've described this uh as
very simply query a llama cloud index,
but you'll notice here we're pulling in
a uh query engine called citation query
engine from the llama index framework.
The LLM is again switchable and the
retriever is our Llama cloud index. And
simply this tool is run whenever there's
a question from a user and we return the
answer to the user in the UI as well.
All right. Finally, I want to talk about
something we've been working on very
recently. Uh is that now we've um made
it so that you can if you have already
an MCP server running with tools running
and you want to use pre uh configured
workflow agents with llama index. You
can bring them in as any other tool. You
might provide a Python um agent or a
function agent as well. So you don't
have to define any tools and functions
yourself. From now on, we've made it so
that if there is an MCP tool out there
that you want to uh use, you're free to
do so with a Llama index uh workflow or
a predefined agent.
The same thing is valid for all of the
Llama Cloud tools that we mentioned
throughout this presentation, including
Llama Extract. any extract agent that
you create with Llama cloud uh you can
run as its own individual uh MCP tool.
The same goes for indexes. Any indexes
you may have within Llama Cloud uh you
can run as an MCP tool and you describe
it with your own custom definition. And
that is a video of me explaining the
whole process. And this is again an
open-source tool uh an open source
server that you can run yourself.
And with that, um, the whole point of
this presentation was hopefully that
you'll try out Notebook Llama. I hope
that you contribute to it and also it's
a great way to discover basically all of
the products we have with Llama Index
and Llama Cloud. So with that, thank you
very much.
>> And I can take questions.
>> Thank you, Tana. Would you like to join
me for a couple of questions, please?
Thank you so much for the presentation
by the way.
Wow,
very nice slides. Very colorful. I love
it.
>> Thank you.
>> Um,
>> our logo changed a bit too, so it's a
bit more colorful now.
>> I see that.
All right. Um, question. What's the best
framework to build agents?
>> Oo, spicy question.
Um, I think it I I love to give this
answer. Everyone calls it a very
political answer, but I think it's
really true. Um, they all are trying to
get to the same that it's the same means
to an end. I don't know if I'm saying
this correctly in English. Um, but I
think it really depends on the type of
developer experience you're after. Some
frameworks are a lot faster at inte uh
integrating the latest uh research. Some
frameworks are a lot more stable. Some
some frameworks do a great job of having
observability and tracing inherently and
some focus on other more sort of AI
specific functionality. So that's my
political answer to that.
>> Awesome. Hope you enjoy.
>> Llama index guys if you care about
developer experience from what I'm
>> I'm being truthful here. I'm Devril. So
there you go. That's the answer. But
llama index.
>> Awesome. So yeah. Yeah. You mentioned
you're in Devr so you meet a lot of
Llama index users as well. What are some
of the cool things that you see in
build?
>> Oh wow, that is a good question. So we
have very two two very distinct groups
of people with Llama index users. I will
say Llama index users seem to be a lot
more creative. Uh it is an open source
pro product. So I guess that's uh that's
um expected. Um and Llama cloud users
are usually um they're doing this for
business purposes that maybe enterprise
customers etc as well. Very fun uh
projects. Okay, that's going to be a
tough one. Um, we had one uh specific
use case where it's a very difficult
situation, but this user was trying to
use uh Llama Cloud to pass incredibly
complex sort of architectural charts. I
think it was like a system grid or
something. So, I think that always
sticks out to me. This is a very
difficult one to solve as well.
>> Wow. Awesome. Um, one final question.
>> Yes. Uh in the architecture overview we
also see Postgress the base database by
the way just just saying no bias here.
Uh what is that used for?
>> So um all of the data that you saw the
context and the documents you saw are
all stored in uh Llama cloud. However we
do run Postgress locally uh and that
stores everything that the notebook uh
the notebook llama UI needs. So for
example the mind map is generated and
you have an HTML content for it we store
that in Postgres so that you can go back
to it.
>> Awesome. Thank you so much for Tuana. So
let's give it up for Tuana please.
>> Thank you.
>> Thank you.
All right. Are we ready for our next
speaker?
>> Yeah. Okay. Let's do it. So, our next
speaker leads the AI developer
experience at Google Deep Mind, where
she's helping shape how um builders
everywhere use the largest I'm I'm going
to It's 5:00 p.m., but we're going to
get through this together. How builders
everywhere use the latest generations of
models. And I think she's going to talk
about a lot of cool models today. So,
I'm very excited about this talk. Uh
today she'll take us beyond chat bots
showing us uh live demos on V3, Gen3
and Gemini 2.5 Pro. So please join me in
welcoming to the stage AI developer
experience at Google Deep Mind, Paige
Bailey.
[Music]
Thank you so much. I'm so excited to be
here. And even though it's 5:00 p.m., I
think there will be some things to wake
us all up uh later today. So, hi
everyone. My name is Paige. I'm
fortunate enough to be here with my
excellent colleagues from Google
DeepMind. Uh everybody uh everybody from
GDM kind of raise your hand. Um
excellent. uh Ahmed, uh Guiam, uh uh
Patrick, Ian, they've all been valiantly
kind of sharing demos, staffing the
booth, presenting all throughout the
day. Um and we're here today to kind of
wrap it all up. Um some of these things
might look a little bit familiar if you
were at some of our earlier talks, but
hopefully there will be enough new stuff
that it'll be exciting even for people
who might have experimented with our
models before. So uh as mentioned these
are the folks from DeepMind who are here
today. Um and I also just want to say
that Google has been a little bit busy.
Um if folks might have amen uh remember
back to a year ago I think there were uh
you know not as rapid of release
cadences. There were fewer models
released out to the market, fewer
features. Um over the last really year
and a half we've been kind of
accelerating a pace that's really
exciting to see. I'm a long-term Googler
and I've never really seen us ship at
this kind of a rate. I think we're
releasing a new model or a new feature
every 5 days or thereabouts. Um, a
couple of these models that have been
recent to market are Gemini 2.5 Pro. So,
show of hands, how many people have used
that? Um, excellent. Uh, Gemini 2.5
flash image preview, aka Nanobanana. How
many folks have been excellent? Also,
we're not the best at naming things. Um,
V3, which is our video generation model.
Show of hands. Awesome. Uh, Gemma 3N,
part of our open model family, like
open- source kind of rules and fuels the
world. And then Genie3, which is also
our open worlds model. Um, so, uh,
you'll see all of these today and then
also talk through some applications and
when you might want to use one versus
the other. So, Gemini is special in a
number of ways. One of which is that
it's natively multimodal. It can
understand video and images and audio
and text and code and all of the above
all at once in multiple languages. But
it can also output multiple modalities.
So Gemini models are kind of unique in
the market in the sense that they can do
many many things. They can output text
and code but also images. They can edit
images. They can output audio like you
might have seen in uh GM's section of
the workshop earlier this morning or
earlier this day. Um but that means that
you can do all sorts of things. You can
have image understanding, um, editing,
speech to text, texttospech, all kind of
natively incorporated into one model in
addition to things like PDF
understanding. One of the folks here
today has been using Gemini to kind of
pull in PDFs and extract out images, um,
extract out bounding boxes for different
geospatial coordinates, um, and then
also using it for real-time
conversations. Um, so this is just an
example of something that you can do
with the uh the nano banana model. Give
it an image of a car and have it kind of
picture perfect turned into a
convertible. Um, uh, our robotics team
has been using something called Gemini
Live to kind of orchestrate many of the
different robotics behaviors. Um, Gemini
being able to automatically detect
bounding boxes and different features
kind of helps the robotics models
understand where to grip a specific
object, where to head if you ask it in
natural language to kind of rotate um, a
tool or to build you a salad or
something similar. Um, we've been
incorporating Gemini into our smart
glasses, so it can kind of give you
recommendations on the fly as you're
navigating across cities. Um, it can
give you help with your math homework or
your physics homework as you're looking
at something on the screen. Um, we saw
this uh in action at Google IO earlier
this year if you want to take a look.
Um, and it can also do things that feel
a little bit like Hitchhiker's Guide to
the Galaxy. So, if you remember the
Babelfish where you have one person
speaking in one language and you hear it
in your native language and you speak
something back and they hear it in
theirs, um, that is possible today with
the Gemini Live API. um which is pretty
remarkable. Um but I am not a fan of
slides. I'm actually very bad at slides.
Um and so I think it's much more
interesting to kind of see some of these
things in action um in AI Studio uh and
in some of our model uh some of our
model scenarios. So if you haven't seen
AI Studio before, this is the best place
to go to get access to DeepMind's models
as soon as they're released. You can
select different models here off to the
right, see different details about them,
um, including details like pricing, the
specific model name. So this is what you
would incorporate into the API. Um, if
you wanted to call that specific model
as you interact with the models, you can
also click this get code button and you
can see the libraries called here to
engage with the Gemini models. So we
have a Gen AI SDK which is unified with
our Vertex AI compadres over in Google
Cloud. Um, so instead of having a
distinct model for the Gemini APIs and
one for our enterprise customers, these
are now merged for Python and for
TypeScript. So if you need to toggle
from one to the other, um, it's pretty
seamless. And then we also have a couple
of features like the streaming feature,
um, generating media, build, um, and the
like. Um, and so hopefully folks saw our
build feature a little bit earlier
today. Show of hands. How many folks um
saw that at the expo as well as the
others? Um given that there are a few
folks who haven't seen it, I'm going to
do a live demo of build featuring the
nano banana model um including
deployment and then we're going to race
over back to show a couple of the other
capabilities. So with
oh pardon me um with build, one of the
fun things is that you can describe a
natural language and app that you would
like to create. Um and then you can
build it in real time in the UI and
deploy it uh via Google Cloud and Google
Cloud Run.
So as an example
um let me uh let me toggle back. Um so
hopefully
uh where did the the other sort of
screen go?
Um
h
there we go. Excellent. So the uh with
the build feature, you can describe in
natural language the app that you would
like to create um and you can deploy it
in real time via Google Cloud. So I
could say something like um create an
app that takes uh a webcam photo
um of the user. Uh the app should then
uh use that photo uh to ground um the
creation of a uh let's say Dungeons and
Dragons character.
Uh the app uh should use Gemini 2.5
image preview aka a nano banana to
modify the image um to uh uh to show uh
for the D&D character um uh so it looks
similar to the user. um make sure the
app is well designed and that the uh the
stats for the character are included. Um
and I'm going to hit control enter. And
immediately we see Gemini 2.5 Pro kind
of breaking down every step of creating
this app, kind of walking through all of
the different architectural
considerations that it would need to
make. Um selecting the different models
that are available. Uh, and since this
is baked in natively to AI Studio, it's
incorporating the latest models and the
latest features from the Generative AI
SDK. So, if you ever played Sim City
2000 or any of the the other kind of Sim
games, that loading screen probably
looked a little bit similar. Um, all of
the code is getting written here on to
the right. Um, you can see this really
nice file explorer here in the center.
Um, and then we also have a really handy
save to GitHub feature off to the right
if you wanted to save to a public or a
private repo. Um, all of these things
are kind of uh generating prompts on the
fly. So if you need to have a specific
prompt um to engage with the model as
part of the app, you can. Um, it's
handling all of these services like the
webcam. If it encounters any errors, it
will take the error, feed it back to the
model, and then use Gemini in order to
resolve it. Um, and at the end of this,
we should be able to deploy to Cloudr
Run as well. This is all part of an
improved code generation and vibe coding
experience with our latest Gemini
models. They've been topping all of the
leaderboards, including Ellar Marina,
um, for these code generation scenarios,
and it's been really exciting to see how
they've improved over time. So, let's
try and see if this works. Um, I'm going
to begin the quest.
Uh this does uh hopefully
uh look look okay. So it's consulting
the ancient tomes, rolling the dice of
destiny. Um powered by Gemini, unleash
your own imagination.
Um so let's uh also zoom out a little
bit so we can see the full app. Um
bargaining with a mischievous fay. Oh,
that sounds very uh that sounds very
cute.
Um, and then that does look very much
like me. Um, like well, and my dreams,
right? Like but the uh but it does have
kind of the original portrait. Um,
neutral good bard. I'm an entertainer.
Well, that's that's uh nice to say. Warm
smile and gesture of peace. And then you
can see the character backstory along
the way. I actually really love this and
think that this should probably be my
Facebook image. Um, but Lyra Brighton.
So cool enough. If you select a Google
Cloud project, um I'm just going to hit
this one. Uh it verifies the project. Um
and then you can click deploy app. Um it
creates a unique URL behind the scenes
that you can share with your friends,
your family, um uh other folks that you
want to join your D&D campaign. Um and
then after this is created, so you can
see the app, it gets deployed. Um, but
even cooler, you can also take a look at
all of the logs, the Google Cloud, um,
uh, kind of services that have been
created along the way. Um, and all of
these things are are kind of, um, scaled
out to production. We care about the the
headaches of Kubernetes and the like, so
you don't really have to. Um, you can
also see billing associated with that
account. So, if I pull up the billing um
for my linked account, which is where
all of the apps uh that I've deployed
are uh are getting utilized, you can see
that clearly I have a problem, but also
that uh there are a lot of Gemini APIs
uh Cloud Run and Cloud Storage um
components just kind of added behind the
scenes so I don't have to worry. Um,
another cool feature that uh that I
mentioned my colleague Guom shared a
little bit earlier today um is the live
feature. So, I'm going to go ahead and
click stream. Um, you can add to the
system instructions. So, you can say
something to the effect of uh please
only speak to the user in French
and uh kind of have that added. Um, and
then you can do all sorts of things. You
can share your screen, you can share a
video feed, and the model kind of
interacts with you dynamically um using
uh the languages that you specify or
using a variety of languages based on
the user conversation. I'm going to go
ahead and ground with Google search and
then I'm going to go ahead and share my
screen um and ask Gemini what it sees.
Um and I've already got kind of pulled
up into a tab uh Google Collab. Do not
judge me by the number of untitled
notebooks that I have. Um, but I'm going
to go ahead and ask Gemini what it sees.
Hey there, Gemini. What do you see on
the screen?
Google
starting or generate with AI.
Was that correct? Like, okay, cool.
Cool. Excellent. Good job. Excellent.
More cheers. The um but the uh but even
cooler, you can also ask for help with
the UI. So, you could say something to
the effect of, "Hey, Gemini, how would I
change the runtime type in this collab
notebook?
change runtime type.
Awesome. So, it was able to kind of
navigate me through the UI, tell me what
to select, and since I've turned on
Google search grounding, I can even ask,
"Hey, Gemini, is it going to be raining
uh in Paris today? Like, what is the
weather like in Paris?"
[Music]
>> Yeah. So, it was asking It sounded like
it didn't get it right. Um but uh but uh
it does have the ability to look up
information on Google search. So you can
ask it natural language questions. Um
even specify your own function calls and
the model should be able to to kind of
pull in that information. Um in addition
to all of this kind of operating within
Google um AI studio. Uh again if you
click get code you get everything that
you need in order to replicate the
experiment. We've also been baking
Gemini into Google Collab itself. So,
one feature that feels very underloved
and underutilized, at least to me, um,
who really enjoys working with data, is
that you can specify a different data
set. So, I'm going to copy the path to
this CSV file, um, which you can just
kind of see here on the left. You can
also add a link to a URL and ask for
Gemini to help with web scraping it. But
you can say something to the effect of
please
do exploratory
data analysis,
if I could type on this CSV file and
build a model to predict California
housing prices, which is always like a
very depressing thing. Um, but kind of
similar to what we just saw in Google AI
Studio, what happens behind the scenes
is that Gemini builds a plan. So, it
kind of creates um a step-by-step
instructions for what it would need in
order to accomplish the task. So, you
can see here um you can see here loading
data kind of doing EDA, feature
engineering, data prep-processing, model
selection um and all of that kind of
gets incorporated into the step-by-step
process. uh and then Gemini writes the
code and executes it within the context
of the notebook. So if you've never used
Collab before, this is kind of a
notebook based interface on top of uh
some compute with a whole bunch of
Python libraries installed. What's
happening is that Gemini is doing each
one of these tasks. It's describing its
reasoning. It's writing the code. It's
executing the code. It's using the
outputs for each of the cells that it's
created to to kind of inform the next
steps that it uses for its analysis. So,
it's analyzing these data. It's creating
these really rich detailed plots for the
CSV file, but it could also be JSON. It
could be TSVs. It could be like a table
that you're importing from a database or
from BigQuery. It's doing feature
engineering. It's deciding based on the
structure and the shape of the data what
kind of model it should be using to
train. Um, and then afterwards it's
giving you kind of a summary of the
results and explaining all of its
reasoning traces along the way. So, this
is pretty cool. I I started doing
machine learning back in 2009, like back
before scikitlearn even found its legs.
Um, and it's been really really just
brain to me to see how all of these
tools have evolved um over the over the
context of the the last year or so. And
so it looks like it's decided to create
a linear regression model. It's training
the model. Um and then it should give us
some sort of insight into the uh into
kind of the the uh the results that it
found for the data set and the R squar
values. So you see MSE R 2 um and then
hopefully a summary of all of the
results at the very very end. So this is
pretty cool. I highly encourage everyone
to play with the uh these features um
these agents that are baked into Google
Collab. Um the features that are baked
into AI Studio, the models that we have
available um and to test them out uh to
test them out for all of your use cases.
Um but in the interest of time, um I'm
going to go ahead and go back to our
original presentation. Um, and we're
gonna carine along the way for the rest
of the examples. So, VO3,
like hopefully everybody is excited
about V3. V3 is one of three of our
generative models that have been
released into the public. Imagine 4 is
for image generation. Liria uh for music
generation. Um, but we uh like I given
that no AIE presentation is complete
without a quote from Andre Karpathy. Um,
I'm really excited about video in
particular because it's an incredible
medium to kind of help educate, but also
to help communicate, especially for
audiences that are like my niece and
nephew's age. You know, they already
tell me like, you know, Paige, nobody
reads anymore. Um, but I I think video
is something that can resonate and
really find places um with people who
have different learning styles, with
people who maybe uh don't want to take
the time to uh to to kind of learn um
through alternative means. And so V3 is
kind of the first step in the path of
making that a reality. So uh in the
interest of not cycling through all of
the text on the screen, I'm going to do
a lot of showing. So, we've got a lot of
new features as part of our video
models. One of which is uh kind of this
character consistency or reference
powered video. So, here you see a person
and a hallway that have both been
incorporated into um into kind of the
the prompt for the VO model. This is
still using V2. Um but then you see kind
of that same character walking down the
hall um as you describe. um reference
powered video, very similar. You can
have the same character just in a
variety of scenes, a variety of lighting
cases. Um and it still looks like the
same happy little monster um kind of
swimming or walking or hanging out at a
gas station. Um we also have a feature
where you can take in an image and
animate it. Um a feature where you can
have an image with a guide for the
animation. So, not just kind of like
creating what it looks like might happen
is the next step, um, but actually
nudging it to be like a woman walking
down the road in Texas. Um, you can
have, uh, these sort of camera controls
where you can control the style of the
video outputs, outpainting, which was
really really important for recent uh, a
recent exploration that DeepMind did to
kind of restore the Wizard of Oz into a
into a state such that it could be
displayed on the sphere in Las Vegas.
um adding objects to scenes, removing
objects, playing around with
perspectives and camera controls. Um uh
there's just an example of removing
objects. You can do the same with our
nano banana models, by the way. Um you
can have reference face movements, so an
avatar and kind of reference face
movements that control it. Um first and
last frame. So you can define the first
frame, the last frame and ask for the V
models to interpolate between it. And
then again all of that possible with our
last iteration of VO models. V3 kind of
takes it to the next level. So with V3
um you can do things like uh you can do
things like have
um photorealistic video
of people and places. So, these were all
created just through prompts and just
through prompts coupled with input
images. Um, hopefully you can hear the
audio also displayed in the room. Um,
but these are uh everything from kind of
these futuristic landscapes
um to things that feel a little bit like
a scene from perhaps Lord of the Rings
um where before Froto throws in the ring
uh into
um but lots and lots of cool things to
do with Vio. Again, in the interest of
time, I'm going to show you just one
more video um to kind of get the gist of
what's possible. Um created by our
friends in the Google uh Google Paris
office.
>> Yep.
It's like daft punk. I can't believe
this new Veo model. It is amazing.
Artificial.
Artificial.
Artificial.
[Music]
artificial.
[Music]
>> Awesome. So if you see M Blancc anywhere
around the Google Paris offices um
definitely tell him that he has an
excellent taste for film making. Um so
moving along to the next slide or
attempting to move along to the next
slide.
Yep. So, one of the one of the things
that I wanted to try with the V3 model
when it was first released was to see if
I could create or really replicate a
commercial um that I had seen on
television. Um I wanted to see how much
easier it would be to do with V3 versus
V2 and also how many other models I
would need to stitch together in order
to make that to make that possible. Um,
so this is the this is the advertisement
in question.
>> Hey, my name's Paige and what makes the
Chick-fil-A chicken sandwich original to
me is the crispiness of the breading and
the tenderness of the fillet. It's
tasty, it's warm, it's total
satisfaction. So, that's not me, Paige.
That's like some other page that also
likes chicken sandwiches. Um but uh the
the kind of the next step um to to kind
of build a similar style of experience
with V2 is I take the original image, I
give it to Gemini 2.5 Pro, I asked it to
segment it into 8second clips, generate
a unique prompt for each 8-second clip.
Um that was all used to kind of guide
and steer the V2 outputs which again as
a reminder don't have audio. Um, so to
have audio incorporated, I used the
texttospech version of Gemini in order
to create the audio track. I used uh
kind of a a music model to to generate a
30- secondond clip for the audio in the
background, which I also used Gemini 2.5
Pro to describe. I stitched it all
together uh using Camtasia, though you
could also use Movie Pie. And then I got
that final uh video, and it looked a
little bit like this.
[Music]
Hey, my name's Paige and what makes a
Chick-fil-A chicken sandwich original to
me is the crispiness of the breading and
a tenderness of the fillet. It's tasty,
it's warm, it's total satisfaction.
And I actually kind of like this one
better. It's a lot more chill. Um but uh
but again, this is using V2, a
collection of models. took about, you
know, 2530 minutes in order to create
end to end, which is still a lot less
than it would have taken um for for kind
of a professional movie making team, I
think. Um but one of my colleagues also
told me that clearly it was wrong
because like the breading on that
chicken was like completely different
than Chick-fil-A breading. Um it was
definitely Popeye's breading. Um so, uh
like you know, like some work to be
done. Um but with V3, the process is
much simpler as you can see. So you just
have the original video. You ask Gemini
2.5 Pro to create the prompt or the
collection of prompts. In this case, I
only wanted the first 8 seconds. Um, it
generated the detailed text description
and then that's what I used to kind of
give to the VO model to get that final
output. Um, and so this is what it
>> Hey, my name's Paige and what makes the
Chick-fil-A chicken sandwich to me is
the crispiness of the breading and the
tenderness of the fillet.
>> And so that's pretty strong with just a
single prompt. Um, Genie 3 is our new
frontier model for uh kind of exploring
worlds. You can navigate just with the
arrow keys. Um, behind the scenes, it's
powered by techniques similar to our VO
models as well as a lot of uh
integrations with Gemini. So, it's it's
kind of a a harness style approach for
for generating these worlds and allowing
you to navigate through them. Um, but
you can just create via a prompt or via
an input image the kind of worlds that
you would like to see. Um, navigate
through them. Um, and there's even
consistency. So, if you interact with
the world, if you draw something on the
wall, if you look down and you see that
your galashes are yellow, um, it will
remember that and persist it through the
duration of the exploration. Um, you can
even see what it would feel like to
experience a hurricane in Florida um,
from from a road. Um, which is pretty
rad. We should be having a trusted
tester program for it coming pretty
soon. Um, but stay tuned. Uh, and can't
wait to get that out to folks. So, we've
talked about a variety of models, Pro,
Flash, Flash Light. Um, and I also want
to talk a little bit about our nano
model family, which is small enough to
fit on mobile devices and to be embedded
within browsers. um specifically uh
Gemini Nano and then also some of our
other open models like Gemma 3 and 3N.
Um so interestingly uh for Gemma 3, how
many folks have used Gemma 3 or heard
about it? Um Gemma 3 is remarkable in a
few ways, right? Like so you can see
here that our 27 billion parameter
version of Gemma 3 um is uh kind of able
to fit on just a single H100. Um, so
just one GPU as opposed to the 32 that
you would need to run DeepScar 1 or
DeepSc V3. Um, and even cooler, uh,
Gemma 3N, which is 4 billion parameters
in size, so small enough to fit on your
laptop, like small enough to be, you
know, locally hosted. Um, it's free to
use because you can download the
weights. It's actually exceeding the
capabilities of our Gemini 1.5 Pro
model, which was our best model, you
know, six or seven months ago. So I I
just want to like underscore that,
right? Like our best model, the one that
took multiple TPUs to run um is now uh
you know being bested by an open model
that is small enough to fit on your
laptop and that you can use for free. So
if you fast forward 6 months from now,
right, like and you look at the frontier
like what's possible today. So Genie3,
VO3, like nano banana, like all this
cool stuff. Um, you could easily imagine
that probably all of those models or
equivalents would be friendly enough to
run on your local devices without you
having to send data anywhere, without
you having to send something to a
server. And also, the frontier would
probably look dramatically different
than it does today. So, I I truly think
that open models and locally hosted
models are the future. Um, and we're
deeply deeply investing in that, not
just for Gemini on Pixel devices with
Gemini Nano, but also something called
Gemini and Chrome. We announced this
about 5 days ago. Um, but with Gemini
and Chrome, uh, you can do a variety of
things. You can summarize information
across tabs. And coming soon is a
feature that feels very similar to
Project Mariner if you've seen that
before, but it allows you to kind of ask
in natural language, hey, here's an
email with a request for me to make
something. Go on, uh, Instacart or go
order me all of the groceries and put
them into my grocery cart and check out,
please. Um, and all of this would be
available embedded within the browser.
um and just kind of free for you to use
or to interact with as part of your
Chrome native experience.
So this is a you know it feels very
futuristic. It feels like sci-fi um but
it's it's stuff that is definitely
coming down the pipe from Google's
perspective.
I also want to underscore there has
never been a better time to be a
founder. So, if any of you in this in
this room are like startup curious, uh
there's uh almost nothing stopping you
from creating a company um that's, you
know, cash flow positive and getting it
out into the world. Um we've also seen a
whole bunch of VCs that we partner with
founding or uh not founding but
supporting solo founders much more than
they have previously. So if you've
always felt a little bit unsure just
because you know you wanted to build a
business but you didn't necessarily have
a partner um again never been a better
time. Um smaller teams also capable of
doing outsized amounts of work. Like if
you talk to the Black Forest Labs guys
or the Mstral guys it still boggles my
brain how much they're able to do with
just a relatively small number of
engineers. Um, so I just want to
encourage you if you need any
encouragement to go build, go create,
um, get things out into the world. With
that, um, hopefully I didn't go too much
over time. Um, but I wanted to just say
thank you. Thank you so much for having
us. Um, and, uh, thank you for all your
great questions, for sharing your use
cases. We're really excited to see what
you are about to build and have been
building with the Google Deepline
models. Um, and your homework for today,
if you haven't already tried it, is to
go to ai.dev and start experimenting
with Gemini, generate an API key, and
start using it in your projects. So,
thank you so much. Thank you to the team
from DeepMind who is here today that has
been doing all of this work. Um, and uh,
we appreciate you.
>> Thank you, Paige. That was amazing.
>> Excellent. Uh, do we have time for
questions?
>> Yes, I have I have a few follow-up
questions for you.
>> Excellent. And it's okay if they are
spicy. We can we can talk space.
>> Let me think now.
>> All right.
>> Um All right. They're not that spicy,
but I think
>> so. Amazing presentation. I'm I'm like
the mind blown emoji all over. I'm I'm
>> Well, I was blown away. I I'm speechless
now.
>> Um I've seen you use uh V2 instead of
V3. I know that you said V3 is V2 but
supercharged. Yep.
>> Um, but are there like real use cases
for V2? Now, even if V3 is available,
>> there are some features that are still
only available via the V2 APIs. Um, but
we're quickly trying to integrate them
with V3. So hopefully in the future
there won't be a need to rely on the V2
models. You can just use V3 natively.
Um, uh, from a pricing perspective,
we're also bringing down the costs of V3
pretty significantly. They've already
dropped uh, uh, over the last month or
two. So uh like definitely definitely uh
if you're thinking about building
something or creating video like
prioritize VO3 explorations compared to
BO2.
>> Um awesome. Yeah, you spoke about so
many models and you mentioned uh J3 and
I was wondering how much of it uh how
much of Genie is actually used for
autonomous vehicles and and to train uh
autonomous robots as well.
>> Gotcha. So, so, uh, the Genie 3 the
Genie 3 models, I don't think that
they've necessarily been generating data
to to be training autonomous vehicles at
the moment. Um, but we do uh include
synthetic data into pre-training and
post-raining for the Gemini family of
models. So, I'm sure that folks have
played video games before or if you
haven't like live a little, you know,
like there there are a lot of there are
a lot of really cool games out there.
Um, but uh like a lot of the footage of
agents interacting in video games kind
of accomplishing task exploring the
world are also really useful training
data for models such as world models or
even models like VO3 or Gemini's uh sort
of video understanding capabilities. So,
so I definitely think that that those
kinds of data would be really really
interesting for models uh to incorporate
back. But even just kind of the agents
exploring world style video game video
footage is really really helpful for
model training too.
>> Very cool. Um I have one last question
for you.
>> Yep.
>> So um
>> just one
>> I have so many written down but I yeah
for the sake of time I can only have
this one but um I think I'll catch you
later for sure. The um so what's unique
about Deep Mind's models and what are
you excited about? Yeah. So, Deep Mind's
models, I really really love how we're
kind of pushing the boundaries for
multimodal outputs. You know, the audio
outputs, the video outputs, but also,
you know, open- source is kind of near
and dear to all of our hearts. I think
the Gemma team is doing phenomenal work
and I really have been excited to see
how these smaller models can be
incorporated into, you know, things like
Chrome and to mobile devices. um and
really can't wait to see more of the
road map of Gemini and Chrome. I think
it's truly magical to kind of imagine
this world where you can uh you know
have all of your data kept local um but
it can accomplish a variety of tasks for
you um or can toggle efficiently between
models that are zero cost um because
they're just running on local devices
and then model uh models that need to be
uh to be used that might be hosted
server side.
>> Wow. Well, thank you so much. Thanks.
Uh, thanks, Paige. You guys have been
cooking for sure. Shipping every five
days is incredible.
>> Well, everybody on the team has been
contribut like it's it's been fantastic
to get to work with everyone and I I
feel fortunate and honored to go to work
every single day.
>> Thank you so much.
>> Excellent. Thank you.
>> Give it up for Paige.
>> All right,
time to announce our last speaker of the
day.
Up next, um, our speaker led generative
audio research at Google Brain and
early, uh, and earlier worked at the
speech recognition at Facebook AI
research before he co-founded Qout uh,
where he serves now as chief modeling
officer. Today he'll talk about full
duplex conversation with Moshi,
speechtoiki,
and more. Please uh join me in welcoming
to the stage Chief Modeling Officer Neil
Zegedore.
[Music]
[Music]
Hi everyone. Uh thanks a lot for having
me today. Uh I'm happy to talk about our
work on scaling realtime voice AI. And
uh before that I would like to say a few
words about QAI. So QAI is a nonprofit
uh AI research lab we created in Paris
two years ago. Uh so thanks to the
generous donation from Xavier Rodul Sad
and Eric Schmidt. Uh it's an AI lab that
is focused on open research and open
science. So the main mission is to make
big advances in AI in particular around
multimodel LLMs. And the specific thing
about QA is that since it's a nonprofit,
all our inventions are published uh
shared in open source. We train PhD
students, we collaborate with academia
and so on and so forth. And so what do I
mean by scaling real time uh audio? So
you may be familiar with solutions.
There are a lot of them uh around like
11 Labs, Azu and so on. And so currently
the main applications in uh AI voice are
around offline content. So typically you
will generate an audio book uh with a
synthetic voice or you would have uh a
character in a movie that could say a
few artificial uh sentences or a small
character that can uh in a video game
also interact with you and so on and so
forth. So this is mostly offline
content. So this is content that is
highly qualitative and is generated as
low volumes.
a huge opportunity that is currently not
really addressed is everything that is
interactive and very high volume. So
taking against uh again the example of
uh the difference between all life
content and the rest uh when you make an
audio book you can pay a lot you can
spend a lot of time iterating because
you generate it once and it's consumed
by a lot of people on the other end
there is a lot of content that is
generated on the fly and needs to be
processed right now and is consumed by a
few people and then thrown in the trash.
So for example, if you look at uh
gaming, if you have interactive NPCs
with whom you're going to talk, uh these
interactions, there will be a lot of
them. They are heard by a single person
and so they need to be very very
scalable, right? It's the same for
robotics. Eventually, emanate robots are
going to be play a huge part in our
society and interaction with them will
be mostly vocal, right? So we're talking
about lot and lots of volumes of audio
much more speech AI speech than is
currently existing. Same for media.
Audio book is massive media where
generate one audio for a lot of people.
But now when I show some examples
something that is starting to to be uh a
new product is personalized news where
you have your own news instead of
consuming the same as everyone you have
some personalized news digest about your
interest. It's the same in that context.
This is very focused on a on a single
person. So you need to generate massive
volumes of audio.
Um
and what does we look for when we mean
voice AI. So the first thing is the kind
of application you are interested in. So
people will naturally think about speech
synthesis or voice agents. But there are
a lot of things you can do. So synthesis
is a task of you give a text and
generate the corresponding audio.
Transcription is a fact of on the other
hand writing what is being said. is very
useful for example for meetings uh and
so on. Translation is translating speech
in another language most like hopefully
in real time conserving the voice and so
on. Transformation is all the kind of
audio effects you can do on a voice. So
if you look for example of visual
effects in even apps like Tik Tok, Snap,
Snap and so on, there's a lot of AI in
it, right? So you can make a lot of AI
based transformation on your face. But
the audio effects is like pitch up,
pitch down, slow, uh faster, you know,
it's not really using AI right now. So
you could imagine much more richer
transformation. And the last one is the
full conversational experience, the
voice agent. So now when you have this
set of capabilities, what are you
looking for in terms of quality? First
one is fidelity, right? So, it needs to
sound like if it was recorded in a
studio and not on a smartphone in the
subway. Uh, the second thing is you want
to be able to design voices either by
cloning them or by writing uh natural
language description like I want a
middle-aged man who smokes way too much
and has a deep voice or something like
that. Um, you want emotions to be uh
rightfully understood by the AI and also
be produced consistently. So right now
you might be crying uh with your AI
voice and it's going to be like oh
that's so great and so and so you know
it's not really exactly the kind of
interaction you are looking for. The
flow should be very natural right now
and I talk a bit about it. You need a
lot of discipline when you talk to an AI
and so you'd rather want to have
something that is you know much more
natural like a human conversation.
Finally the latency needs to be very
very low. And now about scalability. I
think the two main challenges we are
facing right now and Paige say a few
words about ondevice models. Sorry. Um
if we want to scale audio generation to
reach all the use cases I showed before
and gaming assistants uh personalized
media and so on either we need to be
able to generate very large scale
volumes on the cloud or we need to be
able to have small scale generation on
device. Right? So let's say you want to
make NPC in a video game. either it's
hosted on the cloud and you need to
generate millions of hours of NPC voices
uh you know every month or so uh or
everybody is running on their PS5 uh the
local TTS device and so in that case
it's scaling through devices right
so in particular if you look at quality
and scalability there is kind of a
trade-off where people currently have to
choose so uh if we take speech synthesis
again very high quality low scalability
the audio book right you generate it
once it needs to sound very good but you
don't need to generate a lot of them
right you you make the Harry Potter's
books uh and you sell them and you make
a lot of money uh on the other end
voicemail requires a lot of scalability
because everyone has their own voicemail
it's a you know interaction that you
cannot predict so it needs to be
generated for everyone uh but the
quality can be pretty crap right if you
call your voicemail you don't need a
full emotional understanding uh you know
just want to to know what people wanted
to tell you at the intersection of
quality and scalability there is a
personal assistant. Now the content is
not mass media. Everybody wants to have
their own interaction but they want it
to be as qualitative as if it was the
premium audio book. So in that case you
need to nail both aspects at the same
time. It's the same for translation in
uh if you do diplomatic interpretation
really not our focus right now because
the you know the expectations on
reliability and accuracy are extremely
high. Same if you are asked by I don't
know Netflix to dub a movie uh you know
the expectations on the quality are
going to be very very very high. On the
other hand if you're translating uh you
want to be translated while you travel
it's a bit related to the you know the
recent release of in the AirPods Pro
then it's okay I guess if the quality is
a bit lower uh as long as it's useful
and kind of reliable and in particular
can run on device so you can bring it
with you in your travels. And at the
intersections of quality and
scalability, you have meetings, phone
calls or small creators. So small
creators, it's a bit like in opposition
to movies. So in the case of movies, you
make one movie for a lot of people. Uh
for small creators, it's the content is
consumed by a few people, but there is a
lot of content that is generated at the
same time. So you need to translate much
much more content.
Finally, for voice agents, one example I
like is a startup working on airline
claims. So if you do airline claims,
it's like a AI that calls the aine and
say, "Yeah, the flight was late by 1
hour, so you need to give us a 300
bucks." In that context, if you pay $10
for the voice, it's fine, right? Because
you get a lot of money. But now, uh, if
you have a bot at McDonald's that wants
to take your order, it needs to be super
cheap. But at the same time, you know,
the quality can be pretty crap as long
as it takes the command, right? And at
the intersection, you have video game
NPCs, interactive podcasts, uh,
e-learning. So if you want to learn a
language through voice, not only it
needs to be very cheap for you because
you don't want to spend like 500 bucks a
month to talk to it, but also you want
the experience to be enjoyable and to be
qualitative and so on. So you need to
nail both aspect at the same time. So
what I'm going to show is how you can
address all these aspects and the story
at Qoutai around them. So the first one
is about quality. So the first project
we did at we did at QAI called Moshi, it
was about creating the first full duplex
conversational AI. It was more than a
year ago uh before the release of the
advanced uh voice mode from OpenAI. Now
it's something that people are used to
to have uh you know conversational
chatbots
all of them without exception even today
still rely on duplex setting. It's a bit
like a key walkie. So either the AI is
speaking or it's listening which means
that it requires this discipline. So you
know it's always a bit awkward when you
talk to an AI because if you interrupt
it or you cough it thinks you are
interrupting and so it starts breaking
the flow of conversation and so on and
so forth. So it's not very natural
because you need to adapt to the
limitations of the voice rather than the
voice adapting to the fact that I don't
know maybe you are erratic in the way
you speak. Full duplex is the fact that
the model always speaks and always
listens. It can interject at any time.
It can be interrupted at any time.
Exactly like in a human conversation. In
a human conversation, if you're on the
phone with a relative from your family,
the amount of overlap speech is around
20%. Which means that 20% of the time
people are speaking on one another. And
that makes it, you know, like a a rich
conversation in a way.
And so how do we address that? That's a
small technical part in this uh in this
talk. Uh we wanted also to address
another aspect which is that people rely
on cascaded system. So typical chatbot
uh is uh speech to text and then you
have an LLM and then you have text to
speech. How can we merge all these steps
into a single one? Because if we merge
them into a single one, we don't lose
emotional information because we don't
go through text and the latency is going
to be much better. So the way we did it
is by taking inspiration from text
models. So very quickly a text model
text LLM is a probabilistic model where
you give a sequence of word and predict
the next one. So QI is an you predict AI
and then you inject AI and predict
research and so on so forth. So our main
algorithm and this the expertise of QTI
is audio language models.
How do you make an audio language model?
The most basic thing you could do is say
okay I'm going to put audio as input and
audio as output of the LLM. It doesn't
work at all for a very simple uh reason.
Look at the sentence QA is an AI lab
based in Paris. It's eight words. So if
you want to pass it into an LLM it's
very few tokens. It's around 3 seconds
to to pronounce the sentence. Even at 24
kHz, which is not studio quality audio,
you get 72,000 values, which means that
you have a super long sequence to pass
through your LM. And given that self
attention has a quadratic cost, so it's
proportional to the square of the length
of the sequence. It means that it's 100
million times more costly to process
audio than text with an LLM. So what we
did instead is that we invented Mimi.
It's a codec. It's like an alternative
to MP3 or Opus. It takes a large audio
file and it compresses it so densely
that it's a bit like a text
representation. In the case of the
sentence I showed, instead of having
72,000 samples, you goes through 37
tokens. So now, you know, it's almost as
if it was text and you can just train
your LLM to predict these this these
tokens.
Then your LM predicts these audio tokens
and you have a decoder. It's a
generative adversal network and it's
going to reconstruct high quality audio
from it. So now we have turned the task
of audio generation into a task of
language modeling and uh in particular
that's the main standard architecture
used by uh Google openai and so on for
modeling conversations and the way they
do it is that they have audio lm and
they have this sequence of uh audio
tokens that either represents the lm or
the user. So in that context uh the
model is going to predict uh the next
turn based on the on the previous turn.
It works nicely and can give all the
chatbots that you have right now. But
it's still a talky wy because it's
either the model is speaking or the
person is speaking. So what we did is we
invented what we call multiream
modeling. It's very simple idea. Uh we
just have two streams in parallel which
means that both can be speaking at the
same time. Both can be silent at the
same time. One can talk and and the
other around. And what it gives and I'm
going to show you is the demo we did
more than a year ago now at CAM. You're
going to see two things. The first thing
is conversation with Mushi where you're
going to see that the model has still
the lowest latency ever at this point.
Uh sometimes when it guesses where
you're going, it starts answering before
you're done. Uh you're going to see Alex
speaks at the same time and there is no
problem with this ambiguity where both
are totally the same time. And second
video, it's me during work in our office
where it's super noisy and it's good
advantage of full duplex models is that
since they don't have this awkward
handling of speaker turns, you can have
a lot of background noise and they just
work pretty well. The the planet is
Sirius 22. Can you plot a trajectory
course to it, please?
>> Yes, sir.
>> Okay. How long is it going to take us to
get there?
>> I've mapped it out. It's approximately 5
months to get there.
>> Okay, that's that's not too bad. Uh, do
you think we have all we need on board
the ship to start the mission?
>> Yes, sir. We have everything we need.
>> Okay.
>> Good day. How are you doing?
>> Hey, uh, I'm doing some construction
work right now, as you can see. Uh, I'm
I'm looking for some advice. Can you
help me with that? Sure, I'm happy to
help. What's the issue you're facing
with your project?
>> So, in that context, I couldn't even
hear anything and the model has no issue
handling noise. Uh, the second thing we
we tackled is capability. So, the
multistream architecture we invented,
it's not an architecture only for
dialogue. It's basically a general
framework for modeling any kind of audio
task. In particular, if you take your
two streams and instead of having the LM
speak and the user speech, you can have
me speaking in French and me speaking in
English and now you can predict me
speaking in English from me speaking
French and you get real time uh
speechtoech translation.
Hi everyone. I'm delighted to be able to
speak to you today in French since my
phone will
translate what I am telling you into
English. As you can see, it is in
airplane mode. It doesn't have an
internet connection because our machine
translation model allows for real real
time voice translation entirely on the
phone. You can find a model for free
download on Huntington.
So in this context, this model was made
by a single master intern in a few
months because since it's the same
architecture, it just generate new
training data. It's very easy to go from
conversation to translation. It's again
extremely robust to noise.
>> My name is Alexandri and I am testing
this model in extreme conditions. At
this moment there is very loud music and
I can barely hear what I'm saying.
However, our model is able to translate
live.
>> So it's Gavinsky behind. You cannot
probably recognize him. But anyway, so
this is for translation again. You can
there is a live demo coming and
everything can go wrong. So please bear
with me. Uh from translation to
transcription now. So okay I showed two
tasks now. Instead of predicting English
from French, you can predict uh uh text
from speech. And in that context,
basically the idea is uh is the
following. So it's still the same
architecture, but now what it can do is
uh real-time transcription. So at the
moment, it's the most accurate and
fastest real-time transcription uh in
the world. It's open source. Um and the
way it works is it predicts in a
continuous fashion the text uh from the
speech. And so this
and now I can go back to English and
it's going to work hopefully. I hope so.
Okay. Uh first demo. Okay. Back to the
presentation now. Uh you can also do it
the other way around. So instead of
predicting speech from from speech, you
can predict speech from text. So it's uh
in that case streaming text to speech.
Uh so text to speech there are a lot of
them out there. The specificity about
this one is that it's streaming on text
which means that as you type words they
start being generated. It's useless for
human because it will be super weird for
you know typing and start speaking. But
when you want to make it work with an
LLM it's very useful that let's say you
want to make a conversational agent from
an LLM and the LM is going to produce a
use paragraph. You don't want to wait
for the LLM predicting the paragraph to
start predicting the audio. You want
both to be predicted at the same time.
So in that context, what you're going to
see is the actual realtime latency.
>> Coming up next, we've got something
special. The AI engineer Paris
afterparty. Enjoy drinks and fruitful
conversations. See you next time.
>> Please stay. It's not I should have put
it at the end of the presentation. So
it's not it's not Please a few minutes
more and we can get have some drinks. Um
and interestingly so there was a nice
talk earlier by our friends at Pyote but
again instead of predicting just text
you can also predict text with uh the
label of who is speaking when. So if you
want have a meeting with a lot of people
uh you want to be able not only to
transcribe a single stream of text like
is often done because it's hard to pass
but rather having the model detect who
is speaking and then associating the
right transcript to them. All of this
still in real time.
>> Understand that?
>> Yes. Uh,
>> okay. I tell you, if you need to tell me
something, I want I'll let you come up
here to the podium so you can speak into
the microphone and I can hear you.
>> Yes. Uh, the attorney here. I'm wanting
to fire him.
>> Uhhuh.
>> And you know, I don't feel like he's
doing any
>> Who are you going to Who are you going
to hire?
>> I'm not going to hire nobody. I'm going
to try to get a different public
defender
>> there. There isn't You have a right to
an attorney.
>> Yeah. Anyway, so you know, I don't want
to put too much content because it's
going to be strike. Anyway, uh so that
shows the flexibility of the of of this
model. Uh and now what you can do is you
take the real time uh speech to text and
a real time text to speech and you can
put an LM in the beginning. So in a way
we are back to what I showed before
which is the cascaded system. So it has
more latency and it's not speech to
speech. But the very nice thing about
that is that it's completely
customizable. So you don't have to touch
the LM. So you have a LLM with like a
vision understanding and function
calling and tool use uh rag whatever you
name it. Most people they just want to
make it speak. They don't want to train
a new model that is completely replacing
their text stack. So in that context and
that's going to be the demo that worked
perfectly during the started freaking
out uh 30 minutes ago. So let's see uh
in that context the way it works is uh
for example I can put uh the upload a
10second sample. So in that context I
uploaded a voice sample from General
Shard from LA and I write a personality
to the LM. So just a prompt explaining
that I'm talking to General and now you
can have like design in with a 10 second
sample and a very short text uh a new
conversational experience basically.
Excuse
me. I mean my general for the sake of
the audience who is mostly English
speaking maybe you can switch to
English. So right now I'm giving uh a
talk on audio language models and I was
thinking maybe you could have some
advice uh for me. Maybe you could try
explaining it in your own words.
Ah,
when I think of language models, I am
reminded of the importance of
communication. During the darkest days
of World War II, we had to convey
messages clearly and with conviction to
unite the French people. In the same
way, these models must be precise.
>> Okay. Thanks a lot, Mag. Uh, thanks a
lot. Have a good one.
And so finally the last one is about
scalability. So uh the demo I showed of
the ondevice uh speech to speech
translation you can al also run it on
the cloud and on a single H100 you can
process 320 concurrent conversations.
That means that uh on a single GPU you
can serve hundreds of concurrent
conversations. U similarly if we look at
our speech to text and text to speech.
So our text to speech in terms of
streaming solutions is more accurate
than whisper for example. And the
throughput it's the number of seconds
you can process in one second. Right?
That means that we have like a 400 times
realtime factor against a five times
real time factor for whisper streaming.
Same for our TTS that is open source. If
we compare to the main open source TTS
or use DIA, CSM, Chatterbox, uh we have
more than 100 time higher throughput
while having a uh smaller word, right?
Which means that we have better
pronunciation and fewer errors. What
that mean is that now you have something
that can scale to address all the
applications that I I've started
describing because now if you want to
make NPCs in video games where everybody
is going to have uh 20 villagers uh they
can talk to in a huge open world and so
on it starts to become you know
believable which was really not the case
uh until now. And to give you a final
idea of uh just a final demonstration of
this uh you can as of today on LA
province listen to uh our news article
with our
so uh yeah probably not the best topic.
I really apologize about that. But here
what you see the different the
difference between two type of media.
What I just played is uh the journal for
everyone right? So that's kind of the
audio book. It's a single media
generated for everyone. On the right,
unfortunately, it's not available on on
the web yet. It's on the app. It's a
personalized news digest. So in that
context, there is only a solution like
ours that can scale to generate for each
uh you know person reading a journal
their own uh news stream. So
basically our goal is to go and remove
the trade-off between quality and
scalability and provide both. And uh in
particular what we have been doing with
LA Province for uh uh you know
incorporating our our models into
products is the kind of collaborations
we are as well looking for right now. So
you can find all our code on GitHub our
pre-trained models on face. We publish
every all our research on archive and in
conferences. So all of this can be found
and uh we are interested maybe in
opening you know uh limited access to an
API that we can serve to people who
wants to you know try to prototype large
scale generation. So what we're excited
about is if someone want to make a video
game with NPCs with voice or someone
wants to auto automate customer support
for something or create education app
and so on super interesting the kind of
use cases. So please don't hesitate to
reach out. This link uh brings to a
Google form where you can uh reach out
to us. Thanks a lot for your interest.
Thanks a lot for bearing with me. Uh
yeah, thanks for your attention.
>> Wow.
Thanks Neil. Thank you Neil would like
to join me for a couple of questions.
Awesome.
All right.
There's so much to unpack there. Um All
right. So uh yeah you you showed
multiple models. I don't know where to
start from but uh what's the most
challenging part of developing a model
like a full duplex model? Uh so the
reason why we did the full duplex model
is um so when we started working on it
was uh uh early 2024 and dialogue at
this time was really a task that nobody
had been able to tackle at all. Right.
Uh TTA started to work. we had done at
Google I work on music generation and so
on so all of this was kind of you know
working well and there was really
nothing for dialogue so at QAI our
strategy is always the same right we are
a small lab a small team uh so what we
try to target as projects are things
that are very new and can be tackled by
a deep expertise of the topic rather
than you know scaling a lot the
resources and so on so what is the most
challenging is that in particular and I
really love it about audio audio models
are typically very small they are around
one to three billion model uh
parameters. So it's not about scaling
huge infrastructure and so on. It's
really about understanding how the human
hearing works, how speech production
works, how you can put that knowledge
into machines. And so it's it's really
about having a very deep knowledge
interest of the field of audio rather
than you know just uh a general
knowledge of machine learning I would
say.
>> All right. Very cool. Yeah. You you
mentioned that you have a small team but
is also nonprofit. So can you expand a
little bit about that? So the reason why
we created Qoutai as a nonprofit is um
so the ambition is to is to create
something that can foster uh new ideas
that are groundbreaking. And what is
interesting that if you look at the main
inventions in in AI uh for example one
that is transformer right the
transformer architecture that was
invented at Google it was invented with
a specific application in mind which is
machine translation but it was invented
at Google brain the fundamental research
lab and the reason why it's so is
because it was a complete rethinking of
the way you do sequence modeling people
were using recurrent neural networks
there is no relation between recurrent
neural networks and transformers which
means that you really need to be ready
to say okay let's I think everything
from first principles maybe we can waste
a year uh but you know if it works out
it's going to be huge. It's very hard to
do that in a startup. So we decided to
do a nonprofit because we really wanted
to have a mission where we can try stuff
fail and just focus on making big things
without having distractions and Xavier
Eric Smith and they were very happy to
accompany us with that.
>> That's very cool. But you still have
some um commercial partnerships as well
right? So we start getting interested in
that with the main reason of uh as I
showed you in the benchmarks. We are a
nonprofit and at the same time we make
research models and realize are very
competitive competitive to the point
where they can be competitive with
commercial solutions and so we'll be
interested in providing them to people
not only through uh you know open source
models but as well uh specific models
that where let's say someone needs a
model in a specific language and they
are okay to open source it and they want
us to work on it. Obviously we need to
find a way to make it work because we
are a small team but uh we are very
happy that our open source models can
also be used in uh in products.
>> Awesome. So what's next for QI and what
are you excited about? Uh so what we're
excited about in a way and I think it's
quite interesting is uh so Moshi was the
first fulllex model and then unmute what
I showed with general is a cascading
model and all the field seems to go back
to cascading model because it's just so
much simpler and more convenient for
products which means that the the
progress for full duplex model has kind
of remained almost flat uh since Moshi
and it's clearly the you know the end
goal is this. So for us it makes a lot
of sense to keep walking that path
towards models uh where you you really
feel like there is a deep understanding
uh mutual understanding between you and
the machine uh with whom you talk and it
it makes the the conversation very
enjoyable but at the same time it's very
useful. So you cannot just give up rag
and function calling and so on because
then you have a chitchat buddy and it's
pretty fun and very natural but it's
kind of useless except a chit chat
buddy. So you want to have like a chit
chat buddy but that can access the
knowledge of the world and make a lot of
complex stuff. Exactly.
>> Yeah.
>> Awesome. Well, thank you so much, Neil.
Thanks for the presentation. Uh that was
my last question.
>> Thanks a lot.
>> So, all right. Let's give it up for
Neil.
>> Thank you.
>> Awesome. Very excited for QI.
So, uh before you leave, I would like to
say one more thing. Hold on.
>> Oh.
>> Oh, sorry. Yep. I got I got a clicker
here. I want to well, thank you so much
for bearing with us and staying with us
until now. I know it's been a long two
days, but we're very excited for having
you here. I personally really enjoyed
this experience with you guys. So, thank
you. I'd like to thank every one of you.
So, thanks. Let's give it up for
everybody here. All right.
Look, this event wouldn't have happened
without you, without your support, the
support of the community, and the
support of our sponsors. So I just
wanted to thank uh everyone. So I wanted
to thank Docker, Neo4j, Sentry, Deep
Mind, Arise, Alolia, and everybody else
who supported us throughout this this uh
this event. And uh and we're super happy
uh like the quality of the speakers were
just amazing. I don't know about you
guys, but I really enjoyed and I had the
front row uh uh to to watch all those
talks and it's fantastic. But before you
leave, we have two things for you. So,
we have if we're all going to go
upstairs by the expo, we're going to
take a group photo and that's going to
happen in 3 minutes. And we have a very,
very special announcement. So, make sure
to be around for that. Okay? So, we can
all walk together and in 3 minutes,
group photo and a special announcement.
And thank you so much again.
Bye. Woo. Yeah. Yeah.
[Music]
[Applause]
Heat.
[Music]
[Applause]
[Music]
Heat. Heat.
Heat.
[Music]
Heat.
[Music]