AI Engineer World’s Fair 2024 — GPUs & Inference Track

Channel: aiDotEngineer
Published at: 2024-06-27
YouTube video id: JVSKlEmUr0k
Source: https://www.youtube.com/watch?v=JVSKlEmUr0k
e
e
e
e
e
e
e e
test test one
two host mike check one check one two
this is the host mik check check one two
uh yeah so I'm let me just let me get
your attention for one second sorry
sorry about that project project yes
from here
okay okay hello everyone does that work
keep it a little close hello
everyone okay
watch me
break I Dro my
backpack uh so I'm volunteering uh with
Sean to
this also yep 10 n
six five four three two
one uh I actually don't know right now
ly last talk right at 2 o'clock 2:20
uh yes the 220 talk oh Lyn's coming nice
L is not coming oh she's not coming oh
but her C is oh okay okay he's actually
better no just just kidding maybe not
talking but like you know yeah to get
nerdy with great great to end it
off um yeah I'm not sure about
replacement I can go like ask around I
guess no I have swix text okay okay
gotcha okay um oh also for the compass
thing uh do you mind if I download the
app on your phone because it has to stay
on on you um I can send you the the link
and stuff
yes I'll be I'll be here as well like
the whole the whole thing if anything
happens sounds
good
Compass um actually you go Safari is a
link and I'll link can you drop it yeah
I can do
that okay I'm so glad you're here
early
s first
that eight seven six five four three two
one
zero I
think wants to talk hello everyone hello
everyone is that better super
close testing testing okay okay
cool
e
e
e
e
e
e
e
e e
[Music]
[Music]
me
you
you
you
you
you
[Music]
you
you
you would
you you
would
you
you
you
you
you
you
you
you would
you you
[Music]
we got an
insomniac with eyes wide shut and we got
everything we need and then a little too
much I know that you're starving for
something you can't touch but you be
honest with me right
now there's something in the underc I
can feel it coming up don't you want to
feel it taking over your senses don't
you ever feel
Technologic FES baby come escape with me
I'll come sweep you up for your feet
don't you want to feel it don't you
don't
you think there's something in my bag
that's weighing me down oh it's just the
weight of the world now I'm calling it
out we're a little starving for some
Lightning Love can we speak on honestly
right
now there's something in the
undercurrent I can feel it coming up
don't you want to feel it taking over
your SES don't you ever feel it teic
fces baby come escape with me I'll come
sweep you off your feet don't you want
to feel it don't
[Music]
[Music]
St baby just don't walk away I Need You
Now f it out all the time we spent Al
fighting through the fire don't let me
down I need you now cuz I'm feeling worn
out it's getting to me lost some heart
trying to get on my
feet caught in the madness I feel you
somehow don't let me go I need you right
now I want to be next to you you want to
be next to me holding our Paper Hearts
fading our Broken Dreams I want to be
next to you you want to be next to me
holding our Paper Hearts feeding our
Broken Dreams want to be next to
[Music]
tell me that you want to stay baby just
don't walk away I Need You Now
f it out all the time we SP alone
fighting through the fire don't let me
down I need you now I'm feeling worn out
it's getting to me lost some heart
trying to get on my
feet caught in the madness I feel you
somehow don't let me go I need you right
now I want want to be next to you you
want to be next to me holding our Paper
Hearts fading our Broken Dreams I want
to be next to you you want to be next to
me hold it our paper heart feting our
Broken Dreams I want to be next to
[Music]
want to be next to you you want to be
next to me holding
ouring our Broken Dreams want to be next
to you you want to be next to me holding
our paper heart feeding our Broken
Dreams
[Music]
[Applause]
[Music]
know
[Music]
w
[Music]
[Music]
[Applause]
you you
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
know
[Music]
it was summer back in
89 we were kids falling in love for the
first time Hold Your Hand you look me in
the eyes
kind of feeling you get Once In
A but now something went you're moving
on I found myself on The Blind Side now
you won't call we lost it all you fade
away I'm picking up my heart from every
piece that's broken been trying to get
back to myself but don't have a clue I'm
looking for some luck can't find a door
that's open I'm losing all my feels like
I'm
here because I'm missing
you because I'm missing
you
oh because I'm missing
you because I'm missing
you because I'm missing you
in a minute hello everyone we'll begin
in a minute
okay we are going to get it started
hello everyone my name is Nyla worker
and I am here in the GPU and inference
track I hope you all are very excited
because I think this is top of mind for
all of us either you have a lot of GPU
and your GPU rich and want to optimize
it or you are GPU poor and need to
leverage every second of your gpus
um or maybe it's not even a g learn
about different Hardware so let's get it
started and enjoy this track uh our
first Speaker of the day is Santos and
he is going to be speaking about
agnostics AI platform and first uh the
talk is going to be about how to
accelerate training fine tuning in
particular for Lama so with that I let
you
go thank you
all right so great um so the talk is
actually going to be about um uh how you
run things extremely easy directly from
Python and the example that I'm going to
show you here is obviously I just have
five minutes in on my end but I'm going
to try my best to Showcase how you can f
tune pretty much 20 is an arbitrary
number here but hundreds of models that
you can do right from python without
needing anything like kubernetes talker
or anything on your side uh so before
that you can find the talk and the
actual code for what I'm going to do in
this QR code and you'll find lot more
interesting examples over there to try
out on run as well um okay so what do we
do um so calent is an open source slop
core uh product on its end and what we
do is we help people write python
locally and ship the code to any kind of
compute backend that you need to send it
to so what that means is hey you have a
python function that you want to run on
your GPU um in your local laptop open up
a notebook add a single decorator on top
to say hey I want to run this on h00
with 36 gigs of memory for 2 days
maximum time limit and press shift enter
in your jupyter notebook and that's it
the code gets shipped to a back end in a
GPU and you get back the result on your
side in the open source case it sends it
to your own compute you can attach your
own compute cluster and it runs over
there in the cloud case it runs in our
GPU cluster and you just pay for the GPU
time that it runs so it runs for 5
minutes you pay for 5 minutes of h00 it
trans for 10 seconds you pay for 10
seconds of HS on your side you can also
bring your own compute and attach to us
and we'll help you orchestrate the
entire Compu that you're handling in on
your side be it your own cloud or
on-prem systems or whatever it is on
your
end okay so covalent basically has a
bunch of perimeters that you define in
you can submit in jobs which are called
single functions so essentially all you
need to do is as I said add a single
decorator on top and say what is the
computer that you need to ship it to it
goes there it runs and you get back the
python object back and you just pay for
the function that you running
in we also let you run inferences and
again it's completely pythonic you don't
dockerize you don't run kubernetes
cluster you don't do anything you just
say hey I have an initializer function
and I have a I need an endpoint called
SL generate and you define your python
functions you click a single cc. deploy
command uh in your jupyter notebook the
entier service gets shipped to us and we
scale you get back an API end point that
scales to zero or scales in it request
as and when your new request comes in
you can Define your Custom Auto scaling
mechanism like hey I want to Auto scale
it to 10 gpus exactly at 9:00 every day
or I want to Auto scale whenever my GP
utilization hits in 80% or I want to
Auto scale whenever the number of
requests I get in is thousand so you can
Define whatever Auto scaling you want
you can Define authentication and
everything and everything happens in the
background for you you don't even touch
a single code of kubernetes or Docker or
anything on your
side and the talk I'm going to give is a
very tiny example um of what we do from
our side but if you go to this Linkin
there's a whole host of examples uh that
you can run in right from realtime time
series analysis to uh you know using
inverter Transformers for time series
which is like a state of the art U time
series Transformer on its end uh running
in large systems um large language
models on your serving uh systems and
even building an entire AI model Foundry
out of our just pure pythonic code uh on
your side
so without further Ado I'll quickly run
through the code example of how you do
essentially fine-tune a bunch of huge
set of models uh directly just from
python on your end and I'll also show
you how it looks like uh from the front
end side as well so um it's rather
simple all you do is I have written a
bunch of U normal pythonic training
functions in my local package called
finetune and evaluate on our end and
what we going to do is hey I'm going to
define a python task which essentially
calls in my fine tune function which is
going to accept a model and data and
return back a fine tune model so this is
a simple python function and I'm going
to say I want to run it on a 24 core CPU
with 1 GPU in it of type h 100 with 48
gigs of memory and going to give a Max
limit of 18 hours on it and then I'm
going to say I'm going to once the model
is done I'm going to accept the model
and then evaluate its accuracy on the
and finally I'm going to just sort the
model among all the best models and then
pick the best model in it and I want
this to run on a CPU based machine I
don't want to waste GPU for my sorting
or whatever I'm going to do in on my end
and finally I'm going to deploy the best
model that I figured um in it's that has
performed well on my end and this is
again a simple decorator i' say hey this
is my initialization servers and I'm
going to create an endpoint called SL
generate um and I'm going to generate
the text and give back the
prediction and finally this is where the
magic happens to tie together all of
these things what I do is I'm going to
create a workflow where I'm pretty much
just going to Simply Loop over a bunch
of models to finetune call the fine tune
function evaluate the task and get the
accuracy make a list of all the models
and accuracy sort the best models and
then deploy the model from my end and
this is completely pythonic and once you
dispatch this to our server which is
essentially calling a single line over
here what you will go back and see is a
new job that creates in our application
and all of the functions that you call
will run in the respective devices that
you just defined so for instance here is
uh one of the evaluation step that ran
in and it has its own machine that we
ran in it ran in l14 it ran for 6
minutes and you get back just 87 cents
to evaluate your model in another model
ran in v00 on its end and it ran for 6
minutes again it costed 11 cents to do
it in and in total you finally have
deployed fine tuned untrained completely
in python without needing anything like
do all kubernetes on your end and we
have a booth over there do visit us and
we can have more chat over there thank
you
guys
awesome awesome that was great and now
for our next speaker we have Dylan who
is going to be speaking about the work
that he's been doing at semi analysis if
anyone has read any of the work that
he's done on the supply chain of uh an
Analytics on the supply chain of gpus
tpus and different places this is really
exciting work uh so Dylan
hello hello I'm going to talk about a
couple different things here but mostly
uh running models as well as uh Frontier
models uh a couple different things
right like you know people have been
talking about
stagnation um and uh I I don't think
anyone else anyone here sees that a lot
of people have been talking about
stagnation of models and a lot of a lot
of that has to just do with the fact
that we haven't seen a big capabilities
leap uh in the last bit uh but that that
comes really from
uh models that we're using today are
largely the same as the models that were
trained in 2022 right GPD 4 4 Turbo 40
those are just smaller models that are
trained for longer so similar quality
right um you know 3.5 CET came out
recently but again that's actually
smaller than Opus but it's somehow
better because they trained it for
longer right but we haven't seen a
extremely large Model come out yet and
and but we will soon uh but one
interesting thing right is gp4 is like
1.8 trillion parameters it's crazy crazy
expensive to run right uh 200 billion
parameters uh each token requires you
know almost 600 gig flops uh but that
that that's almost going to be
considered a last generation model right
in in a year from now um so there's a
couple things that I wanted to talk
about regarding that right and and
mostly on the inference side because I
don't think you know anyone here is
going to try and train that kind of Next
Generation model but definitely we we
need to be able to run it um and so you
know a few things right so just just
going to break down inference uh in
detail right uh you know you know
there's two parts of inference right
there's pre-fill there's decode prefill
is the prompt processing right and the
interesting thing is if you have a 2K
prompt 2K uh context length prompt right
2,000 tokens you input into GPT um
that's that's a pedop flop itself right
um and then you know if you have 32,000
prompt that you enter it's 20 pedop
flops actually so it's an incredible
amount of compute uh that's required to
just process the prompt um and and you
know while well prefill is is very
compute intensive right it's actually
the opposite of decode right decode is
actually generating each token
iteratively right so you you process the
prompt then you generate a token you
feed it back in and you keep going
iteratively right um and decode is
extremely memory bandwidth intensive
right um You have to load the whole
model from the weights the entire all
the weights into the uh chip right or
chips uh for decode um and the Big
Challenge here is that you know hey if
you have 1.8 trillion parameters if
you're running out a reasonable batch
size you're activating all the experts
you need to load all 1.8 trillion
parameters every single token generation
right even if you're serving multiple
users at once that means you're uh you
need you know a 1.8 uh you terabytes a
second of memory bandwidth you want to
do 30 tokens per second I think that's
like a minimum bar for most people right
uh a lot of people want hundreds of
tokens per second but even if you want
30 tokens per second per user 64 users
you need 60 terab a second of memory
bandwidth if even if you look at an h100
it has like three right so this is a
extremely challenging systems problem um
more you know decode while it is very
bandwidth intensive it's actually quite
cheap on the compute which is why uh if
you look at like open AI pricing or CLA
pricing you see a three or four to one
ratio between pre-fill versus decode
pricing right uh so the input tokens
cost you know onethird that of the
output tokens um or 1/4 that so so you
know today the best models I think 40
and and 3.5 CET or uh I want to say it's
$15 per million tokens and then it's $5
per million tokens for input uh 15 for
output um so five for pre-fill 15 for
decode um and and soon we're going to
have you know in the in the open source
you know so what everyone here can touch
is is llama 3 405b right and that's
that's going to be a real capability
sort of unlock um for the you know know
the open source market as well as you
know Builders here right and I think I
think there's a couple things that uh
people really need to be able to
implement right like you can't just run
llama CPP on llama 405b right like it's
just not going to work um so there's a
bunch of stuff that people have to work
on um you know whether it's using you
know closed Source libraries like tensor
RTM uh that only work on Nvidia or like
VM which is an open source library that
works uh on AMD and Intel and and soon
other people's chips as well um um you
know there's there's a lot of stuff that
people need to figure out one one of
those is is continuous batching right uh
because you're going to get you know
running inference at batch size one is
horrendously expensive um you know it's
great to run if you're running it on
your own personal devices but if you're
running it in the cloud right you're
renting GPS you're running batch size
one you're you're going to cost yourself
10x more you know 10x is a low bar right
it's actually could be 10x to 100x more
than running at a high batch right so
you have to figure out how to run High
batch sizes batch sizes help many
concurrent users you're serving um and
so one of those things that makes it
difficult is that users requests come in
at different times right uh one person
might send a request now and then
another person sends in a request five
seconds later uh but the first person's
request is not done so you need to be
able to do continuous batching I.E sub
uh be able to run through the model
iteratively uh every time right um and
and bring in new users so continuous
batching is one of the things that you
have to have to have support of and and
a lot of software today like llama CPP
doesn't have support for that so either
you need to build it yourself or um you
know contribute to an open source
project that that builds this um to to
enable lowcost inference right for you
know models like llama 405b right um
another one of those is is uh
disaggregated uh pre-fill or
disaggregated batching right it depends
on what you call it um but you know if
you go back to earlier I was discussing
uh prefill is very compute intensive
decode is very uh bandwidth with
intensive these are two different
workloads but when you Ser when you're
serving a user right whether it's uh you
know in your own app or you're using an
API what have you right like these users
uh don't care that it's two different
workloads right it's one workload to
them uh I get tokens out right I submit
something to you and I get tokens back
uh but but for anyone running the infra
themselves uh they need to they need to
be keenly aware that these are two
different workloads um so one thing that
a lot of people have uh started to do um
Google's publicly said they're doing it
I believe opening ID nrop are also doing
it um you know uh other firms like
together and fireworks have hinted that
they're doing this uh is is
disaggregated prefill right so once your
inference volumes are high enough you
don't just run inference you know you
don't just replicate the model across
however many chips you have right uh say
say it takes four mod four chips to
serve llama 405b right in the future um
you wouldn't just have you know if you
have so many if you have enough users
you don't just go four and then eight 16
whatever right you don't just replicate
that across the world you actually do
this thing called disaggregated pre-fill
you have one set of accelerators do the
pre-fill which is very compute intensive
and then you hand it off to the other
set of accelerators to do decode now
today everyone just uses the same
accelerator for that right h100 or a100
or you know maybe maybe l40 or something
but mostly h100
uh but there's a there's a reason you do
this right and and and that big reason
is that you have a lot of noisy
neighbors right um so if you've ever
worked in like CPUs or on anything in
cloud computing noisy neighbors are a
huge huge issue um and actually like
there's it's very trivial to
dramatically slow down most inference
providers Services uh if if you just uh
send queries in a certain way like in a
in a sort of malicious way um you can
you can just slow down people's uh
service right whether that's you know
and and that'll that'll impact the user
time to First token right um and I think
that's a huge issue right if time to
First token is too long people will just
quit right using your service um if uh
you know the tokens per second varies a
lot right for a moment you're getting
100 tokens per second then it drops down
to like 30 then it drops goes back up to
a 100 that's going to be really annoying
to the user so so there's a lot of
things around you know SLA and and
reliability and all these things that
you have to guarantee and so
disaggregated pre-fill uh is is one of
the techniques to do that right um and
and so you don't want to have someone
submit you know for example hey I have a
database and I want to sub I want to run
an llm query across every single Row in
that database and I'm just going to
submit it to you my service provider
because you have this cool model or what
have you that's fine tun n some data set
and what whatever it is right if I
submit 10,000 rows to you at once that's
going to kill everyone else's
performance right so so this is one of
the techniques that people have for uh
making it so you know that that person
who you definitely want to serve uh
doesn't impact everyone else's usage uh
because once you open up your service to
the real world you're not going to be
able to control who's submitting what
what and rate limits are the most
annoying thing ever so that's not the
correct way to go about it um another
thing is context caching right so Google
launched this recently uh they're the
only one offering this today but I think
this is a really big deal uh because
when people talk about fine-tuning right
of models that's great uh but in reality
the best models are really expensive to
fine tune or impossible to fine tune
right I can't go fine-tune 3.5 CET or
fine-tuning llama 405b is going to take
you know dozens and dozens of gpus right
so so instead of that the the uh or you
know in in close Source models generally
so Google only does close Source models
mostly for the big ones right so Gemini
1.5 Pro they offered this they they
brought this recently right which is
context caching so instead of you know
fine-tuning your model why not you know
just fill out a context length of you
know they they offer I think two million
now today right two million context
length um why not fill it out with your
data there right um you know and and
there's a couple you know advantages to
that one is you can use the best models
right in the case of fine tune models
you really are focused on like the Llama
7B or mix draw or llama uh you know 70b
it's it's kind of much lower quality
models than what's available in the Clos
Source world uh so one of the things you
can do is you can um Implement what
Google has called context caching in the
in the open source World we'll we'll
have super long context models soon
enough but uh economically right you
know we talked about $15 token per
million tokens output um and million
million tokens input if you were to have
uh on on you know the best close Source
models today if you were to submit a
prompt of like you know a million tokens
and and most most of the times you're
looking at a document you get a query
back right you your your output is very
small almost all the cost is just
sending them that document right so
that's that's going to really really
hurt you so for people you know
targeting maybe like a legal AI or like
a you know some sort of other contract
review AI a lot of these Enterprise use
cases uh prefill is going to dominate
your cost if you're using apis um and so
Google has this context caching and and
open source will have it so models you
can run yourself and and others will
deploy over time uh but basically you
don't recompute the KV cache right the
the context length every single time
instead you cache it h but the problem
is to save save that takes an incredible
amount of memory um so you don't save it
in the gpu's memory right you save it on
the cpu's memory or storage um and so uh
VM uh which is an open source library
for inference is contributing is
building this currently so if you're
interested in contributing to that uh
check that out um or if you're
interested in using it just start a
project right um because you know while
most of the models we have in the closed
Source today are like only like 32 or 8K
or 4K context length they're coming with
longer um and being able to you know
dramatically reduce your costs um by
caching the context um
is is very is GNA is going to
dramatically reduce cost right um so now
I'm just going to talk about like head
in the cloud stuff instead of like real
usable things which is um you know
what's coming down the pipeline right
which is you know gbd4 was like 20,000
chips for 90 to 100 days um used you
know 38 gwatt hours very very expensive
cool um but you know what's what is what
are they building now right uh open AI
xai um anthropic many others are
building 100,000 chip clusters right and
it would trange you in 3 days right so
it's kind of irrelevant um you know and
and uh I'll skip over this part too
relevant um but you know what what
what's a modern system capable of right
like h100 is is pretty uh pretty fast
relative to a100 and and coming down the
pipeline is these the new Nvidia chips
but what what's com you know what's
coming down with these 100,000 GPU
clusters right um it's not going to be a
1.8 trillion parameter models are
actually going to be you know it could
be in the tens of trillions of
parameters um you know the the training
flops right it talked about gp4 is it's
roughly 2 e25 flops right which is uh
you know a number that's not really
relevant or 2 e25 flop um but with
100,000 GPU cluster you can do 10 e26 10
e27 flops uh and to run that model is
going to require 200 gigabytes or
terabytes of second memory bandwidth
right um but what what is that like what
does that look like right so so this is
a on the top right is an image of uh
Microsoft data centers in Arizona where
they're making GPT 5 right um they have
about 100,000 gpus here uh it's 150
megawatts right like the average home
does not consume you know that's like
that's like like tens of thousands if
not hundreds of thousands of homes of
power consumption right it's it's kind
of insane um elon's talked about his
next Generation cluster he's building
100,000 GPU cluster today uh but he's
talked about his next Generation cluster
is 300,000 gpus is kind of insane but
the the power cost for that alone would
be like $500 million a year right so
it's like you know people are people are
kind of insane but it's pretty cool um
but you know the the the interesting
thing here is you know on training we we
you know when when you when you try and
train a model today people just talk
about fully connected clusters uh every
GPU is connected to every other GPU at
some speed and you you know you have to
do you know all your operations but
that's not really possible when you go
to these super large clusters right um
so the 100,000 GP clusters those are
being built this year and then next year
they're planning to build multiple
100,000 GPU clusters already you can see
that it exists across multiple buildings
right um and so there's a lot of
complicated networking uh going on right
to connect these data centers together
um and and one other thing I that that I
think is just like kind of interesting
to again head in the clouds just to
think about is um when you connect these
chips together there's a lot of Optics
right uh you know you convert from
electrical to Optical uh and then you
know over fiber optics to connect
between chips transceivers Etc right uh
these are extremely unreliable right uh
they tend to have a failure rate of
around 5 years um and so what's
interesting is if you're talking about a
100,000 GPU cluster um or if you're
talking about a 500,000 GPU cluster
you're going to have something fail like
every five minutes right um which is
insane right how do you even deal with
something in your cluster failing every
five minutes when you're training a
model right um so you know this is this
is again more of like a hardware
oriented thing but uh you know the the
other thing that's interesting is like
when you get chips they're not all the
same speed you know h100 is not an h100
um they're stragglers uh so if you get a
large distribution of chips um what we
call an industry is is called the
Silicon Lottery um in that like you know
you you can buy for example a a gaming
GPU and and compare it to other people's
gaming gpus on the forums and they're
actually like percentages difference in
performance but when you do a massive
training cluster um you end up with you
know training is a synchronous workload
right you know you you you update the
weights you then you pass the gradients
around right um and then you you know
then you again run through a bunch of
data uh update the weights or pass the
gradients around update the weights
right um so it's a synchronous workload
so if one of them is 10% slower then
everything is 10% slower and bite dance
had a cool paper where actually they saw
a 25% decrease in speed just because one
random GPU they got uh while it did
technically work um in inid in and
according to Nvidia it was fine it was
like 25% slower than uh what they wanted
right so they're you know this is like
this is on like a 20,000 GPU cluster
even right um so so it's uh it's it's
quite interesting that you know that
that's these are the problems people
running into at scale right so they
pulled that GPU out um and then you you
can sort of see their performance
dramatically uplifted right um during
during training um and then again this
is bite dance on a 20,000 GPU cluster so
it's it's
um it's a it's a big big issue um and I
think I think some of the other stuff in
this presentation is not really relevant
uh but I think I think what do these
next Generation systems look like is a
very um important question to ask
yourself right um you know and what what
do I what do I what do I do when I deal
with that right like I think a lot of
the scaffolding that people are building
uh today for llms are dealing with you
know is is dealing with hallucinations
and things like that and and the hope
that everyone has or at least a lot of
the AGI people have is that you know
when I when I 100x the compute um you
know when I build a cluster that takes
$500 million of electricity and I trade
a model with it it's going to make
something that uh uh you know yearly
electricity cost and make a model with
it and then the cluster itself cost over
10 billion by the way right uh it's it's
going to get rid of a lot of this um the
hallucinations it's going to let us do a
lot of interesting things um yeah so so
I think that's that's basically all for
the talk I just wanted to you know uh
mention you know sort of a reasonable
thing which is how do you run llama 405b
kind of some strategies that people need
to implement that aren't necessarily
implemented yet uh in the open source
that are implemented at the labs um but
then also like you know what are they
doing right because they're not worried
about you know llama 45b capable models
anymore
[Applause]
awesome that was a great talk and now we
have S madra from
Gro hey everyone great talk Dylan
all right uh thanks everyone uh exciting
here we uh basic basically uh this is a
brand new talk so I apologize if we go
through it and it's uh it's it's it's a
bit wonky but I I think it should be fun
um what we really wanted to pay homage
to today is um actually you know just 25
years ago we crossed the one gigahertz
speed barrier uh in microprocessors
what's really crazy is um when when we
started thinking about this talk I
actually thought it happened a lot
before
1999 and I just kind of remember my own
uh Arc of um getting involved with
computers but really it was 1999 I had
to kind of double and triple check it
this is the exact press release when uh
Intel broke the one gahz speed barrier
and obviously that was interesting you
know for a couple of perspectives one it
was this you know really big number and
moment but two it was really after this
that um you know Intel started to change
about how they think about processors
would be used and went for I guess you
know multicores and things like that and
and it's really something that we need
to think about in terms of what's going
to happen with ls and and really if you
go back to the the rate of increase it
only took uh you know about two decades
to get three orders of magnitude speed
Improvement in in
microprocessors and so if we take a step
now and look at where we are with llms
and we think about anywhere close to the
speed of innovation and in fact you know
what we hear a lot of people talk about
um you know including Jensen is that
we're beyond the sort of curve of
Moore's Law so we're actually innovating
even faster than that in in llms today
um you know just to look at what we've
been able to do at Gro just in a short
amount of time U you this is between
April and June of this year you know we
were able to increase the speed of llama
3 8B by over 50% and so uh the
improvements that are happening in this
area are really really quick and and
super exciting we're really kind of keen
to kind of dive into what could happen
here
um and so let let's think about like the
state-of-the-art right and so um you
know there's models today that you know
we can process and others can process
that say huge inputs see on the
equivalent of you know 10,000 input
tokens per second which gets you down to
say a third of a second across you know
processing all of those and when you do
that you actually end up with these
capabilities um from a you know speed
perspective that far exceed human
capabilities for both integrating and
analyzing information and it's happening
um you know really really fast the
example I like to talk about here um and
I don't know if you've used this but I
highly recommend it it's this um you
know really cool service called globe.
engineer and what it does is you give it
a task and or you know so say I'm here
do I think the example I use here helped
me plan a trip to New York to try you
know the best pizza or something like
that and what it will do is it and you
know I couldn't even capture the whole
screen here but it'll basically figure
out all the different elements that have
to happen and it's doing this live
online it's connected to the internet so
everything from the flights to the taxi
options to the hotel options and then
the food options and then itinerary and
how I can do it and it you know it does
it all in you know maybe less than five
seconds and if you think about what's
really happening there and I like to you
know think about when I try uh plan for
trips myself I end up basically opening
you know tens to sometimes even hundreds
of tabs and those tabs each I have like
a like a research stream happening for
me and now all of that is solved in like
you know a simple interface you know
really enabled by these llms being able
to one input process tokens input tokens
faster and then ultimately output tokens
faster and it's really giving us a huge
edge up in how we operate as humans um
and you know where does this all go like
if we start thinking about um you know
human super intelligence uh and
optimizing and accelerating models it
really takes takes us to like
interesting paradigms here and you know
we'll talk about this more in a second
but like you know the high level way to
think about it is what if an llm you
know really becomes either like an
operating system or like the core of you
know how we think about compute today
and we we think about it completely
differently than any of the approaches
that we've had before um you know the
way we program these things the way our
expectations over on how they analyze
things and so we're really you know
that's interesting in terms of where
this is going in terms of super
intelligence and staying away from AGI
but more about changing the Paradigm
from where we are today and you know the
thing that crosses my mind here is what
happened in the industrial revolution
you know if we think about three
Industries let's think about making food
making cars and making clothes all of
those before the Industrial Revolution
were bespoke right so you'd have you
know people that would make one or two
cars a day you'd have people work on
farms that could you know maybe farm for
less than a city even a small village or
someone that was making sweaters could
you know make them you know one one a
day or maybe even one a week and when we
had the Industrial Revolution show up we
basically had this ability to make
hundreds or thousands of cars a day food
farming at a scale that could be
national uh clothing that could be made
at national scale and we're really you
know we haven't had that in technology
um The Arc of technology has been um and
this isn't my own framework it comes
from pal meritz uh you know who was a a
longtime Microsoft guy and then VMware
and then pivotal where where he and I
met um you know he said the first era of
computing was just taking paper
processes and making them digital and he
goes That's evident in the way if you
think about how the operating system is
structured files folders inbox outbox
those are all paper processes that got
turned into you know digital processes
the next era for us was basically making
those things connected right that's the
internet era and what we've been through
now you know maybe in the last 15 years
is form factor changes right either
pushing things into the cloud for scale
or mobile so you can do it on your phone
but finally with AI we're we're starting
to get to a place where we have the
industrialization in the same way we saw
for those you know manufacturing and
physical Industries we see that for
technology so you know 18 or maybe 24
months ago if you needed to have a um a
Photoshop made of some kind of artifact
that you're going to put in a
presentation you'd go to your designer
and maybe the designer would make one or
two a day for you now you can go to Mid
journey and get a th made in the next
minute if you want to so we're going
through that same kind of
industrialization for tech technology
and if we just dive in deeper here into
you know where we go as we can get into
like 10,000 complex decisions per second
just by getting this down to you know
0.1 milliseconds and then if we if we
really really kind of start increasing
that it does become viable to think
about the core of our Computing becoming
an llm and I think this is a real
challenge for a lot of people because we
you know obviously we have existing
paradigms that were really really locked
into but this paradigm shift is
fundamentally different in terms of how
software will be built how software will
run and how software will scale and we
don't think about it too much today
because we think about the speed
associated with um you know running llms
and their capabilities but if we can
imagine the same um growth that we saw
in CPUs happen in this era um we can
imagine that the core of these devices
changed to become you know something and
this is again hat tip to karpathy this
is a diagram that he drew can imagine an
llm being a core at you know whether
what happens in video and audio we
starting to see that today what happens
in our browsers how we interact with
other llms how we interact with you know
code interpreters and even our file
systems and how we interact with those
type of things and so what is the art of
possible if we start doing
and so uh I'll just kind of rattle off
some things here that you know crossed
our minds as we were putting this
presentation together um you know we
really don't spend a lot of time
thinking about it but many responses
today um in llms are are sort of near
real time they're at sort of reading
speed but if we go to like instantaneous
responses and decision making this
becomes a lot faster again this is
really evident when you think about
something like that Globe example I
showed what you're really able to do
there is take a task that would probably
take you either an afternoon or evening
or number of evenings and it's done in
just a few seconds for you um and then
there's personalized experiences you
know today we don't really have a lot of
personalized experiences happening we're
starting to see elements of it you know
I think open AI has started to launch a
number of features that allow it to
understand you know specifics of Your
World it could be your pet's names or
kids names or spouses names but really I
think you know where this goes to and a
lot of people push on this I know you
know two of my friends uh you know Bill
Gurley and Brad gersner they talk about
this a lot on their pod where they
really view personalization as the next
major Frontier and personalization and
speed are going to go hand inand if
we're going to make that work kind of
seamlessly for folks I think next is
kind of a a universal natural language
processing and so if we think about our
interface today to software it's you
know you know we started with sort of
point and click and keyboards uh we've
gone to touch with our you know mobile
devices but really you know you start to
see the power of this and you know I
think everyone's been super excited for
the release of GPT 40 uh The Voice
agents we we I don't think we fully got
there yet but I think we've showed the
art of the possible there with what they
were able to do with voice and then that
kind of mixed interaction I would say
like you know we refer to as sort of
like xrx which it's like any type of
input reasoning and any type of output
um you know the example I like to tell
people there if you're trying to order
something you may want to interact with
an agent in voice but you may want to
see the responses in text and so think
about if you're trying to book your
haircut and you want to say well tell me
what times are available and then you
know it tells you well there's 9:00 a.m.
and 11:00 a.m. and 3:30 and 5:30 that's
hard to remember if it's just coming
back to you in voice so you want to
basically have these interactions that
are multimodal that kind of touches on
my second point there and I think we're
going to start to see a lot more of
those uh interface changes as
well um you know Advanced virtual
assistance this is like complex task
scheduling I think a lot of what we'll
see in the back half of just this year
is uh you know agents start to become
much more uh complex and a lot of focus
from llm providers as well I think on
making uh you know complex tasks
something that are solved it's it's
interesting today because we measure the
efficacy of llm through generally single
shot and I think we do that because of
you know going back to that where we you
know the of the conversation which is
the performance barrier but naturally if
you even take any existing llm today and
multi-shot it its scores get a lot
better and there was a couple papers
that came out recently that showed if
you just had multiple agents working
together on a problem they can far ex of
of a less you know less parameter model
they can compete with higher parameter
models just by doing sort of multi-shot
reasoning or working together and so I
think we'll see a lot more of that as a
speed improves and I think there's
there's an incredible incredible
optionality there
you know we saw the first um I think
first cut of collaborative AI agents
with apple AI you know where you see
something maybe running on device
interacting with something off device
it's I think it's a very early
implementation and I think these things
will get much more sophisticated and
better um an area you know we've spent a
lot of time within our career is like
analytics and Predictive Analytics I
think today everything is uh you know
pretty much action oriented and drived
off a human action so I think if we get
to a place where the speed goes up they
can be a lot more predictive you know
what does that really mean it's just an
agent that's always running in the
background because the compute Cycles
are next to free we don't see that today
but I think we get there as we get you
know higher up the curve um you know
context aware as well and today we again
we are generally limited to how much
context we can provide and we're having
to even with with models with bigger
context windows we still have to you
know be conscious of you know how much
compute Cycles we're going to use but I
think if that becomes next to free it
becomes quite powerful for us um you
know creative tools and customizable
content uh I'll focus on the second one
here this is this is an area where I
think many of us would would like to see
things go you know the example I always
like to you know one of my favorite
shows was Seinfeld and obviously you
know it's not on anymore but one of the
things I like to do uh you know when
when I'm bored is go into you know llm
of choice and have it right a Seinfeld
episode but made up of like modern-day
things that are happening and if you
ever try that it's super fun because it
does an incredible job of you know
identifying which character in those
scenarios that you give it would would
have you know sort of the funny or odd
thing happen to them and so the idea of
you know taking that Beyond sort of
writing and taking that to multimedia
forms is is going to be really you know
really really powerful going forward um
you know complex decision making you
know before our company was acquired by
grock uh you know we were building a
company called defini ative intelligence
so we spent a lot of a lot of time in
this space um not only uh doing sort of
say natural language to re uh you know
analysis of of SQL um Texas SQL as a lot
of people would call it but U you know
Rick who's sitting here with us like you
know he was working on this really cool
product for us called Pioneer which was
a automated data science agent where
it's really meant to run almost
endlessly on a problem and uh you know
you sort of Define a cap and you think
about how a business runs a business has
a bunch of kpis and then a business has
a bunch of data that's coming in and
then usually humans are taking that data
and analyzing it kpis and creating
Powerpoints and spreadsheets and telling
either Senior Management or the world
how well they're doing well there's no
reason that just shouldn't happen
automatically right and where there's an
agent just constantly you know looking
at the new data that's coming in asking
additional questions diving into it and
I think we had a lot of interesting
things emerge you know we had let
pioneer loose on a data set of human
workers and their performance reviews
and one of the things that we saw was it
was able to correlate really interesting
uh things that we couldn't think about
in terms of you know depending on your
age and depending on your performance
review it really affected your um I
guess your output your productivity and
so I was able to kind of discover that
if you're of a certain age and you got a
certain type of performance review
you're productivity would fall off and
maybe Rick can correct me if I'm wrong
later but it was something along those
lines which I was always an interesting
example for us and then obviously a lot
of you know um really interesting things
around Dynamic
optimization um you know this this an
area we're familiar from before um you
know when a bunch of us were at Ford um
after the acquisition of autonomic we
really saw you know for the supply chain
if you think about how you know cars are
produced and how they're shipped um um
you know there's you know pretty
sophisticated software that does this
but it's still not efficient right and I
think um you know the art of the
possible with sort of what we were
talking about earlier could be very very
interesting for some of our old old
colleagues at Ford um I'll touch on a
couple more things and then leave a
couple minutes for questions if there's
any but Edge Ai and decentralized AI
this is pretty cool um you know there's
a a really cool project called you know
hyperspace thatai what they're doing is
um they're actually have a lot of uh you
know taking you know sort of like SEI at
home or even render and where they're
basically allowing people to take their
unused GPU compute and make it available
in the cloud uh or I guess yeah and um
and why that's interesting is there's
certain use cases that necessarily don't
require something to be real time and so
I think we'll see a lot more of that now
this intersects really well with us
getting more throughput and getting
lower latency out of existing existing
system so I think we'll see a lot more
of that as well especially because the
amount of power consumption that's
required if you distribute it that you
could be really
interesting um and a couple more here is
uh enhanced security and privacy this is
a big area you know I was I was talking
uh to one of our colleagues last night
and he was subject to a really really
scary type of U I guess maybe fishing
call where um you know someone had
called in U sounded very formal uh and
had a access to a lot of information now
you you know we've all seen um there's
you know these kind of U people that run
scam call centers and people that go and
attack them but th these folks armed
with AI are much more sophisticated
because they can create stories and
narratives that are much deeper than
sort of the call center worker of past
and uh now I think in order to protect
against these systems you'll almost need
to have something on your side um so
that you can you know you can think
about it you know with our colleague he
was just so confused because the
narrative was so good the only way he
could really figure out that this person
was a scammer other than hanging up on
them was saying me some kind of formal
message um and we need them to run
incredibly fast and so um and I think
this is the last set of them here is uh
you know education is is something
that's really important to us you know
broadly at gro we we think about this
and we think about you know making
tokens available cheaper and more
broadly um and being able to personalize
you know salak con has a very good Ted
Talk from a couple years ago ago where
he really highlights um you it's the two
Sigma talk and he say you can take any
student at any level the highest levels
or even someone performing lower and if
you give them a personalized tutor they
can improve their test scores to
standard deviations and so imagine doing
that you know obviously with AIS that
are um you know can be one very cheap to
use and that can be personalized to
their learning experience um you know I
was speaking to someone recently who was
building uh an AI service for
homeschooling and what was what was
powerful about that particular service
is let's say you have a young child and
they're really into unicorns or ponies
and you want to teach them about you
know math and so you know math
subtraction addition multiplication it's
a lot easier if you frame it in the
context of those things hey you know you
have three ponies times two unicorns and
what do you get from it and so I never
thought about that before but for
Learning and customizing that for the
interest of the person is quite powerful
so we'll see more of that um and then
just just interoperability and
compatibility right I think this is an
area if you've ever been in Enterprise
software the majority of money spent in
deploying and maintaining enterprise
software is really related to you know
interconnectivity and interoperability
and compatibility and so um you know
having really fast and cheap um you know
AI technologies will help us really
reduce a huge burden that exists on the
Enterprise today so um that's it
hopefully you guys enjoyed that
[Music]
don't thank you so
much awesome we had such a great talk
from Gro now we have a product manager
from uh uh cruso and he going i j and
he'll be speaking about acceler
accelerating mixture of experts
awesome thank you okay
just okay
great thank you for coming over to our
session today a lots of really
interesting stuff is happening on the AI
World these days right uh with all the
recent model developments and the GPU
developments it is really cool to see
all the use cases um however here I want
to talk now a little bit about the
infrastructure and the way how we can
support the the newest models of the
gpus and the
newest um the newest models the machine
learning models and how we can help that
everything is working smoothly and fast
and and productive so my name is Yen
vinko I'm a product manager at
cruso uh my main responsibility is is
the infrastructure and specifically Al
GPU networking infrastructure and we are
always looking for a way how we can
increase the performance of that Network
because as we will see later in the
presentation it is really important to
do that now a little bit about the cruo
cruo is an AI Cloud platform which has
one I think very important mission for
all of us it's to align the future of
computing with the future of climate
there is a really strong Demand right
now for the computing power the gpus are
really energy hungry there is a lot of
Investments being done in the data
center area and of course that puts an
additional pressure on the grid and on
the energy sources what we are trying to
do here at cruo we are trying to utilize
the stranded energy sources wasted
energy sources and Renewables to power
our data centers we want really to be
able to make sure that every time when
you train your model every time when
you're using GPU for inference you're
not causing any negative impact on the
climate now uh whenever we are building
the cloud right the AI Cloud we are
building it based on three important
pillars first of all there is a high
performance pillar uh as the customers
are buying our services and procuring
you know the GPU times and training
their models we have to ensure that all
the infrastructure is optimized for this
training every time when it's not
optimized every time when there is a
delay or a glitch or any sort of outage
or simply not that great performance it
causes the direct impact on the
customer's bottom line it causes a
direct impact on the time to train and
kind of raises the cost to train the
model now the second one which I think
is very important for everybody around
here is the easy to use uh we want
really to separate ourselves from the
general purpose
clouds uh we do know all of them the
hyperscalers are building the great
infrastructure and are trying to support
each and every use case the customers
might have for the cloud computing
however in our case we really want to
focus on the experience of the AI
engineer so we want to make sure that we
are providing a simpler user interface
that allows developers to spin up the
computer resources to deploy the models
to train them to use them to in for
inference and and so on
uh all the underlying complexity of the
infrastructure is being hidden by us and
I believe that's our job to make sure
that that is that stays the case and now
as I mentioned we as I mentioned before
we are climate aligned which means uh we
as a company really aiming to power 100%
of our data centers with the renewable
wasted energy sources with some some
some form of stranded energy sources to
ensure that we are uh we are being Net
Zero emission net new zero emissions
from the carbon perspective we have a
big story around that feel free to check
it on our website or come over to our
booth on the on the show floor and the
team will happy to talk about that now
where are we present right now we have a
number of the data centers located
across the US uh as you see three of
them in the continental United States
and they are generally located close to
the energy sources I was mentioning
before so we have the one in Texas we
have uh the one in the northern central
part of the country and on the East we
are also building right now one big data
center in Iceland that will be powered
by the geothermal energy I mean again a
amazing way to use the constrained
energy sources or the renew Ables to
power the data center we are trying to
follow that model hence we are placing
our data center strategically the
placement of the data center in Iceland
though will be also very important for
our Emir customers given the latency and
the and the general connectivity to the
Europe that is something I think uh
might be helpful for them as
well now what is our platform right I
say cruiso cloud but generally when
whenever we are talking about any Cloud
we are talking about three General types
of the products first and foremost we
have the compute we are offering the VMS
with uh with gpus attached to them so
every time when customer wants to spin
up when customer wants to get access to
the gpus they're able to get it through
the VM they can get a bunch of DM
connected together and use them as a one
single training cluster we also offer
CPU instances for any potential data
pre-processing or any general purpose
compute tasks you might have for the
data preparation for the
offload whatever you have uh from the
storage perspective we are offering FML
and persistent discs on the Node so
those are delivered from the nvme on the
local server where your VMS are being
placed we also have the persistent blog
sto storage solution available for our
customers and we are working on
providing and delivering the managed
file system the network file systems for
the customers on the networking side of
course more traditional more typical VPC
networking that's the network sometimes
we call it front-end Network that is
used to deliver the customer traffic
from the Internet or from the customer
environment wherever the customer might
have the data sources to deliver that
towards the VM so that's your kind of
main connectivity
uh path to the outside world now uh we
do offer a number of the additional
services on that that's not simply
connectivity we also have the firewalls
we will be offering the lot balancers
soon but generally we are trying to
follow more traditional path for the VPC
networking and and the requirements the
customers usually have there now what is
more interesting and what we will be
talking a little bit later today in
Greater details is our rail optimized
infin and cluster networking so for
those of you who don't know typically C
typically providers the GPU providers
are separating their Network they have
the front end Network which is used for
general purpose traffic but then all the
communication between the gpus is
happening on the Standalone separate
Network that is uh really high
performance low latency and how
bandwidth and the whole topology is
optimized for the GPU to GPU
communication uh now last but not least
the user experience as I mentioned
before we are our main customers our
main Persona the people who are using
crew Cruiser Cloud are the AI developers
and machine learning Engineers so we
want to make sure they have what they
need in order you know to be successful
and not to think too much about
infrastructure we off offer CLI we have
apis we have guy so everything can be
automated everything can be can be
consumed and configured in the way you
like it
more uh we do have a lot of customers
already and it was very fun for me to
see on the floor that some of them are
there and some of them are talking about
their Solutions probably this is the
first time in my life whenever I'm
attending a conference and standing at
the booth I don't have to compete with
all the people around us so we do see
all the companies that are presenting
their Solutions right now as our partner
Partners we do partner with a bunch of
them already we have the together AI
here we have the uh ban Ai and all of
them are using our infrastructure for
different purposes so together AI for
example they're really into using
Cruiser infrastructure for the ml
training for the fine-tuning their
models and some sometimes for inference
the C is uh they're tra they're using
our compute infrastructure to train the
new foundational models this is really
great I mean if you are the customer of
together AI for example or codium or
what not it is likely that you have been
somehow exposed to the cruiser
infrastructure
now the distributed training has a very
specific set of problems or issues right
there is a compute part of it when the
computation is being done on the gpus
but since we are talking about a
distributed training which means there
are a lot of gpus at certain stages
Whenever there is a uh Whenever there is
a training step being completed all the
gpus have to exchange the information
have to exchange the data that they
calculated on their own this is
typically done through the all reduce or
or all all all get uh um through the
reduce process and the protocols and it
contains a forward path the backward
path but then the networking part takes
without any optimization about 25 30% of
the network of the time of the training
time
now this is the time where when your
gpus are staying idle they're not being
able to compute anything because they
have to wait for all the information to
be gathered uh together this is kind of
a bad thing for everybody right this is
bad thing for the customers because they
still pay for that infrastructure they
still have to wait it delays the the mod
model training but it also bad for us
because we have the infrastructure that
is not being performant enough there are
a couple of Tricks we can do first of
all the computation and communication
overlap allows you to start the network
exchange or the data data exchange when
the computation is still ongoing but
even with that when we were working with
the customers we so just the
reduction uh of about 10% so about 25%
of the training time was still spent on
the network me as a product manager on
the infrastructure side are constantly
being asked like how can we reduce that
how can we use the network as much as
possible and reduce that Gap so we we
have been looking into that and we were
trying to figure out what would be the
right cluster networking topology how
can we make sure that our data fabric
that is used for connecting the gpus is
being fully optimized and is being uh is
able to provide the bandwidth needed and
the latency needed the standard fed tree
those of you who have been working in
the data centry infrastructure before
that is something that we were
traditionally doing for years that's a
great way to build a scalable maybe
non-blocking fabric right but there are
a bunch of issues with that first of all
if we will be connecting our servers
that are shown below to a single leaf
that introduces the single choke point
as well as the single fold domain right
if we are losing the leaf we are losing
all the gpus that are connected to the
now what else we were thinking about is
like look we have that
switch we have that switch sorry what is
it the time okay so we have that switch
that can be used for the backend traffic
propagation and why don't we use that
switch for from the bandwidth
perspective and kind of you know have an
additional
path let me just use the simple two node
uh example to explain the topology and
to explain how we are using it so first
of all whenever we have the gpus that
want to communicate within one server
they can use their embedded NV link andv
switch and that provides a good
communication they don't have to go to
the outside fabric anywhere and and
whatnot now whenever we have the data
communication between the gpus on the
different nodes if they collect if they
are connected to the single Le that's
something we called one single rail that
means that the traffic communication
will be passing through the uh through
the one single Le just one H away and
you will get the to the destination now
what is interesting here is when we want
to talk to the different Trails right we
have to go all the way to the spine and
that introduces the additional hop
besides the bandwidth saturation
problems that may lead to the additional
latency which will be really important
for all for your uh all
reduce all reduce
operations but luckily for us Nvidia
with the recent version of Nick
introduced the feature called pxn which
allows you to use the internal and we
switch inside the host to communicate
across the rail so whenever we want to
have the GPU zero to communicate with
the GPU 8 on the another host we can use
an an internal switch to do the traffic
Hub between the gpus and then send it to
the leap where it is connected to so it
still allow us to use one single hop and
have access across the different rails
of the gpus now we did some nickel test
result
uh and we saw quite a significant
Improvement 50% for the small messages
and 50% of the large messages now those
numbers here are of course for the
smaller uh for the smaller messages are
about latency for larger ones we care
more about bandwidth because latency
tends to stand to stay roughly the same
uh those numbers are great right
everyone everybody would love them but
not the customers and it does make sense
because those numbers are synthetic and
more are showing you the workload that
is applied to your network what
customers care about is the time to
train the particular model so we use the
sparse mixture of expert as an example
and uh I mean I'm not going to dive into
the details how how it works but
essentially the sparse Network the the
sparse mixture of experts shows you
gives you a different layers of the
experts and thows the traffic between
them whenever you're deploying that on
the really large GPU cluster that makes
uh that creates a ton of traffic like
all the gpus have to send the traffic to
each other they have to extend the
information the workload on the network
is pretty significant
so we use the mixl model the open source
uh sparse mixture of experts uh which is
contained of 8 feet forward blocks eight
seven billions of par
and we use the fine tuning to use this
model to fine tune it on 240 h100 gpus
and we did a quite significant we saw a
quite significant Improvement when we
had the pxn enabled and without it the
14% of improvement is something that can
be directly connected to the time to
train the model that can be directly
connected to cost of training the model
and that is something that everybody
really uh really got excited I
definitely got excited and and our
customers as well because that shows
them some real value numbers they can
get with a model now that was it from my
side sorry for uh going through that
it's so fast it's a very you know large
topic it's it's hard to talk about that
but I'm happy to answer all the
additional
questions anything you guys might
have thank
you feel free to come over to our booth
if you want to talk I'm I'm happy to
chat
more awesome with that this track has
concluded for now so feel free to go for
lunch awesome have a good
day because I'm missing
you I was chasing all the wrong
sides trying to hold on to something
that I couldn't find which you didn't
Captivate my
mind now I know we've in the sunsets in
Paradise but now something went wrong
you're moving on I found myself on The
Blind Side now you won't call we lost it
all you fade away I'm picking up my
heart from every piece that's broken and
trying to get back to myself but don't
have a clue I'm looking for some luck
can't find a door it's open I'm losing
All My Hope feels like I'm left
here because I'm missing you
because I'm missing
you
oh because I'm missing
you because I'm missing
you because I'm missing you
I'm in I'm in I'm in picking up my heart
every piece that's broken been trying to
get back to myself don't have a CL
looking
for can't find a it's open I'm losing
all my like
I'm because I'm missing
you holding my breath and I'm ready to
go I'm falling right in and I'm ready to
go I found what I want and I know that
we're on top so I'll ta and I'm ready to
hold my breath and I'm ready to go I
catch you laughing and I'm ready to go
you're
hold it
and I'm
ready we are a Sumer storm
feeling you can't
ignore do you ever stop to feel it CAU
in
the
glow I'll come back to
your to know that you
believe summer all that I
want you know summer we got it
all holding my breath and I'm ready to
I'm fall right in and I'm ready to go I
found what I want and I know top so and
I'm ready
toing my breath and I'm ready to go I
catch you laughing and I'm ready to go
you're holding the man strike it
and
[Music]
I'm we light it up again the sky and I
[Music]
[Applause]
dancing on the pavement CAU in a
perfect you and those eyes
again when I least
expected said you're all that I
want we know together we got it
all holding my breath and ready to go
I'm falling right and then I'm ready to
go I found what I want and I know top so
I'm ready to my breath and I'm ready to
go I catch laugh and I'm ready to go
your my soul and I'm ready to if I find
myself at your door would you follow me
to better places if I find myself at
your door gra the keys let's go I want
to taste this if you
my door I would follow you to better
places if you showed up ready let's go
let's
go holding my breath and I'm ready to
I'm falling in and I'm ready to I found
what I want and I know we're on top so
and I'm ready
to my breath and I'm ready to go I catch
and I'm ready to
your my soul and I'm ready to W oh
holding my breath and I'm ready to go
right and I'm ready to go I found what I
want and I
know and I'm ready to my breath and I'm
to go I catch and I'm ready
to it
[Music]
so and I'm ready
[Music]
[Music]
to night
days kill me were the only
one we were holding nothing
back from the greatest night we ever
[Music]
had the
B driving SL stand in your
car singing Al along every
night must to play that song 100 times
made on fire we were ligh and the summer
bre dancing the rec in your bedroom like
a always on my
mind mus
onzen and TI always on my
mind I feel it all come back in the
moment bre
SP like the ocean so if you want to come
with me do it all
play it all slow motion
[Music]
you're my feels like
ating up the night when I'm
alone when I hear words
[Music]
tomato Sun the B
driving in your
heart singing the
every must to play that song 100
times made
fire and the
bre Rec in your
[Music]
like the
music in always on my
mind I feel it all come back in the
moment back
SP the ocean so if you want to come with
the door is
open it all in slow
motions motion
[Music]
feel all come back in the
moment that can spend away like the
ocean so if you want to come the
open it all back
[Music]
[Music]
down
down
[Music]
time
[Music]
he
[Music]
can
[Music]
he
[Music]
[Music]
[Music]
[Music]
oh
[Music]
oh
[Music]
[Music]
[Music]
w
[Music]
I was watching you watch the sun come up
vage te
and worn through High Times these nights
taste like gold sweet with Obsession
show me
something as each morning com we were
out the night like we wear our clothes
dancing right through the fire while we
watch it sing it
on as we give up our gos as a morning
curs Through the Windows we riding all
new light burning R through the pain
tearing past all the life you're wearing
out my name we wear our
problems underneath the cloes like super
M like super heroes it's coming over now
it's way down a I need a PE that only we
can hear a super silic CSE you want to
feel like us it's forever star you're in
America under your influ a full moon
Waxing now I couldn't see it until you
show me how feels like we're insane we
blame it all on love
saturated so we can't get in the we were
out the night like we wear our clothes
dancing right through the fire while we
were Chico singing
an we give up our gos as a new evening
comes through the windows it's coming
over now it's a w down a Harmony of PE
that only we can hear a super sonic
Crush you are to feel like us it's all
forever so young in
America it's coming over me electric
Sy every night on fire I knon master PE
a super sonic Crush you want to feel
like a it's forever S Young in
[Music]
America don't hold back tonight is all
we the is going so come with us don't
hold back tonight is all we have this
sky is going black so come with
[Music]
us don't hold back tonight is all we
have the sky is going so come with us
don't hold back tonight is all we have
the sky is going look so com up
it's coming over now
it's down a Harmony of that only we can
hear
super CR you want to feel like us it's
all
forever you're in
America it's coming over me electric
SYM every night on fire I KNE on my
Master pece a super sonic CR you want to
be like us it's all
forever in
[Music]
America come with
us hold
[Music]
tonight going so come with us
don't hold back
[Music]
tonight going so come with us don't hold
back
[Music]
[Music]
[Applause]
[Music]
[Music]
more
[Music]
w
[Music]
[Applause]
[Music]
[Music]
Tak one more breath beside
[Music]
you so I could find strength to divide
us
it all we got it I know we did the best
we could if I could go back undo the
mess I would Mize your face before I
go but this is how we
go got to give it up sometimes it's
go know it when to kill your pride
there's no one to blame nothing really
stays the same this is how we
go sometimes we hold on to
[Music]
let hold let
[Music]
sometimes we hold on to let
go there is nothing lost between
[Music]
us and I know you have your
reasons some days I'm a mess but I know
there's a rainbow over all of the past
your head on my shoulder but I know
we're better on our
own but this is how we go
got to give it up sometimes it's
go knowing when to kill your pride
there's no one to blame nothing really
stays the same this is how we
grow sometimes we hold on to let go
[Music]
sometimes we hold on to let it
[Music]
go hold
[Music]
how to give it up
go KN it when to kill your pride there's
no oh what the blame nothing really
stays the same this is how we go this is
how we
[Music]
[Music]
n
[Music]
[Music]
[Music]
B
[Music]
[Music]
a
[Music]
[Music]
n
[Music]
[Music]
[Music]
d
[Music]
[Music]
[Music]
we
know doors set open for us in a
moment keeping
light riding
our keeping our
sight everything we want we catch our
breath in the middle of it all taste it
Echoes
ech sun is coming up all the rest is
coming Crystal
[Music]
visioner true
belever the why I can see on the horizon
all
we can feel chasing
the see the
forest for the trees I'm keeping watch
all that
storming waking up and turning on
keeping
light riding
our keeping our
sights everything we
want we catch our breath in the mid love
it all chasing
Echo is coming all the is coming Crystal
[Music]
Vision
like
you looking on world I can see on the
horizon all our we can feel it chasing
the light is all
know chasing
the stop ion let
[Music]
it i't Let It Go
[Music]
like a
fever up
we're true
belever we looking on the I can see on
the horizon all our we
[Music]
feel True
Believer we looking on the world
I can see on the horizon all the we can
feel
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
e e
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
here
[Music]
would
you
you would you
would
you would you would
you
you
you
you
you
you
you
you
you
you you
would
you would
you
you would
you you
[Music]
[Music]
[Music]
me sh me
you
you
you
you
you
you
you
you
you
you
you
you you
you
you
you
you
you
you
you
you
you you
[Music]
we got an
insomniac with eyes wide shut and we got
everything we need and then a little too
much I know that you're starving for
something you can't touch but you beond
is with me right
now there's something in the under car
and I can feel it coming up don't you
want to feel it taking over your senses
don't you
teic FES baby come escape with me I'll
come sweep you up for your Fe don't you
want to feel it don't you to don't you
to think there's something in my bag
that's weighing me down oh it's just the
weight of the world now I'm calling it
out we're a little starving for some
Lightning Love can we speak
honestly right
now there's something in the
undercurrent I can feel it coming up
don't you want to feel it taking over
your SES don't you ever feel it
Technologic fces baby come escape with
me I'll come sweep you off of your feet
don't you want to feel it don't you to
don't you to
[Music]
tell me that you to stay baby just don't
walk away I Need You Now fade it out all
the time we spent Al fighting through
the fireone don't let me down I need you
now cuz I'm feeling worn out it's
getting to me lost some heart trying to
get on my
feet caught in the madness I feel you
somehow don't let me go I need you right
now I want to be next to you you want to
be next to me holding our Paper Hearts
fading our Broken Dreams I want to be
next to you you you want to be next to
me holding our Paper Hearts feing our
Broken Dreams you want to be next to
[Music]
you tell me that you want to stay baby
just don't walk away I need you
now f it out all the time we spent alone
fighting through the fire don't let me
down I Need You Now CU I'm feeling worn
out it's getting to
me lost some heart trying to get on my
feet caught in the madness I feel you
somehow don't let me go I need you right
now I want to be next to you you want to
be next to me holding our Paper Hearts
fading our Broken Dreams I want to be
next to you you want to be next to me
holding our Paper Hearts waiting our
Broken Dreams I want to be next to
[Music]
want to be next to you you want to be
next to me holding our paper heart fing
our Broken Dreams want to be next to you
you want to be next to me holding our
paper heart feeding out broken
[Music]
[Applause]
[Music]
[Applause]
know know
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Applause]
you
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
you get
[Applause]
[Music]
it was summer back in
89 we were kids falling in love for the
first time Hold Your Hand you look me in
the eyes
kind of feeling you get Once in a
li but now something went wrong you're
moving on I found myself on The Blind
Side now you won't call we lost it all
you fade away I'm picking up my heart
from every piece that's broken been
trying to get back to myself but don't
have a clue I'm looking for some luck
can't find a door that's open I'm losing
all my feels like
I'm because I'm missing
you because I'm missing
you
oh because I'm missing
you because I'm missing
you because I'm missing you
because I'm missing
you I was chasing all theong
sides trying to hold on to something
that I couldn't find with she didn't
Captivate my
mind now I know we've in the sunsets in
Paradise but now something went wrong
you're moving on I found myself on The
Blind Side now you won't call we lost it
all you fade away I'm picking up my
heart from every piece that's broken
been trying to get back to myself but
don't have a clue I'm looking for some
luck can't find a door that's open I'm
losing All My Hope feels like I'm left
here because I'm missing
you because I'm missing
you
oh because I'm missing
you because I'm missing you
because I'm missing
you picking up my heart every piece
that's broken trying to get back to
myself but don't have a clue I'm looking
for some
can't find the it's open I'm losing all
my like I'm left
here because
[Music]
I'm holding my breath and I'm ready to
go I'm falling right in and I'm ready to
go I found what I want and I know that
we're on top soing and I'm ready to W
hold in my breath and I'm ready to go I
catch you laugh and I'm ready to go
your it and IGN my soul and I'm ready
to we are
a
storm feeling you can't
ignore do you ever stop to feel it C in
after
glow I'll come back to your door
to know that
[Music]
[Applause]
youm all that I
want you
know we got it
[Music]
all holding my breath
[Music]
and what I want and I
to I'm ready to holding my breath and
I'm ready to go I catch you laughing and
I'm ready to go your hold and the strike
it and my soul and I'm ready
[Music]
to we
it
again our
[Music]
silou dancing on the pavement CAU in a
perfect
Stone you and those eyes again
when I least
expected said you're all that I
want we know together we got it
all holding my breath and ready to go
right to I what I want and I
know and I'm ready to in my breath and
I'm ready to go I catch a laugh and I'm
ready to go your hold and I strike it
and it my soul and I'm ready to if I
find myself at your door would you
follow me to better places if I find
myself your gra the keys let's go I want
to taste this if you show up my I would
follow you to places if you show up
ready let's go let's go oh holding my
breath and I'm ready to go I'm falling
right in ready to go I found what I want
and I know we're on top so I and I'm
ready
to my breath and ready to go
iatch
[Music]
ready ready to holding my bre to right
I'm to what I want and I know they on
top so I and I'm ready to in my breath
and I'm ready to
I catch laugh and I'm ready
to
it I'm ready
[Music]
to I'm ready to
[Music]
days me we were the only
on we were holding nothing
back from the greatest nights we ever
had so
The
Bard driving slow stand in your
car singing along every
night to play that song 100
times made fire we were Li and the
summer bre dancing the rec in your
bedroom on mind you the mus on the mind
Frozen and TI always on my
M I feel it all come back in the
moment bre
spin like the ocean so if you want to
come with the it
open play
it slow
motion motion
[Music]
you're my mem
so my heat feels like
ating up the night when I'm
alone when I hear words you
made
the driving SL in your
heart singing along every night must to
play that song 100 times
made
fire and the bre
Dan
in
mind
mind always on my
mind I feel it all come back in the
moment back
SP like the ocean so if you want to come
the door is open play it over in slow
[Music]
motions feel it all come back in the
moment I can spend way like the ocean so
if you want to come with the door it
open St back in slow motion
[Music]
[Applause]
[Music]
back
[Music]
down
n
[Music]
n
down
[Music]
take the back
down back down
[Music]
n
[Music]
[Music]
[Music]
oh
[Music]
[Music]
[Music]
h
[Music]
[Music]
[Music]
s
[Music]
[Music]
let
[Music]
I Wasing you the come up t-shirt and
worn through High Times these nights
taste like gold sweet with session show
me something new as each morning comes
we were out the night like we wear our
clothes dancing right through the fire
while we watch it singing on
r as we give up our as a new morning CS
Through the Windows we riding all new
lives burning red through the page
tearing past all the light you're
wearing that mask
we we our
problems underneath the cloes like
super like super heroes it's coming over
now it's way PR down a Harmony of PE
that only we can hear a super sonic
curse you are to feel like us it's
forever St you need a m
under your influence a full moon Waxing
now I couldn't see it until you show me
how feels like we're insane we blame it
all on love
saturated so we can't get in we were out
the night like we wear our clothes
dancing right through the fire while we
were sh sing in our
Anthem we give up our goals as a new
evening comes through the windows it's
coming over now it's w down a Harmony
ofs that only we can hear a super sonic
CR you want to feel like us it's all
forever so you're
America it's coming over me electric
SYM every night on fire I KN on Master
AIC CR you would to feel like it's
forever you're in America
[Music]
don't hold back tonight
is the is going so come with us don't
hold back tonight is all we have the sky
is going so come with
[Music]
us don't hold back tonight is all we
have the sky is going so come with us
don't hold back tonight is all we have
the is going so coming
us it's coming over now
it's down a Harmony appear that only we
can hear a super sonic C you want to
feel like us it's all
forever you're in
America it's coming over me electric
Sy every night on fire I need on
Masterpiece a
super Crush you want to be like us it's
all Forever song you're in America
[Music]
so come with us
don't
tonight SK going
so don't hold back
tonight going so come with us don't hold
back
[Music]
[Applause]
[Music]
he
[Music]
he
[Music]
more
[Music]
n
oh
[Music]
[Music]
more breath beside
[Music]
you so I could find strength to divide
[Music]
us it all we got it I know we did the
best we could if I could go back undo
the mess I would Miz your face before I
go
but this is how we
grow got to give it up sometimes it's
go knowing when to kill your pride
there's no one to blame nothing really
stays the same this is how we
go sometimes we hold on to let go
[Music]
we hold on to let
[Music]
go between
[Music]
us and oh I know you have your
reasons some days I'm a mess but I know
there's a rainow over all of the past
your head on my shoulder but I know
we're better on
our but this is how we
go got to give it up sometimes it's
go knowing when to kill your pride
there's no to play
nothing really stays the same this is
how we
grow sometimes we hold on to let
[Music]
go hold
[Music]
let hold
sometimes we hold on
[Music]
how got to give it up sometimes as
go knowing when to kill your pride
there's no to blame nothing really stays
the same this is how we gr this is how
[Music]
[Music]
n
[Music]
[Music]
B
[Music]
[Music]
oh
[Music]
[Music]
B
[Music]
[Music]
n
[Music]
d
[Music]
[Music]
[Music]
secrets that we
know doors that open for us in a moment
keeping light on
riding keeping our
sight everything we
want we catch our breath in the middle
of it all Chast it
ech is coming up over the coming Crystal
Vision VIs
[Music]
likeing on the I see on the horizon all
[Music]
we see in the
forest for the trees I'm keeping watch
all theor
waking up and turning on keeping
light riding
all keep our sight on everything
we we catch our breath in the midle of
Chas coming all the rest is coming
Crystal Vision
[Music]
like
you true
belever I can see on the horizon all we
can feel Chas the
is chasing the light is know
[Music]
can't stop and I won't let It
[Music]
Go Let It
Go
like
uping on the I can see it on the horizon
all our hopes we can feel it chasing
[Music]
is True
Believer
weing I can see on the horizon all we
can feel
[Music]
[Music]
test test one two testing one two one
two 34 check one two 1 two 3
4 check one two
[Music]
[Music]
[Music]
[Music]
he
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
I'll take it to you
[Music]
problem that's
great that's perfect
[Music]
[Applause]
it should be for all of EX the first
slide I justess
[Music]
up so we don't need
[Music]
[Music]
we've all been there huh we've all been
there
great oh I look for to
[Music]
it like he was like
Studio I
know okay sounds good
[Music]
yeah
you
you
[Music]
you
you
you you
would you would
you would you would you
[Music]
[Music]
[Music]
okay sure the house speak sure still
doing I haven't gotten your sound right
there no
[Music]
problem of
course I love it
[Music]
oh sorry I'm going to hi everyone please
ignore me as I do some sound testing
just going to keep rambling until I get
the queue that the sound is all good to
go oh I'm hoping that's
it thanks everyone
[Music]
what
you
you
you
you
you
you
you
[Music]
you
you you
[Music]
we got an
insomniac with eyes wide shut and we got
everything we need and then a little too
much I know that you're starving for
something you can't touch would you be
honest with me right
now there's something in the
undercurrent I can feel it coming up
don't you want to feel it taking over
your senses don't you ever feel it
Technologic fces baby come escape with
me I'll come sweep you for your feet
don't you want to feel it don't you to
don't you
think there's something in my B that's
weighing me down oh it's just the weight
of the world now I'm calling it out
we're a little starving for some
Lightning Love can we speak honestly
right
now there's something in the
undercurrent I can feel it coming up
don't you want to feel it taking over
your senses don't you ever feel
Technologic fces baby come escap with me
I'll come sweep you off your feet don't
you want to feel it don't you w to don't
you w
[Music]
[Music]
to tell me that you want to stay baby
just don't walk away I Need You Now fade
it out all the time we SP alone fighting
through the don't let me down I need you
nowz I'm feeling worn out it's getting
to me lost some heart trying to get on
my
feet caught in the madness I feel you
somehow
don't let me go I need you right now I
want to be next to you you want to be
next to me holding our Paper Hearts
fading our Broken Dreams I want to be
next to you you want to be next to me
holding our Paper Hearts fading our
Broken Dreams you want to be next to
[Music]
you
[Music]
tell me that you want to stay baby just
don't walk away I Need You Now fad it
out all the time we spent alone fighting
through the fire don't let me down I
Need You Now CU I'm feeling worn out
it's getting to
me lost some heart trying to get on my
feet c m feel
[Applause]
so for
[Music]
me take a small tour towards
open but hi everyone my name
is I'm a research engineer at and as
mentioned I'm techical Le
of before I get started I just wanted to
say uh how awesome it is to be here with
you all today when we were building
Gemma our North Star the the thing we
were most excited about was building
something to empower and accelerate the
[Music]
amazing and since we launched our first
models in February I have been
absolutely Blown Away by the incredible
projects and research and and
innovations that have been already been
built on top of Gemma um so I'm
particularly excited to be here with so
many developers today and especially
delighted to unveil the latest
advancements and additions to the Gemma
model family so without further Ado
we'll get
started as many of you probably know
Google SCH has been a Pioneer in
publications of AI and ml research for
the past decade including publishing
some of the key research that has
sparked recent Innovations we've seen in
AI research like the transformer senten
piece Bert to name a few Google Deep
Mind has really continued this tradition
and is actively working to share our
research for the world to validate and
examine and build on but Google's
support of the open Community for AI and
ml is not just limited to publishing
research we've also been doing work to
support ml across the entire technical
stack for a long time from Hardware
breakthroughs of tpus um which I imagine
is especially relevant for this crowd
and this track um all the way to an
evolution in ml Frameworks from
tensorflow to
Jax throughout all of this open
development has been especially critical
for Google our ability to collaborate
with the open- source Community has
helped us
all discover more innovate faster and
really push the limits of what AI is
capable of so this long history of
support of the open source Community
leads us to today and to Google's latest
investment in open models
Gemma Gemma is Google deep Minds family
of of open-source lightweight
state-of-the-art models which we build
from the same research and technology
used to create the Gemini models I'm so
sorry I think that's my phone going off
during this talk please feel free to
rummage through that bag s sorry folks
thank you wow lesson learned that even
the speaker needs to remember to silence
her cell phone all right back to doah
there are a couple of key advantages of
the Gemma models that I want to
highlight today the first is that Gemma
models were built to be responsible from
by Design I can tell you from personal
experience that from day Zero of
developing a Gemma model safety is a top
priority that means we are manually
inspecting data sets to make sure that
we are not only training on the highest
quality data but also the safest data we
can this means that we are evaluating
our models for safety starting with our
earliest experimentation and ablations
so that we are selecting training
methodologies methodologies that we know
will result in a safer model and at the
end of our development our final models
are evaluated against the same rigorous
state-of-the-art safety evaluations that
we evaluate Gemini models against and we
really do this to make sure that no
matter where or how you you deploy a
Gemma model you can count on the fact
that you will have a trustworthy and
responsible AI application no matter how
you've customized Gemma models you can
trust that it will be a responsible
model Gemma models also achieve
unparalleled breakthrough performance
for models of their scale including
outperforming significantly larger
models but we'll get some more on that
very
shortly we also designed the Gemma
models to be highly extensible so that
you can use a Gemma model wherever and
however you want this means they're
optimized for tpus and gpus as well as
for use on your local device they're
supported across many Frameworks
tensorflow Jack Caris pytorch ol llama
Transformers you name it Gemma is
probably there and finally the real
power of the Gemma models comes from
their Open Access and open license
that period that's what's powerful about
Gemma we put state-of-the-art technology
into your hands so you can decide what
the next wave of innovation looks
like when we decided to launch the Gemma
models we wanted to make sure that we
could meet developers exactly where they
are which is why Gemma models are
available anywhere and everywhere you
can find an open model I will not list
all of the Frameworks on this slide this
is only a fraction of the places where
you can find Gemma models today this
means you can use Gemma how you need it
when you need it with the tools that you
prefer for
development since our initial launch
back in February we've added a couple of
different variants to the Gemma model
family we of course have our initial
models Gemma 1.0 which are our
foundational llms we also released
shortly after that code Gemma which are
the Gemma 1.0 models fine-tuned for
improved performance on code generation
and code evaluation and one variant that
I am particularly excited about is
recurrent Gemma which is a novel
architecture a state space model that's
designed for faster and more efficient
inference especially at long
contexts we've also updated all of these
models since their initial release we
now have Gemma 1.1 which is better at
instruction following in chat we've
updated code Gemma to have even more
improved code performance and we now
have for current Gemma that not only the
original 2B size but also at a 9 billion
parameter
size so there's a lot going on in the
Gemma model family and I'm especially
excited to tell you about our two most
recent launches um the first one is
actually our most highly requested
feature since since day Zero of launch
and that was
multimodality so we launched palema P
Gemma oh thank you I appreciate
it this is why I love the open source
Community truly the most passionate
developers that there are P Jemma is a
combination of the Sig lip Vision
encoder combined with the Gemma 1.0 text
decoder this combin allows us to do a
variety of image text sort of tasks and
capabilities including question
answering image and video captioning
object detection and object
segmentation the model comes in a couple
of different variants it's currently
only available at the 2v size we have
pre-trained weights that are available
that can be fine-tuned for specific
tasks we have a couple of different
fine-tuned variants as well that are
already targeted towards things like
object detection and object segmentation
and we also have transfer checkpoints
that are models that are specialized to
Target a couple of academic
benchmarks up until this morning that
was our latest release but I'm very
excited to be here today with you guys
because it is Gemma V2 launch day
wooho
thanks we have been working very hard on
these models since Gemma 1.0 launch date
we tried to do as much as we could to
gather feedback from the community to
learn where the 1.0 and 1.1 models fell
short and what we could do to make them
better and so we created Gemma 2 Gemma 2
comes in both a 9 billion parameter size
and a 27 billion parameter size both
models are without a doubt the most
performant of their size and both models
are also outperform modelss that are
even two to three times larger than
these base
models but Gemma 2 isn't just powerful
it's designed to easily integrate into
the workflows that you already have
existing so Gemma 2 uses all of the same
tools all of the same Frameworks as
Gemma 1 which means if you've already
started developing with Gemma 1 you can
with only a couple of lines of code
automatically switch to using the Gemma
2 models and have increased performance
and um more power behind your
applications we also have the same broad
framework compatibility again tensorflow
Jacks Transformers AMA all of the ones I
previously named we have them for Gemma
2 as well we also have significantly
improved documentation we have more
guides more tutorials so that we can
coach you through how to get started not
only with inference but with Advanced
and efficient finetuning from Day Zero
and finally we really wanted to Target
fine tuning as one of the key
capabilities of these models we did
extensive research into how our core
modeling decisions impact users ability
to do Downstream fine-tuning so we
believe these models are in going to be
incredibly easy to fine-tune so you can
customize them to whatever your use case
may
be in addition to make it especially
easy to get started using Gemma 2 models
we have made the 27b model available in
Google AI Studios this means you can go
to the AI Studio homepage and select
Gemma 2 now if you wanted to and start
playing around with prompts right away
you shouldn't have to do anything except
come up with an idea for how you want to
push the limits of our model I am I am
especially excited to see what you all
end up doing with AI Studios and Gemma
um and we have a couple of different
ways for you to let us know what you're
building which I'll get to down the road
um but if you have ideas I'll be here
all day and want to hear what you're
doing with the Gemma
models but let's dive a little bit more
into performance we are incredibly proud
of the models that we've made as I
mentioned they are without a doubt the
best most performant models of their
size and are also comp competitive with
models two to three times larger so our
27b model is has performance in the same
ballpark as llama
370b and outperforms grock models on
many benchmarks by a fairly significant
margin in some cases um but I think
academic benchmarks are only part of the
way that we evaluate Gemma models
sometimes these benchmarks are not
always indicative of how a model will
per form once it's in your hands so
we've done extensive human evaluations
as well where we find that the Gemma
models are consistently heavily
preferred to other open models including
larger open models um and I'm also proud
to say that the Gemma 27b model is
currently the number one open model of
its size and it currently outranks llama
370b memotron 340b
grock Claude 3 with many many other
models as well um thank you wow you guys
are very supportive I appreciate
it um the only other open model of any
size that outperforms the Gemma 27b
model is the E large model on on LMS um
so we expect that you should have some
fun playing around with it especially
for chat applications we found in our
evaluations that the Gemma 2 models are
even better at instruction following
they're even more creative they're
better at factuality better all around
than the geml 1.0 and 1.1
models the other important thing that I
want to make sure to highlight from our
most recent launch is the Gemma cookbook
current the Gemma cookbook is available
on GitHub now and contains 20 different
recipes of ranging from easy to very
advanced applications of how to use the
Gemma models and the thing that I am
most excited about is the Gemma cookbook
is currently accepting poll requests so
this is a great opportunity to share
with us what you're building with the
Gemma models and so we can help share it
with the rest of the
world and of course I have to say we
also wouldn't mind if you start the
repository come go take a look and tell
us what you're building with
Gemma so there are a couple of different
ways you can get started with the Gemma
2 models of course I just mentioned the
cookbook you can also apply to get a gcp
credits to accelerate your research
using Gemma 2 we have a lot of funding
available to support research I would
really encourage you to fill out an
application regardless of how small or
big your project is we also as I
mentioned have significantly improved
documentation we have many guides
tutorials
collabs across every framework so you
can get started doing inference fine
tuning and evaluation with Gemma 2
models you can download them anywhere
open models are available and please
chat with us on Discord or other social
media channels so we can learn more
about what you're
building and that's about all from me
today I am so excited to see what you
all build with Gemma I have been working
on this project
for almost 2 years now and started
working on this project because I as a
researcher and Academia was disappointed
to see how far behind open foundational
llms were compared to the rapid
improvements we were seeing in
proprietary models so this is something
that's very near and dear to my heart
and that I wish I had had when I was
actively part of the open source
community so I'm very excited to see the
projects and the research that you all
do with these models please engage with
us on social media on GitHub on hugging
face here at the event and let us know
what you think of the models let us know
what you think we can do better for next
time and thank you all very much really
appreciate your
time awesome so we have maybe two more
minutes would you like like to say what
is an exciting thing that you'd like to
see build with Gemma
models yes I love this question I think
um as I mentioned at the beginning of
the talk I think safety and
responsibility is is a really crucial
part of building the Gemma models but I
also think it's an area where in general
we have not seen nearly enough research
and I think that there's a lot of
interesting work going on in the open
source Community right now now about how
to build more private more secure models
that will be particularly interesting um
I'm really excited by some of the novel
architectures that we're seeing people
adapting Gemma models to do things like
the state space model like recurrent
Gemma um probably the thing I'm most
excited about is the research areas that
I know nothing about like I'm really
excited to see people blow our minds
with research ideas that never would
have occurred to us that's really what
I'm looking forward to
most and just one more question so the
B1 was out for a while what is something
that you saw that blew your mind in the
past couple of months one of my favorite
uses of the Gemma V1 models is um are
the Gemma models use the same tokenizer
as the Gemini models so while Gemma is
trained on primarily English data the
Gemini models are multimodal they're
multilingual so this means the Gemma
models are super easily adaptable to
different languages um so one of my
favorite projects we saw it was also
highlighted in iio was um a team of
researchers in India fine-tuned Gemma to
achieve state-of-the-art performance on
over 200 variants of indic languages
which had never been achieved before to
have a model such a small model cover so
many languages at such high quality so
that was that was pretty crazy to see
that was one of my favorite ones we also
had um this should be in the cookbook
fairly soon but we had a use we have a
cookbook recipe that was inspired by a
use case where um a small business was
getting um sort of emails with what
their new orders should be and the
emails were very all over the place they
had no structure and no format but all
sorts of different lists of requirements
of what the small business needed to
make for this order um and there was a
small business that fine-tuned a Gemma
model to take in this list of emails and
output in a priority order what orders
needed to be taken care of and when so I
I was excited to see that like very
practical application of a Gemma model
as well awesome okay uh without further
Ado we should move on to the next
speaker uh from fireworks AI here we
have dimma the CTO uh the CEO couldn't
make it today but dimma will give us an
excellent talk
like
make it today because of some personal
emergency so you got me and as you saw
we don't have yet AI to figure out with
projection but we have ai for a lot of
other things uh so today I'm going to
talk about fireworks Ai and generally
I'm going to continue this team which
kin started about open models uh and how
we uh basically focus on
productionize models in inference at
fireworks but first uh as an
introduction uh what's our background uh
so the founding team of fireworks uh
comes from PCH leads at meta and some
veterans from Google AI so we combined
have like probably decade of experience
in productionizing Ai and some of the
biggest companies in the world and I
myself personally has been core
maintainer of P for the past five years
so topic of Open Source is really close
to my heart and since we kind of LED
this revolution of Open Source tool
chain for deep learning through our work
on pyro and some of the Google
Technologies we we really believe that
open source models are the future also
for like for Gen application and our
focus at fireworks is precisely on
that um
so I mean uh how many people in the
audience actually like use GPT and
deploy in for
production and how many people how many
folks use open models um a play
prodction oh okay so I was about to
convince you that share of open Source
models is going to grow over time but it
looks like in this audience it's already
already sizable but nevertheless um so
why why basically this trade off why go
big why go small uh currently still like
bulk of production inference is still
based on proprietary models and uh the
catch with that those are really good
models and they often Frontier in in
many domains uh however the cat is that
it's one model which is good on many
many things and it's often served uh in
the same way regardless of the use case
which means that maybe if you have bch
inference on some narrow domain or you
have some super real time use case uh
where you need to you need to do like
voice assistant or something uh those
are often sered from the same
infrastructure without customization in
terms of model capabilities it also
means yeah like gp4 is great or CLA is
great it can handle a lot of sense but
you are often paying a lot for
additional capabilities which are not
needed in particular this case you don't
really need customer support chatbot to
know about 150 Pokémons or be able to
write write your poetry uh but you
really want it to be really good in the
particular uh narrow
domain so this uh kind this kind of
discrepancy for large models leads to
several issues one as I mentioned is
high latency because using a big model
means uh longer response times uh which
is particularly important for Real Time
use cases like voice assistance it gets
more and more important with tic stuff
because for stuff like um for example
next talk is going to be a de right like
you you need to do a lot of steps for
like something like agent like
application to do reasoning and call the
model many times so latency is really
important and often you see that you can
pick smaller models like lar or Gemma
which you just talk about and achieve
the for an nrow domain uh same better
quality while being you know up to 10
times faster uh for example for some of
the function calling these cases like
externally Benchmark from uh from
Berkeley yeah like the you get similar
performance from F tun llama 3 at 10
access speed cost is also uh is also an
issue if you're running a big model for
uh on a lot of traffic you know even if
you have perhaps 5K tokens prompt and
10,000 users and each of them call LM 20
times per day you know on GPT 4 even on
GPT 40 it probably adds up to like 10K
per day or something like several
million per month Al 30 million per year
which is a sizable cost of a startup you
can easily cut that with much smaller
models and that often we see as a uh as
a kind of motivation for reaching out
for smaller and more customizable models
uh but really the uh like where open
models shine is domain adaptability and
that comes in two aspects first U there
is so many different fine tunes and
customizations uh I think KAIT was
mentioning about you know Gemma built
Indian languages adaptation like there
are model specialized for coder for
medicine if you had to hug and face
there are like tens of thousands of
different model variants and because the
weights are open you can always
customize to your particular use case
and tune and uh tune quality
specifically for for what you need so
open source models are great so what are
the challenges uh the challenges really
come from three areas uh first like what
we usually see when people try to use
you know open model something like gem
or whatever uh or or or lb uh you run
into complicated setup and maintenance
right you need to go and find gpus
somewhere you need to figure out which
Frameworks to run on those you need to
like download the models maybe do some
performance optimization tuning and you
kind of have to repeat this process end
to end every time the model gets updated
or new version is released Etc uh on
optimization itself uh there is
especially for LMS but generally for
generic models there are many attributes
uh and settings which are really really
dependent on your use case and
requirements somebody needs low latency
somebody needs High throughput problems
can be short problems can be long Etc
and choosing the optimal settings across
the stack is actually not trival and as
I show later in many cases you can get
multiple X improvements from doing from
doing this efficiently and finally like
just getting it production ready is
actually hard uh if as you kind of go
from experimentation to production even
just BBC in gpus on public clouds is not
not as because gpus I not always
reliable uh but getting to Enterprise
scale requires you know SC all the
scalability technology Telemetry
observability Etc so those are uh things
which we focus on uh solvent at
fireworks so starting with efficiency we
built uh our own custom servant deck
which we believe is one of the fastest
is not the fastest uh we did it did it
from the ground up meaning from writing
our own you know Cuda kernels all the
way to customizing how the
stuff gets deployed and orchestrated on
the service level and that brings
multiple optimizations but most
importantly we really focus on
customizing this service T to your needs
uh which basically means for your custom
workload and for your custom cost and
latency latency requirements we can we
can tune it for uh for those settings
what does it mean in practice and what
does customization means in practice uh
for example many use cases use Rag and
use very long prompts uh so there are
many settings you can tune actually on
the runtime level at the deployment
level to optimize for loan prompts which
often can be repeatable so cing is
useful or just tun in settings so the
stut is higher while maintaining latency
so this is independently bench markable
if you go to uh you know artificial
analysis and select L prompt where pbox
actually is the fastest even faster than
some of the other providers which are
over there at
expoo uh and uh we don't only focus we
don't only focus on Lon friends uh we
will focus on many modalities uh as an
example for image generation we are the
fastest providers serving sdxl we're
also the only provider serving sd3 uh
stabili new model because their API
actually routes to our
servers um and finally as I mentioned
like LMS like customiz especially for
LMS customization matters a lot uh One
require like one paradig how to think
about performance of LMS often it's
useful for use cases is to think about
Max like minimizing cost under a
particular latency constraint we often
have customers come and saying like hey
I need to like have this my interactive
application I need to generate that many
tokens under two seconds and that's
where that's really where like cross
stack optimizations shine uh whereby uh
tun into particular like latency cut off
and changing many settings you can
deliver much higher throughput multiple
times higher throughput uh which B
higher throughput basically means fewer
gpus and lower
cost uh in terms in terms of model
support we support support best quality
open source models uh we heard about
Gemma now obviously llamas some of the
ASR and tex to speech models uh pretty
much from from many providers we also
work with model Developers for examp for
example e large uh in US is also served
on uh on fireworks launched launch last
week and uh as a kind of platform
capabilities as I mentioned we have a
lot of Open Source models to uh to get
you started or ones we do some of the
fine tuning of those models uh in house
so I'm going to talk a little bit about
function calling specialized models
later on or we do some of the vision
language model Fusion ourselves which we
release as well and of course the key
for open source U open model development
is it can tuned for a particular use
case so we do provide a platform for
fine tuning uh whether you're bringing
your data set collected elsewhere or
collecting it live uh with a feedback
when on our
platform uh specifically on
customization it's like one uh interest
one interesting feature which a lot of
people exper starting to experiment with
models uh find interesting is if you try
to find fine tune and deploy the
resulting model how uh how to serve it
efficiently uh turns out if you do laury
tun which a lot of folks do uh you can
do uh smart tricks and deploy multiple
plur models on the same GPU uh actually
thousands of them which means that we
can give you still serverless inference
with p per token even if you have like
thousands of model variants uh sitting
and deployed there without having to pay
any fixed
cost of course single model is all great
uh but what we see increasingly more and
more in applications is model is not the
product right uh uh by itself you need a
kind of bigger system uh in order to
solve Target application and the reason
for that is because uh models by
themselves tend to hallucinate so you
need some ground in and that's where rag
or access to external knowledge bases
comes in uh also we don't have yet in
Industry magical multimodel uh AI across
all the modalities so often you have to
kind of chain multiple types of models
and of course you have all this like
external tools and external actions
which uh kind of end to end applications
might want to do in atic
form uh so I think the term which I
really like which is like popularized by
data brakes is like compound AI system
but basically increasingly seen like
transition from just the model being the
product to kind of this combination of
maybe like Rag and function colon and
external tools Etc built together as the
product and that's pretty much Direction
which basic kind of see uh this field
moving along uh over time so what does
it mean from uh from our perspective
what we what we do in this case uh so we
see kind of as a function calling like
agent as a at the core of this uh
emerging architecture which might be
connected to either domain specialized
models served on our platform directly
or maybe tuned for part different needs
and connected to external tools maybe
it's a qu interpreter or maybe it's like
external apis somewhere uh with really
like this kind of Central agentic View
uh Central Central model kind of
coordinating and trying to trash the uh
user user requirements if it's for
example a chatbot or something uh you
probably all uh heard about like
function calling you know popularized by
open Ed initially that's that's
basically the same idea uh so yeah def
qu is really like a how to how to
connect llm to external tools and exter
and external elements what does it mean
in practice so we actually uh focus on F
models specifically for function Callin
so we release a series of models like
that like the latest one fire function V
was released two weeks ago and uh what
you can do with that is uh
to
click I manage to click on this
button uh what it means is that you can
build uh applications which kind of
combine free form General chat uh
capabilities with function col so in
this case uh this is you know this fire
function has some chat capabilities so
you could see you can like ask it what
what you can you do and it has like some
self reflection to tell you what it can
do it's also connected in this demo app
to a bunch of uh external tools so it
can uh query uh like stock quotes it can
plot some charts all those like external
apis uh it can U also gener generate
images but what it really needs to
figure out is how to translate user
query into do complex reasoning
translated into function calls so for
example if we ask it to generate a bar
chart with top three uh like with stocks
of top Cloud providers like the big
three it actually needs to do several
steps right it needs to understand that
like top three Cloud providers means you
knows gcp and uh an aure right aure is
on by Microsoft it needs to then go do
function calls querying their stock
prices and finally it needs to combine
those information and send it to chat
plotting API which is what just happened
uh in the in the background uh another
important aspect which you have to do
for like efficient uh kind of function
calling chat capabilities you need to
have contextual awareness so if I ask it
to add particular uh if I has to add
Oracle to this graph it needs to
understand what I'm referring to and
like still keep the previous context and
regenerate the image and finally you
know if I switch to a part to a
different topic it kind of needs to drop
the previous context and understand that
like hey this is less uh this historical
context is less important I'm going to
start from scratch so there is no like
Oracle in that cat photo or whatever uh
so you know this particular demo is uh
is actually open source you can like go
to our GitHub and try it out it's built
with fire function and built with like a
few other Mo a few other models
including like sdxl which are run on our
platform uh the model itself for
function Colin is actually open source
uh it's on Hing face I mean you can of
course call it on uh at fireworks with
for optimal speeds but you can also uh
run it locally if you want it uses a
bunch of uh you know functionality on
our platform uh for example like
structure generation uh with with Jason
model grammar mode which I think was
similar to some of the previous talks
from like outline guys uh which were
talking here yesterday uh yeah so
finally try try it out and generally
like how to get started in fireworks so
if you had cut out of fireworks say such
models you'll you'll find a lot of uh
open open source open weight models
which I mentioned about they're
available in the playground in terms of
product offering uh we have this kind of
range which can take you from early
prototyping all the way toly scale so
you can start with serverless inference
which is not different from uh getting
to open API open air playground or
something where you pay per token uh
it's a Conant price you don't need to
worry about like Hardware settings or
anything as I mention you can still do
fine tun in so you can you can do po
fine tun our platform you can bring your
own lur adapter and still Ser it sess as
you kind of graduate like maybe like a
startup a new gradate uh more production
scale uh you might want to go to on
demand where where it's more like
dedicated Hardware with war settings and
modifications uh for your use case uh
you can bring your own custom model tun
from scratch or do it on our platform
and finally you kind of if you scale up
uh to bigger volume and want to go to
enter enter Enterprise level where it's
discounted longterm term contracts and
we also will help you to kind of
personalize it up into some of those
studion for performance which I which I
talked about earlier and in terms of
these cases I mean we're running
production for uh for many many
companies ranging from small startups to
Big Enterprises I served like last time
I checked like most 150 billion tokens
per day so you know companies like quora
build jet Bots like Po uh sour draft and
cursor which I think I think corer had a
talk here yesterday they use us for like
some of the code assistant functionality
and their like latency is really
important uh as you can imagine know
folks like upstage and liner building
like different assistants and agents on
top of that so uh we're definitely
production ready go try try it out uh
finally we care a lot about developers
you uh you guys um so actually this is
external numbers from like last year L
chain state of AI stuff where turns out
we are one of the like after hen face
the most popular platform for where
people pull models which is great was
very nice to nice to hear and again for
for getting started just uh head out to
head out to our website you can go in
the go play in the playground uh right
away so for example you can run you know
Llama Or GMA or whatever at the at the
top speeds
um and kind of go start building from
there I'm really excited to see what you
can build with open models of fire
function or some stuff which you uh
which you can find tune on on your own
and yeah last point we're as I mentioned
open a API compatible so you can still
use you know your your favorite tools
the same clients or you can use
Frameworks like L chain or llama index
or Etc so yeah really excited uh uh to
kind of to be here and tell a little bit
about
open source uh open source models and
how we had fireworks uh focusing on
productionizing that and scaling it up
uh go try out and you can also find us
at the boo uh at the Expo thank
[Applause]
you that was a great talk and now who
who here has heard about
Devon most of the room well so this tag
is going to be really interesting for
you all here we have a Scott
[Applause]
woo cool
okay yeah I'm Scott from cognition Ai
and I'm going to tell you guys a little
bit about um you know the the the early
makings of Devon we're still super super
early on and and also a little bit about
kind of the space as a whole and and
what's coming next um you know I thought
it'd be nice to to start with the demo
first um it sounds like some of you guys
have have already seen some of the
videos but I brought a nice custom one
here for the world's fair today um so
I'll just show that
quickly and here I basically said hey
Devon this was this morning by the way
this is huge
I I said hey Devon uh I want you to
build a mobile friendly website to play
the name game so I have a lot of trouble
memorizing names and faces I don't know
about you guys um but I basically just
said you know here's a here's a tsv file
of a bunch of names and faces these are
all the speakers uh here at the Worlds
Fair this week and I said can you set up
the game so that you show two different
random faces and then show the names of
one of them and have me guess which one
is which right and I give kind of a few
instructions on how the game should work
and so Devon is a fully autonomous
software engineer and what that means is
Devon has access to all the same tools
that a human software engineer would
have when it was building when when
they're building something like this and
so the first thing that Devon's going to
do is Devon's going to make a plan um
and you can see here you know kind of a
basic plan coming out um one of the
interesting things about this is the
plan changes a lot over time and so you
know as you get new information or new
feedback you update your plan according
with that
too after that Devon's basically just
running this the same way that human
would and so if you can take a look you
know Devon makes a new directory for the
name game website starts a new react app
you know all all the same Primitives um
works on building it out and building
the code you know reads the tsv file to
take a look uh at what's going on here
um and it's just kind of generally
working through it and jumping through
um it comes out and deploys this first
version after after some minutes and
I'll just pull this up quickly so that's
what this looks like um it's it's closed
but not quite there right I mean it
shows the it's still showing the names
and I think maybe I didn't quite specify
that exactly um but you know you can
click uh the name and got that correct
um and so I just went ahead and just
gave it some more feedback in plain
English and so I said hey you know can
you hide the two names until I click on
the answer and also can you probably
restyle the play again button it's like
you know somehow it's it's a little off
on this page um and I kept going uh and
just just kind of gave it more and more
feedback over time and I also asked it
hey can you add a streak counter as well
you know can you can you keep track of
how many I got correct and you know
reset to zero um you know a few of these
other things
um and the website It ultimately
deployed uh was this one right here and
so this is Justine for example keeps
track of my streak you can see it's kind
of uh ramping it up and so you know if I
were to for example um if I got this one
wrong on purpose then you would see the
streak would reset to zero you know and
it would go on and so I actually played
this game and learned the names of
everyone which was which was super
helpful by the way and uh you know you
guys can play it too it's uh it's right
here if you want to try it out this has
all the speakers I think it was
something like 170 speakers here at the
at the World's Fair um this week so you
know this is kind of a cool
example um but you know I I want to
highlight how different the world is if
software engineering is just this easy
you know if you can just explain exactly
what you want in plane English um and
and get that out um and so you know this
is obviously kind of a toy use case and
it's it's perhaps useful but we use
Devon all the time ourselves when we're
building Devon actually um and by the
way I obviously I didn't make this
website myself I just said hey you build
me this website with the QR code and
whatever uh and Deon built that too but
um you know here here's here's a quick
example of Devon uh that we're using
ourselves in production and so you know
if you take a quick look here for
example there's this whole search bar
and um there's all the sessions and you
can search across sessions right uh
Devon actually made that in the Devon
repository um you can see here Bryce is
on our team and Bryce was asking hey
Devon can you go into the Devon sessions
list create a search bar component
here's what I need you to do um and so
there's a few features about this in
particular particular that are obviously
um tuned for working in a production
codebase you can see here that Devon
started from a snapshot so we have a
machine instance loaded where it's
cloned from it has a Playbook so it
knows like a lot of the details about
our repositories and then it's also just
able to generally work within our git
environment so you'll see it just make a
PR and interact with all those same
tools and so I'll just kind of go
through this quickly yeah Deon says
absolutely you know makes the first pull
request Bryce continues and again you're
just giving feedback in planing Eng
right and you say hey this a great start
you know now could you could you add a
magnifying glass um and make it
idiomatic you know use phosphor lose
seat you know it's up to you right uh
and Devon says yeah sure I'll build that
and Bryce says oh by the way no need to
test you know I trust you um and Devan
says by the way I'm dealing with a bit
of an issue with the login process you
know it's just like you're working with
with another engineer right and Deon and
Bryce says uh okay bro uh and you know
it kind of builds it all and gets the pr
and this PR was actually merged and this
is you know the search bar right and
similarly you know a lot of the API
Integrations that Devon has were built
by Devon you know a lot of our own
internal dashboards and metrics tracking
within Devon were actually also built by
Devon um and it's been kind of a kind of
a fun one to see like Devon Building the
company with the company as well um so
cool yeah I want to talk a little bit
about you know our journey so far and
and about what's happening in the space
as well and so um you know we got
started back in November um so it's been
about seven months now um it's kind of
funny we started in a hacker house in
Berling game uh and it was basically
just a lot of us had already like lived
together at that point you know we' all
had our own Journeys in Ai and we just
knew that we wanted to build something
together and we obviously knew that we
wanted to do something in code and build
a coding agent um and then that Hacker
House in the Bay Area after that there
was another Hacker House in New York
then there was another Hacker House in
the Bay Area so we were actually we
we've been going back and forth between
New York and the bay for basically the
last seven months I think at this point
we were now going to like settle in the
bay but it's been going back and forth
and getting like a slightly bigger
Airbnb each time because the team also
gets a little bit
bigger um but you know why why Devon in
particular um and you know this is this
is a um particular question that that
I'm really passionate about which is you
know language models have been pretty
big I think that's that's fair to say um
and you know the first wave of
generative AI is what I generally call
these text completion products right and
you know that makes a lot of natural
sense if you think about it that
obviously the interface of a language
model is text completion right you give
it a prefix and it completes the sub
from there and so if you think about
chat GPT if you think about a lot of
these Q&A products if you think about
you know writing marketing copy or
answering customer support or even
GitHub copilot and cursor and products
like that you know obviously very you
know a lot of these are really great
products and very natural use case where
you have the prefix so far and you're
asking the model to complete what's next
in the suffix right and it does that for
you and that's a tool that's useful
right and I think we're entering this
new wave where you know we're going
beyond that and actually introducing
some amount of autonomous
decision-making um and obviously you
know that's typically referred to in our
space as agents right and you know there
there's all sorts of new things that you
unlock right there's a there's a lot
higher bar of consistency that you
require but there's new things that you
unlock with that um and so it's it's
been an interesting one because it's
it's both a very deep core capabilities
question of getting Devon to to solve
these but also a pretty interesting
product design problem because I think
that the uxx of Agents is is something
that's extremely new um and then why
code in particular you know a few
different things obviously we're all
coding nerds as well you know we're all
engineers and so the idea of teaching AI
to code is is you know one of the
coolest things that we could think of
but beyond that I think there's a few
particular reasons that code with agents
Works especially well you know one is
that obviously there's so much more to
being a software engineer than typing
the code right a lot of the work that
you're going to do is you know you're
going to be looking into a bug you're
going to be looking at the different
file of the code base maybe you're going
to be running this or that command maybe
you're going to be pulling up
documentation maybe you're going to run
the front end yourself to to reproduce
the bug you know you look at this thing
you make it make this edit you try it
again um all of this work here obviously
is you know that's what software
engineering is Right more so than just
uh than just typing the code in the file
um which leads very naturally to an
agentic workflow you know another part
which I think is closely related is the
ability to iterate with code feedback um
and so what I mean by that is you know
if you were given an entire production
code base and you were told hey this has
this one bug I need you to fix it here's
the bug um you know and and it's let's
say it's like thousands of files and you
know hundreds of thousands of lines of
code I mean it'd be pretty tough
honestly for most humans it's also going
to be quite tough for AIS as well and
obviously the way that we do this in
practice is you know you you you go and
add print statements you pull up the
logs you check the monitoring you know
you you jump back and forth between
different files you try and diagnose it
right each of these things that you're
doing you know you're you're making a
decision and then you're running actual
code to find out what happened and from
that you're you're able to iterate and
it just gives you a much cleaner path to
solve the problem in front of you um and
similarly you know that that kind of
lends very well to agents and the last
thing I just want to mention is you know
how fast model agentic capabilities are
improving um and so you know two years
ago like even something as simple as as
this name game demo I think would have
been almost Unthinkable um and you know
you think about where things are going
and things are going to be two years
from now I I think there's a lot of you
know the data the the the right training
and so on that's that's really really
rapidly improving in the
space um and then you know again beyond
the capabilities problem there's
actually a really deep ux problem as
well uh and at a high level you know I
think what's kind of Happening Here is
when we're building agents and I think
all of us in the space are quite new to
agents you know the immediate first
things I think to map to are you know
how we use software today and also how
we talk with other humans right and so
you know I mean even a lot of the
features in Devon are essentially
looking over your own intern shoulder
you know you can see their computer and
you can see what commands they're
running and things like that the thing
is I think an agent is actually pretty
different from both you know there's a
lot of nuances and details of parallel
work information gathering how it
manages context um etc etc that are
super super different and it's actually
a quite deep problem from a product
perspective as well just to give you
guys a bit of a sense of that like
here's just a kind of uh a short list of
some of the features that we've built
into the product and so you know
obviously there's Devon being able to
use the shell you know edit code browse
the web but there's all these other
things right you know being able to fork
and roll back sessions you know being
able to handle Integrations with slack
and GitHub being able to handle
playbooks to to store machine snapshots
to keep track of Secrets you know to be
able to work with the right tools for
verification you know all of this is
part of the actual product iteration
right which is you know on its own I
think already an incredibly incred dense
problem I think honestly we're going to
see actually a lot more iteration with
that over time and I just wanted to show
kind of a new feature which we just
recently shipped which is the ability to
use Devon's machine um which is kind of
again it's a kind of thing that's not
always uh there there's not necessarily
a very close parallel in uh you know in
the software that we have today right
but the ability to just have a vs code
live share in Devon's machine and you
know if you want to collaborate with
Devon and say hey oh there's this these
couple lines like you know you should
make this edit I went ahead and did that
edit for you and you can just talk with
Deon and do that right um so there's a
lot more room to go and a lot to iterate
on in space um one of the thing other
things I wanted to mention too is just
how much you know we've seen it changing
our own workflow um so you you guys saw
like a simple example of Devon Building
the search bar but you know we actually
handle tasks in a much more async way
now one of the cool kind of features of
Devon I'd say is you know if as an
engineer you're you're working on let's
say four different tasks today you know
you just give one to Devon number one
you give the second one to Devon number
two the third one to Deon number three
you have four devons that are all
running in parallel and it's kind of
turning every engineer into an
engineering manager is almost how I
describe it you know I think the devans
are very like enthusiastic interns is
what I'd say I mean they try very hard
you know they're obviously they don't
know everything they get little things
wrong they ask a lot of questions but
you know you're kind of working with
each of them and and having them iterate
and so here's just kind of a fun example
I mean this is literally from from
earlier today but you know we were
talking about some particular feature
and about what we wanted to build in
this case it was a pretty simple thing
of changing the color but it's just as
simple as just saying in slack in the
conversation hey at Devon can you just
change this thing and then Devon goes
and makes a PR and then you hit merge
you know and so um you know we've had a
lot of occasions where we're you know in
the gym or in the car or something and
now you can actually write C because you
know you can tell Devon exactly what you
want Devon to do um you just don't have
your whole computer with you and can't
type everything but you know being able
to just kind of describe what you want
want to Devon and then being able to
review the code afterward um actually
works really well so what's next um you
know I think this is a really important
question uh and obviously I think the
technology is extremely early today um
but you know where do these things go in
a few years and also what happens with
software engineering I think that
there's there's been a lot of
uncertainty about that question and you
know as we're using Devon more and more
I think one of the big things that we
see actually is this is kind of obvious
perhaps but Devon is not the one that
decides what to do or what to build you
know and there's this core part of
software engineering the way I describe
it is like software Engineers everywhere
you know are are doing really two jobs
at once right and the first job is
basically problem solving with code you
know you you're you're given a problem
and you're breaking down exactly what is
the solution you're going to build you
know what is the architecture that
you're going to use what are all the
flows and the details and the edge cases
that might come up and kind of
architecting your exact solution and
then the second part is once you have
that you know you're dealing with debug
or implementing different functions or
writing unit tests or all of the other
things that kind of go into this
implementation of something that you
know you want to do right and you know I
think right now the average softw
engineer is probably spending like 10 or
20% of the time on that first thinking
part and they're spending 80 or 90% of
the time on that implementation part and
you know what we really see
is Devon actually just frees you up to
do more of the first part you know and I
think the future of Devon again it's
very very early but I think as Devon
gets better we're going to see more of
that where Devon just frees up the
implementation for you where you don't
have to go figure out how to set up
kubernetes you know you don't have to go
like debug all these like apis that are
broken you know you don't have to go
like deal with version changes or
migrations or all of these other things
that you know take up a lot of time in
software engineering right but you
actually are spending all your time on
figuring out how to solve the problems
in front of you you know it's it's a
little more like um a mix between you
know a technical architect and a product
manager almost right and so you know I
think software engineering the job that
we call sofware engineering is going to
change but I think practically like
there's actually going to be way more
software Engineers than ever you know
and and I think there's a lot of
precedent for that too you know
programming back then you know used to
mean Punch Cards and then after that it
used to mean assembly you know and then
after that it used to mean see right and
you know as these things have gone on I
mean most people aren't using Punch
Cards anymore but there's actually way
more programmers than before right and
and I think one of the things that's um
easy to underestimate is just how much
more code there is to write um and you
know it's it's it's funny to think about
I think because obviously we all love
software here in this room I would say I
think software has been the number one
driver of progress in the world in the
last 40 or 50 years and yet despite that
I think you know our demand for software
to be built is actually probably a lot
more than 10x what we're currently
getting um and so you know I think what
happens is we we get to open up the
power of software engineering to a lot
more people and every single software
engineer gets to be five or 10x more
effective but we actually do a lot more
software engineering
um cool yeah so so that's all I had uh
but yeah we' love to open the floor if
there's any
questions yeah right here in the
front great question great question so
so we've been ramping up access um every
every week we've been letting on more
and more people we've also been sizing
up with our Enterprise customers we have
a lot of weit list to get through so
we're we're we're doing it as fast as we
can um but but would love to get you
guys access as soon as possible yeah
yeah um all the way in the back over
there yeah in the
red yeah exactly so in our code base for
example Devon has all the setup that it
needs it has a machine that's basically
instantiated where it can run the dev
environment it can run the server it can
run the front end and so if it's if
you're asking it hey I need you to debug
this particular thing it'll just pull it
up itself and then you know reproduce it
and then it'll debug it and try it again
yeah exactly yeah yeah yeah any other
questions right here
yeah oh of course yeah so sorry so so
someone asked um you know with all of
these simpler tasks getting solved um
what happens to you know all the junior
Engineers or the interns who who
obviously need to learn how to code you
know I I I think what happens honestly
is I think that demand is going to just
keep Rising with Supply and I think the
training process is going to change a
little bit but you know I think a lot of
these core fundamentals you know if you
think of someone as when you say
someone's a really great engineer
typically it you don't mean that they
type really fast although maybe they do
that too right you typically mean that
they they have a really great
understanding of problems you know they
know all the different architectures
they never miss an edge case stuff like
that right and so those are the
fundamentals that I think are always
going to matter and I think um basically
I think interns or Junior Engineers are
going to get more exposed to getting to
use those fundamentals earlier and
earlier
yeah yeah okay so someone asked what
what are the the biggest challenges to
realizing the vision of a future Suite
you know there's a
lot I mean it's it's basically
everything as you can imag I mean
there's speed there's consistency
there's access there's Integrations
there's the right product ux um you know
and I think all these things one of the
cool things I think is how much of a
rising tide there is everywhere and so
you know obviously we're we're going to
do our best work on it but you know
every every new hardware release is is
amazing for it you know every every New
Foundation model that comes out is
amazing you know every new piece of
agentic research and I think this is the
kind of thing where um I think there
will be a lot of different optimizations
that come in different parts of the
stack that make this agentic flow better
and better and better um but uh yeah it
won't just be one small thing but I
think it'll be it'll be pretty fast
that's all the time we had so thank you
so much thank you guys so
[Applause]
much e