AI Engineer World’s Fair 2024 — GPUs & Inference Track
Channel: aiDotEngineer
Published at: 2024-06-27
YouTube video id: JVSKlEmUr0k
Source: https://www.youtube.com/watch?v=JVSKlEmUr0k
e e e e e e e e test test one two host mike check one check one two this is the host mik check check one two uh yeah so I'm let me just let me get your attention for one second sorry sorry about that project project yes from here okay okay hello everyone does that work keep it a little close hello everyone okay watch me break I Dro my backpack uh so I'm volunteering uh with Sean to this also yep 10 n six five four three two one uh I actually don't know right now ly last talk right at 2 o'clock 2:20 uh yes the 220 talk oh Lyn's coming nice L is not coming oh she's not coming oh but her C is oh okay okay he's actually better no just just kidding maybe not talking but like you know yeah to get nerdy with great great to end it off um yeah I'm not sure about replacement I can go like ask around I guess no I have swix text okay okay gotcha okay um oh also for the compass thing uh do you mind if I download the app on your phone because it has to stay on on you um I can send you the the link and stuff yes I'll be I'll be here as well like the whole the whole thing if anything happens sounds good Compass um actually you go Safari is a link and I'll link can you drop it yeah I can do that okay I'm so glad you're here early s first that eight seven six five four three two one zero I think wants to talk hello everyone hello everyone is that better super close testing testing okay okay cool e e e e e e e e e [Music] [Music] me you you you you you [Music] you you you would you you would you you you you you you you you would you you [Music] we got an insomniac with eyes wide shut and we got everything we need and then a little too much I know that you're starving for something you can't touch but you be honest with me right now there's something in the underc I can feel it coming up don't you want to feel it taking over your senses don't you ever feel Technologic FES baby come escape with me I'll come sweep you up for your feet don't you want to feel it don't you don't you think there's something in my bag that's weighing me down oh it's just the weight of the world now I'm calling it out we're a little starving for some Lightning Love can we speak on honestly right now there's something in the undercurrent I can feel it coming up don't you want to feel it taking over your SES don't you ever feel it teic fces baby come escape with me I'll come sweep you off your feet don't you want to feel it don't [Music] [Music] St baby just don't walk away I Need You Now f it out all the time we spent Al fighting through the fire don't let me down I need you now cuz I'm feeling worn out it's getting to me lost some heart trying to get on my feet caught in the madness I feel you somehow don't let me go I need you right now I want to be next to you you want to be next to me holding our Paper Hearts fading our Broken Dreams I want to be next to you you want to be next to me holding our Paper Hearts feeding our Broken Dreams want to be next to [Music] tell me that you want to stay baby just don't walk away I Need You Now f it out all the time we SP alone fighting through the fire don't let me down I need you now I'm feeling worn out it's getting to me lost some heart trying to get on my feet caught in the madness I feel you somehow don't let me go I need you right now I want want to be next to you you want to be next to me holding our Paper Hearts fading our Broken Dreams I want to be next to you you want to be next to me hold it our paper heart feting our Broken Dreams I want to be next to [Music] want to be next to you you want to be next to me holding ouring our Broken Dreams want to be next to you you want to be next to me holding our paper heart feeding our Broken Dreams [Music] [Applause] [Music] know [Music] w [Music] [Music] [Applause] you you [Applause] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] know [Music] it was summer back in 89 we were kids falling in love for the first time Hold Your Hand you look me in the eyes kind of feeling you get Once In A but now something went you're moving on I found myself on The Blind Side now you won't call we lost it all you fade away I'm picking up my heart from every piece that's broken been trying to get back to myself but don't have a clue I'm looking for some luck can't find a door that's open I'm losing all my feels like I'm here because I'm missing you because I'm missing you oh because I'm missing you because I'm missing you because I'm missing you in a minute hello everyone we'll begin in a minute okay we are going to get it started hello everyone my name is Nyla worker and I am here in the GPU and inference track I hope you all are very excited because I think this is top of mind for all of us either you have a lot of GPU and your GPU rich and want to optimize it or you are GPU poor and need to leverage every second of your gpus um or maybe it's not even a g learn about different Hardware so let's get it started and enjoy this track uh our first Speaker of the day is Santos and he is going to be speaking about agnostics AI platform and first uh the talk is going to be about how to accelerate training fine tuning in particular for Lama so with that I let you go thank you all right so great um so the talk is actually going to be about um uh how you run things extremely easy directly from Python and the example that I'm going to show you here is obviously I just have five minutes in on my end but I'm going to try my best to Showcase how you can f tune pretty much 20 is an arbitrary number here but hundreds of models that you can do right from python without needing anything like kubernetes talker or anything on your side uh so before that you can find the talk and the actual code for what I'm going to do in this QR code and you'll find lot more interesting examples over there to try out on run as well um okay so what do we do um so calent is an open source slop core uh product on its end and what we do is we help people write python locally and ship the code to any kind of compute backend that you need to send it to so what that means is hey you have a python function that you want to run on your GPU um in your local laptop open up a notebook add a single decorator on top to say hey I want to run this on h00 with 36 gigs of memory for 2 days maximum time limit and press shift enter in your jupyter notebook and that's it the code gets shipped to a back end in a GPU and you get back the result on your side in the open source case it sends it to your own compute you can attach your own compute cluster and it runs over there in the cloud case it runs in our GPU cluster and you just pay for the GPU time that it runs so it runs for 5 minutes you pay for 5 minutes of h00 it trans for 10 seconds you pay for 10 seconds of HS on your side you can also bring your own compute and attach to us and we'll help you orchestrate the entire Compu that you're handling in on your side be it your own cloud or on-prem systems or whatever it is on your end okay so covalent basically has a bunch of perimeters that you define in you can submit in jobs which are called single functions so essentially all you need to do is as I said add a single decorator on top and say what is the computer that you need to ship it to it goes there it runs and you get back the python object back and you just pay for the function that you running in we also let you run inferences and again it's completely pythonic you don't dockerize you don't run kubernetes cluster you don't do anything you just say hey I have an initializer function and I have a I need an endpoint called SL generate and you define your python functions you click a single cc. deploy command uh in your jupyter notebook the entier service gets shipped to us and we scale you get back an API end point that scales to zero or scales in it request as and when your new request comes in you can Define your Custom Auto scaling mechanism like hey I want to Auto scale it to 10 gpus exactly at 9:00 every day or I want to Auto scale whenever my GP utilization hits in 80% or I want to Auto scale whenever the number of requests I get in is thousand so you can Define whatever Auto scaling you want you can Define authentication and everything and everything happens in the background for you you don't even touch a single code of kubernetes or Docker or anything on your side and the talk I'm going to give is a very tiny example um of what we do from our side but if you go to this Linkin there's a whole host of examples uh that you can run in right from realtime time series analysis to uh you know using inverter Transformers for time series which is like a state of the art U time series Transformer on its end uh running in large systems um large language models on your serving uh systems and even building an entire AI model Foundry out of our just pure pythonic code uh on your side so without further Ado I'll quickly run through the code example of how you do essentially fine-tune a bunch of huge set of models uh directly just from python on your end and I'll also show you how it looks like uh from the front end side as well so um it's rather simple all you do is I have written a bunch of U normal pythonic training functions in my local package called finetune and evaluate on our end and what we going to do is hey I'm going to define a python task which essentially calls in my fine tune function which is going to accept a model and data and return back a fine tune model so this is a simple python function and I'm going to say I want to run it on a 24 core CPU with 1 GPU in it of type h 100 with 48 gigs of memory and going to give a Max limit of 18 hours on it and then I'm going to say I'm going to once the model is done I'm going to accept the model and then evaluate its accuracy on the and finally I'm going to just sort the model among all the best models and then pick the best model in it and I want this to run on a CPU based machine I don't want to waste GPU for my sorting or whatever I'm going to do in on my end and finally I'm going to deploy the best model that I figured um in it's that has performed well on my end and this is again a simple decorator i' say hey this is my initialization servers and I'm going to create an endpoint called SL generate um and I'm going to generate the text and give back the prediction and finally this is where the magic happens to tie together all of these things what I do is I'm going to create a workflow where I'm pretty much just going to Simply Loop over a bunch of models to finetune call the fine tune function evaluate the task and get the accuracy make a list of all the models and accuracy sort the best models and then deploy the model from my end and this is completely pythonic and once you dispatch this to our server which is essentially calling a single line over here what you will go back and see is a new job that creates in our application and all of the functions that you call will run in the respective devices that you just defined so for instance here is uh one of the evaluation step that ran in and it has its own machine that we ran in it ran in l14 it ran for 6 minutes and you get back just 87 cents to evaluate your model in another model ran in v00 on its end and it ran for 6 minutes again it costed 11 cents to do it in and in total you finally have deployed fine tuned untrained completely in python without needing anything like do all kubernetes on your end and we have a booth over there do visit us and we can have more chat over there thank you guys awesome awesome that was great and now for our next speaker we have Dylan who is going to be speaking about the work that he's been doing at semi analysis if anyone has read any of the work that he's done on the supply chain of uh an Analytics on the supply chain of gpus tpus and different places this is really exciting work uh so Dylan hello hello I'm going to talk about a couple different things here but mostly uh running models as well as uh Frontier models uh a couple different things right like you know people have been talking about stagnation um and uh I I don't think anyone else anyone here sees that a lot of people have been talking about stagnation of models and a lot of a lot of that has to just do with the fact that we haven't seen a big capabilities leap uh in the last bit uh but that that comes really from uh models that we're using today are largely the same as the models that were trained in 2022 right GPD 4 4 Turbo 40 those are just smaller models that are trained for longer so similar quality right um you know 3.5 CET came out recently but again that's actually smaller than Opus but it's somehow better because they trained it for longer right but we haven't seen a extremely large Model come out yet and and but we will soon uh but one interesting thing right is gp4 is like 1.8 trillion parameters it's crazy crazy expensive to run right uh 200 billion parameters uh each token requires you know almost 600 gig flops uh but that that that's almost going to be considered a last generation model right in in a year from now um so there's a couple things that I wanted to talk about regarding that right and and mostly on the inference side because I don't think you know anyone here is going to try and train that kind of Next Generation model but definitely we we need to be able to run it um and so you know a few things right so just just going to break down inference uh in detail right uh you know you know there's two parts of inference right there's pre-fill there's decode prefill is the prompt processing right and the interesting thing is if you have a 2K prompt 2K uh context length prompt right 2,000 tokens you input into GPT um that's that's a pedop flop itself right um and then you know if you have 32,000 prompt that you enter it's 20 pedop flops actually so it's an incredible amount of compute uh that's required to just process the prompt um and and you know while well prefill is is very compute intensive right it's actually the opposite of decode right decode is actually generating each token iteratively right so you you process the prompt then you generate a token you feed it back in and you keep going iteratively right um and decode is extremely memory bandwidth intensive right um You have to load the whole model from the weights the entire all the weights into the uh chip right or chips uh for decode um and the Big Challenge here is that you know hey if you have 1.8 trillion parameters if you're running out a reasonable batch size you're activating all the experts you need to load all 1.8 trillion parameters every single token generation right even if you're serving multiple users at once that means you're uh you need you know a 1.8 uh you terabytes a second of memory bandwidth you want to do 30 tokens per second I think that's like a minimum bar for most people right uh a lot of people want hundreds of tokens per second but even if you want 30 tokens per second per user 64 users you need 60 terab a second of memory bandwidth if even if you look at an h100 it has like three right so this is a extremely challenging systems problem um more you know decode while it is very bandwidth intensive it's actually quite cheap on the compute which is why uh if you look at like open AI pricing or CLA pricing you see a three or four to one ratio between pre-fill versus decode pricing right uh so the input tokens cost you know onethird that of the output tokens um or 1/4 that so so you know today the best models I think 40 and and 3.5 CET or uh I want to say it's $15 per million tokens and then it's $5 per million tokens for input uh 15 for output um so five for pre-fill 15 for decode um and and soon we're going to have you know in the in the open source you know so what everyone here can touch is is llama 3 405b right and that's that's going to be a real capability sort of unlock um for the you know know the open source market as well as you know Builders here right and I think I think there's a couple things that uh people really need to be able to implement right like you can't just run llama CPP on llama 405b right like it's just not going to work um so there's a bunch of stuff that people have to work on um you know whether it's using you know closed Source libraries like tensor RTM uh that only work on Nvidia or like VM which is an open source library that works uh on AMD and Intel and and soon other people's chips as well um um you know there's there's a lot of stuff that people need to figure out one one of those is is continuous batching right uh because you're going to get you know running inference at batch size one is horrendously expensive um you know it's great to run if you're running it on your own personal devices but if you're running it in the cloud right you're renting GPS you're running batch size one you're you're going to cost yourself 10x more you know 10x is a low bar right it's actually could be 10x to 100x more than running at a high batch right so you have to figure out how to run High batch sizes batch sizes help many concurrent users you're serving um and so one of those things that makes it difficult is that users requests come in at different times right uh one person might send a request now and then another person sends in a request five seconds later uh but the first person's request is not done so you need to be able to do continuous batching I.E sub uh be able to run through the model iteratively uh every time right um and and bring in new users so continuous batching is one of the things that you have to have to have support of and and a lot of software today like llama CPP doesn't have support for that so either you need to build it yourself or um you know contribute to an open source project that that builds this um to to enable lowcost inference right for you know models like llama 405b right um another one of those is is uh disaggregated uh pre-fill or disaggregated batching right it depends on what you call it um but you know if you go back to earlier I was discussing uh prefill is very compute intensive decode is very uh bandwidth with intensive these are two different workloads but when you Ser when you're serving a user right whether it's uh you know in your own app or you're using an API what have you right like these users uh don't care that it's two different workloads right it's one workload to them uh I get tokens out right I submit something to you and I get tokens back uh but but for anyone running the infra themselves uh they need to they need to be keenly aware that these are two different workloads um so one thing that a lot of people have uh started to do um Google's publicly said they're doing it I believe opening ID nrop are also doing it um you know uh other firms like together and fireworks have hinted that they're doing this uh is is disaggregated prefill right so once your inference volumes are high enough you don't just run inference you know you don't just replicate the model across however many chips you have right uh say say it takes four mod four chips to serve llama 405b right in the future um you wouldn't just have you know if you have so many if you have enough users you don't just go four and then eight 16 whatever right you don't just replicate that across the world you actually do this thing called disaggregated pre-fill you have one set of accelerators do the pre-fill which is very compute intensive and then you hand it off to the other set of accelerators to do decode now today everyone just uses the same accelerator for that right h100 or a100 or you know maybe maybe l40 or something but mostly h100 uh but there's a there's a reason you do this right and and and that big reason is that you have a lot of noisy neighbors right um so if you've ever worked in like CPUs or on anything in cloud computing noisy neighbors are a huge huge issue um and actually like there's it's very trivial to dramatically slow down most inference providers Services uh if if you just uh send queries in a certain way like in a in a sort of malicious way um you can you can just slow down people's uh service right whether that's you know and and that'll that'll impact the user time to First token right um and I think that's a huge issue right if time to First token is too long people will just quit right using your service um if uh you know the tokens per second varies a lot right for a moment you're getting 100 tokens per second then it drops down to like 30 then it drops goes back up to a 100 that's going to be really annoying to the user so so there's a lot of things around you know SLA and and reliability and all these things that you have to guarantee and so disaggregated pre-fill uh is is one of the techniques to do that right um and and so you don't want to have someone submit you know for example hey I have a database and I want to sub I want to run an llm query across every single Row in that database and I'm just going to submit it to you my service provider because you have this cool model or what have you that's fine tun n some data set and what whatever it is right if I submit 10,000 rows to you at once that's going to kill everyone else's performance right so so this is one of the techniques that people have for uh making it so you know that that person who you definitely want to serve uh doesn't impact everyone else's usage uh because once you open up your service to the real world you're not going to be able to control who's submitting what what and rate limits are the most annoying thing ever so that's not the correct way to go about it um another thing is context caching right so Google launched this recently uh they're the only one offering this today but I think this is a really big deal uh because when people talk about fine-tuning right of models that's great uh but in reality the best models are really expensive to fine tune or impossible to fine tune right I can't go fine-tune 3.5 CET or fine-tuning llama 405b is going to take you know dozens and dozens of gpus right so so instead of that the the uh or you know in in close Source models generally so Google only does close Source models mostly for the big ones right so Gemini 1.5 Pro they offered this they they brought this recently right which is context caching so instead of you know fine-tuning your model why not you know just fill out a context length of you know they they offer I think two million now today right two million context length um why not fill it out with your data there right um you know and and there's a couple you know advantages to that one is you can use the best models right in the case of fine tune models you really are focused on like the Llama 7B or mix draw or llama uh you know 70b it's it's kind of much lower quality models than what's available in the Clos Source world uh so one of the things you can do is you can um Implement what Google has called context caching in the in the open source World we'll we'll have super long context models soon enough but uh economically right you know we talked about $15 token per million tokens output um and million million tokens input if you were to have uh on on you know the best close Source models today if you were to submit a prompt of like you know a million tokens and and most most of the times you're looking at a document you get a query back right you your your output is very small almost all the cost is just sending them that document right so that's that's going to really really hurt you so for people you know targeting maybe like a legal AI or like a you know some sort of other contract review AI a lot of these Enterprise use cases uh prefill is going to dominate your cost if you're using apis um and so Google has this context caching and and open source will have it so models you can run yourself and and others will deploy over time uh but basically you don't recompute the KV cache right the the context length every single time instead you cache it h but the problem is to save save that takes an incredible amount of memory um so you don't save it in the gpu's memory right you save it on the cpu's memory or storage um and so uh VM uh which is an open source library for inference is contributing is building this currently so if you're interested in contributing to that uh check that out um or if you're interested in using it just start a project right um because you know while most of the models we have in the closed Source today are like only like 32 or 8K or 4K context length they're coming with longer um and being able to you know dramatically reduce your costs um by caching the context um is is very is GNA is going to dramatically reduce cost right um so now I'm just going to talk about like head in the cloud stuff instead of like real usable things which is um you know what's coming down the pipeline right which is you know gbd4 was like 20,000 chips for 90 to 100 days um used you know 38 gwatt hours very very expensive cool um but you know what's what is what are they building now right uh open AI xai um anthropic many others are building 100,000 chip clusters right and it would trange you in 3 days right so it's kind of irrelevant um you know and and uh I'll skip over this part too relevant um but you know what what what's a modern system capable of right like h100 is is pretty uh pretty fast relative to a100 and and coming down the pipeline is these the new Nvidia chips but what what's com you know what's coming down with these 100,000 GPU clusters right um it's not going to be a 1.8 trillion parameter models are actually going to be you know it could be in the tens of trillions of parameters um you know the the training flops right it talked about gp4 is it's roughly 2 e25 flops right which is uh you know a number that's not really relevant or 2 e25 flop um but with 100,000 GPU cluster you can do 10 e26 10 e27 flops uh and to run that model is going to require 200 gigabytes or terabytes of second memory bandwidth right um but what what is that like what does that look like right so so this is a on the top right is an image of uh Microsoft data centers in Arizona where they're making GPT 5 right um they have about 100,000 gpus here uh it's 150 megawatts right like the average home does not consume you know that's like that's like like tens of thousands if not hundreds of thousands of homes of power consumption right it's it's kind of insane um elon's talked about his next Generation cluster he's building 100,000 GPU cluster today uh but he's talked about his next Generation cluster is 300,000 gpus is kind of insane but the the power cost for that alone would be like $500 million a year right so it's like you know people are people are kind of insane but it's pretty cool um but you know the the the interesting thing here is you know on training we we you know when when you when you try and train a model today people just talk about fully connected clusters uh every GPU is connected to every other GPU at some speed and you you know you have to do you know all your operations but that's not really possible when you go to these super large clusters right um so the 100,000 GP clusters those are being built this year and then next year they're planning to build multiple 100,000 GPU clusters already you can see that it exists across multiple buildings right um and so there's a lot of complicated networking uh going on right to connect these data centers together um and and one other thing I that that I think is just like kind of interesting to again head in the clouds just to think about is um when you connect these chips together there's a lot of Optics right uh you know you convert from electrical to Optical uh and then you know over fiber optics to connect between chips transceivers Etc right uh these are extremely unreliable right uh they tend to have a failure rate of around 5 years um and so what's interesting is if you're talking about a 100,000 GPU cluster um or if you're talking about a 500,000 GPU cluster you're going to have something fail like every five minutes right um which is insane right how do you even deal with something in your cluster failing every five minutes when you're training a model right um so you know this is this is again more of like a hardware oriented thing but uh you know the the other thing that's interesting is like when you get chips they're not all the same speed you know h100 is not an h100 um they're stragglers uh so if you get a large distribution of chips um what we call an industry is is called the Silicon Lottery um in that like you know you you can buy for example a a gaming GPU and and compare it to other people's gaming gpus on the forums and they're actually like percentages difference in performance but when you do a massive training cluster um you end up with you know training is a synchronous workload right you know you you you update the weights you then you pass the gradients around right um and then you you know then you again run through a bunch of data uh update the weights or pass the gradients around update the weights right um so it's a synchronous workload so if one of them is 10% slower then everything is 10% slower and bite dance had a cool paper where actually they saw a 25% decrease in speed just because one random GPU they got uh while it did technically work um in inid in and according to Nvidia it was fine it was like 25% slower than uh what they wanted right so they're you know this is like this is on like a 20,000 GPU cluster even right um so so it's uh it's it's quite interesting that you know that that's these are the problems people running into at scale right so they pulled that GPU out um and then you you can sort of see their performance dramatically uplifted right um during during training um and then again this is bite dance on a 20,000 GPU cluster so it's it's um it's a it's a big big issue um and I think I think some of the other stuff in this presentation is not really relevant uh but I think I think what do these next Generation systems look like is a very um important question to ask yourself right um you know and what what do I what do I what do I do when I deal with that right like I think a lot of the scaffolding that people are building uh today for llms are dealing with you know is is dealing with hallucinations and things like that and and the hope that everyone has or at least a lot of the AGI people have is that you know when I when I 100x the compute um you know when I build a cluster that takes $500 million of electricity and I trade a model with it it's going to make something that uh uh you know yearly electricity cost and make a model with it and then the cluster itself cost over 10 billion by the way right uh it's it's going to get rid of a lot of this um the hallucinations it's going to let us do a lot of interesting things um yeah so so I think that's that's basically all for the talk I just wanted to you know uh mention you know sort of a reasonable thing which is how do you run llama 405b kind of some strategies that people need to implement that aren't necessarily implemented yet uh in the open source that are implemented at the labs um but then also like you know what are they doing right because they're not worried about you know llama 45b capable models anymore [Applause] awesome that was a great talk and now we have S madra from Gro hey everyone great talk Dylan all right uh thanks everyone uh exciting here we uh basic basically uh this is a brand new talk so I apologize if we go through it and it's uh it's it's it's a bit wonky but I I think it should be fun um what we really wanted to pay homage to today is um actually you know just 25 years ago we crossed the one gigahertz speed barrier uh in microprocessors what's really crazy is um when when we started thinking about this talk I actually thought it happened a lot before 1999 and I just kind of remember my own uh Arc of um getting involved with computers but really it was 1999 I had to kind of double and triple check it this is the exact press release when uh Intel broke the one gahz speed barrier and obviously that was interesting you know for a couple of perspectives one it was this you know really big number and moment but two it was really after this that um you know Intel started to change about how they think about processors would be used and went for I guess you know multicores and things like that and and it's really something that we need to think about in terms of what's going to happen with ls and and really if you go back to the the rate of increase it only took uh you know about two decades to get three orders of magnitude speed Improvement in in microprocessors and so if we take a step now and look at where we are with llms and we think about anywhere close to the speed of innovation and in fact you know what we hear a lot of people talk about um you know including Jensen is that we're beyond the sort of curve of Moore's Law so we're actually innovating even faster than that in in llms today um you know just to look at what we've been able to do at Gro just in a short amount of time U you this is between April and June of this year you know we were able to increase the speed of llama 3 8B by over 50% and so uh the improvements that are happening in this area are really really quick and and super exciting we're really kind of keen to kind of dive into what could happen here um and so let let's think about like the state-of-the-art right and so um you know there's models today that you know we can process and others can process that say huge inputs see on the equivalent of you know 10,000 input tokens per second which gets you down to say a third of a second across you know processing all of those and when you do that you actually end up with these capabilities um from a you know speed perspective that far exceed human capabilities for both integrating and analyzing information and it's happening um you know really really fast the example I like to talk about here um and I don't know if you've used this but I highly recommend it it's this um you know really cool service called globe. engineer and what it does is you give it a task and or you know so say I'm here do I think the example I use here helped me plan a trip to New York to try you know the best pizza or something like that and what it will do is it and you know I couldn't even capture the whole screen here but it'll basically figure out all the different elements that have to happen and it's doing this live online it's connected to the internet so everything from the flights to the taxi options to the hotel options and then the food options and then itinerary and how I can do it and it you know it does it all in you know maybe less than five seconds and if you think about what's really happening there and I like to you know think about when I try uh plan for trips myself I end up basically opening you know tens to sometimes even hundreds of tabs and those tabs each I have like a like a research stream happening for me and now all of that is solved in like you know a simple interface you know really enabled by these llms being able to one input process tokens input tokens faster and then ultimately output tokens faster and it's really giving us a huge edge up in how we operate as humans um and you know where does this all go like if we start thinking about um you know human super intelligence uh and optimizing and accelerating models it really takes takes us to like interesting paradigms here and you know we'll talk about this more in a second but like you know the high level way to think about it is what if an llm you know really becomes either like an operating system or like the core of you know how we think about compute today and we we think about it completely differently than any of the approaches that we've had before um you know the way we program these things the way our expectations over on how they analyze things and so we're really you know that's interesting in terms of where this is going in terms of super intelligence and staying away from AGI but more about changing the Paradigm from where we are today and you know the thing that crosses my mind here is what happened in the industrial revolution you know if we think about three Industries let's think about making food making cars and making clothes all of those before the Industrial Revolution were bespoke right so you'd have you know people that would make one or two cars a day you'd have people work on farms that could you know maybe farm for less than a city even a small village or someone that was making sweaters could you know make them you know one one a day or maybe even one a week and when we had the Industrial Revolution show up we basically had this ability to make hundreds or thousands of cars a day food farming at a scale that could be national uh clothing that could be made at national scale and we're really you know we haven't had that in technology um The Arc of technology has been um and this isn't my own framework it comes from pal meritz uh you know who was a a longtime Microsoft guy and then VMware and then pivotal where where he and I met um you know he said the first era of computing was just taking paper processes and making them digital and he goes That's evident in the way if you think about how the operating system is structured files folders inbox outbox those are all paper processes that got turned into you know digital processes the next era for us was basically making those things connected right that's the internet era and what we've been through now you know maybe in the last 15 years is form factor changes right either pushing things into the cloud for scale or mobile so you can do it on your phone but finally with AI we're we're starting to get to a place where we have the industrialization in the same way we saw for those you know manufacturing and physical Industries we see that for technology so you know 18 or maybe 24 months ago if you needed to have a um a Photoshop made of some kind of artifact that you're going to put in a presentation you'd go to your designer and maybe the designer would make one or two a day for you now you can go to Mid journey and get a th made in the next minute if you want to so we're going through that same kind of industrialization for tech technology and if we just dive in deeper here into you know where we go as we can get into like 10,000 complex decisions per second just by getting this down to you know 0.1 milliseconds and then if we if we really really kind of start increasing that it does become viable to think about the core of our Computing becoming an llm and I think this is a real challenge for a lot of people because we you know obviously we have existing paradigms that were really really locked into but this paradigm shift is fundamentally different in terms of how software will be built how software will run and how software will scale and we don't think about it too much today because we think about the speed associated with um you know running llms and their capabilities but if we can imagine the same um growth that we saw in CPUs happen in this era um we can imagine that the core of these devices changed to become you know something and this is again hat tip to karpathy this is a diagram that he drew can imagine an llm being a core at you know whether what happens in video and audio we starting to see that today what happens in our browsers how we interact with other llms how we interact with you know code interpreters and even our file systems and how we interact with those type of things and so what is the art of possible if we start doing and so uh I'll just kind of rattle off some things here that you know crossed our minds as we were putting this presentation together um you know we really don't spend a lot of time thinking about it but many responses today um in llms are are sort of near real time they're at sort of reading speed but if we go to like instantaneous responses and decision making this becomes a lot faster again this is really evident when you think about something like that Globe example I showed what you're really able to do there is take a task that would probably take you either an afternoon or evening or number of evenings and it's done in just a few seconds for you um and then there's personalized experiences you know today we don't really have a lot of personalized experiences happening we're starting to see elements of it you know I think open AI has started to launch a number of features that allow it to understand you know specifics of Your World it could be your pet's names or kids names or spouses names but really I think you know where this goes to and a lot of people push on this I know you know two of my friends uh you know Bill Gurley and Brad gersner they talk about this a lot on their pod where they really view personalization as the next major Frontier and personalization and speed are going to go hand inand if we're going to make that work kind of seamlessly for folks I think next is kind of a a universal natural language processing and so if we think about our interface today to software it's you know you know we started with sort of point and click and keyboards uh we've gone to touch with our you know mobile devices but really you know you start to see the power of this and you know I think everyone's been super excited for the release of GPT 40 uh The Voice agents we we I don't think we fully got there yet but I think we've showed the art of the possible there with what they were able to do with voice and then that kind of mixed interaction I would say like you know we refer to as sort of like xrx which it's like any type of input reasoning and any type of output um you know the example I like to tell people there if you're trying to order something you may want to interact with an agent in voice but you may want to see the responses in text and so think about if you're trying to book your haircut and you want to say well tell me what times are available and then you know it tells you well there's 9:00 a.m. and 11:00 a.m. and 3:30 and 5:30 that's hard to remember if it's just coming back to you in voice so you want to basically have these interactions that are multimodal that kind of touches on my second point there and I think we're going to start to see a lot more of those uh interface changes as well um you know Advanced virtual assistance this is like complex task scheduling I think a lot of what we'll see in the back half of just this year is uh you know agents start to become much more uh complex and a lot of focus from llm providers as well I think on making uh you know complex tasks something that are solved it's it's interesting today because we measure the efficacy of llm through generally single shot and I think we do that because of you know going back to that where we you know the of the conversation which is the performance barrier but naturally if you even take any existing llm today and multi-shot it its scores get a lot better and there was a couple papers that came out recently that showed if you just had multiple agents working together on a problem they can far ex of of a less you know less parameter model they can compete with higher parameter models just by doing sort of multi-shot reasoning or working together and so I think we'll see a lot more of that as a speed improves and I think there's there's an incredible incredible optionality there you know we saw the first um I think first cut of collaborative AI agents with apple AI you know where you see something maybe running on device interacting with something off device it's I think it's a very early implementation and I think these things will get much more sophisticated and better um an area you know we've spent a lot of time within our career is like analytics and Predictive Analytics I think today everything is uh you know pretty much action oriented and drived off a human action so I think if we get to a place where the speed goes up they can be a lot more predictive you know what does that really mean it's just an agent that's always running in the background because the compute Cycles are next to free we don't see that today but I think we get there as we get you know higher up the curve um you know context aware as well and today we again we are generally limited to how much context we can provide and we're having to even with with models with bigger context windows we still have to you know be conscious of you know how much compute Cycles we're going to use but I think if that becomes next to free it becomes quite powerful for us um you know creative tools and customizable content uh I'll focus on the second one here this is this is an area where I think many of us would would like to see things go you know the example I always like to you know one of my favorite shows was Seinfeld and obviously you know it's not on anymore but one of the things I like to do uh you know when when I'm bored is go into you know llm of choice and have it right a Seinfeld episode but made up of like modern-day things that are happening and if you ever try that it's super fun because it does an incredible job of you know identifying which character in those scenarios that you give it would would have you know sort of the funny or odd thing happen to them and so the idea of you know taking that Beyond sort of writing and taking that to multimedia forms is is going to be really you know really really powerful going forward um you know complex decision making you know before our company was acquired by grock uh you know we were building a company called defini ative intelligence so we spent a lot of a lot of time in this space um not only uh doing sort of say natural language to re uh you know analysis of of SQL um Texas SQL as a lot of people would call it but U you know Rick who's sitting here with us like you know he was working on this really cool product for us called Pioneer which was a automated data science agent where it's really meant to run almost endlessly on a problem and uh you know you sort of Define a cap and you think about how a business runs a business has a bunch of kpis and then a business has a bunch of data that's coming in and then usually humans are taking that data and analyzing it kpis and creating Powerpoints and spreadsheets and telling either Senior Management or the world how well they're doing well there's no reason that just shouldn't happen automatically right and where there's an agent just constantly you know looking at the new data that's coming in asking additional questions diving into it and I think we had a lot of interesting things emerge you know we had let pioneer loose on a data set of human workers and their performance reviews and one of the things that we saw was it was able to correlate really interesting uh things that we couldn't think about in terms of you know depending on your age and depending on your performance review it really affected your um I guess your output your productivity and so I was able to kind of discover that if you're of a certain age and you got a certain type of performance review you're productivity would fall off and maybe Rick can correct me if I'm wrong later but it was something along those lines which I was always an interesting example for us and then obviously a lot of you know um really interesting things around Dynamic optimization um you know this this an area we're familiar from before um you know when a bunch of us were at Ford um after the acquisition of autonomic we really saw you know for the supply chain if you think about how you know cars are produced and how they're shipped um um you know there's you know pretty sophisticated software that does this but it's still not efficient right and I think um you know the art of the possible with sort of what we were talking about earlier could be very very interesting for some of our old old colleagues at Ford um I'll touch on a couple more things and then leave a couple minutes for questions if there's any but Edge Ai and decentralized AI this is pretty cool um you know there's a a really cool project called you know hyperspace thatai what they're doing is um they're actually have a lot of uh you know taking you know sort of like SEI at home or even render and where they're basically allowing people to take their unused GPU compute and make it available in the cloud uh or I guess yeah and um and why that's interesting is there's certain use cases that necessarily don't require something to be real time and so I think we'll see a lot more of that now this intersects really well with us getting more throughput and getting lower latency out of existing existing system so I think we'll see a lot more of that as well especially because the amount of power consumption that's required if you distribute it that you could be really interesting um and a couple more here is uh enhanced security and privacy this is a big area you know I was I was talking uh to one of our colleagues last night and he was subject to a really really scary type of U I guess maybe fishing call where um you know someone had called in U sounded very formal uh and had a access to a lot of information now you you know we've all seen um there's you know these kind of U people that run scam call centers and people that go and attack them but th these folks armed with AI are much more sophisticated because they can create stories and narratives that are much deeper than sort of the call center worker of past and uh now I think in order to protect against these systems you'll almost need to have something on your side um so that you can you know you can think about it you know with our colleague he was just so confused because the narrative was so good the only way he could really figure out that this person was a scammer other than hanging up on them was saying me some kind of formal message um and we need them to run incredibly fast and so um and I think this is the last set of them here is uh you know education is is something that's really important to us you know broadly at gro we we think about this and we think about you know making tokens available cheaper and more broadly um and being able to personalize you know salak con has a very good Ted Talk from a couple years ago ago where he really highlights um you it's the two Sigma talk and he say you can take any student at any level the highest levels or even someone performing lower and if you give them a personalized tutor they can improve their test scores to standard deviations and so imagine doing that you know obviously with AIS that are um you know can be one very cheap to use and that can be personalized to their learning experience um you know I was speaking to someone recently who was building uh an AI service for homeschooling and what was what was powerful about that particular service is let's say you have a young child and they're really into unicorns or ponies and you want to teach them about you know math and so you know math subtraction addition multiplication it's a lot easier if you frame it in the context of those things hey you know you have three ponies times two unicorns and what do you get from it and so I never thought about that before but for Learning and customizing that for the interest of the person is quite powerful so we'll see more of that um and then just just interoperability and compatibility right I think this is an area if you've ever been in Enterprise software the majority of money spent in deploying and maintaining enterprise software is really related to you know interconnectivity and interoperability and compatibility and so um you know having really fast and cheap um you know AI technologies will help us really reduce a huge burden that exists on the Enterprise today so um that's it hopefully you guys enjoyed that [Music] don't thank you so much awesome we had such a great talk from Gro now we have a product manager from uh uh cruso and he going i j and he'll be speaking about acceler accelerating mixture of experts awesome thank you okay just okay great thank you for coming over to our session today a lots of really interesting stuff is happening on the AI World these days right uh with all the recent model developments and the GPU developments it is really cool to see all the use cases um however here I want to talk now a little bit about the infrastructure and the way how we can support the the newest models of the gpus and the newest um the newest models the machine learning models and how we can help that everything is working smoothly and fast and and productive so my name is Yen vinko I'm a product manager at cruso uh my main responsibility is is the infrastructure and specifically Al GPU networking infrastructure and we are always looking for a way how we can increase the performance of that Network because as we will see later in the presentation it is really important to do that now a little bit about the cruo cruo is an AI Cloud platform which has one I think very important mission for all of us it's to align the future of computing with the future of climate there is a really strong Demand right now for the computing power the gpus are really energy hungry there is a lot of Investments being done in the data center area and of course that puts an additional pressure on the grid and on the energy sources what we are trying to do here at cruo we are trying to utilize the stranded energy sources wasted energy sources and Renewables to power our data centers we want really to be able to make sure that every time when you train your model every time when you're using GPU for inference you're not causing any negative impact on the climate now uh whenever we are building the cloud right the AI Cloud we are building it based on three important pillars first of all there is a high performance pillar uh as the customers are buying our services and procuring you know the GPU times and training their models we have to ensure that all the infrastructure is optimized for this training every time when it's not optimized every time when there is a delay or a glitch or any sort of outage or simply not that great performance it causes the direct impact on the customer's bottom line it causes a direct impact on the time to train and kind of raises the cost to train the model now the second one which I think is very important for everybody around here is the easy to use uh we want really to separate ourselves from the general purpose clouds uh we do know all of them the hyperscalers are building the great infrastructure and are trying to support each and every use case the customers might have for the cloud computing however in our case we really want to focus on the experience of the AI engineer so we want to make sure that we are providing a simpler user interface that allows developers to spin up the computer resources to deploy the models to train them to use them to in for inference and and so on uh all the underlying complexity of the infrastructure is being hidden by us and I believe that's our job to make sure that that is that stays the case and now as I mentioned we as I mentioned before we are climate aligned which means uh we as a company really aiming to power 100% of our data centers with the renewable wasted energy sources with some some some form of stranded energy sources to ensure that we are uh we are being Net Zero emission net new zero emissions from the carbon perspective we have a big story around that feel free to check it on our website or come over to our booth on the on the show floor and the team will happy to talk about that now where are we present right now we have a number of the data centers located across the US uh as you see three of them in the continental United States and they are generally located close to the energy sources I was mentioning before so we have the one in Texas we have uh the one in the northern central part of the country and on the East we are also building right now one big data center in Iceland that will be powered by the geothermal energy I mean again a amazing way to use the constrained energy sources or the renew Ables to power the data center we are trying to follow that model hence we are placing our data center strategically the placement of the data center in Iceland though will be also very important for our Emir customers given the latency and the and the general connectivity to the Europe that is something I think uh might be helpful for them as well now what is our platform right I say cruiso cloud but generally when whenever we are talking about any Cloud we are talking about three General types of the products first and foremost we have the compute we are offering the VMS with uh with gpus attached to them so every time when customer wants to spin up when customer wants to get access to the gpus they're able to get it through the VM they can get a bunch of DM connected together and use them as a one single training cluster we also offer CPU instances for any potential data pre-processing or any general purpose compute tasks you might have for the data preparation for the offload whatever you have uh from the storage perspective we are offering FML and persistent discs on the Node so those are delivered from the nvme on the local server where your VMS are being placed we also have the persistent blog sto storage solution available for our customers and we are working on providing and delivering the managed file system the network file systems for the customers on the networking side of course more traditional more typical VPC networking that's the network sometimes we call it front-end Network that is used to deliver the customer traffic from the Internet or from the customer environment wherever the customer might have the data sources to deliver that towards the VM so that's your kind of main connectivity uh path to the outside world now uh we do offer a number of the additional services on that that's not simply connectivity we also have the firewalls we will be offering the lot balancers soon but generally we are trying to follow more traditional path for the VPC networking and and the requirements the customers usually have there now what is more interesting and what we will be talking a little bit later today in Greater details is our rail optimized infin and cluster networking so for those of you who don't know typically C typically providers the GPU providers are separating their Network they have the front end Network which is used for general purpose traffic but then all the communication between the gpus is happening on the Standalone separate Network that is uh really high performance low latency and how bandwidth and the whole topology is optimized for the GPU to GPU communication uh now last but not least the user experience as I mentioned before we are our main customers our main Persona the people who are using crew Cruiser Cloud are the AI developers and machine learning Engineers so we want to make sure they have what they need in order you know to be successful and not to think too much about infrastructure we off offer CLI we have apis we have guy so everything can be automated everything can be can be consumed and configured in the way you like it more uh we do have a lot of customers already and it was very fun for me to see on the floor that some of them are there and some of them are talking about their Solutions probably this is the first time in my life whenever I'm attending a conference and standing at the booth I don't have to compete with all the people around us so we do see all the companies that are presenting their Solutions right now as our partner Partners we do partner with a bunch of them already we have the together AI here we have the uh ban Ai and all of them are using our infrastructure for different purposes so together AI for example they're really into using Cruiser infrastructure for the ml training for the fine-tuning their models and some sometimes for inference the C is uh they're tra they're using our compute infrastructure to train the new foundational models this is really great I mean if you are the customer of together AI for example or codium or what not it is likely that you have been somehow exposed to the cruiser infrastructure now the distributed training has a very specific set of problems or issues right there is a compute part of it when the computation is being done on the gpus but since we are talking about a distributed training which means there are a lot of gpus at certain stages Whenever there is a uh Whenever there is a training step being completed all the gpus have to exchange the information have to exchange the data that they calculated on their own this is typically done through the all reduce or or all all all get uh um through the reduce process and the protocols and it contains a forward path the backward path but then the networking part takes without any optimization about 25 30% of the network of the time of the training time now this is the time where when your gpus are staying idle they're not being able to compute anything because they have to wait for all the information to be gathered uh together this is kind of a bad thing for everybody right this is bad thing for the customers because they still pay for that infrastructure they still have to wait it delays the the mod model training but it also bad for us because we have the infrastructure that is not being performant enough there are a couple of Tricks we can do first of all the computation and communication overlap allows you to start the network exchange or the data data exchange when the computation is still ongoing but even with that when we were working with the customers we so just the reduction uh of about 10% so about 25% of the training time was still spent on the network me as a product manager on the infrastructure side are constantly being asked like how can we reduce that how can we use the network as much as possible and reduce that Gap so we we have been looking into that and we were trying to figure out what would be the right cluster networking topology how can we make sure that our data fabric that is used for connecting the gpus is being fully optimized and is being uh is able to provide the bandwidth needed and the latency needed the standard fed tree those of you who have been working in the data centry infrastructure before that is something that we were traditionally doing for years that's a great way to build a scalable maybe non-blocking fabric right but there are a bunch of issues with that first of all if we will be connecting our servers that are shown below to a single leaf that introduces the single choke point as well as the single fold domain right if we are losing the leaf we are losing all the gpus that are connected to the now what else we were thinking about is like look we have that switch we have that switch sorry what is it the time okay so we have that switch that can be used for the backend traffic propagation and why don't we use that switch for from the bandwidth perspective and kind of you know have an additional path let me just use the simple two node uh example to explain the topology and to explain how we are using it so first of all whenever we have the gpus that want to communicate within one server they can use their embedded NV link andv switch and that provides a good communication they don't have to go to the outside fabric anywhere and and whatnot now whenever we have the data communication between the gpus on the different nodes if they collect if they are connected to the single Le that's something we called one single rail that means that the traffic communication will be passing through the uh through the one single Le just one H away and you will get the to the destination now what is interesting here is when we want to talk to the different Trails right we have to go all the way to the spine and that introduces the additional hop besides the bandwidth saturation problems that may lead to the additional latency which will be really important for all for your uh all reduce all reduce operations but luckily for us Nvidia with the recent version of Nick introduced the feature called pxn which allows you to use the internal and we switch inside the host to communicate across the rail so whenever we want to have the GPU zero to communicate with the GPU 8 on the another host we can use an an internal switch to do the traffic Hub between the gpus and then send it to the leap where it is connected to so it still allow us to use one single hop and have access across the different rails of the gpus now we did some nickel test result uh and we saw quite a significant Improvement 50% for the small messages and 50% of the large messages now those numbers here are of course for the smaller uh for the smaller messages are about latency for larger ones we care more about bandwidth because latency tends to stand to stay roughly the same uh those numbers are great right everyone everybody would love them but not the customers and it does make sense because those numbers are synthetic and more are showing you the workload that is applied to your network what customers care about is the time to train the particular model so we use the sparse mixture of expert as an example and uh I mean I'm not going to dive into the details how how it works but essentially the sparse Network the the sparse mixture of experts shows you gives you a different layers of the experts and thows the traffic between them whenever you're deploying that on the really large GPU cluster that makes uh that creates a ton of traffic like all the gpus have to send the traffic to each other they have to extend the information the workload on the network is pretty significant so we use the mixl model the open source uh sparse mixture of experts uh which is contained of 8 feet forward blocks eight seven billions of par and we use the fine tuning to use this model to fine tune it on 240 h100 gpus and we did a quite significant we saw a quite significant Improvement when we had the pxn enabled and without it the 14% of improvement is something that can be directly connected to the time to train the model that can be directly connected to cost of training the model and that is something that everybody really uh really got excited I definitely got excited and and our customers as well because that shows them some real value numbers they can get with a model now that was it from my side sorry for uh going through that it's so fast it's a very you know large topic it's it's hard to talk about that but I'm happy to answer all the additional questions anything you guys might have thank you feel free to come over to our booth if you want to talk I'm I'm happy to chat more awesome with that this track has concluded for now so feel free to go for lunch awesome have a good day because I'm missing you I was chasing all the wrong sides trying to hold on to something that I couldn't find which you didn't Captivate my mind now I know we've in the sunsets in Paradise but now something went wrong you're moving on I found myself on The Blind Side now you won't call we lost it all you fade away I'm picking up my heart from every piece that's broken and trying to get back to myself but don't have a clue I'm looking for some luck can't find a door it's open I'm losing All My Hope feels like I'm left here because I'm missing you because I'm missing you oh because I'm missing you because I'm missing you because I'm missing you I'm in I'm in I'm in picking up my heart every piece that's broken been trying to get back to myself don't have a CL looking for can't find a it's open I'm losing all my like I'm because I'm missing you holding my breath and I'm ready to go I'm falling right in and I'm ready to go I found what I want and I know that we're on top so I'll ta and I'm ready to hold my breath and I'm ready to go I catch you laughing and I'm ready to go you're hold it and I'm ready we are a Sumer storm feeling you can't ignore do you ever stop to feel it CAU in the glow I'll come back to your to know that you believe summer all that I want you know summer we got it all holding my breath and I'm ready to I'm fall right in and I'm ready to go I found what I want and I know top so and I'm ready toing my breath and I'm ready to go I catch you laughing and I'm ready to go you're holding the man strike it and [Music] I'm we light it up again the sky and I [Music] [Applause] dancing on the pavement CAU in a perfect you and those eyes again when I least expected said you're all that I want we know together we got it all holding my breath and ready to go I'm falling right and then I'm ready to go I found what I want and I know top so I'm ready to my breath and I'm ready to go I catch laugh and I'm ready to go your my soul and I'm ready to if I find myself at your door would you follow me to better places if I find myself at your door gra the keys let's go I want to taste this if you my door I would follow you to better places if you showed up ready let's go let's go holding my breath and I'm ready to I'm falling in and I'm ready to I found what I want and I know we're on top so and I'm ready to my breath and I'm ready to go I catch and I'm ready to your my soul and I'm ready to W oh holding my breath and I'm ready to go right and I'm ready to go I found what I want and I know and I'm ready to my breath and I'm to go I catch and I'm ready to it [Music] so and I'm ready [Music] [Music] to night days kill me were the only one we were holding nothing back from the greatest night we ever [Music] had the B driving SL stand in your car singing Al along every night must to play that song 100 times made on fire we were ligh and the summer bre dancing the rec in your bedroom like a always on my mind mus onzen and TI always on my mind I feel it all come back in the moment bre SP like the ocean so if you want to come with me do it all play it all slow motion [Music] you're my feels like ating up the night when I'm alone when I hear words [Music] tomato Sun the B driving in your heart singing the every must to play that song 100 times made fire and the bre Rec in your [Music] like the music in always on my mind I feel it all come back in the moment back SP the ocean so if you want to come with the door is open it all in slow motions motion [Music] feel all come back in the moment that can spend away like the ocean so if you want to come the open it all back [Music] [Music] down down [Music] time [Music] he [Music] can [Music] he [Music] [Music] [Music] [Music] oh [Music] oh [Music] [Music] [Music] w [Music] I was watching you watch the sun come up vage te and worn through High Times these nights taste like gold sweet with Obsession show me something as each morning com we were out the night like we wear our clothes dancing right through the fire while we watch it sing it on as we give up our gos as a morning curs Through the Windows we riding all new light burning R through the pain tearing past all the life you're wearing out my name we wear our problems underneath the cloes like super M like super heroes it's coming over now it's way down a I need a PE that only we can hear a super silic CSE you want to feel like us it's forever star you're in America under your influ a full moon Waxing now I couldn't see it until you show me how feels like we're insane we blame it all on love saturated so we can't get in the we were out the night like we wear our clothes dancing right through the fire while we were Chico singing an we give up our gos as a new evening comes through the windows it's coming over now it's a w down a Harmony of PE that only we can hear a super sonic Crush you are to feel like us it's all forever so young in America it's coming over me electric Sy every night on fire I knon master PE a super sonic Crush you want to feel like a it's forever S Young in [Music] America don't hold back tonight is all we the is going so come with us don't hold back tonight is all we have this sky is going black so come with [Music] us don't hold back tonight is all we have the sky is going so come with us don't hold back tonight is all we have the sky is going look so com up it's coming over now it's down a Harmony of that only we can hear super CR you want to feel like us it's all forever you're in America it's coming over me electric SYM every night on fire I KNE on my Master pece a super sonic CR you want to be like us it's all forever in [Music] America come with us hold [Music] tonight going so come with us don't hold back [Music] tonight going so come with us don't hold back [Music] [Music] [Applause] [Music] [Music] more [Music] w [Music] [Applause] [Music] [Music] Tak one more breath beside [Music] you so I could find strength to divide us it all we got it I know we did the best we could if I could go back undo the mess I would Mize your face before I go but this is how we go got to give it up sometimes it's go know it when to kill your pride there's no one to blame nothing really stays the same this is how we go sometimes we hold on to [Music] let hold let [Music] sometimes we hold on to let go there is nothing lost between [Music] us and I know you have your reasons some days I'm a mess but I know there's a rainbow over all of the past your head on my shoulder but I know we're better on our own but this is how we go got to give it up sometimes it's go knowing when to kill your pride there's no one to blame nothing really stays the same this is how we grow sometimes we hold on to let go [Music] sometimes we hold on to let it [Music] go hold [Music] how to give it up go KN it when to kill your pride there's no oh what the blame nothing really stays the same this is how we go this is how we [Music] [Music] n [Music] [Music] [Music] B [Music] [Music] a [Music] [Music] n [Music] [Music] [Music] d [Music] [Music] [Music] we know doors set open for us in a moment keeping light riding our keeping our sight everything we want we catch our breath in the middle of it all taste it Echoes ech sun is coming up all the rest is coming Crystal [Music] visioner true belever the why I can see on the horizon all we can feel chasing the see the forest for the trees I'm keeping watch all that storming waking up and turning on keeping light riding our keeping our sights everything we want we catch our breath in the mid love it all chasing Echo is coming all the is coming Crystal [Music] Vision like you looking on world I can see on the horizon all our we can feel it chasing the light is all know chasing the stop ion let [Music] it i't Let It Go [Music] like a fever up we're true belever we looking on the I can see on the horizon all our we [Music] feel True Believer we looking on the world I can see on the horizon all the we can feel [Music] [Music] [Music] [Music] [Music] [Music] e e [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] here [Music] would you you would you would you would you would you you you you you you you you you you you would you would you you would you you [Music] [Music] [Music] me sh me you you you you you you you you you you you you you you you you you you you you you you you [Music] we got an insomniac with eyes wide shut and we got everything we need and then a little too much I know that you're starving for something you can't touch but you beond is with me right now there's something in the under car and I can feel it coming up don't you want to feel it taking over your senses don't you teic FES baby come escape with me I'll come sweep you up for your Fe don't you want to feel it don't you to don't you to think there's something in my bag that's weighing me down oh it's just the weight of the world now I'm calling it out we're a little starving for some Lightning Love can we speak honestly right now there's something in the undercurrent I can feel it coming up don't you want to feel it taking over your SES don't you ever feel it Technologic fces baby come escape with me I'll come sweep you off of your feet don't you want to feel it don't you to don't you to [Music] tell me that you to stay baby just don't walk away I Need You Now fade it out all the time we spent Al fighting through the fireone don't let me down I need you now cuz I'm feeling worn out it's getting to me lost some heart trying to get on my feet caught in the madness I feel you somehow don't let me go I need you right now I want to be next to you you want to be next to me holding our Paper Hearts fading our Broken Dreams I want to be next to you you you want to be next to me holding our Paper Hearts feing our Broken Dreams you want to be next to [Music] you tell me that you want to stay baby just don't walk away I need you now f it out all the time we spent alone fighting through the fire don't let me down I Need You Now CU I'm feeling worn out it's getting to me lost some heart trying to get on my feet caught in the madness I feel you somehow don't let me go I need you right now I want to be next to you you want to be next to me holding our Paper Hearts fading our Broken Dreams I want to be next to you you want to be next to me holding our Paper Hearts waiting our Broken Dreams I want to be next to [Music] want to be next to you you want to be next to me holding our paper heart fing our Broken Dreams want to be next to you you want to be next to me holding our paper heart feeding out broken [Music] [Applause] [Music] [Applause] know know [Applause] [Music] [Applause] [Music] [Music] [Applause] you [Applause] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] you get [Applause] [Music] it was summer back in 89 we were kids falling in love for the first time Hold Your Hand you look me in the eyes kind of feeling you get Once in a li but now something went wrong you're moving on I found myself on The Blind Side now you won't call we lost it all you fade away I'm picking up my heart from every piece that's broken been trying to get back to myself but don't have a clue I'm looking for some luck can't find a door that's open I'm losing all my feels like I'm because I'm missing you because I'm missing you oh because I'm missing you because I'm missing you because I'm missing you because I'm missing you I was chasing all theong sides trying to hold on to something that I couldn't find with she didn't Captivate my mind now I know we've in the sunsets in Paradise but now something went wrong you're moving on I found myself on The Blind Side now you won't call we lost it all you fade away I'm picking up my heart from every piece that's broken been trying to get back to myself but don't have a clue I'm looking for some luck can't find a door that's open I'm losing All My Hope feels like I'm left here because I'm missing you because I'm missing you oh because I'm missing you because I'm missing you because I'm missing you picking up my heart every piece that's broken trying to get back to myself but don't have a clue I'm looking for some can't find the it's open I'm losing all my like I'm left here because [Music] I'm holding my breath and I'm ready to go I'm falling right in and I'm ready to go I found what I want and I know that we're on top soing and I'm ready to W hold in my breath and I'm ready to go I catch you laugh and I'm ready to go your it and IGN my soul and I'm ready to we are a storm feeling you can't ignore do you ever stop to feel it C in after glow I'll come back to your door to know that [Music] [Applause] youm all that I want you know we got it [Music] all holding my breath [Music] and what I want and I to I'm ready to holding my breath and I'm ready to go I catch you laughing and I'm ready to go your hold and the strike it and my soul and I'm ready [Music] to we it again our [Music] silou dancing on the pavement CAU in a perfect Stone you and those eyes again when I least expected said you're all that I want we know together we got it all holding my breath and ready to go right to I what I want and I know and I'm ready to in my breath and I'm ready to go I catch a laugh and I'm ready to go your hold and I strike it and it my soul and I'm ready to if I find myself at your door would you follow me to better places if I find myself your gra the keys let's go I want to taste this if you show up my I would follow you to places if you show up ready let's go let's go oh holding my breath and I'm ready to go I'm falling right in ready to go I found what I want and I know we're on top so I and I'm ready to my breath and ready to go iatch [Music] ready ready to holding my bre to right I'm to what I want and I know they on top so I and I'm ready to in my breath and I'm ready to I catch laugh and I'm ready to it I'm ready [Music] to I'm ready to [Music] days me we were the only on we were holding nothing back from the greatest nights we ever had so The Bard driving slow stand in your car singing along every night to play that song 100 times made fire we were Li and the summer bre dancing the rec in your bedroom on mind you the mus on the mind Frozen and TI always on my M I feel it all come back in the moment bre spin like the ocean so if you want to come with the it open play it slow motion motion [Music] you're my mem so my heat feels like ating up the night when I'm alone when I hear words you made the driving SL in your heart singing along every night must to play that song 100 times made fire and the bre Dan in mind mind always on my mind I feel it all come back in the moment back SP like the ocean so if you want to come the door is open play it over in slow [Music] motions feel it all come back in the moment I can spend way like the ocean so if you want to come with the door it open St back in slow motion [Music] [Applause] [Music] back [Music] down n [Music] n down [Music] take the back down back down [Music] n [Music] [Music] [Music] oh [Music] [Music] [Music] h [Music] [Music] [Music] s [Music] [Music] let [Music] I Wasing you the come up t-shirt and worn through High Times these nights taste like gold sweet with session show me something new as each morning comes we were out the night like we wear our clothes dancing right through the fire while we watch it singing on r as we give up our as a new morning CS Through the Windows we riding all new lives burning red through the page tearing past all the light you're wearing that mask we we our problems underneath the cloes like super like super heroes it's coming over now it's way PR down a Harmony of PE that only we can hear a super sonic curse you are to feel like us it's forever St you need a m under your influence a full moon Waxing now I couldn't see it until you show me how feels like we're insane we blame it all on love saturated so we can't get in we were out the night like we wear our clothes dancing right through the fire while we were sh sing in our Anthem we give up our goals as a new evening comes through the windows it's coming over now it's w down a Harmony ofs that only we can hear a super sonic CR you want to feel like us it's all forever so you're America it's coming over me electric SYM every night on fire I KN on Master AIC CR you would to feel like it's forever you're in America [Music] don't hold back tonight is the is going so come with us don't hold back tonight is all we have the sky is going so come with [Music] us don't hold back tonight is all we have the sky is going so come with us don't hold back tonight is all we have the is going so coming us it's coming over now it's down a Harmony appear that only we can hear a super sonic C you want to feel like us it's all forever you're in America it's coming over me electric Sy every night on fire I need on Masterpiece a super Crush you want to be like us it's all Forever song you're in America [Music] so come with us don't tonight SK going so don't hold back tonight going so come with us don't hold back [Music] [Applause] [Music] he [Music] he [Music] more [Music] n oh [Music] [Music] more breath beside [Music] you so I could find strength to divide [Music] us it all we got it I know we did the best we could if I could go back undo the mess I would Miz your face before I go but this is how we grow got to give it up sometimes it's go knowing when to kill your pride there's no one to blame nothing really stays the same this is how we go sometimes we hold on to let go [Music] we hold on to let [Music] go between [Music] us and oh I know you have your reasons some days I'm a mess but I know there's a rainow over all of the past your head on my shoulder but I know we're better on our but this is how we go got to give it up sometimes it's go knowing when to kill your pride there's no to play nothing really stays the same this is how we grow sometimes we hold on to let [Music] go hold [Music] let hold sometimes we hold on [Music] how got to give it up sometimes as go knowing when to kill your pride there's no to blame nothing really stays the same this is how we gr this is how [Music] [Music] n [Music] [Music] B [Music] [Music] oh [Music] [Music] B [Music] [Music] n [Music] d [Music] [Music] [Music] secrets that we know doors that open for us in a moment keeping light on riding keeping our sight everything we want we catch our breath in the middle of it all Chast it ech is coming up over the coming Crystal Vision VIs [Music] likeing on the I see on the horizon all [Music] we see in the forest for the trees I'm keeping watch all theor waking up and turning on keeping light riding all keep our sight on everything we we catch our breath in the midle of Chas coming all the rest is coming Crystal Vision [Music] like you true belever I can see on the horizon all we can feel Chas the is chasing the light is know [Music] can't stop and I won't let It [Music] Go Let It Go like uping on the I can see it on the horizon all our hopes we can feel it chasing [Music] is True Believer weing I can see on the horizon all we can feel [Music] [Music] test test one two testing one two one two 34 check one two 1 two 3 4 check one two [Music] [Music] [Music] [Music] he [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] I'll take it to you [Music] problem that's great that's perfect [Music] [Applause] it should be for all of EX the first slide I justess [Music] up so we don't need [Music] [Music] we've all been there huh we've all been there great oh I look for to [Music] it like he was like Studio I know okay sounds good [Music] yeah you you [Music] you you you you would you would you would you would you [Music] [Music] [Music] okay sure the house speak sure still doing I haven't gotten your sound right there no [Music] problem of course I love it [Music] oh sorry I'm going to hi everyone please ignore me as I do some sound testing just going to keep rambling until I get the queue that the sound is all good to go oh I'm hoping that's it thanks everyone [Music] what you you you you you you you [Music] you you you [Music] we got an insomniac with eyes wide shut and we got everything we need and then a little too much I know that you're starving for something you can't touch would you be honest with me right now there's something in the undercurrent I can feel it coming up don't you want to feel it taking over your senses don't you ever feel it Technologic fces baby come escape with me I'll come sweep you for your feet don't you want to feel it don't you to don't you think there's something in my B that's weighing me down oh it's just the weight of the world now I'm calling it out we're a little starving for some Lightning Love can we speak honestly right now there's something in the undercurrent I can feel it coming up don't you want to feel it taking over your senses don't you ever feel Technologic fces baby come escap with me I'll come sweep you off your feet don't you want to feel it don't you w to don't you w [Music] [Music] to tell me that you want to stay baby just don't walk away I Need You Now fade it out all the time we SP alone fighting through the don't let me down I need you nowz I'm feeling worn out it's getting to me lost some heart trying to get on my feet caught in the madness I feel you somehow don't let me go I need you right now I want to be next to you you want to be next to me holding our Paper Hearts fading our Broken Dreams I want to be next to you you want to be next to me holding our Paper Hearts fading our Broken Dreams you want to be next to [Music] you [Music] tell me that you want to stay baby just don't walk away I Need You Now fad it out all the time we spent alone fighting through the fire don't let me down I Need You Now CU I'm feeling worn out it's getting to me lost some heart trying to get on my feet c m feel [Applause] so for [Music] me take a small tour towards open but hi everyone my name is I'm a research engineer at and as mentioned I'm techical Le of before I get started I just wanted to say uh how awesome it is to be here with you all today when we were building Gemma our North Star the the thing we were most excited about was building something to empower and accelerate the [Music] amazing and since we launched our first models in February I have been absolutely Blown Away by the incredible projects and research and and innovations that have been already been built on top of Gemma um so I'm particularly excited to be here with so many developers today and especially delighted to unveil the latest advancements and additions to the Gemma model family so without further Ado we'll get started as many of you probably know Google SCH has been a Pioneer in publications of AI and ml research for the past decade including publishing some of the key research that has sparked recent Innovations we've seen in AI research like the transformer senten piece Bert to name a few Google Deep Mind has really continued this tradition and is actively working to share our research for the world to validate and examine and build on but Google's support of the open Community for AI and ml is not just limited to publishing research we've also been doing work to support ml across the entire technical stack for a long time from Hardware breakthroughs of tpus um which I imagine is especially relevant for this crowd and this track um all the way to an evolution in ml Frameworks from tensorflow to Jax throughout all of this open development has been especially critical for Google our ability to collaborate with the open- source Community has helped us all discover more innovate faster and really push the limits of what AI is capable of so this long history of support of the open source Community leads us to today and to Google's latest investment in open models Gemma Gemma is Google deep Minds family of of open-source lightweight state-of-the-art models which we build from the same research and technology used to create the Gemini models I'm so sorry I think that's my phone going off during this talk please feel free to rummage through that bag s sorry folks thank you wow lesson learned that even the speaker needs to remember to silence her cell phone all right back to doah there are a couple of key advantages of the Gemma models that I want to highlight today the first is that Gemma models were built to be responsible from by Design I can tell you from personal experience that from day Zero of developing a Gemma model safety is a top priority that means we are manually inspecting data sets to make sure that we are not only training on the highest quality data but also the safest data we can this means that we are evaluating our models for safety starting with our earliest experimentation and ablations so that we are selecting training methodologies methodologies that we know will result in a safer model and at the end of our development our final models are evaluated against the same rigorous state-of-the-art safety evaluations that we evaluate Gemini models against and we really do this to make sure that no matter where or how you you deploy a Gemma model you can count on the fact that you will have a trustworthy and responsible AI application no matter how you've customized Gemma models you can trust that it will be a responsible model Gemma models also achieve unparalleled breakthrough performance for models of their scale including outperforming significantly larger models but we'll get some more on that very shortly we also designed the Gemma models to be highly extensible so that you can use a Gemma model wherever and however you want this means they're optimized for tpus and gpus as well as for use on your local device they're supported across many Frameworks tensorflow Jack Caris pytorch ol llama Transformers you name it Gemma is probably there and finally the real power of the Gemma models comes from their Open Access and open license that period that's what's powerful about Gemma we put state-of-the-art technology into your hands so you can decide what the next wave of innovation looks like when we decided to launch the Gemma models we wanted to make sure that we could meet developers exactly where they are which is why Gemma models are available anywhere and everywhere you can find an open model I will not list all of the Frameworks on this slide this is only a fraction of the places where you can find Gemma models today this means you can use Gemma how you need it when you need it with the tools that you prefer for development since our initial launch back in February we've added a couple of different variants to the Gemma model family we of course have our initial models Gemma 1.0 which are our foundational llms we also released shortly after that code Gemma which are the Gemma 1.0 models fine-tuned for improved performance on code generation and code evaluation and one variant that I am particularly excited about is recurrent Gemma which is a novel architecture a state space model that's designed for faster and more efficient inference especially at long contexts we've also updated all of these models since their initial release we now have Gemma 1.1 which is better at instruction following in chat we've updated code Gemma to have even more improved code performance and we now have for current Gemma that not only the original 2B size but also at a 9 billion parameter size so there's a lot going on in the Gemma model family and I'm especially excited to tell you about our two most recent launches um the first one is actually our most highly requested feature since since day Zero of launch and that was multimodality so we launched palema P Gemma oh thank you I appreciate it this is why I love the open source Community truly the most passionate developers that there are P Jemma is a combination of the Sig lip Vision encoder combined with the Gemma 1.0 text decoder this combin allows us to do a variety of image text sort of tasks and capabilities including question answering image and video captioning object detection and object segmentation the model comes in a couple of different variants it's currently only available at the 2v size we have pre-trained weights that are available that can be fine-tuned for specific tasks we have a couple of different fine-tuned variants as well that are already targeted towards things like object detection and object segmentation and we also have transfer checkpoints that are models that are specialized to Target a couple of academic benchmarks up until this morning that was our latest release but I'm very excited to be here today with you guys because it is Gemma V2 launch day wooho thanks we have been working very hard on these models since Gemma 1.0 launch date we tried to do as much as we could to gather feedback from the community to learn where the 1.0 and 1.1 models fell short and what we could do to make them better and so we created Gemma 2 Gemma 2 comes in both a 9 billion parameter size and a 27 billion parameter size both models are without a doubt the most performant of their size and both models are also outperform modelss that are even two to three times larger than these base models but Gemma 2 isn't just powerful it's designed to easily integrate into the workflows that you already have existing so Gemma 2 uses all of the same tools all of the same Frameworks as Gemma 1 which means if you've already started developing with Gemma 1 you can with only a couple of lines of code automatically switch to using the Gemma 2 models and have increased performance and um more power behind your applications we also have the same broad framework compatibility again tensorflow Jacks Transformers AMA all of the ones I previously named we have them for Gemma 2 as well we also have significantly improved documentation we have more guides more tutorials so that we can coach you through how to get started not only with inference but with Advanced and efficient finetuning from Day Zero and finally we really wanted to Target fine tuning as one of the key capabilities of these models we did extensive research into how our core modeling decisions impact users ability to do Downstream fine-tuning so we believe these models are in going to be incredibly easy to fine-tune so you can customize them to whatever your use case may be in addition to make it especially easy to get started using Gemma 2 models we have made the 27b model available in Google AI Studios this means you can go to the AI Studio homepage and select Gemma 2 now if you wanted to and start playing around with prompts right away you shouldn't have to do anything except come up with an idea for how you want to push the limits of our model I am I am especially excited to see what you all end up doing with AI Studios and Gemma um and we have a couple of different ways for you to let us know what you're building which I'll get to down the road um but if you have ideas I'll be here all day and want to hear what you're doing with the Gemma models but let's dive a little bit more into performance we are incredibly proud of the models that we've made as I mentioned they are without a doubt the best most performant models of their size and are also comp competitive with models two to three times larger so our 27b model is has performance in the same ballpark as llama 370b and outperforms grock models on many benchmarks by a fairly significant margin in some cases um but I think academic benchmarks are only part of the way that we evaluate Gemma models sometimes these benchmarks are not always indicative of how a model will per form once it's in your hands so we've done extensive human evaluations as well where we find that the Gemma models are consistently heavily preferred to other open models including larger open models um and I'm also proud to say that the Gemma 27b model is currently the number one open model of its size and it currently outranks llama 370b memotron 340b grock Claude 3 with many many other models as well um thank you wow you guys are very supportive I appreciate it um the only other open model of any size that outperforms the Gemma 27b model is the E large model on on LMS um so we expect that you should have some fun playing around with it especially for chat applications we found in our evaluations that the Gemma 2 models are even better at instruction following they're even more creative they're better at factuality better all around than the geml 1.0 and 1.1 models the other important thing that I want to make sure to highlight from our most recent launch is the Gemma cookbook current the Gemma cookbook is available on GitHub now and contains 20 different recipes of ranging from easy to very advanced applications of how to use the Gemma models and the thing that I am most excited about is the Gemma cookbook is currently accepting poll requests so this is a great opportunity to share with us what you're building with the Gemma models and so we can help share it with the rest of the world and of course I have to say we also wouldn't mind if you start the repository come go take a look and tell us what you're building with Gemma so there are a couple of different ways you can get started with the Gemma 2 models of course I just mentioned the cookbook you can also apply to get a gcp credits to accelerate your research using Gemma 2 we have a lot of funding available to support research I would really encourage you to fill out an application regardless of how small or big your project is we also as I mentioned have significantly improved documentation we have many guides tutorials collabs across every framework so you can get started doing inference fine tuning and evaluation with Gemma 2 models you can download them anywhere open models are available and please chat with us on Discord or other social media channels so we can learn more about what you're building and that's about all from me today I am so excited to see what you all build with Gemma I have been working on this project for almost 2 years now and started working on this project because I as a researcher and Academia was disappointed to see how far behind open foundational llms were compared to the rapid improvements we were seeing in proprietary models so this is something that's very near and dear to my heart and that I wish I had had when I was actively part of the open source community so I'm very excited to see the projects and the research that you all do with these models please engage with us on social media on GitHub on hugging face here at the event and let us know what you think of the models let us know what you think we can do better for next time and thank you all very much really appreciate your time awesome so we have maybe two more minutes would you like like to say what is an exciting thing that you'd like to see build with Gemma models yes I love this question I think um as I mentioned at the beginning of the talk I think safety and responsibility is is a really crucial part of building the Gemma models but I also think it's an area where in general we have not seen nearly enough research and I think that there's a lot of interesting work going on in the open source Community right now now about how to build more private more secure models that will be particularly interesting um I'm really excited by some of the novel architectures that we're seeing people adapting Gemma models to do things like the state space model like recurrent Gemma um probably the thing I'm most excited about is the research areas that I know nothing about like I'm really excited to see people blow our minds with research ideas that never would have occurred to us that's really what I'm looking forward to most and just one more question so the B1 was out for a while what is something that you saw that blew your mind in the past couple of months one of my favorite uses of the Gemma V1 models is um are the Gemma models use the same tokenizer as the Gemini models so while Gemma is trained on primarily English data the Gemini models are multimodal they're multilingual so this means the Gemma models are super easily adaptable to different languages um so one of my favorite projects we saw it was also highlighted in iio was um a team of researchers in India fine-tuned Gemma to achieve state-of-the-art performance on over 200 variants of indic languages which had never been achieved before to have a model such a small model cover so many languages at such high quality so that was that was pretty crazy to see that was one of my favorite ones we also had um this should be in the cookbook fairly soon but we had a use we have a cookbook recipe that was inspired by a use case where um a small business was getting um sort of emails with what their new orders should be and the emails were very all over the place they had no structure and no format but all sorts of different lists of requirements of what the small business needed to make for this order um and there was a small business that fine-tuned a Gemma model to take in this list of emails and output in a priority order what orders needed to be taken care of and when so I I was excited to see that like very practical application of a Gemma model as well awesome okay uh without further Ado we should move on to the next speaker uh from fireworks AI here we have dimma the CTO uh the CEO couldn't make it today but dimma will give us an excellent talk like make it today because of some personal emergency so you got me and as you saw we don't have yet AI to figure out with projection but we have ai for a lot of other things uh so today I'm going to talk about fireworks Ai and generally I'm going to continue this team which kin started about open models uh and how we uh basically focus on productionize models in inference at fireworks but first uh as an introduction uh what's our background uh so the founding team of fireworks uh comes from PCH leads at meta and some veterans from Google AI so we combined have like probably decade of experience in productionizing Ai and some of the biggest companies in the world and I myself personally has been core maintainer of P for the past five years so topic of Open Source is really close to my heart and since we kind of LED this revolution of Open Source tool chain for deep learning through our work on pyro and some of the Google Technologies we we really believe that open source models are the future also for like for Gen application and our focus at fireworks is precisely on that um so I mean uh how many people in the audience actually like use GPT and deploy in for production and how many people how many folks use open models um a play prodction oh okay so I was about to convince you that share of open Source models is going to grow over time but it looks like in this audience it's already already sizable but nevertheless um so why why basically this trade off why go big why go small uh currently still like bulk of production inference is still based on proprietary models and uh the catch with that those are really good models and they often Frontier in in many domains uh however the cat is that it's one model which is good on many many things and it's often served uh in the same way regardless of the use case which means that maybe if you have bch inference on some narrow domain or you have some super real time use case uh where you need to you need to do like voice assistant or something uh those are often sered from the same infrastructure without customization in terms of model capabilities it also means yeah like gp4 is great or CLA is great it can handle a lot of sense but you are often paying a lot for additional capabilities which are not needed in particular this case you don't really need customer support chatbot to know about 150 Pokémons or be able to write write your poetry uh but you really want it to be really good in the particular uh narrow domain so this uh kind this kind of discrepancy for large models leads to several issues one as I mentioned is high latency because using a big model means uh longer response times uh which is particularly important for Real Time use cases like voice assistance it gets more and more important with tic stuff because for stuff like um for example next talk is going to be a de right like you you need to do a lot of steps for like something like agent like application to do reasoning and call the model many times so latency is really important and often you see that you can pick smaller models like lar or Gemma which you just talk about and achieve the for an nrow domain uh same better quality while being you know up to 10 times faster uh for example for some of the function calling these cases like externally Benchmark from uh from Berkeley yeah like the you get similar performance from F tun llama 3 at 10 access speed cost is also uh is also an issue if you're running a big model for uh on a lot of traffic you know even if you have perhaps 5K tokens prompt and 10,000 users and each of them call LM 20 times per day you know on GPT 4 even on GPT 40 it probably adds up to like 10K per day or something like several million per month Al 30 million per year which is a sizable cost of a startup you can easily cut that with much smaller models and that often we see as a uh as a kind of motivation for reaching out for smaller and more customizable models uh but really the uh like where open models shine is domain adaptability and that comes in two aspects first U there is so many different fine tunes and customizations uh I think KAIT was mentioning about you know Gemma built Indian languages adaptation like there are model specialized for coder for medicine if you had to hug and face there are like tens of thousands of different model variants and because the weights are open you can always customize to your particular use case and tune and uh tune quality specifically for for what you need so open source models are great so what are the challenges uh the challenges really come from three areas uh first like what we usually see when people try to use you know open model something like gem or whatever uh or or or lb uh you run into complicated setup and maintenance right you need to go and find gpus somewhere you need to figure out which Frameworks to run on those you need to like download the models maybe do some performance optimization tuning and you kind of have to repeat this process end to end every time the model gets updated or new version is released Etc uh on optimization itself uh there is especially for LMS but generally for generic models there are many attributes uh and settings which are really really dependent on your use case and requirements somebody needs low latency somebody needs High throughput problems can be short problems can be long Etc and choosing the optimal settings across the stack is actually not trival and as I show later in many cases you can get multiple X improvements from doing from doing this efficiently and finally like just getting it production ready is actually hard uh if as you kind of go from experimentation to production even just BBC in gpus on public clouds is not not as because gpus I not always reliable uh but getting to Enterprise scale requires you know SC all the scalability technology Telemetry observability Etc so those are uh things which we focus on uh solvent at fireworks so starting with efficiency we built uh our own custom servant deck which we believe is one of the fastest is not the fastest uh we did it did it from the ground up meaning from writing our own you know Cuda kernels all the way to customizing how the stuff gets deployed and orchestrated on the service level and that brings multiple optimizations but most importantly we really focus on customizing this service T to your needs uh which basically means for your custom workload and for your custom cost and latency latency requirements we can we can tune it for uh for those settings what does it mean in practice and what does customization means in practice uh for example many use cases use Rag and use very long prompts uh so there are many settings you can tune actually on the runtime level at the deployment level to optimize for loan prompts which often can be repeatable so cing is useful or just tun in settings so the stut is higher while maintaining latency so this is independently bench markable if you go to uh you know artificial analysis and select L prompt where pbox actually is the fastest even faster than some of the other providers which are over there at expoo uh and uh we don't only focus we don't only focus on Lon friends uh we will focus on many modalities uh as an example for image generation we are the fastest providers serving sdxl we're also the only provider serving sd3 uh stabili new model because their API actually routes to our servers um and finally as I mentioned like LMS like customiz especially for LMS customization matters a lot uh One require like one paradig how to think about performance of LMS often it's useful for use cases is to think about Max like minimizing cost under a particular latency constraint we often have customers come and saying like hey I need to like have this my interactive application I need to generate that many tokens under two seconds and that's where that's really where like cross stack optimizations shine uh whereby uh tun into particular like latency cut off and changing many settings you can deliver much higher throughput multiple times higher throughput uh which B higher throughput basically means fewer gpus and lower cost uh in terms in terms of model support we support support best quality open source models uh we heard about Gemma now obviously llamas some of the ASR and tex to speech models uh pretty much from from many providers we also work with model Developers for examp for example e large uh in US is also served on uh on fireworks launched launch last week and uh as a kind of platform capabilities as I mentioned we have a lot of Open Source models to uh to get you started or ones we do some of the fine tuning of those models uh in house so I'm going to talk a little bit about function calling specialized models later on or we do some of the vision language model Fusion ourselves which we release as well and of course the key for open source U open model development is it can tuned for a particular use case so we do provide a platform for fine tuning uh whether you're bringing your data set collected elsewhere or collecting it live uh with a feedback when on our platform uh specifically on customization it's like one uh interest one interesting feature which a lot of people exper starting to experiment with models uh find interesting is if you try to find fine tune and deploy the resulting model how uh how to serve it efficiently uh turns out if you do laury tun which a lot of folks do uh you can do uh smart tricks and deploy multiple plur models on the same GPU uh actually thousands of them which means that we can give you still serverless inference with p per token even if you have like thousands of model variants uh sitting and deployed there without having to pay any fixed cost of course single model is all great uh but what we see increasingly more and more in applications is model is not the product right uh uh by itself you need a kind of bigger system uh in order to solve Target application and the reason for that is because uh models by themselves tend to hallucinate so you need some ground in and that's where rag or access to external knowledge bases comes in uh also we don't have yet in Industry magical multimodel uh AI across all the modalities so often you have to kind of chain multiple types of models and of course you have all this like external tools and external actions which uh kind of end to end applications might want to do in atic form uh so I think the term which I really like which is like popularized by data brakes is like compound AI system but basically increasingly seen like transition from just the model being the product to kind of this combination of maybe like Rag and function colon and external tools Etc built together as the product and that's pretty much Direction which basic kind of see uh this field moving along uh over time so what does it mean from uh from our perspective what we what we do in this case uh so we see kind of as a function calling like agent as a at the core of this uh emerging architecture which might be connected to either domain specialized models served on our platform directly or maybe tuned for part different needs and connected to external tools maybe it's a qu interpreter or maybe it's like external apis somewhere uh with really like this kind of Central agentic View uh Central Central model kind of coordinating and trying to trash the uh user user requirements if it's for example a chatbot or something uh you probably all uh heard about like function calling you know popularized by open Ed initially that's that's basically the same idea uh so yeah def qu is really like a how to how to connect llm to external tools and exter and external elements what does it mean in practice so we actually uh focus on F models specifically for function Callin so we release a series of models like that like the latest one fire function V was released two weeks ago and uh what you can do with that is uh to click I manage to click on this button uh what it means is that you can build uh applications which kind of combine free form General chat uh capabilities with function col so in this case uh this is you know this fire function has some chat capabilities so you could see you can like ask it what what you can you do and it has like some self reflection to tell you what it can do it's also connected in this demo app to a bunch of uh external tools so it can uh query uh like stock quotes it can plot some charts all those like external apis uh it can U also gener generate images but what it really needs to figure out is how to translate user query into do complex reasoning translated into function calls so for example if we ask it to generate a bar chart with top three uh like with stocks of top Cloud providers like the big three it actually needs to do several steps right it needs to understand that like top three Cloud providers means you knows gcp and uh an aure right aure is on by Microsoft it needs to then go do function calls querying their stock prices and finally it needs to combine those information and send it to chat plotting API which is what just happened uh in the in the background uh another important aspect which you have to do for like efficient uh kind of function calling chat capabilities you need to have contextual awareness so if I ask it to add particular uh if I has to add Oracle to this graph it needs to understand what I'm referring to and like still keep the previous context and regenerate the image and finally you know if I switch to a part to a different topic it kind of needs to drop the previous context and understand that like hey this is less uh this historical context is less important I'm going to start from scratch so there is no like Oracle in that cat photo or whatever uh so you know this particular demo is uh is actually open source you can like go to our GitHub and try it out it's built with fire function and built with like a few other Mo a few other models including like sdxl which are run on our platform uh the model itself for function Colin is actually open source uh it's on Hing face I mean you can of course call it on uh at fireworks with for optimal speeds but you can also uh run it locally if you want it uses a bunch of uh you know functionality on our platform uh for example like structure generation uh with with Jason model grammar mode which I think was similar to some of the previous talks from like outline guys uh which were talking here yesterday uh yeah so finally try try it out and generally like how to get started in fireworks so if you had cut out of fireworks say such models you'll you'll find a lot of uh open open source open weight models which I mentioned about they're available in the playground in terms of product offering uh we have this kind of range which can take you from early prototyping all the way toly scale so you can start with serverless inference which is not different from uh getting to open API open air playground or something where you pay per token uh it's a Conant price you don't need to worry about like Hardware settings or anything as I mention you can still do fine tun in so you can you can do po fine tun our platform you can bring your own lur adapter and still Ser it sess as you kind of graduate like maybe like a startup a new gradate uh more production scale uh you might want to go to on demand where where it's more like dedicated Hardware with war settings and modifications uh for your use case uh you can bring your own custom model tun from scratch or do it on our platform and finally you kind of if you scale up uh to bigger volume and want to go to enter enter Enterprise level where it's discounted longterm term contracts and we also will help you to kind of personalize it up into some of those studion for performance which I which I talked about earlier and in terms of these cases I mean we're running production for uh for many many companies ranging from small startups to Big Enterprises I served like last time I checked like most 150 billion tokens per day so you know companies like quora build jet Bots like Po uh sour draft and cursor which I think I think corer had a talk here yesterday they use us for like some of the code assistant functionality and their like latency is really important uh as you can imagine know folks like upstage and liner building like different assistants and agents on top of that so uh we're definitely production ready go try try it out uh finally we care a lot about developers you uh you guys um so actually this is external numbers from like last year L chain state of AI stuff where turns out we are one of the like after hen face the most popular platform for where people pull models which is great was very nice to nice to hear and again for for getting started just uh head out to head out to our website you can go in the go play in the playground uh right away so for example you can run you know Llama Or GMA or whatever at the at the top speeds um and kind of go start building from there I'm really excited to see what you can build with open models of fire function or some stuff which you uh which you can find tune on on your own and yeah last point we're as I mentioned open a API compatible so you can still use you know your your favorite tools the same clients or you can use Frameworks like L chain or llama index or Etc so yeah really excited uh uh to kind of to be here and tell a little bit about open source uh open source models and how we had fireworks uh focusing on productionizing that and scaling it up uh go try out and you can also find us at the boo uh at the Expo thank [Applause] you that was a great talk and now who who here has heard about Devon most of the room well so this tag is going to be really interesting for you all here we have a Scott [Applause] woo cool okay yeah I'm Scott from cognition Ai and I'm going to tell you guys a little bit about um you know the the the early makings of Devon we're still super super early on and and also a little bit about kind of the space as a whole and and what's coming next um you know I thought it'd be nice to to start with the demo first um it sounds like some of you guys have have already seen some of the videos but I brought a nice custom one here for the world's fair today um so I'll just show that quickly and here I basically said hey Devon this was this morning by the way this is huge I I said hey Devon uh I want you to build a mobile friendly website to play the name game so I have a lot of trouble memorizing names and faces I don't know about you guys um but I basically just said you know here's a here's a tsv file of a bunch of names and faces these are all the speakers uh here at the Worlds Fair this week and I said can you set up the game so that you show two different random faces and then show the names of one of them and have me guess which one is which right and I give kind of a few instructions on how the game should work and so Devon is a fully autonomous software engineer and what that means is Devon has access to all the same tools that a human software engineer would have when it was building when when they're building something like this and so the first thing that Devon's going to do is Devon's going to make a plan um and you can see here you know kind of a basic plan coming out um one of the interesting things about this is the plan changes a lot over time and so you know as you get new information or new feedback you update your plan according with that too after that Devon's basically just running this the same way that human would and so if you can take a look you know Devon makes a new directory for the name game website starts a new react app you know all all the same Primitives um works on building it out and building the code you know reads the tsv file to take a look uh at what's going on here um and it's just kind of generally working through it and jumping through um it comes out and deploys this first version after after some minutes and I'll just pull this up quickly so that's what this looks like um it's it's closed but not quite there right I mean it shows the it's still showing the names and I think maybe I didn't quite specify that exactly um but you know you can click uh the name and got that correct um and so I just went ahead and just gave it some more feedback in plain English and so I said hey you know can you hide the two names until I click on the answer and also can you probably restyle the play again button it's like you know somehow it's it's a little off on this page um and I kept going uh and just just kind of gave it more and more feedback over time and I also asked it hey can you add a streak counter as well you know can you can you keep track of how many I got correct and you know reset to zero um you know a few of these other things um and the website It ultimately deployed uh was this one right here and so this is Justine for example keeps track of my streak you can see it's kind of uh ramping it up and so you know if I were to for example um if I got this one wrong on purpose then you would see the streak would reset to zero you know and it would go on and so I actually played this game and learned the names of everyone which was which was super helpful by the way and uh you know you guys can play it too it's uh it's right here if you want to try it out this has all the speakers I think it was something like 170 speakers here at the at the World's Fair um this week so you know this is kind of a cool example um but you know I I want to highlight how different the world is if software engineering is just this easy you know if you can just explain exactly what you want in plane English um and and get that out um and so you know this is obviously kind of a toy use case and it's it's perhaps useful but we use Devon all the time ourselves when we're building Devon actually um and by the way I obviously I didn't make this website myself I just said hey you build me this website with the QR code and whatever uh and Deon built that too but um you know here here's here's a quick example of Devon uh that we're using ourselves in production and so you know if you take a quick look here for example there's this whole search bar and um there's all the sessions and you can search across sessions right uh Devon actually made that in the Devon repository um you can see here Bryce is on our team and Bryce was asking hey Devon can you go into the Devon sessions list create a search bar component here's what I need you to do um and so there's a few features about this in particular particular that are obviously um tuned for working in a production codebase you can see here that Devon started from a snapshot so we have a machine instance loaded where it's cloned from it has a Playbook so it knows like a lot of the details about our repositories and then it's also just able to generally work within our git environment so you'll see it just make a PR and interact with all those same tools and so I'll just kind of go through this quickly yeah Deon says absolutely you know makes the first pull request Bryce continues and again you're just giving feedback in planing Eng right and you say hey this a great start you know now could you could you add a magnifying glass um and make it idiomatic you know use phosphor lose seat you know it's up to you right uh and Devon says yeah sure I'll build that and Bryce says oh by the way no need to test you know I trust you um and Devan says by the way I'm dealing with a bit of an issue with the login process you know it's just like you're working with with another engineer right and Deon and Bryce says uh okay bro uh and you know it kind of builds it all and gets the pr and this PR was actually merged and this is you know the search bar right and similarly you know a lot of the API Integrations that Devon has were built by Devon you know a lot of our own internal dashboards and metrics tracking within Devon were actually also built by Devon um and it's been kind of a kind of a fun one to see like Devon Building the company with the company as well um so cool yeah I want to talk a little bit about you know our journey so far and and about what's happening in the space as well and so um you know we got started back in November um so it's been about seven months now um it's kind of funny we started in a hacker house in Berling game uh and it was basically just a lot of us had already like lived together at that point you know we' all had our own Journeys in Ai and we just knew that we wanted to build something together and we obviously knew that we wanted to do something in code and build a coding agent um and then that Hacker House in the Bay Area after that there was another Hacker House in New York then there was another Hacker House in the Bay Area so we were actually we we've been going back and forth between New York and the bay for basically the last seven months I think at this point we were now going to like settle in the bay but it's been going back and forth and getting like a slightly bigger Airbnb each time because the team also gets a little bit bigger um but you know why why Devon in particular um and you know this is this is a um particular question that that I'm really passionate about which is you know language models have been pretty big I think that's that's fair to say um and you know the first wave of generative AI is what I generally call these text completion products right and you know that makes a lot of natural sense if you think about it that obviously the interface of a language model is text completion right you give it a prefix and it completes the sub from there and so if you think about chat GPT if you think about a lot of these Q&A products if you think about you know writing marketing copy or answering customer support or even GitHub copilot and cursor and products like that you know obviously very you know a lot of these are really great products and very natural use case where you have the prefix so far and you're asking the model to complete what's next in the suffix right and it does that for you and that's a tool that's useful right and I think we're entering this new wave where you know we're going beyond that and actually introducing some amount of autonomous decision-making um and obviously you know that's typically referred to in our space as agents right and you know there there's all sorts of new things that you unlock right there's a there's a lot higher bar of consistency that you require but there's new things that you unlock with that um and so it's it's been an interesting one because it's it's both a very deep core capabilities question of getting Devon to to solve these but also a pretty interesting product design problem because I think that the uxx of Agents is is something that's extremely new um and then why code in particular you know a few different things obviously we're all coding nerds as well you know we're all engineers and so the idea of teaching AI to code is is you know one of the coolest things that we could think of but beyond that I think there's a few particular reasons that code with agents Works especially well you know one is that obviously there's so much more to being a software engineer than typing the code right a lot of the work that you're going to do is you know you're going to be looking into a bug you're going to be looking at the different file of the code base maybe you're going to be running this or that command maybe you're going to be pulling up documentation maybe you're going to run the front end yourself to to reproduce the bug you know you look at this thing you make it make this edit you try it again um all of this work here obviously is you know that's what software engineering is Right more so than just uh than just typing the code in the file um which leads very naturally to an agentic workflow you know another part which I think is closely related is the ability to iterate with code feedback um and so what I mean by that is you know if you were given an entire production code base and you were told hey this has this one bug I need you to fix it here's the bug um you know and and it's let's say it's like thousands of files and you know hundreds of thousands of lines of code I mean it'd be pretty tough honestly for most humans it's also going to be quite tough for AIS as well and obviously the way that we do this in practice is you know you you you go and add print statements you pull up the logs you check the monitoring you know you you jump back and forth between different files you try and diagnose it right each of these things that you're doing you know you're you're making a decision and then you're running actual code to find out what happened and from that you're you're able to iterate and it just gives you a much cleaner path to solve the problem in front of you um and similarly you know that that kind of lends very well to agents and the last thing I just want to mention is you know how fast model agentic capabilities are improving um and so you know two years ago like even something as simple as as this name game demo I think would have been almost Unthinkable um and you know you think about where things are going and things are going to be two years from now I I think there's a lot of you know the data the the the right training and so on that's that's really really rapidly improving in the space um and then you know again beyond the capabilities problem there's actually a really deep ux problem as well uh and at a high level you know I think what's kind of Happening Here is when we're building agents and I think all of us in the space are quite new to agents you know the immediate first things I think to map to are you know how we use software today and also how we talk with other humans right and so you know I mean even a lot of the features in Devon are essentially looking over your own intern shoulder you know you can see their computer and you can see what commands they're running and things like that the thing is I think an agent is actually pretty different from both you know there's a lot of nuances and details of parallel work information gathering how it manages context um etc etc that are super super different and it's actually a quite deep problem from a product perspective as well just to give you guys a bit of a sense of that like here's just a kind of uh a short list of some of the features that we've built into the product and so you know obviously there's Devon being able to use the shell you know edit code browse the web but there's all these other things right you know being able to fork and roll back sessions you know being able to handle Integrations with slack and GitHub being able to handle playbooks to to store machine snapshots to keep track of Secrets you know to be able to work with the right tools for verification you know all of this is part of the actual product iteration right which is you know on its own I think already an incredibly incred dense problem I think honestly we're going to see actually a lot more iteration with that over time and I just wanted to show kind of a new feature which we just recently shipped which is the ability to use Devon's machine um which is kind of again it's a kind of thing that's not always uh there there's not necessarily a very close parallel in uh you know in the software that we have today right but the ability to just have a vs code live share in Devon's machine and you know if you want to collaborate with Devon and say hey oh there's this these couple lines like you know you should make this edit I went ahead and did that edit for you and you can just talk with Deon and do that right um so there's a lot more room to go and a lot to iterate on in space um one of the thing other things I wanted to mention too is just how much you know we've seen it changing our own workflow um so you you guys saw like a simple example of Devon Building the search bar but you know we actually handle tasks in a much more async way now one of the cool kind of features of Devon I'd say is you know if as an engineer you're you're working on let's say four different tasks today you know you just give one to Devon number one you give the second one to Devon number two the third one to Deon number three you have four devons that are all running in parallel and it's kind of turning every engineer into an engineering manager is almost how I describe it you know I think the devans are very like enthusiastic interns is what I'd say I mean they try very hard you know they're obviously they don't know everything they get little things wrong they ask a lot of questions but you know you're kind of working with each of them and and having them iterate and so here's just kind of a fun example I mean this is literally from from earlier today but you know we were talking about some particular feature and about what we wanted to build in this case it was a pretty simple thing of changing the color but it's just as simple as just saying in slack in the conversation hey at Devon can you just change this thing and then Devon goes and makes a PR and then you hit merge you know and so um you know we've had a lot of occasions where we're you know in the gym or in the car or something and now you can actually write C because you know you can tell Devon exactly what you want Devon to do um you just don't have your whole computer with you and can't type everything but you know being able to just kind of describe what you want want to Devon and then being able to review the code afterward um actually works really well so what's next um you know I think this is a really important question uh and obviously I think the technology is extremely early today um but you know where do these things go in a few years and also what happens with software engineering I think that there's there's been a lot of uncertainty about that question and you know as we're using Devon more and more I think one of the big things that we see actually is this is kind of obvious perhaps but Devon is not the one that decides what to do or what to build you know and there's this core part of software engineering the way I describe it is like software Engineers everywhere you know are are doing really two jobs at once right and the first job is basically problem solving with code you know you you're you're given a problem and you're breaking down exactly what is the solution you're going to build you know what is the architecture that you're going to use what are all the flows and the details and the edge cases that might come up and kind of architecting your exact solution and then the second part is once you have that you know you're dealing with debug or implementing different functions or writing unit tests or all of the other things that kind of go into this implementation of something that you know you want to do right and you know I think right now the average softw engineer is probably spending like 10 or 20% of the time on that first thinking part and they're spending 80 or 90% of the time on that implementation part and you know what we really see is Devon actually just frees you up to do more of the first part you know and I think the future of Devon again it's very very early but I think as Devon gets better we're going to see more of that where Devon just frees up the implementation for you where you don't have to go figure out how to set up kubernetes you know you don't have to go like debug all these like apis that are broken you know you don't have to go like deal with version changes or migrations or all of these other things that you know take up a lot of time in software engineering right but you actually are spending all your time on figuring out how to solve the problems in front of you you know it's it's a little more like um a mix between you know a technical architect and a product manager almost right and so you know I think software engineering the job that we call sofware engineering is going to change but I think practically like there's actually going to be way more software Engineers than ever you know and and I think there's a lot of precedent for that too you know programming back then you know used to mean Punch Cards and then after that it used to mean assembly you know and then after that it used to mean see right and you know as these things have gone on I mean most people aren't using Punch Cards anymore but there's actually way more programmers than before right and and I think one of the things that's um easy to underestimate is just how much more code there is to write um and you know it's it's it's funny to think about I think because obviously we all love software here in this room I would say I think software has been the number one driver of progress in the world in the last 40 or 50 years and yet despite that I think you know our demand for software to be built is actually probably a lot more than 10x what we're currently getting um and so you know I think what happens is we we get to open up the power of software engineering to a lot more people and every single software engineer gets to be five or 10x more effective but we actually do a lot more software engineering um cool yeah so so that's all I had uh but yeah we' love to open the floor if there's any questions yeah right here in the front great question great question so so we've been ramping up access um every every week we've been letting on more and more people we've also been sizing up with our Enterprise customers we have a lot of weit list to get through so we're we're we're doing it as fast as we can um but but would love to get you guys access as soon as possible yeah yeah um all the way in the back over there yeah in the red yeah exactly so in our code base for example Devon has all the setup that it needs it has a machine that's basically instantiated where it can run the dev environment it can run the server it can run the front end and so if it's if you're asking it hey I need you to debug this particular thing it'll just pull it up itself and then you know reproduce it and then it'll debug it and try it again yeah exactly yeah yeah yeah any other questions right here yeah oh of course yeah so sorry so so someone asked um you know with all of these simpler tasks getting solved um what happens to you know all the junior Engineers or the interns who who obviously need to learn how to code you know I I I think what happens honestly is I think that demand is going to just keep Rising with Supply and I think the training process is going to change a little bit but you know I think a lot of these core fundamentals you know if you think of someone as when you say someone's a really great engineer typically it you don't mean that they type really fast although maybe they do that too right you typically mean that they they have a really great understanding of problems you know they know all the different architectures they never miss an edge case stuff like that right and so those are the fundamentals that I think are always going to matter and I think um basically I think interns or Junior Engineers are going to get more exposed to getting to use those fundamentals earlier and earlier yeah yeah okay so someone asked what what are the the biggest challenges to realizing the vision of a future Suite you know there's a lot I mean it's it's basically everything as you can imag I mean there's speed there's consistency there's access there's Integrations there's the right product ux um you know and I think all these things one of the cool things I think is how much of a rising tide there is everywhere and so you know obviously we're we're going to do our best work on it but you know every every new hardware release is is amazing for it you know every every New Foundation model that comes out is amazing you know every new piece of agentic research and I think this is the kind of thing where um I think there will be a lot of different optimizations that come in different parts of the stack that make this agentic flow better and better and better um but uh yeah it won't just be one small thing but I think it'll be it'll be pretty fast that's all the time we had so thank you so much thank you guys so [Applause] much e