AI Engineer World’s Fair 2024 - Keynotes & Multimodality track
Channel: aiDotEngineer
Published at: 2024-06-28
YouTube video id: vaIiNZoXymg
Source: https://www.youtube.com/watch?v=vaIiNZoXymg
[Music] [Music] because mind now I know theise [Music] now you're moving I found myself on the blind now you won't call we lost [Music] iten I'm my missing [Music] you I'm missing you [Music] [Music] hearten [Music] [Music] are [Music] I'll [Music] hold bre [Music] [Music] [Music] good morning ladies and gentlemen we will start our day in the Ballroom in 10 minutes thank [Music] you it I'm [Music] [Music] [Music] to ke want to T [Music] up up [Music] [Music] so and I'm ready to hold bre [Music] [Music] [Music] I'm days [Music] you and me we were the only we were [Music] holding SL your singing every night to play that song [Music] bre thece you come with the door it open play it open slow motion [Music] [Music] [Music] [Music] [Music] I he tomato [Music] the your [Music] [Music] Frozen and tired always on my mind I feel it all back in [Music] the ladies and gentlemen our program begins in 5 minutes [Music] it all [Music] [Music] n [Music] the [Music] [Music] back [Music] [Music] he he [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] was watching you watch the sun come up ladies and Gentlemen please take your seats our program is about to begin thank you ladies and Gentlemen please welcome back to the stage your host and co-founder of the AI Engineers Summit Benjamin [Music] duny good morning good morning good morning good morning how we doing we all had fun last night is this the crowd that had fun and just came back or this is a crowd that did not have fun looks like some people had too much fun but thank you so much for coming back for the final day of the AI engineer World's Fair so great to see you all I have 3 minutes and 45 seconds and a couple of announcements to make so I like to get right to it so I've been talking to a few people when I get get a chance um about the mobile app that we put a lot of effort into and how it's going to help you at with your attendance and interactions at this event so I wanted to just clear up a couple of things number one the schedule you can favorite sessions can see the session details favorite them and you can go to my schedule so now I can see it there so for a multitrack event I just want to make that clear you can build your own schedule here here so some people didn't get that memo uh of course you can see all the sponsors you got the map there but the real killer feature here is the generative matching and the other networking features here so I'd like to invite Simon sturmer on stage the lead architect for this just to help explain some features can we have a round of applause for Simon please so Simon we got some generative profiles going on here these look to all be unique um for each interaction tell me a a little bit about what's going on behind the scenes here so what we do is we pull in your profile or we build a profile from your socials and from the questions you answered when you registered and then we create embeddings from there and we put that into a vector database and we do a cosine similarity search and just pull your five most similar profiles very nice and now I'm going to a pora's profile and I can see all her talks there I can also see some talking points here and these are generated for me as well that's right so those were while that spinner was going generated in real time yeah awesome and now if we go to one of her sessions we can see after you watch our talk you can rate it we got session feedback built in so this is a fully custom app but all right here's the real killer feature that I like because when I first came to San Francisco in 2015 I went to a Tech conference and everyone's handing out business cards so when you meet someone at the event like how do you connect with them oh let's exchange emails right that's one give me your Twitter your LinkedIn it's always like different with different people right why don't we get them all right now Mr Simon all right I'm gonna scan your badge boom okay there we go so we have badge scanning for everyone now I'm connected as soon as I scan his badge I get his email even if he made it private at first he can set his he can change his settings to make it default to um share with me when I scan him still got the talking points the notes Etc and now is on my short list of scans so after the event I can review all my scans and I can even export all my connections which is going to email them to me so is that pretty cool what do we think all right all right cool cool thank so that's the match thank you so much Simon appreciate it for your helping that um very cool so that's the app just want to make sure we all understand that the other thing is we have ai engineered jobs so this is is at ai. engineer jobs the currently we're featuring jobs from our Expo Partners so go and check those out if you're looking for your next opportunity any XO Partners if you're not there please email me and I'll get it up there before end of day anyone else want to get some jobs up there email me and we can we can talk um lastly one more announcement I got 20 seconds left um we're excited to announce 2025 is now on sale if you enjoyed your time here this is is the best time to lock in the best rate so you can go to ai. engineer s2025 or scan that QR code that's going to reserve your seat lock in the best rate and we don't have dates currently locked in but uh likely around the same time this year so that's my time for today so I'm pleased to Welcome to the stage our opening keynote speaker an absolute Legend in this industry please join me in welcoming Chris [Music] lner all right good morning everyone here to talk to you about modular and accelerating the pace of AI uh you know what ji is I'm not going to tell you all about this let me tell you one of the things I think is really cool about it and very different than certain other Technologies is that it's super easy to deploy there's lots of great endpoints out there there's a lot of good implementations a lot of ways to make it super easy to build a prototype and get get going very quickly but despite all the availability of all these different endpoints sometimes you do have other needs sometimes you might want to go you know and control your data instead of sending your data to somebody else sometimes you might want to integrate it into your own security because you got your critical company data in your model and you don't want to find tune it somewhere else sometimes you want to customize the model like there's research happening all the time right a lot of things in building proprietary models work best for your use cases uh can make your applications even better and of course the inference end points are expensive and so sometimes you want to save money sometimes there's Hardware out there that's really interesting and you want to explore out from the the mainstream and you want to go do this and if you care about any of these things what you need to do is you need to go beyond the endpoint and so how do you do that well if you have many of you have explored this I'm sure the answer shifted it used to be that we had things like py torch and tensor flow and cafe and things like this but as inference became more important the world shifted first we got Onyx tensor RT things like this and today we have an explosion of these different Frameworks some of which are specific to one model that's cool if you care about that one model but if you have many different things you want to deploy and you want to work with it's very frustrating to have to switch between all all these different Technologies and of course it's not not just a model you all know there's this gigantic array of different technologies that get used to build real world things and production and of course none of these are really actually designed for Gen so my my my concern about this my my objection to the status quo is that this fragmentation slows down getting the research and the Innovations coming into gen into your products and I think we've seen so many of these demos last year was really the year of the Gen demo but still we're struggling to get gen into products in an economical and good way and so whose fault is it well is it our fault like many of you are AI Engineers if you don't let's sympathize with the plight of the AI engineer because y'all these folks that're building this have new models and optimizations coming out every week right every product needs to be enhance with Gen this is not like one thing we're getting dumped on and there's so much to do we can't even keep up there's no time to deal with new hardware and all the other exciting new features and of course once you get something to actually works the costs end up making it very difficult to scale these things because getting getting things into production means suddenly you're paying on a per unit basis so it's not the a engineer's fault we should look at the concerns and look at the challenges faced here and so I think that we need a new new approach right we've learned so much let's look at what we need to do how do we solve and improve the world here this is what modular is about it's all I'll give you a quick intro of what we're doing and kind of our approach on this first of all who are we modular is a fairly young company we've been around for a couple of years um we we have brought together some of the world's experts that built all of these things and so we've built tensorflow and pytorch we built compilers like LM and ML and xlaa and all all of these different things and so what what I can say about that is that we learned a lot and I apologize because we know why it is so frustrating to use all these things but but really it's it was a you know the world looked very different five years ago gen didn't exist it's it's understandable we we tried really hard but but we've learned and so what our goal is is to make it so you can own your AI you can own your data you can control your product you can deploy where you want to you can do this and it make it much easier than the current systems work today and so how well what we're doing is really going back to the basics we're going to we're bring together the best-in-class Technologies into one stack not one solution per model our goal is to lift python developers pytorch users this is where the entire industry is and so we want to work with existing people we're not trying to like say hey ditch everything you know and try something new we want to gradually teach and give folks new tools so they can be superpowers so they can have superpowers and finally uh so I spent a lot of time at Apple like I want things that just work like you want to build on top of infrastructure you do not want to have to be experts in the infrastructure and this is the way all of this stuff should work and unfortunately it's just not the case today in Ai and so modular we're building this technology called Max I'll explain super fast what this is um max is two things one is an AI framework which I'll spend a bunch of time about uh the AI framework is free widely available we'll talk about it today the other is our managed services this is how modular makes money very traditional we're not going to spend a lot of time talking about that today and so if you dive into this AI framework well it's we see it as two things it's the best way to deploy py TS it's also the best way to do gen and both halves of this are really important and Max is currently very focused on inference and so these are areas where uh pytorch is challenging at times this is where gen AI is driving us crazy with cost and complexity and so really focusing on this problem is something that we are uh we're all about the other thing as I said before is python so we natively speak python that is where the entire world is we also have other options including C++ which we'll talk about later so how do we approach this well as I said we work with P torch out of the box you can bring your models your model works we can talk to the wide array of pitori things like Onyx and torch script and torch compile and like all this stuff and so you can pick your path and and that's all good uh if you want to go deeper you can use native apis Native apis are great if you want if you speak the language of KV caches and Page detention and things like this and you care about pushing the state-of-the-art of llm and other geni techniques that's very cool and also um max is very different and that it really rebuilds a ton of the stack which I don't have time to talk about but um we do not build on top of CNN and the the Nvidia libraries and on top of the Intel libraries we replace all that with a single consistent stack which is really different approach and I'll talk about what that means later and so what you get is you get a whole bunch of technology that you don't have to worry about and so again as a Next Generation technology you get a lot of fancy compiler Technologies runtimes high performance kernels like all this stuff uh in the box and you don't have to worry about it which is really the point now why would you use max so it's it's a AI framework you have one right and so there are lots of different reasons why people might want to use an alternative thing um for example developer velocity your team being more productive that's actually incredibly important particularly if you're pushing state- ofthe art but it's also very hard to quantify and so I'll do the same thing that you know kind of people generally do is go and talk about the quantifiable thing which is performance and so I'll give you one example of this um we just shipped a release that has our int4 in6 K fancy quantization approach um this is actually 5x faster than and lad. CPP and so if you're using lad. CPP today on in Cloud CPUs this is actually a pretty big deal and 5x can have a pretty big impact on um you know the actual perceived latency of your product and performance and cost characteristics and the way this is possible is again this combination of really crazy compiler and technology and other stuff underneath the covers but the fact that you don't have to care about that is actually pretty nice it's also pretty nice that this isn't just one model this is you know we have this make it easy to do in for technology and then we demonstrate it with a model that people are very familiar with and so if you care about this kind of stuff this is this is actually pretty interesting and it's a Next Generation approach to a lot of the things that are very familiar but it's also done in a generalizable way now CPUs are cool and so I mean so far we've been talking about CPUs but gpus are also cool and what I would say and what I've seen is that the uh uh CPU and AI are are kind of well understood but gpus are where most of the pain is and so I'll talk just a little bit about our approach on this and so first before I tell you what we're doing let me tell you our dream and this is this is not a small ambition this is kind of a crazy dream imagine a world where you can program a GPU as easily as you can program a CPU in Python not C++ in Python that that that that is that that is a very different thing than the world is today imagine a world in which you can actually get better utilization from the gpus you're already paying for I don't know your workload but you're probably somewhere between 30% maybe 50% utilization which means you're paying for like two to three times the amount of GPU that you should be right and that that is understandable given the technology today but that's not great for lots of obvious reasons imagine a world where you have the full power of Cuda so you don't have to say there's a powerful thing and there's an easy to use thing you can have one technology stack that scales well this is something that is really hard this is something where you know Nvidia has a lot of very good software people and they've been working on this for 15 years um but I don't know about you I don't run 15E software on my cell phone like it doesn't doesn't run Blackberry software either and I think that it's time to really rethink this technology stack and push the world forward and that's what we're trying to do and so how does it work well you know it's just like pyour you use one line of code and switch out CPU to GPU haha we've all seen this right this this doesn't say anything I actually hate this kind of a a demo um because the way this is usually implemented is by having a big fork at the top of two completely different technology Stacks one built on top of Intel mkl one built on top of Cuda and so as a consequence nothing actually works the same except for the the thing on the slide and so what modular has done here is we've gone down and said let's replace that entire layer of Technology let's replace the Matrix multiplications let's replace the fuse detention layers let's replace the graph thingies let's replace all this kind of stuff and then make it work super easily super predictably and let's make it all stitched together and yeah it looks fine on a slide but the slide is missing the point so if you are an advanced developer and so many of you don't want to know about this and that's cool if you are an advanced developer like I said you get the full power of Cuda and so if you want you can go write custom kernels directly against Max and that's that's great and for advanced developers which I'm not going to dive too deeply into it's way easier to use than things like the uh Triton language and things like this and it has good developer tools and it has all the things you'd expect from a worldclass implementation of GPU programming technology um for people who don't want to write kernels you also get a very fancy autof fusing compiler and things like this and so you get good performance for the normal cases without having to write the hand fuse kernels which is again a major usability Improvement now you know it's cool like there's a lot of lot of things out there that the promise to be easy but what about performance right a lot of the reason to use the GPU in the first place is about performance and so one of the things I think is pretty cool and one of the things that's very important to modular is that we're not comparing against L standards we're comparing against the vendor's best in this case Nvidia they're experts in their architecture and so if you go look at again there's a million ways to measure things a micro Benchmark go look at the core operation within a neural network matrix multiplication this is the most important thing for a wide variety of workloads and again one set of data but we compare against kuas the hardcoded thing and then also against cutless the more programmable C++ e thing and so max is uh meeting and beating both of these you know by just a little bit I mean it's you know it depends on your bar and data is complicated but you know if you're winning by 30% 30% is actually pretty big deal given the amount of cost the amount of complexity the amount of effort that goes into these kinds of things and so I've talked a lot about the what but I haven't talked about the how and so the how is actually a very important part of this and I'll just give you a sample on this so we are crazy enough that we decided to go rebuild the world's first AI stack from the bottom up for geni and as part of doing that what we realized is we had to go even deeper and so we built a programming language we have a new programming language it's called Mojo and so the thing about Mojo is if you don't want to know about Mojo you don't have to use Mojo you can just use max it's fine but we had to build Mojo in order to build max I'll tell you just a couple of things about this our goal is that Mojo is the best way to extend Python and that means you can get out of C C++ and rust and so what is it as a programming language it's a full it's pythonic so it looks like python it feels like python everything you know about python comes over and you canot have to retrain everything which is a really big deal you get a full tool chain you can download on your computer you can use in Visual Studio code it's open source available on Linux Mac Windows 200,000 people 20,000 people in Discord it's it's really cool would love for you to go check it out if you're interested in this but what is mojo like what what what actually is it fine there's a programming language thing going on well what we decided is we decided the AI needs two things it needs everything that's amazing about python this is in my opinion the developers this is the ecosystem this is the libraries this is the community this is even sorry the package managing and like all the things that people are used to using already those are the things that are great about python what is not great about python unfortunately is its implementation and so what we've done is we've combined the things that are great about python with some very fancy High fluen compiler stuff ml all all this good stuff that then allows us to build something really special and so while it looks like python please do forget everything you know about python because this is this is a different beast and I'm not going to give you a full hourong presentation on Mojo but I'll give you one example of why it's a different from Beast I'll pull it back something many of you care about which is performance and what I'll say is that Mojo is fast how fast well it depends like this isn't a slightly faster python this is a working back from the speed of light of Hardware kind of system and so many people out there have found that it's a thou 100 times to a thousand times faster in Crazy cases it can be even even better than that but the speed is not the point the point is what it means and so in Python for example you should never write a for Loop python is not designed for writing for Loops if you care about performance at least in Mojo you can go write uh code that does arbitrary things this is an example pulled from our llama 3 written in Mojo that does tokenization using a standard algorithm it's chasing link lists has if statements for Loops like it's just normal code and it's python I mean it feels like Python and that that is really the point and so for you the benefit of Mojo is first of all you can ignore it if you don't want to care about it but if you do you don't have to learn C C++ you have lower cost by default versus python because performance is cost it means that as a researcher if you use this you can actually have full stack hackability and if you're a manager it means that you don't have to have people that know rust on your team and C++ and things like this you can have a much more coherent engineering structure where you're able to scale into the pr no matter where it is and so if you want to see something super polarizing go check the modular blog and we'll explain how it's actually faster than rust which may people consider to be the gold standard even though it's again a 15-year-old language so I have to wrap things up they'll get mad at me if I go over um the the thing that I I'm here to say is that many of you may want to go beyond the AP the API and they're fantastic there's amazing technology out there I'm very excited about them too but if you care about control over your data you want to integrate into your your security you want customization you want save money you want portability across Hardware then you need to get on to something else so if you're interested in these things then Max can be very interesting to you Max is free you can download today it's it's totally available go nuts uh we didn't talk about production or deployment or things like this but if you want to do that we can also help we support production deployment on kubernetes sagemaker and we can make it super easy for you our GPU support like I said is actually really hard we're working really hard on this we want to do this right and so it'll launch officially in September if you join our Discord you can get Early Access we'd be very happy to work with you ahead of that too uh we're cranking out new stuff all the time and so if you are interested in learning more you can check out mod.com find us on GitHub a lot of this is open source and join our Discord thank you [Applause] everyone ladies and Gentlemen please welcome to the stage principal developer advocate of AWS Auntie aart hi everyone I'm so excited to be part of this conference and share with you five practical steps from software developer to AI engineer and if anyone is wondering here this AAR are on the slide this is what happens if you ask AI to make you look a little bit more agentic all right let's get started so I'm pretty sure everyone is familiar with this image here and the post from swix that defines the new role of the AI engineer and as you've experienced probably daily in your jobs you don't need to be a full ml researcher anymore or data scientist think that took months or years before to get AI projects into production is now able to be just a couple of API calls super exciting but still if you're working with AI it still makes sense to understand the basics of the technology and this involves a couple of things right so you have to understand at a basic level how Foundation models work why they're sometimes producing output that you don't expect in your application code right you have to understand how you can customize the models how you can you know for example sometimes F tune models to adapt them to your specific use cases and data sets how to include functions in your application code to give them access to additional systems the good news is if you're just starting on this journey to become an ni engineer there's plenty of resources now these days available to you to learn and I wanted to call out one specific course here which is called generative a with large language models a few colleagues in mine we actually collaborated with Andrew in and the team at deeplearning.ai to put this course together and help you understand the fundamentals of generative AI to help you build real world applications if you're curious it's available on deeplearning.ai and on Cera now the second step in this journey is to start get handson with the AI developer tools to help you increase your productivity and I think we all seen this quote here and we experienced it in our daily jobs that how we do work how we develop applications has changed a lot these days we can literally use natural language in inputs to interact with applications and really English has become one of the most um popular and hottest programm languages I think um we can see this happening for example you can go these days from English to code by asking AI to for example rewrite a readme file we can also do Cod to English for example asking I to document functions in our code but this is not all if we look at the software development life cycle I think many of us can agree that the majority of time we usually spend not writing valuable code but all the other things around it so sometimes up to 70% of unvaluable tasks which is writing boilerplate code writing documentation trying to m main old code bases right and sometimes we only have like a fraction of the time maybe 30% that we're spending on actually what you know creates joy in kind of the creative tasks in software development and this is what led us at AWS this inspired us to create Amazon Q Amazon Q is a generi powered assistant specifically developed for software development and this is much more than just a coding assistant Q developer actually uses agents to perform much more complex tasks and help you automate those for example feature development and also code transformation think about working with old Java based codebases that you need to migrate maybe to your newer Java version and to show you how this works I asked my colleague Mike Chambers to put together a quick demo let's have a look with Amazon Q installed inside of my IDE I can go to new tab and I can start a conversation with Amazon Q developer and I can do the kinds of things that maybe you'd expect such as how can I create a serverless application how do I get started and the chat session brings back a list of instructions of what I should do starting off by installing AWS Sam CLI how to do that where to get that from and how to step through the creation of a project now if I've done that then serus Sam for example might actually come back with some generated code and here is that code maybe I don't quite know what this code does so I can rightclick on the code and send it to Amazon Q asking Amazon Q to explain and the code then will go into a prompt along with explain and generate an answer and this this is great for code that's been generated for us but also Imagine code for Legacy systems something that was worked on Years Ago by somebody else where you can get Amazon Q to help explain it we can also get Amazon Q to generate code now this is again probably the kind of thing you'd expect I can put in a comment line inside of my code in this case I want to create an input checking function I'm going to give it some more definition here that I actually want it to trim any string that's being sent into this function and yes Amazon Q can generate this small function well that's great but what about if I've got more code that I need to have generated well I can go to the chat and put in slev and I can put in a much more comprehensive description of something that I would like in this particular case I'm going to ask for it to write a function to search by category in Dynamo DB with a bunch of details about the way that I want the output to be formatted so this is much more than just a single or a few lines of code and in this particular case what's going to happen is it will come back again with a stepbystep list of what's required so I need to add in template. yaml it's recommending that I create search by category. MJS and many more things but this isn't just a big shopping list of things that I need to do this is actually a plan and it's a plan that Amazon Q can actually follow for us so it generates some code as a change set something that we can look at the difference between our current code and what it suggests and if we like that we can actually click on the insert code button and it will add all of that code into our project way more than just a couple of lines so Amazon Q developer is much more than than just code completion all right if you're curious to learn more about Amazon Q Amazon Q developer we have a couple of more sessions throughout this day so make sure you're checking um those Expo sessions and we also do have a session at our AWS Booth here you can also visit our Amazon Q developer Center for much more examples what you can do with it all right let's come to step three and this is where the fund starts start prototyping and building with AI and the fun includes a couple of steps right everyone developing with AI knows this it starts all with defining your use case and then really you're on this road trying to choose from different models you're trying to you know customize them to your use case decide whether it's prompt engineering whether you do rack whether you need to do a little bit of fine-tuning there with your data and of course across the whole development workflow you have to incorporate responsible AI policies making sure data is private and secure and also implementing guard rails into your application and then when you're integrated another fun part obviously working with the agents what we're hearing a lot here throughout this conference and the fun topic of you know how to keep them up to date gen Ops I think there's a lot of terms for that MFM Ops llm Ops so really kind of um a lot of things to consider here I want to dive in briefly into the topic of models to choose and this is really an important topic when you're evaluating models you have to really evaluate them thoroughly because most likely there's not just going to be one size fits all for you in fact if you look at all your use cases you want to implement there's likely no one model to rule them all and this is why we developed Amazon Bedrock Bedrock is a fully managed service that gives you access to a wide range of leading Foundation models that you can start experimenting with implementing into your applications it also integrates the tooling you need to customize your model whether it's fine-tuning also to include Rec workflows to build agents and of course everything in a secure environment where you are in full control of your data and speaking of choice just to give you a quick overview as of today this is the selection of models you can choose from we're working with leading companies such as eii 21 Labs entropic CER meta M AI stability Ai and we also offer our own Amazon Titan models for you to choose from and I'm super excited just to call this out last week together with entropic launch we integrated cloth 3.5 Sonet on Amazon Bedrock as well so you can also since last week use this model super exciting now with Choice also comes responsibility right and we continuously innovate and trying to make it easier for you to build applications across the different model types and just a few weeks ago we introduced a new unified Converse API in Amazon Bedrock what does this do the unified Converse API helps you with a new unified method structured invocation meaning you can use the same parameters and bodies regardless of which model you choose and we are on the platform side we're handling this translation if parameters are called different for the different models handling the system user assistant prompts for you and also giving you a consistent output format and as well having native function calling support in here but let me show you how this looks in code so here's the python example that shows how you can use the new API this is python so we're starting by just integrating the python SDK client here and then you can Define this list of messages and here's for example where you put in your user message prompts you can put in system prompts as well and then this message list you can just pass in this single API call using the converse API here in the model ID you can choose which model you want to test here I'm using an entropic model and then pass the messages and also the inference parameters and again in this API all those parameters are standardized and we're going to make the work behind um the covers to convert this to the specific format that the model is expecting so you have an easy way to work across different models similarly here for function calling we do have a support built in that with the models that support it so how we implement this is by defining a tool list so tool here equivalent to to the functions you want to give access to and then when you're doing the converse API call you can pass this list of tools all right if you want to find out more about Converse API here's a link to our generative AI space on community. AWS which has a lot more Co tutorials code examples not just for python but across different languages as well so check it out the author here Dennis Trout is also somewhere here in the audience here this week so if you want to connect with him talk about different code examples and how to use the API um feel free to reach out all right now let's integrate AI into our applications and this can be a whole session in its own but I want to focus on one of the hottest topics right now that we're discussing during the conference which is of course agents and I have one more demo here and I asked my colleague Mike last time to put together an exciting demo to show you what you can do with agents Mike I we need to be able to create a gentic workflows right inside of the AWS console and inside of the service it works fully serous and I've used it to create an agent that plays Minecraft let me show you how I did it if we jump into the AWS console go down the menu on the left hand side to agents um you can see the agents screen here and I can open up my agent my Minecraft agent now if I just go into agent Builder U and just expand the screen out a little bit you get to see some of the parameters that I used to create this agent so you can see the large language model I used in this case Claude 3 Haiku and you can also see this the instructions for the agent now this is not some notes for myself this is actually prompt engineering that we're doing to explain how we want the agent in this case the Minecraft Bot to play the game and then we also have to add some tools in right some Minecraft tools so we do that through actions and inside of action groups so I've got a couple of different action groups we've got Minecraft actions and Minecraft experimental let's have a look at actions and inside of here we can see some really simple things some actions that the bot will be able to do and these are all linked up to code so we've got the action to jump we've got the action to dig and you can see the description here for Action to dig it's got some instructions again this is prompt engineering and then we've got some parameters that we can select collect in fact we require these parameters so the bot needs to get these for us um if I scroll down a little further there's a couple of really simple actions in here action to get a player location and action to move to a location I want to show you those in action because the bot can actually problem solve and reason its way to be able to use these tools to solve simple problems let's jump into the game um and so it is nighttime so let's set it to be the daytime um so that we can see what's going on so set time to today okay and there in the middle of the screen you can see Rocky Rocky is the Bedrock agent running inside of the game and we can talk to it and we can have a chat session but what about if we want it to come to us now there is no tool to come to us so if I I'm just going to back up a little bit further make it a little bit more of a challenge and I'm going to say come to me in chat and what's going to happen now is that the agent's going to reason through a whole set of actions it's going to look to see who requested that it's then going to take that name and that's my name and it's going to find the location of that player and then it's going to map a path from where it currently was to me all of those things happened all in that blink of an eye and it's a gentic workflows making all of that happen this is super exciting I'm discovering new things that this bot can do every day um but with that it's back to you all right thank you Mike if you're curious to know how we did this check out our booth session we we're running the demo there as well and we have another session in the agents track later today so make sure you're popping in there if you want to know more and of course you can find the project code for this on GitHub so if you want to play play it on your own and how you can integrate agents into a fun thing check out this project all right we're almost there so the last step I really want to call out is stay up to date there's so much much happening in the space as you all know and a really good way to do that is to engage with the community speaking of community I have one last announcement to make and I'm super excited to announce that we're Transforming Our AWS Loft here in San Francisco into the AI engineering hub for the community so we're super excited to host workshops events and meetups there if you want to suggest a couple of topics you're most interested in to make those events most valuable to you fill out this quick survey here also if you're interested in speaking or hosting a meet up yourself you can let us know and also we do have another event tonight which I think we're reaching capacity or just have reached capacity but we do have a happy hour with entropic tonight at The Loft in case you didn't make it anymore in we're at capacity um don't worry we're working on putting together much more events like this in the upcoming weeks and month so keep an eye out for those and with that I'm coming to the end of my presentation this wraps it up the five practical steps to become an AI engineer and let's innovate together and I'm looking forward and I'm excited to see what you build with AI thanks so much make sure you're checking out the rest of the sessions here and also pop by our booth outside side thanks so much relations at anthropic Alex Albert all righty morning everyone one so today I want to start with a uh a little story um a short history lesson if you will so you know sit back get comfortable uh I'm going to take us back to the year 1882 it's the dawn of the electrical Revolution really the world's first commercial power plant just opened up electricity this amazing new force is all the rage in the manufacturing industry people are claiming that it's going to change everything and yet something very interesting happened around this time or rather it didn't happen you see despite electricity's obvious superiority in comparison to the traditional techniques at the time like steam engines it didn't immediately improve manufacturing productivity why well because Factory owners were simply trying to replace their old technology with this new technology into an outdated Paradigm let's picture a typical Factory at the time so we have a huge cool fired steam engine on one end and we have a network of transmission lines going across the top all driving hundreds of machines locked in the same Rhythm these Legacy steam power factories were incredibly inefficient you know if one station needed power all of a sudden you had to turn on that entire steam engine and it had to power all of them Factory layouts were dictated by the limitations of the transmission lines not by was best for the process or for the workers when electricity arrived many Factory owners simply swapped out the steam engine for an electric one and sure you know they added some lights and you know workers didn't have to toil next to a Coal Fired furnace all day but the fundamental limitations of the factory remained so this real electrical Revolution well it didn't actually come until we imagined factories from the ground up with electricity at its core factories started to become flexible and adaptable they allowed for smaller specialized tools workers could bring their tools to the items instead of having to Lug the items back to their workstations the entire manufacturing process became more efficient more Humane and more productive now let's fast forward 140 something years to today and you can see that we find ourselves at a similar point in regards to Ai and llms Enterprises startups developers are all building and integrating llms into their products but often they're just tacking it onto their existing product surface adding a few star icon buttons in the top left corner and calling it a day and this is not the first time we've seen this in Silicon Valley let's think back to when mobile first emerged right companies simply tried to just shrink down their website and put it on a phone it wasn't until we redesigned apps from the ground up with the unique capabilities of mobile like always on camera and GPS that we actually began to see true innovation in the space and adoption this is when the Snapchats and the Ubers of the world started to emerge so just as companies and just as factories went through their you know replace steam engines with electric one phase and tech companies went through their just hire a couple mobile web dev people phase we're now in our magic star icon pH with respect to Ai and yeah it's funny but the thing is you can't blame any of the companies or developers that are actually trying to do this right now right like all of us are trying to do this but in many ways we're just still so early llms are non-deterministic they're hard to build on they're completely different than what most developers are used to using reliability is still an issue prompts still take rounds and rounds of optimization and we've also just started to scratch the surface of potential product opportunities so far not much has really stuck Beyond just the text box we've been missing something something that's a little hard to put a finger on but just last week I think we scratched the surface of a potential new product future that we can build as some of you may have heard last Thursday we released our new model Claude 3.5 Sonet 3.5 Sonet is the first model that we released in the new CLA 3.5 family is only the middle model and yet it is better than our last best model Claude 3 Opus in my opinion Claude 3.5 Sonic is one of the best models in the world right now and the benchmarks seem to back it up mlu human eval gpq tool use all the common characters here it's top of its class in many regards in these academic lab type environments but what I'm most excited for is how it actually does in the real world the model is particularly strong in rag use cases thanks to its 200k context and also has near perfect recall over that entire context as well on coding tasks 3.5 Sonet seems to grasp debugging problems better it's not getting stuck in those same Loops as much as previous models one of the best methods that we found for actually measuring more complicated chains of reasoning is pull requests they have a defined task they usually take a few steps to solve and the model is able to iteratively write and test its way to a solution in our own internal pull request evals we're seeing that CLA 3.5 Sonet scores a 64% and to put that number in comparison Claud 3 Opus only scored a 38% 3.5 Sonet also has state-of-the-art Vision abilities it shows considerable improvement over three Opus in basically every Benchmark that we tested it on things like table transcriptions and OCR are Breeze now passed this table in a 3.5 Sonet and basically replicated it perfectly and marked down um probably can't read all those numbers but trust me I I double check them to make sure they're all right uh Vision capabilities were actually what amazed me the most when I started playing around with this model it feels like we are really on The Cutting Edge of unlocking so many more use cases and you know as you're hearing me say all this you might be thinking well that's great Alex but I mean it doesn't mean anything if I can't actually use the model and you're right and we heard you and that's why 3.5 Sona is available on RPI AWS bedrock and vertex AI we understand that developers want Choice when they're building and we want claw to be available wherever you are in terms of pricing 3.5 Sonet is five times cheaper than three Opus it's only $3 per million input tokens and $15 per million output tokens 3.5 sonnet's combo of speed intelligence and low cost makes it much more economical to use and embed in your apps than three Opus but 3.5 son is not all that we've released in the past week we also released a new product feature that I think is actually more inspiring to developers in terms of thinking about in building those AI products from the ground up it's called artifacts artifacts separate the content that Claud produces from the actual chat dialogue itself this allows you to work collaboratively with Claude on things from to svgs to react websites artifacts become really powerful when you combine it with 3.5 Sonet those coding skills plus that reasoning ability plus that strong visual Acuity enables a new product experience that's really fun to use it's also a developer's best friend and that allows you to quickly take screenshots and figma diagrams and quickly turn it into code and components that you can actually just go use uh as you can see in this I basically cloned her entire cloud. AI chat layout and react just from a single screenshot and this feature has practically been hiding in plain site now uh just waiting really to be discovered for over a year and a half maybe this tweet is right and we really are early on this scurve of production productionizing llms which I think is actually pretty inspiring an artifax is not the only AI feature that we launched recently on Tuesday we released projects projects enables Dev teams to work and collaborate much more efficiently by grounding cla's outputs in your own knowledge whether it's style guides or code bases or transcripts or even your past work on our Claud team plan you can even share these projects in your chats with all your teammates at anthropic our Engineers now upload code repos and documentation that they use and I've started to see people actually just share the chats and the artifacts instead of Google docs or site documentation projects is another great example of when you think from an llm and an AI standpoint first you can actually start to build product experiences that complement these Technologies and don't feel like a simple add-on to what you already have so now that hopefully the creative product jues are flowing in everyone's Minds I want to dive a little bit into API improvements that we've rolled out recently and things that allow you to actually build this cool stuff I also want to give a preview of what's coming next next that will enable you to build even more so a month ago we released our new tool use API tool use allows you to give Claude custom client side functions that it can then intelligently leverage tool use also enables things like consisted structur Json output with 3.5 son it I've actually started to seen devs give Claude hundreds of tools at a time on the developer console front we're also continuing to iterate we added a prompt generator that uses Claud to write prompts for you based on a task description so you can see in this video we put in a task description and then out comes a optimized prompt and then once that prompt is all done you can actually just start editing it right in the workbench itself you can see we've also added support for variables so you can edit prompt templates as well test things like rag use cases and finally we're also working on a new evaluate feature which is currently in console right now with a beta tag uh and we will plan to share more on this and continue to iterate on it very soon so what else is next um well there's there's two things that I can share right now first is that you can expect more models 3.5 hi cou and 3.5 Opus are coming later this year with each model generation we're looking to increase the intelligence decrease the latency and decrease the cost the number one thing that I tell developers is to not forget to build with that in mind models will become smarter cheaper and faster in orders of months not years when you're planning your product road map be ambitious enough to build with the belief that new models may arrive during your development period we are also working on other areas of research like interpretability in one of our latest papers called scaling monos semanticity we explained how we've been able to find features within models that activate for different topics once you identify a feature you're able to clamp its value and turn it up or down to actually steer the model's outputs a few weeks ago we showed claw. users how this worked through Golden Gate clae which was a version of Claude that had the Golden Gate Bridge feature turned up significantly yeah fan favorite we currently have a few beta testers is also experimenting with a steering API um this allows developers to find and clamp features for specific attributes and actually turn that dial up or down which again allows you to control claud's outputs in in addition to actually just prompting it we hope to be able to roll this out to more developers in the very near future as well now if anything in this talk has sparked any ideas I want to encourage you guys to just go out there and build and make quick prototypes as fast as you can to get that validation and that feedback loop started and for even more of an incentive uh we actually just launched another build with Claud contest yesterday yesterday it runs until July 10th the top three projects will each receive 10K and anthropic API credits to see more details just visit that link below it's just at the top of our docs page as well so you can find it there too I'll leave that up for a second and finally if you have any questions or you want to hear more about just what we're thinking about uh I'll be at that AWS Booth down the hall for the next few hours you can also find me on x/ Twitter Alex Alber with2 unor um I do try to read all my DMs I spend way too much time on that site so feel free to ask questions there as well and with that I want to say thank you guys very much and enjoy the uh last day of the [Applause] summit ladies and Gentlemen please welcome to the stage CEO of Lang chain Harrison Chase hello today I want to talk about agents so llm powered agents are really nothing new the react paper came out in October of 2022 Lang chain launch about a month after that and auto GPT is over a year old and to me auto GPT represents the peak of hype in agents and I actually think for a few months after that there was a bit of a falloff in interest as people realized that the generic agent architecture wasn't reliable enough to build systems to ship to production while there was this fallof I do think there was some really interesting work being done so to open ai's assistance API I think was really novel in a few regards and I'll come back to that in a little bit and earlier this year we launched Lang graph while Lang chain did agents it also did a bunch of other things Lang graph is purpose-built for agents however what exactly does that mean so lra is highly controllable and lowlevel as mentioned we saw that these generic agent architectures weren't reliable enough and that companies that were shipping agents to production were building custom cognitive architectures encoding little differences in how they wanted their agents to behave and this was super important and so we made we made L graph extremely low level and controllable it also comes with the built-in persistence layer which enables a lot of really cool humanin the loop interaction patterns and it's streaming first because streaming is really important for llm uis and just to emphasize lingraph works with or without Ling chain and it integrates seamlessly with L Smith our testing and observability platform lingraph is already being used in production by a lot of our awesome Partners ranging from cuttingedge open source projects like GPT researcher to trailblazing unicorns like repet to public companies innovating at scale like Norwegian Cruise Line Ali bank and elastic and today we're excited to announce the first stable version of lingraph reaffirming our commitment to building an agent architecture that allows you to build the custom cognitive architectures that are necessary for bringing agents to production but that's not the only thing we're launching so I want to go back to the assistance API which I mentioned earlier and I think there were a lot of really cool things here and really novel things it wasn't a framework it wasn't just a framework for building agents it also introduced infrastructure components it had built-in persistence it kept track of the messages and stored them for you so you didn't have to do that it had this concept of background runs really good for longer running asynchronous workloads and it allowed you to configure agents the downside is that it didn't give you full control over the cognitive architecture application it it came with a specific state that it expected your application to have a list of messages and it was a little bit rigid and didn't let you easily do other things besides that and so that got us thinking what if we took L graph which lets you build these custom cognitive architectures and combined it with these generic agent infrastructure pieces today we're excited to announce langra Cloud which is a step in that direction so with Lang graph cloud you can take your lingraph applications written in python or JavaScript clode and with no changes get a production ready agent API that agent API has all the benefits of the assistance API so it comes with built-in persistence for whatever the state of your L graph agent is it comes with a task CU to manage background runs and you can configure different instances of your graph to change out the llm or prompt that's used but we've also added a few other things that we're really excited about so when you kick off an agent run and you send a message and it goes and it does a bunch of work what happens when you send it another message before it's finished we call this double texting and we've introduced four different modes to handle this agents aren't just invoked through chat they're also triggered on schedules and lra Cloud comes with buil-in cron jobs to easily support this I talked about human in the loop one of the really important human in the loop features that we're seeing is the ability to break before specific steps for example if you have a tool that you really want the user to approve access for so lra Cloud easily comes with break points which allows you to add this and then resume once you get that confirmation that's not the only human and loop feature that's supported so with langra cloud and the built-in persistence you can easily go back to any step in the agent's trajectory edit that and then resume from there and so this is supporting a bunch of really cool time travel like features that we think will be very important for the uxes of the future and just to emphasize another benefit of lra cloud of course is that it's not bound to open Ai and it supports any cognitive architecture that you can build with L graph finally I want to talk about lingraph studio so lingraph studio is what we think the best way to build debug and share agenes so taking a look at this video you can see that you can easily see the graph of the agent you can invoke it you get streaming output of all the steps and the tokens you can go back you can modify steps edit it and resume from there you can also add break points so that for future iterations you have to explicitly approve it before you continue so all those human in the loop features that I mentioned we built into langra studio and are providing a developer experience specifically aimed for building agents so lra cloud is in private data we're excited to work with Cutting Edge companies to figure out how to deploy agents reliably we think a framework like Lang graph that gives you complete control over the cognitive architecture of your application is just one part and we strongly believe that everyone should be building that that is business logic that you should be incorporating and that is your moat in some sense however we also think there are generic infrastructure pieces that just slow down the time to get to production and that's what we want to be building with lra Cloud thank you all all right how do we enjoy the uh opening Keynote yeah really great time I want to um run through a couple of uh of the tracks coming up so we can go through that so um agents so Andrew ing famously said that AI agents could drive more progress in 2024 even more than the next generation of Frontier models our host Demetrios from the mlops Community just came off organizing his own AI quality Conference next door and brought his ukulele to guide us through building Crews and factories and AI agents join him in salons 14 to 15 and head to to get there head to the AWS Booth down there and turn right at the end of the hallway to the end evals and llm Ops AI Engineers should build real Moes instead of gp4 rappers Twitter loves hacky MVPs but serious AI engineering means writing great evals and building good operational discipline for you to ship better faster and cheaper you might think our track sponsor Galileo AI ships a UI generation platform but you might be surprised by how they ship the evals that ship the platform mouthful to say this is swick's copy I'm reading it for the first time so join Osman in salons 2 to6 just outside the doors behind you and to the left at our Summit last year open AI launched the year of multimodal AI with vision and image generation this year GPT 40 is just one of many Frontier models we will use to push into applications for on device Vision realtime TTS character simulation and classroom education one of the track speakers Ben higher worked on AI and the Apple Vision Pro at the Apple Vision Pro and his co-founder Alexis will guide us through the dawn of of multimodality join her uh right here in Salon 7 um and AI engineer is a Convention of both GPU rich and poor and we are excited to learn the state-ofthe-art from people making making it uh didn't get a chance to rehear this this is going well uh our host Nyla worked in product at Nvidia and convey and is excited to accelerate uh our learning at the top teams in gpus and inference due to scheduling conflicts we also have two talks here from yesterday's tracks um Scott wo from cognition AKA Devin on Coen and Kathleen uh canel of Google Gemma on open models join Nyla and her speakers right here in Salon 8 and last but not least day two of the AI leadership track addresses more needs of VPS of AI from understanding Enterprise rag building evals in security privacy and compliance safeguards hiring and growing uh an AI engineering org as well as case studies from coher and twio and as a reminder this track is exclusive to AI leadership track uh attendees with the green lanyards and the green badges so if there is room at session start time we can let in speakers with the blue badge and blue lanyards but anyone else uh again please do not attempt to attend these sessions these are exclusive sessions or you'll be escorted off the premises by security please don't make us do that um if you want to attend these sessions next year the tickets there for you to purchase um so these take place right across the hall in salons B to d uh so we have uh a number of things happening after this block of talks um we got Expo and Expo sessions from 10:30 to 11:10 uh we got breakout tracks from 11:15 to 12:15 then we'll have lunch uh we got Expo and Expo sessions happening more um then we have HF zero they're doing a demo day uh that's going to be in salons 2 to 2 to S over there um just outside the doors and then we got the uh more Expo sessions then we'll be back here for closing Keynotes from 4 to 5 that includes uh CEO of GitHub Thomas dke but I believe if the demo gods are good we might actually have a special guest with us today Tim did we get that set up oh hey hello who is this hey what's going on everyone please come on welcome swix Hey from Singapore um this is a live here is also watching along it is uh 13 a.m. our time right on well glad you could join us swix so how's it Ben what's going on we want do we want to give any more contacts to the folks here uh um scheduling speakers is hard especially when they're busy launching Gemma 2 and uh it's been a fun experience putting this together I hope everyone's having fun I hope that the the talks have been engaging I wish I was there with you we absolutely miss you I know a lot of people thought it was a joke for a long time that you didn't get to make it so um I'm really glad that you're supporting us this this guy has been so helpful in everything like every all the sessions that you see here he's curating that with you know help from his uh from his friends but um what is it like from from afar you know watching the live stream like are do you have time to watch the live stream are you catching any of the talks yeah uh if actually people are watching the uh YouTube live stream that's me in the chat uh just hanging out with folks saying things I regret sometimes but uh it's it's very fun to just be one of the audience my my dream was always to actually just show up at one of these conferences and just be an attendee and not know what's going on um I think there's a little bit too much not knowing what's going on even as as a core organizer um but it's it's been it's been fun nonetheless I think uh the the lessons learned from this event will allow us to implement a lot of procedures and policies that allow us to smooth our the organization task that bit and I think I think well we we might have a little bit more budget now that we've kind of proven the the model here so um any any parting thoughts as we enter the last day yeah I mean there's there's lots of amazing talks uh I I'm personally excited uh about like I can't pick favorite I'm personally person excited about all of them um and uh they'll catch up on live stream afterwards and I I think there's some After parties as well right uh I'm not sure what the social schedule is like there's tons of After parties I think they're all on the they should be all on the homepage the homepage is getting very long now at this point yeah um let's let's let's put those in slack is everyone on slack let's drop those in slack as well so we we'll drop those in announcements or or general yeah and um I I think I think I restored the for wsare 2025 uh I broke it just before you went on stage because I told you I was going to change the url and then we didn't change the QR code so try it again that's why I broke I thought it was the I thought it was our QR provider that just got a spike and they just cut it off interesting no it's me swix is to blame for the demo Gods SX is the uh the the demo demon um all right well thank you so much for showing up um so sad that you couldn't be here with us but um really great to be a partner with you at this event so let's hear it for swix everyone so we're going to go to break now um we will uh we will see you all back here at 400 pm. but uh enjoy the breakouts until then see you [Music] [Music] [Music] [Music] know [Music] [Music] because I'm missing you oh oh because I'm missing you missing because I'm missing you because I'm missing [Music] youing up my heart piece that's broken trying to get back to myself don't have a CL looking for luck [Music] [Music] tell me to St baby just don't walk away I need you [Music] Al lost some heart trying to get on my feet caught in the I feel you let me now I want to be next to you you want to be next to me holding our hearts fading our Broken Dreams I want to be next to you you want to be next to me hold it on paper heart fing our Broken Dreams want to be next to you [Music] B just don't walk away I Need You Now f it out all the time we spent Al fighting through the fire don't let me down I need you I'm feeling out it's getting to me L get my feet caught in the madness I feel you don't let me go I need you right now I want to be next to you you to be next to me hold iter Hearts fing our Broken Dreams I want to be next to you you want to be next to me holding our Paper Hearts feing our Broken Dreams I want to be next [Music] to you want to be next to you you want to be next to me our heart fing our Broken Dreams I want to be next to you you want to be next to me holding our heart our Broken Dreams want to be next [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] got Ania with eyes wide shut we got everything we need and then a little too much I know that you're starving for something you can't touch but you'll be honest with me right now there's something in the underc I can feel it up don't you want to feel it your SES don't you ever feel [Music] it baby esep you don't to feel it [Music] think there's something in my B that's weighing me down it's just the way of the world now I'm call staring can weest right now there's something in theur I can feel it coming up don't you want to feel it taking over your senses don't you ever feel [Music] teolog baby es I'll sweep you feel [Music] [Music] it in and I'm ready to go I found what I want and I know that we're on top so I'll tap and I'm ready to hold my breath and I'm ready to go I catching and I'm ready to go you're [Music] hold we are feeling IGN do feel it in glow I'll come back to your to know that [Music] [Music] you I and [Music] I and ready my bre [Music] [Music] [Applause] I in a perfect Stone you and those eyes again [Music] [Music] iast [Music] bre and I'm ready to my breath and I'm ready to go I catch and I'm ready your it and and I'm ready to [Music] I Go's go holding my breath and ready to right [Music] and and [Music] breath to IAT [Music] [Music] I'm ready to [Music] n [Music] the [Music] [Music] B [Music] [Music] he hey [Music] [Music] me were the only we were [Music] holding from the greatest [Music] weever theing in your singing every night [Music] I feel it all come back in the moment SP like the so if you want to come it play all [Music] my he feels like a m up the night alone I he [Music] Tom singing every night [Music] play made fire the sum bre in [Music] your [Music] on in [Music] moment the so if you want to come it it all slow [Music] motion like the come with the open it all back in slow motion [Music] [Music] [Applause] [Music] know [Music] you [Music] know know [Music] [Music] [Music] [Music] get [Music] secrets that we know doors that open us in a moment keeping oning [Music] keep catch our breath the [Music] midle coming Crystal Vision [Music] true I see on [Music] the forest for the trees I'm keeping all theor W up [Music] [Music] and is the coming Crystal [Music] you [Music] true [Music] I chasing the is we know [Music] trueing on the I see the [Music] hor you [Music] true I [Music] [Music] [Music] [Music] [Music] [Music] he [Music] [Music] [Music] see [Music] [Music] we oh [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] you you you it would you would you it you [Music] you you you you [Music] you you you would you get would you would you would you [Music] [Music] you you you you [Music] [Music] you [Music] you you you you you you you you you you [Music] [Music] [Music] [Music] [Music] [Music] breath so I could find strength to divide [Music] [Music] KN when to kill your pride there's no to blame nothing really stays the same this is how we we hold on [Music] [Music] [Music] us [Music] I know on this we go got to give it up KN to kill your pride there's to blame nothing really [Music] STS let go [Music] [Music] go KN when to kill your pride there's no to blame nothing really stays the this is how we [Music] [Music] a [Music] [Music] [Music] take [Music] [Music] Dam n [Music] [Music] [Music] [Music] d [Music] f through [Music] [Music] like we our CL dancing right through while [Music] weing through window ring through the Ping past all the you're wearing out my name we our [Music] sneath like super like [Music] supering down a Harmony only us it's forever in America Inu Moon wax I couldn't see it until you insane we blame it [Music] all we [Music] enough we our cles dancing right through the fire while we were Cho singing [Music] as window Harmony [Music] only it's forever so in America coming over me electric Sy every night on [Music] I feel [Music] [Music] back us hold tonight all going [Music] us hold back tonight is all we have going us hold [Music] tonight aony only we can he super CR you want to feel like it's forever America it's coming [Music] over America [Music] [Music] [Music] hold [Music] [Music] [Music] it was summer back in 89 we were kids falling in love for the first time your hand you look me in the eyes kind of feeling you get Once in a li but now something went wrong you're moving on I found myself on The Blind Side now you won't call we lost it all you fade away I'm picking up my heart from every piece that's broken been trying to get back to myself but don't have a clue I'm looking for some luck can't find a door that's open I'm losing all my feel like love you because I'm missing you because I'm missing you oh because I'm missing you because I'm missing you because I'm missing you [Music] I'm missing you I was chasing all the sides trying to hold on to something that I couldn't find you didn't Captivate my mind now I know we've in the sunsets in Paradise but now something went you're moving on I found myself on The Blind Side now you won't call we lost it all you f away I'm picking up my heart from every piece that's broken and trying to get back to myself but don't have a clue I'm looking for some luck can't find a it's open I'm losing all my feels like I'm left hereo because I'm missing you because I'm missing you oh because I'm missing you because I'm missing you because I'm missing you up my heart that's broken try to get back to myself don't a cling find open I'm Los all my feels like I'm here because I'm [Music] you f out all the time we spent alone fighting through the fire let me down I need you now cuz I'm feeling worn out it's getting to me lost some heart trying to get on my feet caught in the madness I feel you some don't let me go I need you right now I want to be next to you you want to be next to me holding our Paper Hearts fading our broken dream I want to be next to you you want to be next to me holding our paper heart fading our Broken Dreams want to be next to [Music] I'll [Music] [Music] are super excited to welcome you to the multimodality track for the AI engineer World spare things have been changing a lot in the world of AI and llms going from just text to text to now a whole world of multimodal inputs we have a really exciting set of speakers here today to help you learn a ton more about this and so I'm excited to welcome Rob CH to talk about substrate it is their launch week so give them a round of applause hey um yeah it's really good to be here um this is a particularly exciting talk for us because we've been working with private clients for about a year now but this is the first time we've really talked about it in public um since our launch last week um I'm incredibly proud of the work we've done so far and um excited to take a few minutes to tell you about it um so if you look at the products out there that have really successfully leveraged this generation of AI I think one thing is true about nearly all of them is that they're using more than one inference runs often many different types of models in tandem to accomplish a specific kind of task really well and I think people really quickly realize that the foundation model is not enough and even very simple tasks like summarizing a document to much much more complex tasks like solving coding problems end to end I think the best products right now are all using systems of inference runs in a logical structure so I think at subrate we believe that building with modular intelligence is always going to be more effective than building with a monolithic intelligence um these systems are inherently more leg which means you can understand them structurally which means that they're debuggable and they're extensible and evals become a lot easier because the decision trees are explicit and you can sort of verify at every step what's going on and what's going wrong um so substrate I think is a is sort of New Way new approach to this um I think our model is sort of fast in ways that other paradigms can't be it's sort of flexible enough to build any AI product out there and it works to scale by default so what is it um I think at its core substrate is a coupling of two things first I think it's a really elegant developer SDK that lets you describe a computation graph over any number of nodes um and the abstractions here are are really General and so we have we have a bunch of intelligence nodes across all the most modalities that you might care about which is like generating images transcribing speech generating text Json embeddings executing code um but second subrate is also an inference engine specifically built to run these computation graphs as efficiently as possible um so these graph representations here um are it's a representation of many tasks and their relationships and since we run a very coordinated compute cluster um we can statically and dynamically optimize things like batching caching sort of networking concurrency physical placement um which really makes a big difference uh and if you look at most Frameworks out there um they're typically involving dispatching a bunch of API calls separately and if you look at what happens mechanically when you do that it's every step means you've got to resolve DNS you've got to go through proxies you've got to through authentication like balance checks um and all of that sort of adds hundreds of milliseconds of latency on every single step and if you contrast that with substrate we we transfer data from node to node process to process on the order of microseconds which is some 10,000 times faster meaning that it's actually feasible now to run online applications that involve dozens of nodes um we've also noticed that Json decoding is is of the most useful patterns for multi- inference runs and I think we've invested a lot into offering a a Best in Class um Json mode both in terms of reliability and speed and if you look at all of this together I think what it means is that substrate is is is really a way that weit didn't enable higher quality outcomes with AI letting you work in a system that's more flexible it's more legible it's more verifiable than any of the current paradigms that sort of exist now um I think there's a lot more to say there all the time I really have today it's only five minutes um but if you're curious um please come out and say hi on the Expo for you can scan this QR code we and get some credits um and go to the website sub. run um or give me an email at uh Rob sub. run for for awesome thank you so much Rob um and now we're excited to welcome Vic kapati who is the CTO at m87 labs to talk about moondream hi can youall hear me I can hear myself uh hi my name is Vic I work on a Model an open source Vision model called Moon dream um a little bit about myself before I dive into Moon dream uh I was at AWS for about 9 years um before I started working on this model um looking at where the stock price is going I'm not sure if that was the right financial decision but I'm very happy with the work I'm doing uh so let's dive into it I'll talk about m a little bit um it is a tiny Vision language model it's less than two billion parameters so it can run anywhere and it's open source Apache 2.0 so you can use it to do anything um here's some examples of things you can do with moon dream you can ask it questions about it images um you can caption images uh it can detect specific objects inside of images so here I asked it to tell me where the peak is and gives me coordinates um I can count stuff I can do all sorts of things um I had the audacity to title my talk how can a tiny Vision model slap so hard so I have to back things up a little bit um and so here's me doing that um these are two Vision uh benchmarks V Vision question answering bench marks one's called VQ V2 the other is called gqa um as you can see mream has been steadily improving over the releases I've made over the last three months uhu I've included a reference line over there for lava 1.5 which is a popular 7 billion parameter uh Vision model so this shows you that um Moon dream gives you compar performance that's comparable to uh models that are about four times bigger than it um I didn't really set out to build a vision model um so so I kind of got roped into it I was originally trying to build an application that required an AI agent so I needed to be able to see what was going on on the user screen and um have it describe what's on the browser page for QA testing automation tried to do this at first with uh GPT 4V but um there were too many safety refusals back then like if there was any human being present in the image it would just refuse to process it um it was also going to be really slow and expensive and so I realized if this is a product I'm trying to build I really need to have control over the model itself so I figured you know what how hard can it be let me just go try and uh build this model myself now the task I was trying to perform here was fairly constrainted um I I just needed to describe screens and answer questions about screens so um it doesn't need to be generally intelligent um I had a couple of 309s at home so I figured I'd train a small version of the model at home and then rent some B machines in the cloud to go train a bigger version um and and uh once I got done training small version I was like hey this actually works pretty well so I posted it on Twitter I thought you know what I might get 20 likes off of this and then I'll move on with my site project at the time um it blew up far beyond expectations I was a little surprised pleasantly surprised but surprised nonetheless and I immediately started seeing um other automated testing companies reach out and be like hey can I use this to describe browser screens uh because this would work really well for us um as well as other companies shout out to our friends at open interpreter from Seattle um that basically told us that they were I forget you know what like this is getting a lot of traction let me pause on the whole automated testing app for a couple of weeks and focus on moon dream and see where it goes um yeah so let me dive into a couple of the technical details um around what makes the model succeed despite being small the first thing we did um that I think really helped was um deciding what problems the model should solve and what it should not solve so mream wants to be a developer tool we focus on being really accurate and not hallucinate um it doesn't really have a lot of knowledge about the world so um if you ask it to write a poem it's probably not going to help you it's really focused on answering question like giving you helping you understand images um this is really important because it affects the type of data that you use and the sort of benchmarks that you want to focus on uh there's a popular Vision language model Benchmark called math which measures how good models are at solving math problems you take a picture of a differential equation and you see whether the model can solve it that was uh an example of a non- goal for us because we just want the model to be good at looking at images the most we do is probably generate a latch representation of the problem we don't really want to even attempt to try and solve calculus um it was not pre-trained from scratch we use um um we we fuse a vision encoder called sigp from Google uh with uh pre-trained text model called f 1.5 from Microsoft uh the notable thing over here is 5 1.5 was also trained on mostly synthetic data which is very similar to our pipeline so it works very well um for this sort of task pre-training from scratch doesn't really make a difference uh as opposed to using pre-train models and it is cost prohibitive so unless you want to get those brownie points for saying you trained it from scratch uh it's probably not worth doing I've experimented with a bunch of different other models uh as they were released and nothing really made too much of a difference what does make a difference though is uh training data um the latest release of moon dream is trained on around 35 million images and uh the problem is especially when you're on a budget like high quality multimodal training data is really hard to come bu by um there's companies that uh there's a lot of companies out there that will annotate data with humans um but it's really expensive and I've heard a rumor recently that they won't even talk to you anymore unless you're willing to sign an upfront seven figure commitment there's a lot of data on the internet images all text pairs um the problem with this is it's often not in the format you want it to be and it's really noisy and the noise is really problematic when you're and when when you're training small models um and so synthetic data is a way to solve this uh where you use that all text information process it it's a bit of an Open Secret that a lot of people are training on outputs from gbd4 um You probably don't want to do that um besides being questionable in terms of terms of use uh it's often not helpful gbd4 is a very powerful model it has reasoning capabilities and knowledge that your small model is never going to be able to get and so when you train it on gp4 outputs what it learns instead is to hallucinate it's going to generate plausible sounding outputs that include details that it cannot possibly memorize uh and so you end up in trouble so um this is a little important I'm going to go a little more technically detailed for a couple minutes to dive into how to do synthetic data so bear with me for a sec we'll pop back up um here's an example of how not not to it uh Coco is a data set it has around 200k images each image has five short descriptions and a bunch of uh object annotations uh with like hey there's a bicycle these coordinates and whatnot um and let's say you want to take the short descriptions and these these object annotations and generate more detail captions that include the union of all the information present over here if you just naively call GPD 4 uh with this information uh it generates this not important to read all of it uh but there's uh two important things to not the first is that um it hallucinates it says uh in the second paragraph there's a person near the right side of the harbor I I think there's like a person way back the there's like five pixels there that may be a post it may be a person we don't really know that's because the object annotations were bad but besides that like uh the model is also taking a lot of creative Liberties over here like saying there's there's five Yates standing out from the rest and whatnot um and so this is uh you need to do a little more pre-processing of your data before you feed it to the model um here's another example there's a data set from Google called localized narratives um the task annotators here uh annotators are given here is um verbally describe this image and as you're describing the image uh how are your mouse over the part of the image that you're describing so it's nice and that it encourages people to create really detailed descriptions that capture spatial positioning um in the image so for example here it says the girl in the front is playing the guitar and whatnot and spatial reasoning is something that Vision language models typically tend to struggle with um I ended up having to build a fairly sophisticated data processing pipeline to get really good results with this um not really important to dive into the details over here but the important thing to note is a it gets really expensive um each image ends up being 20 llm calls and the llm here is mixw 8x7 so it gets pretty expensive um but it was necessary uh the training data is the biggest needle mover in terms of model performance and because of this uh I'd say we spent like maybe one or two orders of magnitude more compute on generating training data than actually training the model itself um a couple so yeah this particular data set we've open sourced it's uh available on hugging face here's an example of the type of questions it generates for this image um the there's an interesting question towards the end what theory does the kid have about the existence of pleasure in the image I'll talk about that in a sec but basically you want to generate a few distractor questions so the model knows to not always agree with the question that the user ask is asking um so yeah couple of the challenges involved in working with synthetic data um there was an interesting incident I had uh early on where a user was like Hey I asked a relatively simple question why couldn't the model answer this um and when I looked at it it turned out that they didn't capitalize the first letter in their question and the model had never seen anything like that during training so I was like what do I do over here um and so it's really important for you to make sure that your training data has the same rough distribution as your real world query so ended up adding like an extra step where we artificially inject like capitalization issues and typos and whatnot into the model before training it um there's also this risk of what we call model collapse where your model has biases iner into it so for example if you try to ask Mi to generate distractor questions he just generate a question that's completely irrelevant to the image it'll always generate something about dinosaurs and aliens and so if you train your model on that it'll instead learn uh to say uh hey if the question is about dinosaurs in Aliens always say no which doesn't really help um and so there you need to inject like some entropy into the process of generating synthetic data to avoid this uh in the case of synthetic captioning you can do something like hey describe this image but also consider the alt text on the image which may be no may be relevant but if it is relevant use relevant facts for that uh and that tends to help a lot all right so popping back up um there were a couple of important learnings I had over the last three months uh that I would like to share with all of you uh the first was U the community was really critical in this whole journey seeing that original engagement that we got from the M emploees helped me realize that hey maybe this is is more valuable than that QA testing application that I was working on because a lot of people have a need for this to build applications like that uh coming from an Enterprise is company uh it's been really valuable it's been refreshing to be able to just talk to customers directly like someone tter DM and be like hey just saw you looking for this what do you think um but it's also helped us connect with a lot of Partners mentors and get a lot of support from the community being open source was critical I kind of didn't really have a choice over here because the competition was free so what am I going to do um but when you're in the dev tool space it is pretty important uh open source is important to a lot of developers they would like to have the ability to run it in different environments um it's also pretty important for a lot of Enterprise users in a lot of cases they don't really want to run the software themselves but having the option is very important to them because uh they've had most Enterprises have had situations where a vendor goes out of business or decides to um screw them in some other capacity um it's also been really critical for engagement for us we've had a lot of people in the community help out Port it to different platforms running run the model in the web browser and whatnot so it's been very valuable for us um this one is a little controversial U not sure everyone agrees with this but I feel pretty strongly that safety guard rails should be implemented at the application layer not baked into the model itself um this was one of my learnings from my first attempt to build a QA testing application with GPT 4B um it made no sense for that application to reject pictures of any picture the contain a human being I understand why they felt it was important um Dev tools are kind of B2 B2B not b2c so it's important to make it easy for developers to decide what guard rails they want and implement it in their model as opposed to just deciding it for all users uh I'm not saying this is not important at all uh kind of makes sense if you're trying to build an assistant to bake that stuff right directly into the model but when you're building for developers makes makes makes less sense um yeah I believe pretty strongly now that tiny models are going to run the world um in computer vision more so perhaps than in text models efficiency is really important um in a lot of cases you're really worried about cost because your processing video and 30 frames a second at 7/10 of a cent per second adds up very quickly and doesn't give you a lot of room to work with but there's also situations where you're really worried about privacy uh or latency and therefore you want to run the model really close to where decisions need to be made um which is not to say big models are not useful I think they're very useful I just think that we'll mostly be running them in our development environments maybe for generating training data uh but the artifact that you going to want to deploy is most likely going to be a smaller model um another thing that was a little surprising to me was looking at the different things people were doing with mream there were a lot of people building net new applications that weren't possible to do before because a model can understand language as well as images but there were also a lot of people doing traditional computer vision things with um with the model it's like is is there a person in the scene or uh is there something suspicious going on tell me where the uh where where where the bus is in this picture of from a road camera um all of which was possible to do before we had Transformers like just train a YOLO V3 model or whatnot the thing that was uh yeah the lesson I I took from this was uh prompting is a much better developer experience than having to train a custom model and so for a lot of developers that would be interested in incorporating Vision into their applications um before they'd be like you know what it's not worth me spending two weeks learning how to like collect data and annotate it and train my own custom model um giving them the option to say hey for Fairly cheap you can just in English describe what you want extracted from this image uh makes it something that they actually consider doing it doing now all right um I think I'm a little ahead of time so I'm excited to maybe do a live demo if the demo God smile upon me but we'll see um in conclusion uh yeah where's Moon dream going we we're not AGI people I'm really focused on making it really easy for developers to build amazing applications with vision um there's a bunch of model improvements that I'm working on right now um I'll talk about some um right now we use 729 tokens to represent an image so you can only really send one image to the model at a time uh we're working on giving users the option to like give a more compressed representation to the to the model um which makes sense if you're not trying to read text or something from the image if you're just trying to do classification whatnot that makes the model run a lot faster which is important especially if you're on CPU as opposed to gpus which can't do as much uh CPUs can't do as much parallel compute and so that sort of thing ends up being really important um we also just raised a seed round um from felis Ascend and also the GitHub fund which I forgot to include in the slide sorry get um this means more gpus but more importantly it means I can finally get some sleep because we're able to get a couple more people to join the team uh if you're interested please reach out we have a contact email on the website or just hit me up on Twitter uh we also have an exciting release coming up later this Summer that I'm super pumped for so stay tuned um I think that's about it so I have a couple of minutes left I think so I'm going to try doing something that may not be the wisest idea but we'll see how it goes all right uh I turn the Wi-Fi off this whole thing is running locally um so what this is going to do is like um start taking my webcam in and it's going to use Moon dream in a in an infinite Loop to describe what it sees and we can ask it different questions so we'll see how that goes and yeah you can ask it different things uh so let's say is the person wearing glasses um you do have to tell the model to answer briefly if you want a yes or no otherwise it gives you like a um answer with a single word let's try that yes okay I'll take them off and I can't see what's did it get [Applause] it let's do that I'll go back to the old prompt [Applause] all right well that was it for me thank you all thank you so much Vic we are super excited to welcome our next speaker uh Ben hilac who is one of the founders of dawn analytics and he's going to be talking about the era of unbounded products hey all um I guess actually I have this on I don't need that uh I'm so excited to be here with you guys today and I think what is uh probably the coolest uh AI conference uh in the world at such an exciting exciting time in history I think especially for AI products um if you don't already know me from either demos on Twitter or sometimes probably ill advised spicy takes on Twitter uh my name is Ben hilac and uh I'm the founder of dawn so at dawn we help some of the best companies in the world everyone from GitHub to can of soup build better more predictable AI products my entire life I've been like really obsessed with building and designing unbounded products so unbounded products are products that transcend the uh the the mouse the monitor in some way right so for me that started with robotics uh when I was is I think the first one was when I was in eth grade uh eventually Rockets uh at SpaceX so these are very these are very unbounded products right um and then most recently uh I was on the design team for the Apple Vision Pro for four years so we designed the first version of vision OS I think that AI makes products less bounded than they've ever been right you can type you can talk you can show images or show video just like we just saw you can also sort of plead you can bargain you can confide right these are very interesting sort of input modalities and this unboundedness often makes products unpredictable right confusing hard to understand users assume your product can do things that it can't they try to do those things doesn't work and they walk away thinking that it can't even do the things that it can't when you talk to people and specifically people that are not in this room how they use chat TBT how they learn how to use it it's often Word of Mouth right so they hear one of their friends say that they used it for travel planning and then they go use it for travel planning a lot of us a lot of us in this room especially like people that are more technical we often learn through trial and error right so we just keep trying keep trying we keep trying because we know that these models are good right we know that it's impressive um but a lot of people are not they don't do the trial eror thing right so they try it once it doesn't work uh they don't try again and so this talk is about making good AI products and to that end I'm going to cover just three things so those three things are the past right so how have products become more unbounded and what has worked for unbounded products in the past the present which is AI products today what are sort of good design patterns and bad design patterns and then the third point is going to be the future right so again just three things just the past the present and the future easy so we're going to start with the past so most software that we use lives on a screen right and you use it just by typing sorry you use it primarily by swiping clicking and tapping right uh when you click something whatever the developer expected to happen is what happens depending on how good of a developer you are it's easy for users to understand what your app can do they look they see the buttons they get it it's also very easy for you to understand what your users are doing you just add a amplitude or mix panel call on a button press you see what they did so if you think about one of the biggest changes to this uh previous to the last two years was multi-touch right and this is just like instead of one pointer you have two but just by adding that second pointer you get relative distance you get rotation right and just this one little change like largely made the smartphone possible Right like largely made it easy to use a screen that is that small and now it's just getting crazy right it's like we have unbounded products everywhere products are so unbounded you have software you know just freely roaming the street of San Francisco getting attacked by fiery mobs right so this is getting crazy um and so I want to talk a little bit about just one unbounded product that's I got to work on uh which is the Vision Pro and what I want to talk about is just three lessons that we learned while we were designing it lessons that I think aren't as intuitive looking from the outside in so I think that bounded products are often defined by this what if question like when we were starting it's like users get themselves into the craziest situations so something as simple as oh well what if someone's in the living room and then they move to the bedroom and they lay down on their bed right what should happen to your apps if you're designing Mac OS that's on a laptop you don't have to worry about that but that's something we had to think about and there's hundreds of more questions like this right what if someone's on a plane what if someone's next to their friend what if someone has a a disability of some sort right like uh they cannot move they can't move their neck they're bedridden so all these wh ifs and I think this is again what defines unbounded products right all of us that are building AI products we're constantly thinking you know oh like what if someone puts in this what if someone puts in that and there's evals etc etc and so without structure you just have chaos right you have a blank slate you have all these wh ifs um infinite World of Poss possibilities and so it's really on us as product designers to add structure and structure is what creates Clarity so again I want to talk about three ways we added structure the first was highlighting what matters and doing it really fast so the first thing you see in Vision OS is a home screen it has apps it has people and it has environments so those are the things that we think matter when you're using Vision OS so they're the first thing you see might not sound that novel and it's not in a lot of ways it's the same thing that happens on your iPhone but when you compare it to VR products that came before it's very hard to understand how you're actually you know what is this thing good for when you look at this menu the second point is hierarchy hierarchy is what gives unbounded products a shape and a purpose right it's what helps users understand what it's good for what they should use it for so again we have the home menu that's kind of where everything starts and ends for vision OS we have Windows they have bounds you can resize them and move them and any individual window can go full screen right so that was our hierarchy the last point which is really important and I think the easiest way to make unbounded products feel familiar is sorry intuitive I got ahead of myself it's familiar familiarity uh it was something we hit when we were building Dawn our first kind of prototype was this star cluster thing that you could explore is really fun um nowadays it looks a lot more like this which are you know tables and we have graphs and examples again it's just structure Clarity and I think that it's no accident that you know the TV app on Vision OS looks a lot like the TV app on tvos right it's not an accident it wasn't laziness uh when people are sort of in Uncharted Territory you want to give them as many signs as home signs of home as possible same thing for control center right when people see Vision OS they already know how to use it so again these three points highlighting what matters bringing that to the Forefront establishing hierarchy and then leveraging familiarity all right and so that was the past now we're going to talk about the present and specifically we're going to talk about AI products we're going to talk about way both good and bad that products have been incorporating structure into their AI features it's really important to note the right Str sorry the right structure is very unique to your app right that's the whole point is that it gives your app a shape helps your user understand what it's actually for so let's take something like dot right dot is a companion thought is sort of a journal at least for me and so the structure they added was that if you pinch out you can see each day separated right it feels a lot like a journal and if you tap a person or two people in this case my co-founders I can see this again structured information about them in a timeline every time I mention them to Dot and so again you're pulling that structure out of the chat lexity does a really good job of using structure to make their experience feel more like a search engine and less like chbt less like a chat you're having a conversation with right and they do this by you know really pulling your title you know your query up top as like a title you know highlighting the sources it came from and then having the answer below that right and then having that take up the full page kind of regardless so it makes it feel again like more like one shot less something you're having a back and forth with now I want to talk about uh sort of an anti- pattern I've seen which is um this is in the the versell uh chatbot demo I think versell does some of the coolest design work in the entire world I didn't like this one um so this is like having this idea of almost ephemeral UI but inside the flow of chat right and I get the appeal right so actually if we go back here sorry this was a video M to show um you know you have a slide slider right so instead of having to like you inquire about you know you want to buy Doge and it shows you this UI so you can adjust exactly how many instead of having to do it over text it's it you know could be good the problem is that when it's stuck inside this sort of unstructured thing it starts like floating away as I try to ask follow-ups right and then at some point I even have two of them right so I go back up to the first one I pressed purchase and now I'm interacting with something that's completely different so it reminds me a lot of sort of that the house and up right it's just kind of up up away so instead of trying to put structure stuff it into this unstructured thing I think the answer is you pull it out right you pull it off to the side and what that means is that as the conversation continues you can just sort of update that structure without disrupting where the user is and that's exactly what Claude did with artifacts right and I think why it's so successful is that they pulled out the structure which is the app you're working on and iterating on from the actual conversation and then so as you make changes you can even go between the versions here without even having to scroll in the conversation right so it's beautiful and it actually brings us to another thing that I think has been really effective for AI apps which is this almost concept of Version Control so this was actually one of the like shipping like original chat CPT features which is kind of crazy but if you edit a message um you can go between the the versions right and actually maintains this entire tree it's very complicated um but super powerful with v0 versel did something again amazing where it feels extremely familiar almost like you're working on Google Slides or something um but you can go back and iterate keep iterating on UI without having to be afraid that you you're losing something right so again versions again I think familiarity is really one of the most important things for unbounded prodct s um I think Claude did an excellent job with this again I'm I'm hyping them up here but uh chbt introduced memory across all of your chats right completely unbounded so when I tell it something about you know some sort of medical problem as I'm working on JavaScript it's like you know it knows that which is very weird to me um I think this idea of projects and that structure of a project is very familiar um so sharing context across a project makes more sense agents are something that are extremely unfamiliar to most people and uh this idea of having you know all these different tasks and you're feeding data between steps whatever but you know what is familiar are spreadsheets right spreadsheets are extremely familiar to not to me actually but to a lot of people and um I think the only real uses of Agents I've seen in in the world in the real world are our spreadsheets so this is Clay right and each column is essentially a step that an agent is taking the user is defining so it's going across building up kind of context across the spreadsheet each row it's often almost you you you do it like a eval right so it's like you run the first 10 rows and then you run the next 50,000 100,000 right so you get it right and you can see here eventually you end up with a personalized email as the last column um but with all these steps in between the next thing that I think is extremely effective in helping people understand what your app is for and skipping all the sort of noise of prompt hacking prompt engineering are examples and presets so chachu I think was the first for this where they had these you know message to comfort a friend plan a relaxing day and so on vzer does an awesome job with this right we're not not just having those suggest those suggestions below but they also have an explore page where you can see what other users are doing what's actually working right again try to like shortcut this like prompt uh you know blank canvas problem notion as well right they have a simple menu where you can change tone for text instead of having to like be like uh you know you are a very concise GPT whatever whatever whatever right so you're just using these tried and proven uh things that notion can validate and that last Point brings us to the Future right so where are interfaces going in the future uh lus gave an awesome talk last year where he described prompt engineering as this almost trying to drive a car a llama trying to drive a car with a pool noodle from the back seat was I think his his metaphor uh and there's some real truth to this right and so I think first of all the future has a lot less prompt engineering and we're already seeing this right we're already seeing this with um generative images you know the way that Apple designed it where you're mixing and matching these different concepts uh you're able you know there's a ton of demos on Twitter of people you know essentially you're going between emotions here in a more intuitive way and then just yesterday figma released this way of adjusting the tone of text right where you're going between professional casual expanded concise the problem with this is that casual means a lot of different things right casual for a Fortune 500 company and a um you know direct to Consumer Cosmetics brand uh you know with ads on on Tik Tok right these are very different things casual when talking to your best friend or a coworker these are different so how do we avoid being reductive when trying to offer these sorts of presets and the answer is you just like mill I I don't know exactly how many zeros I put here but you just like million x or billion x the number of presets right so you have enough presets for everything and I think sparse Auto encoders show a really promising path towards that so if you guys have tried Golden Gate CLA where you can kind of identify the one feature of Golden Gate bridess and amplify it and it makes claw obsessed with bridg uh Golden Gate bridges uh specifically or the Golden Gate Bridge uh my friend gdus has an amazing demo towards this but for manipulating images right so you can see here he's increasing the amount of play of light and shadows increasing the amount of serene Forest streams or Venetian canals um again a very controllable and predictable way okay but uh so now we have a million billion options whatever how do we avoid too many options I think this gets to point three which is ranked presets so these are presets that are personalized searchable and even invoked through natural language they might not even be directly visible to the user so the user types in something like more friendly and you pull up the corresponding presets like how close you are how confrontational it is again maybe they're directly editing it maybe they're not and this gets to the last point or second last point which is developer defined personalization so as soon as you're able to define those sort of features you can start tuning them per user so each user in a way that you can't do with just text prompts today right because text prompts are sort of this fragile House of Cards where if you remove one word the whole output changes so you're able to tune it per user and the last point and especially true as you start your your app is going to become increasingly different per user is shifting from eval to analytics I don't think there's going to be some objectively correct for a lot of domains answer to things like who was the first president yes but the right sort of tone for a summary for a specific user I don't think so and so I think that increasingly it's going to be about how do you understand if you're meeting the needs of your users and what they're asking for so uh that's it thank you so much oh yeah we'll skip this one and thank you so much for coming and uh yeah awesome thank you so much Ben and now we're excited to welcome Karan Goyle from caria to talk about State space models uh can everybody hear me awesome um we'll see if I can stay on track 18 minutes so uh great to be here I'm Curran I'm the CEO of a small company called cesia which is a um 8mon old company uh and I'll you a little bit about what we're working on and uh try to describe some of the challenges that um we see in sort of multimodal um intelligence uh emerging and and how we uh plan to fix them and hopefully actually show a demo as well um if Wi-Fi works um so maybe to set the stage a little bit um last four or five years of AI have basically been really focused on this idea of batch intelligence which is sort of um pretty core to this uh idea of building like an AI system that can reason uh for long periods of time on a problem and then solve it so you can think about like math problems or you know physics problems that are hard um there's a lot of applications where actually what you need are systems that are streaming so they're real time they work instantly so imagine um generating video audio or um doing like understanding applications on um sensor streams Etc so um so it sort of bifurcates where there's these two different types of applications similar to how there's you know generally this idea of having batch workloads and streaming workloads and so um a lot of what we've seen over the last few years has really been focused on uh batch apis where you call a model in the cloud it takes a few seconds and then you get uh a pretty good response back um and now we're seeing some shift towards more real-time applications where you constantly will be quering a model and um asking it to return responses at low latency and then uh using that to sort of corporate uh or generate information um and I think this you know this area is really exciting because it's going to be transformative to a lot of interesting applications that um have so far actually not necessarily been uh the main focus for uh a lot of what we've seen over the last few years so conversational voice uh is an example of this where you should be able to interact with a system and then talk to it um and it should be able to understand you and and do all kinds of tasks on your behalf uh this is similar to having assistants um that are on device and run kind of really efficiently at low power um at you know all times um regardless of whether you're on a phone or or laptop and then um things like World Generation where like you can imagine actually playing a game that is generated in real time similar to um how the graphics are rendered um on gpus um and um all of this you know should be able to happen in real time on low power on your phone uh on your MacBook Etc um robotics is another great example where it sort of culminates with all these coming together on a on a single um device that is uh trying to kind of interpret everything in the world and so I think this is sort of the exciting intersection which is like how do we make intelligence faster and cheaper so that we can put it um everywhere basically um and couple examples that are really powerful realtime intelligence for conversational interfaces uh is going to be really interesting because you would be able to have a agent that can provide customer support for a problem answer questions about health insurance uh you know call your vendor to pick up a shipment all these coordination tasks that uh generally are um annoying to do should be really automated and uh real-time intelligent agents should be doing them um and then humans can spend their time solving sort of harder problems that are uh more interesting and in customer support that could be dealing with uh you know the tail of customers that are much more important because they're pissed off or they're uh uh more important because they have uh you know more customer value Etc and similarly in robotics there's this idea of like ingesting similar to humans like audio video sensor data and then responding instantly to a lot of these um pieces of information so I think this is sort of the the world we we should be living in where all these intelligent models run super fast they saw all these different problems and uh they're able to really kind of um Power these new experiences that are Interactive at their core so this is where we come in we're building uh these realtime Foundation models so um some of what I'll talk about is uh the work we've done in really building kind of new ideas for how you can create uh deep learning model so um I did my PhD before this I was working with a lot of these folks for my PhD Chris was our PhD adviser um and um we were really focused on this idea that you should be able to have a model that can compress information as it comes into to the model and um use that to really kind of build powerful systems that are streaming at their core um and I'll talk a little bit about this but that's really the technology that we've been working with for the last four or five years we've been developing Academia and some of you might have heard of things like Mamba which is sort of a a more recent iteration of this technology you know I did my PhD working on some of the early iterations that nobody uses anymore but um are sort of the precursors to a lot of the modern stuff that is now more widely used and uh and now what we're doing at caria is basically taking this and trying to understand how we can improve it how do we push the boundaries on what architectures can do and um and I think it's an interesting question because you know we should not settle for having a uh one way of doing things I think that's sort of a a poor way to kind of um think about the future so our approach is sort of like let's think about new ways of actually designing models that aren't necessarily built on let's say the Transformer architecture and the standard recipe for deep learning that's uh you know prevalent today and I think it boils down to this question of like efficiently modeling long context is a huge problem because you know a lot of practical data is really long sequence data I think text is maybe the least interesting U long sequence data because text is actually fairly uh compressed already right like you have a lot of information that is uh embedded in two minutes of uh or two sentences of text but there's all these other domains where you know audio video Etc where there's so much information um you know imagine looking at a security camera for a day like you would probably have just so much information coming into the system and just um very little of that would be useful so compression is kind of really fundamental to intelligence because we're able to do this where we can look at all this information and then sort of compress it down to whatever is necessary to remember or understand um and I think so far what we've seen is that the AI systems that we built have not necessarily exhibited that same behavior so they're really kind of built not on the principles of compression but more on this idea of retrieval like keeping all the context around and then using it to reason over all the information that You' seen so I think um our kind of point of view is that multimodal AI will remain challenging as as long as you're sort of working in this Paradigm because uh if you try to think about what humans do in a year um you're basically processing understanding about a billion text tokens 10 billion audio tokens these are you know back of the envelope calculations that I did and about a trillion video tokens probably underestimated how much video we processed and not including all the other sensory information that you're processing and you're doing it simultaneously and you're doing it on a computer that fits in your brain um and you you know sometimes don't eat and drink and you know you're still functioning fine so you know you can have variable amounts of power in the system um so I think the idea that like intelligence is solved is sort of very far from the truth because humans just are an extremely amazing machine that uh does something very extraordinary um in a very compressed way that um our AI models can't do so I think that's sort of our uh you know sort of uh the reason we get up in the morning is we think about this and we're like yeah we're very far away from where we should be um and the best models today are in the you know 10 million 100 million sort of token range so uh that's really good a lot of progress has been made but really this is sort of what we aspired to is how do you kind of build these machines that are long lived that can actually understand information over very long periods of time and I think the cool thing is like like as a human you can remember things that happened 30 years ago with very little effort you don't need to do rag or retrieval or anything you just you know you remember it it's GED in your brain and then you figure it out basically so I think that's kind of an extraordinary capability that we should be able to put into our AI models as well and so some of the big problems with models today are uh you know they're built on Transformers really optimized for data center I think um we see this with like a lot of the work we did which was on sub quadratic models so quadratic scaling and context length really just means that um you know the amount of uh computation you have to do to process long amounts of uh context is very large and so right now the sort of predominant approach is to throw compute at that problem and then hope that that would scale um obviously compute is a very important piece of the puzzle because you do need more computation to be able to do more difficult things but um this type of approach because the quadratic scaling actually has cor scaling with you know very large multimodal context and text context tend to be shorter multimodel context will get larger because you have just way more tokens and information that's going into the system so that's going to be a big challenge for these models especially how do you do this inference efficiently so you're not you know Burning Down the data centers to you know do uh a fairly limited amount of inference like you have to imagine that we're doing a thousand times or um you know 100 thousand times more inference and then um if the models are scaling the same way it's going to be really really really expensive so you're not going to be able to permeate all these applications that I talked about very easily um and so you know that's sort of a big challenge I would say and so you know again our hypothesis is you need new architectures and that's kind of where we spend our time and we want to make these models more efficient faster more capable while being able to handle all these long context problems um this is a slide about you know Transformers being somewhat inefficient at handling this um but obviously um a very good recipe for scaling uh uh these models out and so you know some of the work that we've been doing is new fundamentally efficient architectures so they have compression at their core so they sort of the way they operate I I have a slide on this just to give you kind of a a quick illustration but um they really scal more linearly in context lens so you should be able to have uh because of this like more low power implementations of these models um you can compress information as it comes into the system you have low memory usage um and you can actually scale to much more massive context because of that um and this is all the work around ssms I just through this nice slide which I thought was cool um Jensen had an interesting quote about ssms in one of his wired articles that I like to keep talking about but uh uh but I think it's a cool technology that has a lot of potential and sort of that's where we're um spending a lot of our time and if you folks are interested in Reading More there's lots of videos on YouTube and lots of sort of resources that try to make this more accessible to understand and kind of get into some of the details um but you know the working intuition is basically Transformers are generating quadratically by attending to every pass token of information so as tokens come into the system you're sort of keeping them around and then looking at all the past tokens if you want to generate the word jump uh from the quick brown fox you would actually look at the entire context try to understand what the next word should be and then um generate it push it into the context do it again uh with ssms you just have a streaming system so you you have a to token stream in uh they update an internal memory for the model and then the token gets thrown away so that actually really simplifies the system and that's why it's such a core sort of stream streaming interface because you're just not keeping all this memory around about what happened in the past you're compressing it into some sort of zipped file uh State inside the model that's going to be used to do a future generation and so this is sort of taking this idea of uh taking advantage of this idea of recurrence which is sort of core to how even humans do a lot of their raising and you know last few months a lot of these models have been getting adopted so it's great that you know a lot of folks are now excited about the uh this you know alternate way of doing things that is much more uh sort of oriented around this idea of recurrence rather than retrieval um and so I think like we'll see a lot more activity in this especially multimodel data becoming more important and uh you know a lot of the challenges multimodel data around efficiency will mean that I think that these models will have more of a role to play in the next three to five years um as we also do our work in uh scaling them up and making them more interesting a lot of people ask me about quality uh I only have a few minutes so I'll go through this the rest of the slide super fast but um you know ssms generally have the right quality obviously there's a trade-off between compression and keeping all the information around but actually like compression can be helpful so if you imagine the security camera example if you're watching 24 hours of footage actually compressing all that information on the fly would help you solve tasks and answer questions better rather than looking at all 24 hours every time so I think that's sort of the rule of thumb to think about which is compression super helpful for a large context um not is helpful for um short context and so we see that quality actually is very good for long context problems and and multimodal problems let me talk quickly about some of the work we've been doing so we've been starting to work on sort of multimodal data and we did a release a few weeks ago for a voice generation model so this is sort of text to speech and sort of in line with some of the work we're doing to bring more multimodel uh data into a single model um and use ssms to power the inference and the training and and so on so this is a model you can actually play with try to show you a demo but um one of the things we're proudest off with this model is that we really Shrunk The latency down so when you play with it on the playground you get instant voice back generated from the data center and there's some cool work we're doing to actually run these models on Mac um and other devices so that you can basically have the same experience as you have in the data center but on any device and and do that efficiently and it will power how much time do I have okay we're out of time but I was also almost done so um go to the website play. AI I unfortunately couldn't walk through the demo but um play with it and um send us feedback this is my email in case you want to send me a note would love to hear feedback and anything that you folks uh find interesting thank you [Music] bre to go right I'm ready to go found I and I know they on to so and I'm ready breath toat [Music] [Music] I'm ready to [Music] [Applause] [Music] the [Music] down [Music] he he [Music] you and me we were the only on we were holding nothing back from the greatest nights we ever Sun L up The Bard driving slow in your car singing every night to play that song 100 times fire in your bedroom like on my mind you the music my mind Frozen and tired always on my mind I feel it all come back in the moment SP like the so if you want to come open play it slow motion [Music] memory so my heat feels like a m up the night [Music] I the driving Your Heart Sing My love every night to play that song made fire the summer [Music] dancing and tired always on my mind I feel it all come back in the [Music] moment the oce if you come with the door it open play it all in slow [Music] motions feel it all come back in the moment can spin to like the ocean so if you want to come open motion [Music] [Music] [Applause] [Music] you know you know how you [Music] get [Music] [Music] [Music] [Music] [Music] know know [Music] heav The Echoes secrets that we know doors that open for us in a moment keeping light our everything we we catch our breath in the midle of it all Chas ech sun is coming up all the Crystal [Music] Vision true on the I can see on the horizon all we can feel [Music] see in the forest for the trees I'm keeping watch all that storming waking [Music] uping s everything we we catch our breath in the midle of it allas ech is C all the is coming Crystal [Music] Vision I can see on the horizon all we can feel it Chas the is all know chasing the [Music] light [Music] it like aever up we to believe looking the I see on the [Music] horizon true belever weing the I can see on the horizon all we can feel is life is all know [Music] we [Music] [Music] n [Music] we [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] here [Music] [Music] you would you you would you you would you [Music] you [Music] you you you you you you you would you would you would you you you you [Music] [Music] [Music] me you you you you [Music] you you you what you [Music] you you you you you you you [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] taking one more breath beside you so I could strength div us I know we the best I could go the I but this how we got to give it up sometimes it's KN when to kill your pride there's no what to blame nothing really stays the same this is how we [Music] [Music] [Music] and I know your Reas a but I know there's a row over all the past your head on my shoulder but I know we're better on our but this is how we go got to it up KN to kill your there's no nothing really stays the same this is how we we hold on to let go [Music] hold got to give it up sometimes as C oh KN to kill you Pride there's no to blame nothings [Music] [Music] n [Music] B [Music] [Music] oh [Music] [Music] B [Music] [Music] B n [Music] [Music] up t-shirt through high these night tastes like gold sweet with Obsession show me something [Music] as [Music] ouring while [Music] sing as through window all newes burning through the page tearing all you're we we our problem underneath like super like superoes it's coming over now down a Harmony only super CR you want to feel like it's St in America your influence moon Waxing insan blame it [Music] all we we out the night like we wear our clothes dancing through the [Music] while Asing through window down a Harmony only heic you want to feel like forever s in America coming over me electric [Music] [Music] going come us hold tonight all going come us hold [Music] tonight us [Music] hold down a Harmony oface that only we can you feel [Music] it's feel like us it's forever you're get Mar [Music] [Music] hold come with us [Music] hold it back in 9 we were kids falling in love for the first time your hand you look me in the eyes kind of feeling you get life but now something went you're moving on I found myself on The Blind Side now you won't call we lost it all you fade away I'm picking up my heart from every piece that's broken been trying to get back to myself but don't have a clue I'm looking for some luck can't find a door it's open I'm losing all mys like I'm here because I'm missing because I'm missing you oh I'm missing you because I'm missing you I'm missing you [Music] I was chasing all the sides trying to hold on to something that I couldn't find wish you didn't Captivate my mind now I know we' been the sunet in Paradise but now something went wrong you're moving on I found myself on The Blind Side now you won't call we lost it all you fade away I'm picking up my heart from every piece that's broken been trying to get back to myself but don't a CL I'm looking for some luck can't find a open I'm losing all my feels like I'm because I'm missing you because I'm missing you oh because I'm missing youing because I'm missing you because I'm missing you up my heart every piece that's broken try to get back to myself don't have a CL I'm looking for some Lu can't find a door it's open I'm Los know my hope feels like I'm left here too because I'm missing [Music] you baby just don't walk away I Need You Now fade it out all the time we spent Al fighting through the don't let me down I need you now I'm feeling wor out getting to me lost some trying to get on my feet caught in the madness I feel you somehow don't let me go I need you right now I want to be next to you you want to be next to me me holding our Paper Hearts fading our Broken Dreams I want to be next to you you want to be next to me holding our paper heart fing our Broken Dreams I want to be next to [Music] you you [Music] tell me that you to stay baby just don't walk away I Need You Now f it out all the time alting through let down I need you now I'm feeling out it's getting to me lost some heart trying to get on my feet caught in the madness I feel you somehow don't let me go I need you right now I want to be next to you you want to be next to me holding our Paper Hearts fading our Broken Dreams I want to be next to you you want to be next to me holding on Paper Hearts feing our Broken Dreams I want to be next to you you you [Music] to be next to you to be to me hold ouring our Broken Dreams to be next to you you want to be next to me hold it our paper heart now Broken Dreams I want to be next [Music] [Music] [Music] [Music] [Music] [Music] oh [Music] [Music] [Music] [Music] [Music] [Music] let [Music] [Music] we got [Music] an eyes wide sh we got everything we need and then a little too I know that you're starving for something you can't touch would you be hon with me right now there's something in the under car I can feel it coming up don't you want to feel it taking over your SES don't you ever teolog baby come escap with me I'll come sweep you your Fe don't to feel it don't don't you think there's in my that's weighing me down it's just the weight of the world now I'm calling it out we're a little starving for some lightening love can we speak Hest right now there's something in theur I can feel it coming up don't you want to feel it taking [Music] over baby I'll come sweep you off of your feet don't you want to feel it don't you don't hold breath and and I'm what I and I know to and I'm ready breath and to IAT and I'm ready to go your it and IGN my soul and I'm ready we [Music] are do feel it CAU in theow I'll come back to your know that you summer all that I [Music] bre breath I hold [Music] it I'm ready to [Music] [Applause] again in a you eyes again when I least expected you're all that I want we know together we got it all hold my [Music] breath breath [Music] IAT and I'm ready to if I find myself at your door would you follow me to better places I find myself your the keys let's go up I [Music] up to go I found what I want and I know we're on top so I'll and I'm ready to oh hold in my breath and I'm ready to go I catch and I'm ready [Music] toat to your it I'm D in and I'm ready to [Music] [Music] [Music] down down [Music] back [Music] a [Music] [Music] back [Music] s night and days you and me we were the only one we were holding nothing back from the greatests we ever Sun up the driving slow your singing every night to play that song 100 times made fire [Music] in come back in the moment could spend to the ocean so if you want to come the door it open it motion [Music] my heat feel like a b it up the night when I'm alone when I hear this words you made a [Music] theing your heart singing every night to play [Music] that made fire daning the rec in your bedroom like always on my [Music] mind I feel it all come back in a moment SP like the ocean if you want open play it all slow [Music] all you to come it open it all back in slow motion [Music] [Applause] [Music] [Applause] [Music] get he [Music] [Music] [Music] [Music] [Music] you know you feel me you know secrets that we know doors that open for us in a moment keeping light on ring keeping our sight everything we want we catch our breath in the midle of it all Chas ech is coming up is coming Crystal [Music] Vision like you belever looking on the I see on the [Music] horizon the trees keeping all waking up and turning on keeping light riding hold keep our sight [Music] everything catch our breath in the midle of [Music] Chas is coming all the coming Crystal [Music] Vision like [Music] you true belie the I can see [Music] Horas stop and I won't let it goas let [Music] it up looking on the I can see on the horizon all the we can feel ch [Music] I True Believer weing the I can see on the horizon all we feel [Music] [Music] he [Music] [Music] w [Music] [Music] we [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] oh [Music] [Music] [Music] me [Music] [Music] you [Music] you you you you you you you you you would you would you you you you [Music] you you you you you you you [Music] [Music] here [Music] do you do you you you would you you you [Music] you you you you you you you would you would you would you would you would you [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] we got it I know we did the best we could if I could go back UND the mess would memorize your face before I go but this is how we grow got to give it up know to kill your pride there's no to blame nothing really stays the same this is how we sometimes we hold on to [Music] [Music] there is nothing lost between [Music] us I know your days a but I know there's a rainbow over all the head oner I know on but this how [Music] we it up [Music] KN when to kill your pride there's no to blame nothing really stays [Music] the let [Music] [Music] to kill you Pride there blame stays the this is how we [Music] sometimes we hold on to let go [Music] [Music] [Music] B [Music] [Music] n [Music] [Music] B [Music] [Music] [Music] [Music] d [Music] I was watching you watch the up tshirt through taste [Music] obession dancing right through the fire while we watch itose as we up our as morning through Windows all through the P tearing the we our problems cles like super like super heroes it's coming over now down a Harmony of that only [Music] youever Amer influence wax now I couldn't see it until you show me how feels like we're insane we blame it all on love [Music] saturated our daning while give up our gos as a new evening comes through the wind it's coming over now it's w down a Harmony only it's Amica over MEC Sy every night on kn a super you feel like [Music] forever the going so come with us don't hold back tonight is all we have the is going come [Music] us [Music] hold come with us don't hold back tonight is [Music] we [Music] going Harmony [Music] only in America it's coming over me electric Sy every night on fire Master [Music] super us [Music] it's come don't hold t [Music] [Music] t hold [Music] it was summer back in 89 we were kids falling in love for the first time your hand you me in the eyes kind of feeling you get in a but now something went you're moving on I found myself on The Blind Side now you won't call we lost it all you fade away I'm picking up my heart from every piece that's broken been trying to get back to myself but don't have a CL I'm looking for some luck can't find that's open I'm losing all my like I'm here because I'm missing you because I missing you oh because I'm missing you because I'm missing you because I'm missing [Music] you because I'm [Music] you I was chasing all the sides trying to hold on to something that I couldn't find you didn't Captivate my mind now I know we've in the sunsets in Paradise but now something went wrong you're moving on I found myself on The Blind Side now you won't call we lost it all you fade away I'm picking up my heart from every piece that's broken trying to get back to myself but don't have a CL I'm looking for some Lu can't find a that's open I'm losing all my feels like I'm here because I'm missing you because I'm missing you oh because I'm missing you because I'm missing you because I'm missing picking up my heart every piece that's broken trying to get aing open my Miss [Music] tell all Al through I need you nowz I'm feeling worn out it's getting to me got some heart trying to get on my feet caught in the madness I feel you somehow don't let me go I need you right now I want to be next to you you want to be next to me holding our Paper Hearts fading our Broken Dreams I want to be next to you you want to be next to me holding our paper heart feeding our dreams I want to be next to [Music] you you you [Music] tell me to St baby just don't walk away I Need You Now fad it out all the time we spent Al fight through the fire don't me down I [Music] need getting to me lost some heart trying to get on myet caught in the madness I feel you somehow don't let me go I need you right now I want to be next to you you want to be next to me holding our paper heart fing our Broken Dreams I want to be next to you you want to be next to me hold it our paper heart feing out Broken Dreams I want to be next to you [Music] you want to be next to you you want to be next to me holding our Creer heart fing out brok dreams I want to be next to you you want to be to me holding our hearten [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] n e e e e e e hello hello and welcome back to the afternoon session of the multimodal track we are super excited to get started with Chang sha and Noah talking about the hierarchy of needs for training data set development oh wait okay I think this thing is on hey everyone thanks for coming to our talk hopefully everyone didn't eat too much and you're not too sleepy from from the lunch but uh we're excited to be here and we're excited to be talking to you about uh training data set development for llms so my name is chuna I'm the CEO and co-founder of Lance DB I've been creating data tools for data science machine learning for almost two decades uh starting with being one of the co-authors of P the pandas Library a long time ago uh spent a bunch of years in Big Data Systems and recommender systems and most recently I started this company um Lance CB which is the database for multimodo AI these days I spent about roughly equal time tweeting and and on GitHub and yeah hi everyone I'm Noah I currently lead the AI data platform at character and character one of the leading personalized AI platforms so we train Our Own Foundation models as well as run a direct consumer online platform and I focus on data research so since we train Our Own Foundation models we need to learn what need to train on to engage our users and so we're Focus both on academic benchmarks as well as things like AB tests and trying to get more engagement on our platform my team is focused on Research acceleration as well so we tend to build a lot of tools and leading to this collaboration with Lance and how we think about storing our data so I think if there's one thing I want to convey with this whole talk is that you should really care about what you're training on and you should care for it by giving it a nice form format that does a lot of nice things for it I wanted to start just kind of broad Strokes talking about how we think about pre-training and how we think a post trining there's definitely a lot of overlap but at least in terms of pre-training you tend to think wider right you want to think about more like what domains you're training on are you thinking about books or more chat data um and then you want to also think about quantity right how big is your model how many tokens do you need compared to post training where you're looking at very specific tasks and maybe not just looking at the context of that task but also how difficult is that math problem how easy is that multiple choice problem so you kind of have to get much deeper and more granular in terms of the things that you understand about your data at scale in the middle I I guess I group together some of my favorite problems right now that a lot of people are looking into so ranging from data efficient learning right how do we reduce the amount of data we need to get good results from a similarly sized model how do we sample from data right like what kind of metrics do we need and then how do we look at diversity right measuring diversity is very difficult and looking at some of the automated ways that we do that in industry and all the different papers that are out there so everyone loves a good hierarchy of needs uh I think that for us we always start with clean data and quickly go right up to evaluations for us we always start there because it's hard to measure anything without a compass and since we're focused a lot on post training nowadays having systems for data set management is becoming more and more of a problem so when we thinking about mixtures right these collections of data sets and usually you have different ways of how you're including them in your batches and in your training sets you want to understand those collections not just in terms of what data set they are you know is this Wikipedia or is this some other thing but also what's in there right so it naturally rolls up into analytics so we want token counts an understanding of length and you might even want things that are more complicated right so you might be classifying your code data into not just say is this python or is this Java but also how difficult is it how many functions are in this problem how many classes were you supposed to generate and really having more and more analytics lets you understand your data more I think more than anything reading data has probably been the biggest win so these are kind of just ways of automating things that we've learned from looking at data looking at outputs looking at performance and trying to understand what is going on everything in kind of the top top half I guess of this is more about using language models to improve language models so things like synthetic data things like quality scoring things like data set selection and data set selection is probably the simplest and one of my favorites right you're just kind of looking at ways to match distributions for the behavior you want from your model and the data that you do have and so a lot of what we do is do retrieval or do clustering you know you can embed the web nowadays pretty quickly and how do we pick the data that we like of according to what kind of evaluations we're looking at quality scoring is similarly simple like we built a lot of classifiers in house for a variety of things and there's a lot of cool work around how people are actually doing this with just prompting classifications so you can do it even more simply than say having to go down the route of actually building a classifier and evaluating it and doing that whole Loop and synthetics given the way that we've structured our platform is also super powerful for us because we have this ecosystem of Big Data tools like spark and trino alongside some GPU back services for doing prompting for doing embedding and for classifying things and so we can enrich our data sets we can augment them um you can generate quick examples of say preference Pairs and try to explore a method not at its peak of quality right synthetics are going to have problems but you can start getting signal for what types of data what shape is that data and how can you kind of start looping in human labeling to make it even better so at the top right we use human labeling a lot for improving these classifiers um and we also want to use them for rewriting synthetic data that maybe has issues or rereading data that just has issues in itself and so all of this kind of comes together to motivate a lot of our our platform tooling and uh we go to the next slide to to talk about kind of how we try to make all of this easy for researchers working in this domain so I said at the beginning accelerating research is a big part of this I've included some beautiful yaml here that hopefully people can see that there's a SQL block over there and I think that this is pretty motivating in terms of how we materialize data sets so if you've worked in machine learning at all you know that usually you have a specific training format maybe it's TF records maybe it's Json lines depending on where you're coming from and at least in my experience it's one of the most error prone components of training right I don't know what data this is I'm training on it I'm getting rird results so for our team since we're doing so much iteration around data making it part of your training job and separating concerns in terms of how your data is materialized and what your training job is doing is really really nice for us and this is kind of where land started becoming a big deal especially as we start thinking about multimodal and how the data volumes are much much larger and the problems that we're trying to solve become much more complicated so the materialization service aside you know it's kind of this nice interface that you send it some request and it gives you some list of files um really starts hitting the road when we think about data loading which is its own problem in and of itself if uh especially once data volume becomes really large so Lance has this nice property that Chung will talk about a lot more that allows for quick random access and it lets us Shuffle data very cheaply right so it essentially lets you Shuffle references to rows rather than shuffling the rows themselves which allows you to save a lot of time in iteration speed and at the end of the day for us we just want to watch the gpus go bur and the numbers go up so I'll pass over to CH who can talk a lot more about the Lance format in detail thanks cool so you've heard from Noah about the importance of data in developing models and so if data is critical then it's also critical to have the right data infrastructure for the for your workloads now ai workloads tend to be a little bit different from your traditional data warehousing olab and analytics workloads in a couple of different ways but let me give you just one motivating example if you think about a distributed training workload typically it breaks down into three steps you have a filter you want to select the right uh samples from your raw data set then you'll have a shuffle step where you will then uh draw random rows uh from the filtered set and then you'll stream typically if the data set is large you'll be streaming those observations whether they're text or images or videos from object storage into your gpus so in that one workload you needed fast scans to run the filter you need fast random access to do the shuffling and then you need to be able to deal with potentially very large binary data large blobs uh to be able to quickly stream data directly into your gpus so these three properties are required often in one AI workloads from training to uh search and retrieval but existing data formats and data infrastructure is good for at most two but but often just one of the three and so this is what I'm calling the the new cap theorem for AI data um and that's the motivation for us for Designing uh Lance format around which we've built Lance CB so so uh this problem is of course exacerbated by scale of AI data and especially multimodal data so if you look at tabular data from the past one row of tabular data just scalar simple scalar columns on average it's about 150 bytes per row uh if you add embeddings to that that gets about 20 20 25 times larger depending on the number of Dimensions if you add images uh that's another 20 times and if you add videos that gets pretty astronomical and that's one single row um and with generative AI data is isn't limited by the speed at which you know manual human interaction can generate observations uh new new rows of data is being generated at thousands of tokens per second so scale often blows blows up um in the past as I've been in data for a long time if you were in the tens of terabytes you were a fairly large company and I think these days uh if you're working in generative AI it's not unheard of for you know uh 10% 20% teams to be managing like tens of terabytes to even uh pedabytes of data um so what does Lance format do to solve these problems well so Lance format first it's a colner file format so like parquet but or uh optimized for AI so it gives you the ability to do fast scans like paret it supports fast lookups unlike paret and uh we've actually gotten rid of a big limiting factor in parket called row groups uh and so that we can allow you to store blobs in line Lance format is also a lightweight table format so as you add data uh it's automatically versioned you can also add additional columns without having to copy the original data set so it makes it a lot easier if you're working with large uh multimodal data sets to add experimental features and then roll roll them back later on um and we call this a zeroc copy schema Evolution and then finally of course it supports time travel so that often times if you make a mistake or there's an error or there's bad data it's instantaneous to roll back to a previously known good version so that it doesn't corrupt down Downstream um model training processes and the third aspect of Lance format that's really interesting is indexing extensions so in parquet there are indices but uh the indices can quickly tell you which rows you need but with paret because it doesn't support Random Access even if you know which rows you need to fetch it's really slow to fetch those rows U and and not so with Lance so with with Lance we've added indexing extensions for embeddings so you can do you know essentially billions scale Vector search directly off of S3 um you can we can have scaler indices to make filtering metadata columns really quickly uh and then uh full teex search indices to do keyword or fuzzy search uh and directly from uh from your S3 data set uh and you don't really need that elastic search cluster anymore so what lands gives you is the ability to have a single table for many many different workloads so uh if you have metadata columns or time series columns you can run SQL so you can plug Lance directly into say um duck DB or uh trino or spark and you can run SQL on that you if you're storing large blobs and tensors like te uh the videos or text or images you can plug your Lance data the same table into P torage training um and if you have embedding vectors you can use use the embedded uh uh Vector index to do similarity search and so this makes it a lot easier uh for a full AI workflow from um analyzing exploring your data set to searching and retrieving throughout your data set to uh fine-tuning and training uh your model around this format we've built lanb U the vector database and the more General a database for multimodal AI so uh one big feature is a distributed Vector search so billion uh search through billions of vectors at low latency and very high QPS with order magnitude less infra than other Vector databases and it provides data infrastructure for all of your multimodal data needs when we talk about multimodal we often think narrowly about just image generation or video generation but when you look at the data multimodal I think has many different meanings one of course is the data the data can be uh multimodal so unlike traditional tabular data we can store features um and then audio waveforms images and all that that's we're familiar with that already um and of course vectors and and Vector is a vector so whether they're they're image uh embeddings or or text embeddings now the workload can also be multimodal um so you know not just running olab SQL but you can run Vector search you can run full teex search uh filtering and then um uh and other uh sort of data frame and SQL workloads and then finally the use case in the scenario can also be multimodal so operational scenarios where you're in a production service for um rag or search and retrieval and personalization or lands can be used in training uh or it can be part of your data Lake to analyze and explore all that multimodal data that you have yeah so I think that from at least my team's experience and a lot of what Jung is describing we just think that speed is probably our our best bet in terms of strategy and a lot of the tools that we worked with really slowed down under load under new multimodal needs and we're looking to develop out what the future for those Data Systems looks like so thanks so much for listening to our talk yeah [Applause] codes uh both character and Lancy be are hiring the on the on the I guess your left is uh the job board the QR code for characters job board and our uh Lance CB Discord is on on the right and uh yeah check us out thank you for coming thank you e e e up next we are really excited to welcome stf Dua who's going to be talking about the multimodal future of Education she's been doing research in this domain since 20155 and so we're super excited to hear from her thank you so much is this on yeah can you hear me hi hello I think it's on no it's not on very faint yeah perfect uh hi everyone I'm Steph I'm going to talk about the future of Education with multimodal AI uh we are here at AI engineering Summit and AI engineering starts very very early so I'm curious how many of you have kids how many people in the room have kids okay wonderful uh how many of your kids uh have played with generative AI so far okay so um you won't be surprised to see the next slide basically 70% of generative AI users uh are from Generation Z this is a recent study from Salesforce so it starts very early and the reason reason I care about the future of Education with generative AI is because edu education needs a wakeup call so we know that early L literacy rates um she can run around the room and the robot is constantly going to try to find her now this is the first step um it's using like I mentioned blog based uh programming language expanding scratch and at the time like it allowed kids to not only program their smart lights their voice assistants but also train their own custom models so they can train models with examples of images or examples of text and then use those custom models in their own games and application so for example here like this student train a model to distinguish between unicorns and narwhals and then not only it gets a prediction when it plays with a game but it also gets the confidence level how confident is his custom model that the drawing is a unicorn and we see the confidence is pretty low so they made all sorts of things like looking at what's in their food uh trying to like uh program games like rock paper scissors uh get like uh the robot to talk like Shakespeare and this was used all over the world it's translated in uh more than 30 languages and um the good news is that we evaluated this to see how it increases that critical understanding of AI and how it helps with AI litery ly so to do that um uh I did a longitudinal study in public and private schools where we asked questions of what kids think about AI before then we allow them to engage in AI learning activities and then we ask the same questions at the end and what we found after they learn how to do text training image training smart home programming um is that they became much more skeptical of the AI smarts like in the beginning they would say like yes you know Google home is smarter than me or this model is much better than me and after they learned how it works and how to train it they were not so sure it's smarter than they are I'll show you a quick video to see how that went for a lot of The morex Savvy parents is like go for it technology is going to be a huge part of their lives much more so than my life if it's scary for some people this AI technology I totally get it but as a parent and as a teacher I thought it was really important because these are skills that 21st century kids need to have when my dad was young he bought a car and took it AP to see how it worked so you teach people that young how these things that grown-ups mostly program how it works so as I was saying AI engine years in the making um and this this is the significant difference like to their perception of the smarts of AI before and after doing these learning activities so how did they why did that happen right like why would why did they became more skeptical more critical and also more literate in how to read and write with AI is because by providing this platform and allowing them to Tinker and form hypothesis and test them we basically allow them to engage in the scientific process just like researchers do just like we do right but we needed to have the right sandbox the right platform for them to be able to quickly thinker and quickly iterate so kids are not alone in learning this parents need to learn to teachers need to learn too and we've seen during the pandemic when kids were stuck at home with with parents a huge opportunity for them to learn together so I'll show you one of the early demos of cognates oh the audio is not working on this one I'm not sure why basic there you go you did it no I need you to help me ask a question for that we'll need the ask block see if you can find it awesome so the thing that you're programming is kind of collaborating with you to teach you how to program it right just imagine IM applying that to any of the chat Bots we have today right like when you're not happy with the answer or maybe the answer is not age appropriate or you want to teach you also want to teach something to the model about your language your culture weird facts that you're interested about how do we do that right um so I did another study uh where I this was with kids and parents in 10 different states in us over multiple weeks where we wanted first to learn how do we design a co-pilot for programming for families so before we start and build it like what do they want what works and what doesn't so what we found was that some of the things that kids and parents likeed the most was to generate coding ideas with an AI friend like if they had a co-pilot in in scratch um and this was very very helpful here are some quotes um because like here like one of participants says most people would like coding with AI friend because one of the hardest parts of your project is when you start you run into into a wall because you're out of ideas so they I friend helped with that it also allowed them to express and elaborate their ideas in code so if they had an idea for a game like I want to make uh the bear kind of jump over the Hedgehog but they didn't know how to do it it would kind of help them um find the the right code constructs to do it and more importantly it supported their cretive coding identity so it wasn't the bot that was making all the coding they were doing it the bot was just helping them when they were stuck so this was very very important it encouraged kids and parents to work together which is not always easy right like uh one of the things I discovered I've been working with kids and families for a long time now since 2015 um it's not always easy so actually having like a third moderator to be like oh what does Mommy say what does daddy think uh take turns try this really help with family joint engagement also it doesn't always work right sometimes it's too distracting and it was very important to enable families to shut it off maybe they want to do the game alone they want to do the coding alone so they could stop it whenever they wanted if you have multiple siblings that fight over the laptop it doesn't really it cannot help with that uh or if the comple if the concepts were too complex it was not able to scaffold it always like break it down so parents were very help to to help so after understanding like what are the core things that families want from co-creating and learning how to program with an AI friend and went and basically evaluated all the generative AI models to see if they could do that right so scratch for Scratch like top generative AI models are pretty good at generating explanations giving like ideas or questions to help uh kids and parents like explore like and test like new games and this was published we created a benchmark as well for for measuring this and this is just an example of what the future of Education with multimodal AI could look like um if it's applied to Minecraft to games to physics simulations science simulations it can become a creative sidekick right there are a lot of people who love to build things with their hands what if I could get ideas like by taking pictures of flowers I like and colors I like and it gives me idea and helps me like generate 3D models and that I could afterwards print and paint um or I'm into knitting and I want to use a generative AI model to inspire my meeting projects it can also be a learning companion and a coach it can help with math so together with Nancy Otero we created the first Benchmark for math misconceptions to show what are the most common math problems that kids have in K through 12 and evaluate how good are top of art generative AI models in identifying these misconceptions when kids talk with a chat bot and I put a link to it if you want to download it so I am here to invite you to think about AI engineering and AI tinkering for all ages and how do we go from my experiments to cognates to things that people are doing and tinkering with hugging face and make sure like we open up the space so we use AI not just to teach but we actually use AI for people to learn how to Tinker and Learn by playing and Learn by doing so I like to do what I preach and I'm going to show what I tinkered with AI last night this is these are very fresh demos so this is using the latest um Gemini API and I have three demos let's hope they work uh let's start with the science one uh and I was hoping to draw in real time but I don't have a table so luckily I have some some drawings and we'll see we'll see how well this works so I have a drawing oops a scale with a weight on each side what would happen if you had another 5 kilograms so it's asking me questions based on my drawings and then I can make a new drawing that has like 10 kg and 10 kg and see if that gets better um let's try another one water and CO2 what happens if it gets mixed oh I need to so imagine I have a webcam and I'm like a table and I'm drawing a real time and we could play with it but it's very interactive let's try this one the Earth is being hit by something hopefully not let's add one more arrow and see what happens if we do that ah that was fun so it finally understood it was the moon let's play with the math one and solve the expression inside the parenthesis okay so I have one where I did okay solve the multiplication with the parenthesis and let's assume I've done that too and I have the next question I need a better background for this demo that's for sure uh the first step is to simplify no no no go back okay so any number divided by itself equals one but you see it doesn't give me the answer it just give me a question so I can keep trying right and learning um let's try one more more complicated let loated four on the axis I was hoping it would give me a better question yeah so the last one is the one that is encouraging curiosity so this one what is the lady doing okay uh what are the colors on the flag what shape is the star oh it asked me things about Jordan um let's see what it does with Apple it has can you do you know what Apple this is uh let's see what does the Apple smell like uh I had a nice origami thing as well okay I don't know where Dory gummy went but so you sort of get the gist um these these are like I don't know if do you want me to draw something or do you want to ask one of a question of the science or math or objects any any requests from the audience don't be shy yeah what what should we ask no okay well uh excuse me a system equation yeah can you tell me what to write 2x + 7 = 2 thank you let's try it let's see if it does well with my uh substract seven from both sides of the equation not bad and now if I do that divide both sides by two and so on and so forth um now the cool thing about this like I made the code open source and template so you can play with it too is less than 100 lines you just need to create an API key which is free and you can create your own instructions and hopefully I inspired you to think like Beyond of chatbot interfaces and delegating instructions and delegating like questions and think more in like a tinkerer and think about how we could put these tools in the hands of young people because they are the future and they need to learn about this technology as well and how it works um I think that's my time uh all my research is on my website and I put a QR link for that as well and I look forward to your questions afterwards thank you so [Applause] much e e and next we're excited to have quinda halman Kramer talking about building voice AI agents thank you so uh my name's quind I work at a company called daily we make realtime audio and video infrastructure and when gp4 was released last year we started building lots and lots of experiments uh we thought gbd4 was so good at conversational Exchange in text that it really should be able to talk to you literally talk to you with your voice so we built a bunch of things like this this is probably a year ago and in gen terms it's ancient ancient history but probably worth looking at as context I'm your friendly chatbot here to assist you with any questions or tasks you have in mind how can I help you today uh can you tell me a joke absolutely why don't skeletons fight each other you always start with that one give me a different joke all right I'll give it another go what do you call fake spaghetti an heard that one too come on maybe just so we use tell me a joke from you know very early on all the way through today and I think we know all the jokes that all the llms we use regularly uh tell uh which is funny but not necessarily funny in the same way they were the first time we heard them um so this is sort of a really highle schematic of what we're trying to do here right we've got a user on a phone or a laptop they want talk to their device and then somewhere in the cloud we've got a bunch of gpus uh that are doing a whole lot of heavy duty compute and we need to talk to those uh that cloud computing resource somehow uh so as soon as you build stuff like the video we just saw a couple of things come very much top of Mind One is speed really matters and the other is architectural flexibility is really really important I'll talk about both of those things today uh let's start with architectural flexibility so that really nice clean diagram gets really messy really fast This Is Not Unusual in an engineering software development problem domain uh but I thought I'd kind of make a slide based on like I looked at a bunch of source code I thought about all the conversations I've had with you know my colleagues and our customers and friends who are building this stuff and it turns out you have to at some level kind of be aware of a bunch of these things if you want to build realtime robust voice AI stuff deployed scale to production um it's a little bit of an intimidating map we're definitely putting the multi in multimodal AI here all the way from audio processing things like Echo cancellation uh and CPU management when you're encoding decoding audio and video through networking issues like firewall traversal all the way through to building things like retrieval augmented generation and Tool calling so that your real world applications are really really useful we can collect that kind of messy map into a few a little bit more kind of uh highle categories um it's worth going through these just really quickly because I think they give you a sense of what that map is so you need really robust and low latency media processing and transport you've got to encode the media you've got to send it over the network that's got to work really well it's got to work really fast um you need really good and fast transcription at least until the future of truly multimodal uh audio native models comes which will happen at some point uh and even after that you probably are going to need to go from audio to text for lots of kinds of AI use cases um you have to do lots of real-time data Pipeline and buffer management so uh I think in Discord I've probably maybe 20 or 30 times answered the question why is my audio stream not working uh when I did local development on my Mac and then I pushed it to an Intel box in the cloud it's because you know Indian issues always get you if you're writing low-level audio stuff but there's a lot of sort of Pipeline and buffer management that comes into play here you want to be able to swap between models for a whole bunch of reasons or use multiple models together uh you generally need to call out to external systems you have to do things like phrase inp pointing which is the fancy academic term for when is the person done talking and when do they expect the AI to talk uh you need to handle interruptions really gracefully and that actually is a whole Rabbit Hole of its own uh people will interrupt the bot uh and you need to figure out what you're going to do and how you're going to handle that and how you're going to maintain State when people interrupt you have to do Echo cancellation unless you can convince everybody in the world who's using your thing to wear headphones 100% of the time uh and I've been doing audio video development on the internet for a really long time like our core product at Daily is a web RTC Tech stack so I can tell you that you will not convince everybody in the world to wear headphones all the time uh maybe if you're only deploying to people in like a call center professional context but even then it's pretty tough uh and you need good fast text to speech or voice generation on the other end the good thing is it's a lot but when it all comes together it really is amazing to build with these new gen tools you can build stuff that feels magical and new to people but that is genuinely useful that sort of just feels like how things should be uh we built this is an example of something we built for a Healthcare customer that solved a real pain point I think if you've been to the doctor which I'm sure all of us have you've had to fill out way more forms than you thought you should at subsequent doctor visits and maybe you filled those out on paper maybe you you know got a link to a form to fill out before uh the visit which is like the current state-ofthe-art what if an llm could just call you on your phone or you could click a link in an app and you can have a conversation and you never had to fill out a form again hello chat my name is Jessica and I am with Tri County Health services to ensure your privacy and to confirm your identity could you please provide me with your birthday including the year uh yeah it's January 1st I appreciate that Chad can you also tell me oh sorry yeah uh 1983 thank you for providing your birthday Chad now could you please list your current prescriptions including the medication name and the dosage for each uh yeah I take my Cil and also Gaba pent great could you please provide the dose oh yeah uh sorry 10 milligrams and uh 100 milligrams respectively thank you for that information chat additionally do you have any allergies that we should be aware of if so please list them uh yeah I'm allergic to latex and I'm also allergic to penicillin thank you for letting us know about your allergies Chad next hard so I think you can see how all those pieces we just talked about are actually necessary to pull together something like that but when you do it feels simple it feels like it's supposed to um that's a real video nothing in there is edited uh the call outs to the external system are mocked they're not it's not actually a real electronic health record systems on the back end but it's it's they the real tool calling um and that stuff is now starting to move into production uh in in the real world so I think in a lot of cases the geni frontier is moving very quickly and people are really sort of pulling it out of our hands which is super exciting um so let's switch gears a little bit and talk about fast because that's the other really big piece here how fast do we need how fast is fast and also just a quick side note so we're talking there always two kinds of fast when you're talking about engineering things there's throughput and there's latency these days for conversational interactions throughput is pretty okay for all the tools we all use today like llms and other tools can generate content as fast as people can read it or listen to it but what's hard is latency and latency is that sort of time to First Bite time to First token in lots and lots of engineering contexts there's trade-offs between throughput and latency complicated relationships between throughput and latency uh one of the graphs that I sometimes show in these talks is that uh throughput tends to improve by an order of magnitude every couple of years in lots of domains latency improvements tend to be linear and like way behind throughput improvements so latency is hard and latency is mostly what bites us here human conversational latency like if I am talking to another person it feels weird to me if that person doesn't respond in about half a second sometimes people respond actually a lot faster we seem as humans to be doing like speculative decoding next token prediction just like natively like that's what we do I know what you're GNA say four or five words before you finish saying it I'm queuing up my response I'm sort of doing my inference in this like greedy fashion if you say something I didn't expect well I can like reroute but most of the time I'm right and most of the time if you actually record people in conversation they'll respond in like two or 300 milliseconds commonly and if they don't they'll give you some kind of cue so the the sort of 500 millisecond Target is is pretty important because we hit that uncanny valley pretty quickly when we're above it in fact I think that video of my colleague Chad that you just saw if you watched it with a critical eye what I hope you saw was pretty cool orchestration of like state-ofthe-art gen stuff and probably slower response times than really should be there uh so we've spent the last couple months I've spent the last couple months really sort of thinking a lot about how to improve these response times and just as a kind of Benchmark like relative uh another number that show how hard this is like Gemini Pros time to First token it's like 900 milliseconds so if you're aiming for 500 milliseconds you're already almost double even before you do anything else even before you send stuff you know o over the network for other services or anything so what models and tools you choose are constrained they matter a lot it matters a lot how you string them together um so just to pop up a level again this is what we're trying to achieve and the most power ful tool we have today for making everything run fast in this domain is actually putting as much together into one compute container as we possibly can so if the if the really really big things we're trying to do are natural language uh speech to text and then phrase inp pointing so when should the bot do processing or talk and then llm inference and then Voice output if we can put all those things together and run them locally and collocated we're like way ahead uh of where we are if we can't do that and this is worth emphasizing because I think I'm sure like 95 98 99% of stuff we're all building today with Gen we're calling out to hosted Services there's a lot of really really good reasons for that uh but that's tough in this domain if latency is what you're prioritizing and latency might not be what you're prioritizing and that's okay like there's lots of different trade-offs you can make but if you're trying to make things really really really fast you need to figure out how to host stuff yourself and how to host stuff in a way where you can tune and control and combine everything um so this is the part of the talk where I like look at the clock and I look out at all of you and I try to figure out how much tolerance you have for me talking about latency because I will maybe ironically we'll talk about latency for hours and hours is what I'm obsessed with as an engineer uh I do think it's worth just quickly kind of going over this list of like kind of the best you can hope for latency numbers for a typical voice AI context because some of them are non obvious so first what are we actually measuring we're measuring the time like what do we really really care about we measure the time I stop talking so if there's like a green waveform on one side and a purple waveform on the other side of this like you know uh audio uh editor um the time I stop talking then there's some kind of Gap usually silence we could play hold music music if it's too long a gap um and then there's another wave for on the other side when I first start to hear the LM talking to me that's the Gap we care about the voice to voice latency and that has to include everything it has to include audio encoding sending stuff over the network all the processing sending stuff back playing it out the speakers so the very first number here is actually kind of shockingly high if you're using the laptop mic on Mac OS it's like 30 40 milliseconds of delay there are some good reasons for that there's a lot of audio processing again Echo cancellation background noise removal uh gain control all that stuff plays in plays a role here uh as these kind of use cases get more and more important maybe hardware and operating system manufacturers will push this stuff down but today you're sort of 40 millisecs into your 500 millisecond Target time before the btes ever even get into any of what we think of as software we control um and then you're encoding the audio you're sending it over the network you're doing some packet handling you're reassembling those packets on the other side and what we call a Jitter buffer you're decoding and then you're finally at the like transcription stage if you are really careful about transcription and inp pointing you can probably get down to a couple hundred milliseconds then you're doing llm inference in this context you're really hoping you can get your time to First token from the llm in maybe another 100 milliseconds uh then you're doing some speech uh sorry text to speech and then you're going over the network working out the other side these are real measured numbers sort of medians there's obviously some hand waving because it's hard to measure everything perfectly and there's huge amounts of variation but you're you're at like 600 to 700 milliseconds optimistically but realistically in this world so that gives you a sort of sense of where we are um yesterday we posted a demo on Twitter of really trying to drive all this latency down uh and we have like a container that's open source that you can kind of look at and play with and hack on uh that gets you to like the 500 to 700 milliseconds of uh latency hello I'm fbot how can I help you today let's play a geography game do you want to give me some geography questions to answer that's the spirit I'll give you a question what is the world's largest desert oh I think it's the Sahara that's correct the Sahara Desert is indeed the largest hot desert in the world covering most of North Africa well done next question which river is the longest in South America it's got to be the uh not the Nile sorry the Amazon no need to apologize you're absolutely right again the Amazon River is indeed the longest river in South America and one of the longest rivers in the world good job next question what is the capital city of Australia canbera correct again cber is the capital city of Australia you're h a next question which mountain range runs along the border between France and Spain Pyrenees exactly right the Pyrenees Mountain Range does indeed run along the border between France and Spain you're really showing off your geography skills here next question what is the largest city in Scandinavia visit Stockholm I think there might be a slight mistake visit.com is not a city but rather a travel website I was trying to say stockolm no worries I knew what you meant and you're absolutely right again Stockholm is indeed the largest city in Scandinavia and the capital of Sweden you're doing great next question which city is the largest in Poland kraco close but not quite Kow is a beautiful city in Poland but it's not the largest the correct answer is Warsaw which is the capital and largest city of Poland OH I feel like I should have known that if we're going to talk about capitals maybe I'll ask you some questions what's the capital capital of France I'm ready for a challenge the capital of France is Paris Germany not quite well Germany is a country with no what's the capital of Germany good one the capital of Germany is in deep Berlin and what's the capital of Stockholm I mean sorry what's the capital of Sweden no worries the capital of Sweden is Stockholm what a so the the the idea there was not to do prompt engineering because clearly I could have better done a better job on prompt engineering uh but to kind of really test latency and show all the latency numbers in one place it's useful to visualize stuff and again that UI and everything is open source so the the big number in the middle uh is the the voice to voice latency measured on the client side and then the numbers on the edge are metrics that are being sent from the server that's doing all that work sort of on the individual pieces the measurement sent back so you can sort of see what the variation is and correlate and kind of get good intuitions about this stuff um the architecture here is uh uh two models by a company called Deep gram uh for the transcription and the voice generation uh that are really good compromises between how good they are and how fast they are and deep gram has a hosted service but they also let you run those models on premises in little Docker containers um and that's llama 38b I think because I couldn't quite get 70b to run a fast as I wanted it to although in theory that's possible um and I'll post some links uh to this if you want to look at it more so because we solved so many problems over and over uh we thought it would be great to have an open source framework for this stuff it I think we've seen this in other parts of AI landscape things like Lang chain and llama index are really valuable this is sort of that for real time and multimodal Ai and this slide probably looks familiar because I stole the list list of hard problems from this slide and made a slide that I moved higher up in the talk here for today um but this is a open source framework called pip cat uh it's gotten a bunch of traction recently uh it's vendor neutral even though it came out of uh work that we've done at Daily early on on this and we're just really excited about this it's super fun to be getting lots of community contributions now and if you are trying to build really fast multimodal AI stuff I think it's at least worth taking a look at you can build things like conversational Bots and speech to speech language translation apps and voice controlled agents of various kinds like that control your software user interfaces uh and real-time Vision model stuff like the awesome last presentation is also like baked into pip cat services now here's all the stuff that's supported in PIP cat today we're adding stuff all the time you can add stuff so if you're interested in building please hang out with us in the PIP cat Discord if you want to contribute a a service plugin please do that if you want to be a maintainer for an open source project that's a lot of fun ping me uh maintainers are you know gold in the open source World we're all always trying to recruit great maintainers um and just last slide about the context here so this is the PIP cat star rating list and the day that it went vertical was the GPD 40 announcement we are gonna get great multimodal models and they'll be incredibly useful they'll make building super fast stuff easier and easier um but a they're not here yet and B we're still going to need orchestration layers for all this stuff um also the the the demo that I showed just a minute ago uh that I posted yesterday is now at 175,000 views on Twitter so there's more and more and more interest in voice Ai and we'd love to have people uh come build with us [Applause] awesome thanks so much everyone for joining us for the multimodal track um and that is the end of the sessions today [Music] [Music] feel [Music] like it up the night when I'm alone when I hear the [Music] tomato up the driv your sing every [Music] night fire and the summer bre dancing the rec in your bedroom like always on my mind mind I feel it all come back in a [Music] moment the so if you want to come open it all slow motion [Music] in the moment like the so you [Music] [Music] [Music] get [Music] we know do set open for us in a moment keeping light on riding our keeping our sight everything we we catch our breath in the midle of it all [Music] Chas coming coming Crystal [Music] like true the I see on the horizon all [Music] we keeping all waking up and turning on keeping the light riding all the keeping [Music] our catch our breath in the [Music] midle crystal VIs [Music] you true [Music] belie I see on [Music] the chasing [Music] stop and I won't let it [Music] go looking on the I can see on the horizon all we can [Music] feel true belever we the I [Music] [Music] [Music] oh [Music] [Music] you [Music] he [Music] oh [Music] [Music] [Music] [Music] [Music] [Music] [Music] give [Music] you you would you would you would you you you you you you you you [Music] you you you yeah would you would you [Music] [Music] you you me you you you you [Music] [Music] would you you would you you you you [Music] [Music] [Music] [Music] [Music] [Music] [Music] breath [Music] Bes so I could strength to divide us we got it I know we did the best I could [Music] the to it [Music] up kill your there's no to blame nothing really stays the same this is how we go we hold on to [Music] [Music] let hold [Music] but I know there's a rainow over all the head on but I know on but this is how we got to it [Music] up kill your pride there's blame nothing really stays the same this is how we grow [Music] [Music] we hold [Music] on KN to kill your pride there's no to blame nothing really stays the same this is how we gr we sometimes [Music] [Music] [Music] n [Music] [Music] B [Music] [Music] [Music] [Music] B [Music] [Music] [Music] [Music] [Music] d [Music] [Music] [Music] morning we out the night like we dancing right through the while [Music] sing We our problems [Music] like super [Music] super [Music] only America your influence waxing now I couldn't seean we blame it [Music] [Music] alling while sing we give up our gos [Music] ason only in America it's coming over me electric Sy [Music] night hold [Music] is [Music] us us hold [Music] tonight is going come [Music] us only you want to feel like us it's forever [Music] [Music] amever Amica so come with us [Music] [Music] [Music] back in 89 we were kids falling in love for the first time H your hand you me in the eyes kind of feeling you get in a but now something went wrong you're moving on I found myself on the blind now you won't call we lost it all you fade away I'm picking up my heart from every piece that's broken try to get to don't a I'm looking [Music] for it's I'm losing all mys like I'm here because I'm missing you because I'm missing you oh because I'm missing you because I'm missing you because I'm missing you [Music] because I was chasing all the sides trying to hold something that I couldn't find you didn't Captivate my mind now I know we been the sunset in Paradise but now something went wrong you're moving on I found myself on the blind now you won't call we lost it all you fade away I'm picking up my heart that's broken try to get back to don't have a I'm looking for some luck can't find the door it's open I'm losing all my hope feels like I'm left here because I'm missing you because I'm missing you oh because I'm missing you because I'm missing you because I'm missing you [Music] broken [Music] [Music] [Music] F out all the we SP fighting through the fire don't let me down I need you now I'm feeling out L get on my CAU in thead I feel you let me I need you right now I want to be next to you you want to be next to me holding our Paper Hearts fading our Broken Dreams I want to be next to you you want to be next to me holding our paper heart feing our Broken Dreams want to be next to [Music] tell me that you to stay baby just don't walk away I need you F out all Al through me down I need [Music] you out on CAU in the mad I feel you somehow don't let me go I need you right now I want to be next to you you want to be next to me holding our Paper Hearts fading our Broken Dreams I want to be next to you you want to be next to me hold it on paper heart our Broken Dreams to next [Music] to to you to our Broken Dreams to next to you you want to be next to me holding our papering our broken dream [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] we got an insomnia with eyes wide shut we got everything we need and then a little too much I know that you star you [Music] be now there's something in the under I can feel it coming up don't you want to feel it Tak over your senses don't you ever feel teolog baby [Music] esep think there's that's down it's just the weight of the world now I'm calling it out we're a little starving foring can we speak Hest right now there's something in the I can feel it coming up don't you want to feel [Music] it baby with me I'll come sweep you your Fe don't you to feel it don't you don't you [Music] are stor feeling you can't ignore do you ever [Music] [Music] [Music] he [Music] [Music] [Music] n [Music] ladies and Gentlemen please return to the ballroom to take your seats for our afternoon program we will start our program in 10 minutes thank [Music] you the night like we our [Music] while the my we our problem [Music] [Music] forever America [Music] [Music] waing through while we sing [Music] [Music] it's forever in America El [Music] [Music] [Music] hold tonight is all we [Music] only you feel like it's forever [Music] [Music] going so come with hold t [Music] ladies and gentlemen our program begins in five minutes [Music] [Music] [Music] n [Music] [Music] a [Music] [Music] [Music] [Music] [Music] e [Music] [Music] [Music] n [Music] ladies and Gentlemen please take your seats our program is about to begin thank you ladies and Gentlemen please welcome to the stage head of developer experience at open AI Roman [Music] hu wow good afternoon everyone super excited to see you all here today such an incredible energy here at I'm Roman I lead developer experience at oppi and before joining OPP I was also a founder and like many of you in this room I actually experienced firsthand the magic of building with the frontier models now I'm working on making sure we offer the most delightful experience for all of you builders in this room and what I love the most about this role is also showing uh the art of the possible with our AI models and Technologies and so uh today we're going to go through a few things uh that uh that the great capabilities that the open team has built recently and we'll show some live demos to really bring them to life but first I'd like to start with a quick um zoom out on how we've gotten to where we are today openi is a research company and we're working on building AGI in a way that benefits all of humanity and to achieve this Mission we believe in each rative deployment we really want to make the technology enter contact with reality as early and often often as possible and for that a top Focus for us at oon ey is really all of you like the best builders in the world we really believe strongly that the best developers and startups are integral to the G in this like AGI Mission you guys are the ones that are going to build the native AI products in the future so today we have three million developers around the world building on the uh openi platform and we are very fortunate uh to to have so much Innovation but I'd like to rewind a little bit and you know today outside of this room when people think of open a they often think of CH GPT first because that's become like the the product that has taken the World by storm a little bit but the first product was actually not chat GPT the first product we put out there was the developer platform so back in 2020 at the time we had gpt3 and that's when we first started launching it to the public behind an API uh maybe quick show of hands actually who in this room have played uh with the API at the time of gpt3 ready wow that's like more than half of you you you guys are really my crowd here that's awesome and you know at the time we kind of got a taste for what AI would be capable of doing like basic coding assistance copy editing maybe some very simple translation but to really put things in perspective at the time this was the most or one of the most popular use cases on the platform AI dungeon this was like a roleplaying game purely based on text and it kind of was generating open and it stories and you could navigate uh the world and you know at at each scener when you were trying to look around you would generate new text so that was kind of the state-ofthe-art at the time obviously in 2023 gp4 changed the game it completely um changed the way we thought about AI it got better at reasoning it get more creative more specific it could start being better at coding and reasoning about complex problems and it could use tools also interpret data and kind of that dramatically expanded the aperture of the possibilities with the platform we've had a great Fortune of working with many many developers and companies like integrating gp4 in their own apps and services and this is just one example among many uh Spotify when they took um the our models to kind of generate uh playlists on the Fly based on your music taste and history but the one thing I want to highlight today in this talk is that gp4 was also the beginning of our multimodality Journey this is the very first time where we introduced like Vision capabilities and suddenly gp4 could analyze and interpret data um such as images photos Etc not just purely text and then with gp4 tubo for the first time we brought Vision capabilities into the same model right inside so that you can have the exact same model that does the the two modalities at the same time and of course last month we kind of took a step further with multimodality and that brings us to GPT for so GPT new flagship model that that can reason across audio video and text all in real time and thanks to its native image and audio capabilities we really think that you can now build uh very natural human computer interactions in a way that we've never seen before and so the reason we call it GPT 4 is because O is for Omni Omni model if you will and that that's because we brought all of these modalities into to one single model for you to interact with and there are like many things that are very unique and interesting about GPT for but I'd like to call out a couple uh step function changes here the first one is that it's really a big step towards like this like you know multimodality future with like Ultra fast latency for the first time like I'm not sure if many of you have played with chat GPT with voice mode before or if youve tried to build your own voice assistant but you kind of had to stitch kind of three models together first like whisper for translation then gp4 for reasoning and then ultimately Text to Speech but of course that kind of added a loss of context and some latency along the way now we have all of that into a single model and for all of you developers what's really exciting too is that for the same kind of state-of-the-art reasoning capabilities that we offer with gp4 we also Shi massive efficiency improvements and that means that in turn we've been able to turn those efficiency uh gains back to you so as a result um gp4 is twice as fast as gp4 turbo it's also half the price and we also increase drastically the rate limits I I actually have like 5x higher rate limits on this slide but we are kind of like always pushing them higher and we really want to live in a world where there's not even any rate limits so you can build with no limits but with all that said I'd like to give you a glimpse into GPT 4 uh and so we're going to switch over to my MAAC here so for this first live demo I'm going to bring up uh chat gbt for desktop and all I have to do is Click option uh space and I'm going to bring the voice mode right here hey CH GPT how are you hey I'm doing great thanks for asking how about you I'm great I'm actually on stage at the AI engineer Worlds Fair in San Francisco uh could you greet the audience we have some amazing developers here in the room hey everyone it's awesome to be virtually here at the AI engineer World's Fair in San Francisco so Chad I kind of like how you started but I really want you to bring the energy like people are very excited here so make this more live and also don't forget people are also tuning in from the live stream so don't forget about them hello s Francisco and to everyone joining us on the live stream welcome to the AI engineer World's Fair we've got an incredible lineup and a ton of energy here today let's dive into all the exciting Tech and Innovation that's much better thank you Chad GPT I think they really enjoyed that here if you need anything else just let me know hope the event goes fantastically yeah maybe one thing you could do like since we have a crowd of the best Builders out there can you whisper one secret that's really precious but has their building the future of their AI native products what should they know all right just between us remember the key to building great AI native products is focusing on responsible and ethical parency and glad you liked it can you whisper another one but even lower like very very very slow sure here's one more it's crucial to keep your AI adaptable and scalable technology evolves fast your products should too cool so one thank you a couple things to call out here but first of all like you can see the latency is pretty impressive like it's very much like natural in a conversation and and there's Lally no difference with talking to humans it's almost too fast at times I would say um the second thing that's really cool is that it understands emotion and can generate tones you know it could whisper it could bring the energy it could be more alive but also I can jump back in the conversation at any time I can interrupt the model and I doesn't I don't have to wait for the audio output to finish and that's really cool to have something much more natural uh versus what we what we knew before so the next thing I'd like to to show you of course we're talking of modalities so let's bring the second modality I'd like to turn on camera here anytime hey GPT can you now see me yes I can see you yeah you know you don't have to whisper anymore you can talk normally now got it how's the event going it's good um do I look good actually because I'm in front of a lot of people here you look great ready for the spotlight awesome okay so give me a second I'm going to try I'm going to try to draw something for you and and I'll I'll show it to you sounds good I'm curious to see what you come up with just give me a few more seconds you know take your time I'm here whenever you're ready patient no rush I'm always here when you're ready to okay so based on what I drew here can you take a guess out where I am and also can you translate what I wrote yes that's the Golden Gate Bridge so you must be in San Francisco and you wrote bonjour developer which means hello developers in French pretty cool all right um this was a pretty poor Bridge uh but chat GPT is pretty clever um next I have this book can you take a look chpt what is this yeah that's poor Charlie's Almanac by Charlie Munger it's a collection of the speeches talks and life philosophy Munger is known for his wisdom on investing and decision yeah can you uh can you just give me a random number between 1 and 400 sure how about 126 126 here we go okay page 126 can you give me like a brief overview of what's happening on this page it looks like this page talks about Coco Chanel rule for Success it mentions that Chanel focused on identifying the qualities her customers wanted and made sure to offer them essentially understanding your audience desires is key to success would you like a bit more detail from the page no I think I'm good this is pretty amazing you you read faster than I could read just one line so it's pretty amazing all right um cool next I'd like to show you something else so instead of camera I'll go real quick here and uh instead of sharing camera I'll share my screen so I'm going to bring up Chad GPT one more time here with audio capabilities I'm going to reduce this window a little bit can you clar okay chpt I'm going to share my screen with you can you see what I see yes I can see your screen what are we look looking at here uh well I'm I'm currently working on this app uh you know it's a travel app and I'd like to make it a little better um but there's one thing that I'm not quite sure about doing and I'd like your help so if I go to this tab here and if I try to put the final touches on it feels like the responsiveness is not really good like what do you think I see what you mean when you switch to the mobile view the layout looks a bit cramped and the images seem squished have you considered using media queries to adjust the layout for smaller screens so let's take a look at the code here uh maybe you could help me I'm going to send you first like this whole component um I'm going to copy the code over to you if you can take a look uh can you give me like a on line overview of what's going on in that one this code defines a react hook use assistant that manages interactions with an AI assistant it handles sending and receiving messages file uploads and streaming responses from a server okay cool I don't think that's where we need to change some stuff I think it might be over here in the Discover component what do you think yeah it does look like this is where the layout is being set for those end descriptions so I'm using tawin CSS um what should I do exactly to kind of change it and make it responsive here you could adjust the grid color attributes on mobile sizes by using mobile what's the what's the of prefix I should use for mobile screens then you can use the SM prefix for small screens okay try that I say yeah this and maybe for uh other should that is that looking right like medium screen maybe two and then we keep the three for large screens something like that yeah that looks good adjusting the take a look now what do you think it looks much better the images and layout great job solving that any else you want to tweak so I really like using chat GPT on desktop as a coding assistant but of course this was a very trivial use case but what's also even more interesting is when you start reasoning out loud with chat GPT to build something but you also tell like hey actually I'm going to get cursor to do it but what should I prom cursor and I've done that many times it's also pretty amazing to see how of them can interact across modalities but let's go back to my presentation please I'd like to give you a little bit of a sneak peek of what's on our mind what are we working on next at openai as we think about these modalities and the future of models so there are four things that we're currently top of that are currently top of mind for us especially for all of you developers building on the platform the first thing is textual intelligence of course as you can tell we're extremely excited about modalities but we also think that increasing textual intelligence is still very key uh to unlock the transformational value of AI and we expect the potential uh of llms intelligence that we expect that potential to be like still very huge in the future those models today they're pretty good you know as we can tell we we're building things with them but at the same time what's really cool to realize that is that the dumbers they lever be will always have better models coming up and if you will like it's almost like we have first graders working alongside us they still make mistakes every now and then but we expect that in a year from now they might be like completely different and unrecognizable from what we have today they could become Master students In The Bleak of an eye in multiple disciplines like medical research or scientific reasoning to reex the next Frontier Model will have such a a function change in reasoning improvements again the second Rea Focus that we're excited about is like faster and cheaper models and we know that not every use case requires like the highest intelligence of course gp4 pricing has decreased significantly uh 80% in fact over a year but we also want to in introduce like more models over time so we want these models to be cheaper for you all to build we want to have models of different sizes we don't really have timelines to share today but that's something we very excited about as well and finally we want to help you run U async workloads we launched a couple months ago the batch API and we're seeing like tremendous success already especially For Those modalities say you have like documents to analyze with vision with or photos or images all that can be batched for another 50% uh discount on pricing third we also believe in model customization we really believe that every company every organization will have a customized model and we have like a wide range of offering here I'm sure many of you here have tried our fine-tuning API it's completely available for for anyone to build with but we also assist companies all the way like Harvey for instance uh a startup that's building a product uh for law firms and they were able to kind of customize gp4 entirely on us case law and they've seen like amazing results in doing so and last we'll continue to invest in enabling agents we're extremely excited about the future of agents and we share a little bit about that Vision back in November at death day and agents will be able to perceive interact with the world using all of these modalities just like human beings and once again that's where the multimodality story comes into play Imagine an agent being able to kind of coordinate with m multiple AI systems but also securely access your data and and even yes manage your calendar and things like that we're very excited about agents dein of course is an amazing example of what agents can become like cognition Labs has built this awesome uh uh this awesome like software engineer that can code alongside you but he's able to break down complex task and actually um you know browse the documentation online submit pool request and so on and so forth it's really a glimpse into uh what we can expect for the future of agents and with all of that it's no surprise that in fact Paul gram realized um a few months ago that like often 20 year 22y old sorry programmers are often as good if not better than 28y old programmers and that's because they have these amazing AI tools at their fingertips so with that I'd like to switch to another demo to kind of show you this time not chat GPT but rather like what we can build with these modalities so in the title of this talk I did not mention video but I'm sure most of you have seen Sora the preview of our kind of diffusion model that's able to generate videos from a very simple uh prompt and this is one of them so uh in the interest of time I've already sent this prompt to Sora describing a documentary with a tree frog very precise on what I'm expecting and if I click here this is what came out of Sora [Music] it's pretty [Music] cool but next what I'd like to do is going to bring this video to life you know and and here what I'm doing is like I simply sliced frames out of the video of Sora and what I'm going to do next is very simple I'm going to send these six frames over to um to GPT for with vision with this prompt if you're curious and I'm going to tell it to narrate what it sees as if it was a narrator so going back here I'm going to click analyze and narrate again this is all happening in real time so every single time the story is unique and I'm just discovering it like all of you and boom that's it so that's what gp4 with vision was able to create based on what it saw in those frames so it's pretty magical but last but not least I wanted to show you one thing that we also previewed recently uh and it's our voice engine model the voice engine model is the ability for us to create custom voices based on very short clips and of course we take safety very responsibly so this is not a model that's broadly available just yet uh but I wanted to give you a sneak peek today of how it works and also the voice engine is what we use internally with actors to bring the voices you know in the API or in chat GPT so here I'm going to go ahead and show you a quick demo hey so I'm on stage at the uh AI engineer World Fair I just uh need to record a few s seconds of my voice I'm super excited to see the audience that's really captivated by these modalities and what we can now build uh on the openi platform all right hey so I'm on stage at the uh engineer World Fair I just yeah sounds like it's perfect uh that's all we need so now to bring it all bring us all together here what I'm going to do is I'm going to take this clip I'm going to take the script that we just generated and I'm sending all of it back to um the voice engine and we'll see what happens in the heart of the dense Misty Forest a vibrant frog makes its careful way along a moss covered Branch its bright green body adorned with black and yellow patterns stands out amidst the Lush foliage I can also have it translate in multiple languages so let's try French and for those who know me that's actually I sound speak French maybe one last one with Japanese all right um thank you you let's go back real quick to the to the slides and of course this is one's very specific examples of bringing modalities together with you know Sora videos GPT for1 Vision the voice engine that we have not uh released yet but I hope this inspires you to see how you can kind of picture the future with these modalities combined together so to wrap up we're focused on these four things textual intelligence to drive it up uh making our models faster and and more affordable so you all can scale we're thinking about customizable uh models for your needs and finally making sure you can build for this multimodal future and agents and if there's one thing I want to leave you off with today is that our goal is not for you guys to spend more with open but our goal is for you to build more with open a because let's remember we're still in the very early Innings of that transition and it's a fundamental shift in how we think and build software every day so we really want to help you in that transition we're dedicated to supporting developers startups we love feedback so if there's anything we could do better please come find me after this talk and you know this is really like the most exciting time to be building an AI native company so we want you to bet on the future of AI and and we know that gold Builders like all of you are going to come up with the future and and invent it before anyone else so with that thank you so much and we can't wait to see what you're going to build with those new modalities and reinvent software [Applause] 2.0 ladies and Gentlemen please welcome to the stage the authors of what we learned from a year of building with llms Brian Bishoff and Charles fry [Music] [Applause] hey everyone so you're about to experience something of a strange talk and not just because Brian and I are strange but because something kind of strange happened over the last year A bunch of us were posting things on Twitter we were uh writing blog posts complaining about llms and we formed a little group chat and we were you know continuing to complain about llm to each other uh and sharing what we were working on when we realized we were all about to write the exact same blog post what we learned in the last year so we we got together and we uh we turned what was initially a couple of short blog posts into a long white paper on O'Reilly uh combining our lessons across strategic operational and tactical levels of building LM applications and the response to that white paper was overwhelmingly positive we got uh we heard from everybody from people who contribute to postgress to venture capitalists to Tool Builders saying we loved what you wrote in that article um I like I felt that pain too and we were invited on the strength of that to give this keynote address and so we faced a kind of funny challenge which is part of the appeal of this blog post uh of this article was that the six of us all came together to write it as Scott condrin put it it was like the an Avengers team up uh so we had to figure out a way to deliver uh one keynote talk from six people uh so we we pulled the Avengers together for uh one night only uh to sort of EX like deliver some of the most important insights from that 30 page article uh to add some of our spicy extra takes that ended up on The Cutting Room floor and to respond to the allegations I'd like to State unequivocally that we are not in fact Crypt Bros who just found out that gp4 was the new web 3 um uh we all trained our first neural networks back when you had to write the gradients by hand so uh we split the article up to three pieces we split the talk into three pieces first you're going to hear from me and Brian talking about the Strategic considerations for building llm applications how do you look to the Future how do you see around corners how do you make big decisions then we're going to hand the clickers and the stage over to ham Hussein and Jason Lou Who are going to share the operational considerations how do you put together processes how do you put together teams how do you think about workflows around delivering llm applications and then they will hand over the clickers in the stage to uh sh Shankar and Eugene Yan who will talk about the Tactical considerations for building LM applications what are the specific techniques tactics and moves that have stood the test of one year's time for building llm applications all right so Brian how do you build an LM application without getting outmaneuvered and wasting everybody's time and money ah yes yes well many of you may be thinking that there's really only only one way to win in this new exciting Dynamic and very scary industry and that of course is to train your own custom model pre-training fine-tuning a little rhf here and there you better start from scratch buddy eh not quite the model is actually not your moat for almost no one in this audience the model is the moat you all as AI engineering devotees should be building in your zone of Genius you should be leveraging your product expertise or your existing product maybe you've got one and you should be finding your Niche and digging into that Niche exploiting it you should be building what the model providers are not there's a high likelihood that the model providers have to build a lot of things for all of their customers don't waste your calories on building these things the Sam mman phrase of steamrolling is appropriate here and you should be treating the models like any other SAS product you should be quickly dropping them when there's a competitor that's clearly better no offense to GPT 40 but Sona 35 looking looking pretty sharp it's important to keep in mind that a model with high MML U scores that's not a product 87% on spider SQL that doesn't automate all data requests or even 87% of them you can't sell human eval pass at 67 at least my GTM team doesn't know how an excellent llm powered application is an excellent product it's well-designed it solves a job to be done and it enhances your user why are we so excited about AI human enhancement so what should you build if not all of these things things that generalize to smarter and faster models things that help you maintain your products Quality Bar under uncertainty and things that help you continuously improve whoa Brian continuous Improvement that's uh that's my trigger phrase the idea of continuous Improvement has been brought to the world of llm applications by like this shift in Focus that we've all felt since the previous AI engineer Summit to focus on evaluation and data it's nicely syncot alized by this diagram from our co-author ham Hussein showing this virtuous cycle of improvement it has evals and data at the center but the core reason to create those evals the core reason to collect that data is to drive forward this Loop of continuous Improvement and despite what your expensive Consultants or U your the many of the uh LinkedIn influencers posting about llm apps might say this is not actually the first time that Engineers have tried to tame a complex system and make it useful and valuable this same Loop of iterative improvement was also at the core of mlops at the operationalization of machine learning models before llms this figure from our co-author shre Shan car's paper uh had that same Loop of iterative improvement centered also on evaluation and on data collection mlops was also not the first time that Engineers faced this problem the problem of complexity the problem of non-determinism and uncertainty this the devops movement that gave mlops its name also focused on this kind of iterative improvement and on monitoring uh information in production to turn into improvements to products but dear reader devops was not the first time that Engineers t tackled this problem of uncertainty and solved it with iterative Improvement devops built on the ideas of The Lean Startup movement uh from Eric Rees that that was focusing not just on building an application not just on building an a machine learning model or an llm agent but on building the entire business and it used this same Loop centered on measurement and data uh to to drive the Improvement of and building of a business this idea itself was not invented in Northern California despite what uh some people might say it has its roots in the Toyota production system and in the idea of Kaizen or continuous Improvement geni gutsu is one of the core principles from that movement that we can take forward into the development of llm applications it means real things real places and at Toyota that meant sending Executives out to factory floors getting their khakis a bit dirty for LM applications the equivalent is looking at your data looking the that data is the real information about how your llm application is delivering value to users there's nothing that is more valuable than that finally it's there's lots of people selling tools at this conference including myself it's easy to get overly excited about the tools and the construction of this iterative Loop of improvement and to forget where value actually comes from and there's a great pity the earthy statement from the Toyota production system from chig shingo that I really like value is only created when metal gets vent so we have to make sure that we don't get lost just building our eals and calculating concept drift and we instead make sure that we continue to get out there and bend metal and create value for our users not going to lie I might have misunderstood earlier when you said let's get bent Okay so right off the bat we need to spin that data flywheel Bob oh wait sorry wrong wrong game show point is we need to get this moving we need to get this in front of users and human beings we need to express the goals for our system and how do we do that with evals remember evals are not convenient weird bespoke uh metrics evals are objectives they're what we want our system to do any system for capturing this behavior is good good enough I don't have an eval framework to sell you but what I do have to sell you is this idea that you should be getting out there you should be getting started but wait Brian I'm really nervous what if this isn't good enough for my customers fear is the mind killer put it out there in beta if it's good enough for these incredible companies like apple intelligence Photoshop and hex that's me it's good enough for you you need to collect this data you need to put something in the wild you need to start looking at your user interactions the real user interactions lm's responses deserve human eyes you can give it some AI eyes too but definitely look at it with your human eyes binary human feedback is valuable it's nice to add some rich feedback too that can be interesting but start with binaries and finally user requests will reveal the pmf opportunities that lie below your product substrate where is your pmf everybody wants to know it's in your user interactions what are they asking your chatbot that you haven't yet implemented that's a really nice direction to skate if that's where the Puck's going and despite the focus on the user interactions that you can have today the things that you can ship right now it's important to also think about the future the best way to predict the future is to look at the past find people predicting the present and copy what they did in designing the many of the components of the personal Computing Revolution Alan Kay and others at Park adopted as a core technique projecting Moore's Law out into the future they built expensive unmarketable slow and buggy systems themselves so they could experience what it was like and build for that future and and create it we don't have quite the industrial scaling uh information that uh that Moore had when he wrote down his predictions but we do have the beginnings of those same laws there's been an order of magnitude decrease every 12 to 18 months at three distinct levels of capability at the capability of Da Vinci the original gpt3 API that brought that excited a a lot of us about the idea of building on Foundation models the capabilities of Tex D Vinci 2 the model lineage underlying chat gbt that brought the rest of the world to excitement about this technology and the latest and greatest level of capabilities with gp4 and Sonet in each case around 15 months is enough time to drop the cost by an entire order of magnitude this is faster than Mo's law and so the appropriate way to plan for the future is to think what this implies for what applications that are not economical today will be economical at the time that you need to raise your next round uh so in 2023 it cost about $625 an hour to run a video game where all the NPCs were powered by a chat bot that's pretty expensive in 1980 it cost about $6 an hour to play Pac-Man inflation adjusted that suggests that if we just wait for two orders of magnitude reduction or about 30 months from mid 2023 it should be possible to deliver a compelling video game experience with chat chatbot NPCs at about $6 an hour and people will probably pay for it so you can't sell it now but you can live it and you can design it and you can be ready when the time comes so that's how to think about the future and how to think strategically when building L applications I'd like to call to the stage my co-authors Jason Lou and hamama Hussein to talk about the operational aspects let's give them a [Applause] hand all right so how and I have basically been doing a lot of AI Consulting in the past year right we've worked with about 20 companies so far and you know we've done something from preed all the way to public companies and I'm pretty bored of giving generic good advice especially because there's such a range of operators here and so instead I'm going to invert my goal today is to tell you how to ruin your business first of all everyone knows that in the gold rush you sell shovels and so if you want to get gold you got to buy shovels too right you know if you want to find more gold keep buying shovels where do I dig keep buying shovels how do I know when to stop digging the shovel will tell you and how do I dig one deep hole versus making investments in a plenty of shallow holes again the answer is more shovels clearly right and this might be generic so I'll give you some more specific advice if your rag app doesn't work try a vector database a different Vector database if the methodol doesn't work Implement a new paper and maybe if you update the embedding model you'll finally find product Market fit because truth be told success does not lie in developing expertise or processes try more tools there's no need to balance between exploring and exploiting the mechanisms that work for you change the tools and the processes and the decision-making Frameworks don't matter the right tool will solve everything number two find a machine learning engineer who can fine-tune as quickly as possible a $2,000 per month open AI bill is very expensive and instead hire someone for a quarter of a million dollars give them 1% % of their company to fight Cuda build errors and figure out server cold starts right because what's the point of growing your company if you're just a rapper and if your margins are too low try fine-tuning it's much easier than figuring out how to build something worth charging for it's really I can I cannot reiterate this enough it's very important to hire a machine learning engineer as quickly as possible right even if you have no data generating products they love fixing versell typescript build errors and generally if you hire a full stack engineer who's really caught the llm bug they they're going to lack real experience and this is because python is a dead language right machine learning Engineers research Engineers can easily pick up typescript and the ecosystem that exists in Python could be quickly reimplemented in a couple of weekends right the people who wrote python code for the past 10 years doing data analysis they're going to easily be able transition their tools and if anything it's really easy to teach things like product sense and data literacy to the JavaScript community and most important of all in order to find this kind of magic Talent we need to create a very Catal job title let's use words like ninja and Wizard or data scientist or prompt engineer or even the AI engineer in the past 10 years we've known that this works really well right every time we know exactly L who we want as long as we catch a very wide net of skills it doesn't really matter whether or not we don't know what outcomes we're looking for anyways to dig me out of this hole I'll uh have HL explain and uh you know take a deep breath think out loud step by step thank you [Applause] Jason so that was really good I mean this is step back from the cliff a little bit and let's kind of Linger on the topic of AI engineer had heard some booing in the audience um and so I love the term a engineer like much props to swix for kind of popularizing this term allows us all to get together and have conversations like this but I think that there's a misunderstanding of the skills of AI engineer what is what skills you need to be successful and there's a lot of inflated expectations as a founder or engineering leader your talent is the most important lever that you have and so what I'm going to do is I'm going to talk about some of the problems and perhaps some solutions when it comes to this Talent a misunderstanding so just to review what is an AI engineer so this is a diagram that everyone has probably seen uh there's a spectrum of skills in the AI space and there's this API dividing line in the middle and kind of to the right of the AP dividing line we have ai engineer AI engineer skills are focused on things like chains agents tooling and infra and auspiciously missing from the AI engineer are tools like evals and data and I think a lot of people have taken this diagram too literally and taken it to heart and say hey we don't really need to know about ebals for example the problem is is that you can go from 0 to one really fast in fact you can go to 01 faster than ever before with all the great tools out there just by using Vibe checks and implementing the tools that we talked about however without evals you can't make progress quickly lead to stagnation because if you can't measure what you're doing you can't make your system better and you can't go beyond 0o to one so what can we do about this about this eval skill set and data literacy so Jason and I have found that we can actually get really good at writing evals and data literacy with just four to six weeks of deliberate practice in fact like very effective and we think that these skills evals and data should be brought more into the core of AI engineer and it really it like helps solve this problem and it's something that we see over and over again so the next thing I want to talk about is the AI engineer job title itself and so vague job titles can be problematic what we see over and over again in our Consulting is that this kind of catchall role have very inl inflated expectations um this anytime anything goes wrong with the AI people look towards that role to fix it and sometimes that role doesn't have all the skills they need to move forward and we've seen this before with the role of data scientists the titles and names really matter um and what I want to emphasize I think AI engineer is very aspirational and you should keep learning and it's a good thing to strive towards but you need to have reasonable expectations and just to kind of bring it back to data science we've seen this before in data science as well where we had kind of a decade ago when this role was coined it was a unicorn that had all these skills software engineering skills sta itics math domain expertise we found out as an industry that we had to unroll that into many other different roles such as decision scientist machine learning engineer data engineer so on and so forth and I think similar things may be happening with the role of AI engineer and it's good to keep that in mind and what I see or what we both see in Consulting is that it's helpful to be more specific to be more deliberate about what skills you need and at what time and depending on your maturity it's very helpful to not only specify what the skills are but what kinds of products you'll be working on so these are some job titles from GitHub co-pilot um that kind of are very specific about the skills you need at that time and really it's important to hire the right Talent at the right time on the maturity curve so when you're first starting out you only need application development software engineering and or AI engineering to go from zero to one then you need platform and data engineering to C capture that data and then only after that you should hire a machine learning engineer do not hire a machine learning engineer without having any data but again you can get a lot more mileage out of your AI engineer with deliberate practice on evals and data we usually find four to six weeks practice does the job so in recap one of the biggest failure modes is Talent we think that a engineer is often over scoped but underspecified but we can fix that learning evals next I want to give it over to Shrea Shankar and Eugene Yan to talk about to dive into this evals and data [Applause] literacy thanks question thank you Jason thank you haml next up sh and I going to share with you about the Tactical aspects of building with LMS in production specifically evals monitoring and guard rails so here's a he news quote how important evals are to the team is a differentiator between team shipping out hot garbage and those building real products I would agree I think here's an example of lm's uh of Apple's recent LM where they shared about how they actually collected 750 summaries of push notification and email sum summarizations because these are data sets they are representative of their actual use case so how do we build evals for our own products well I think the same thing the simple thing is to just make it simpler for example if you're trying to extract product attributes from a product description break it down into title price rating and then you can just simple do simply do assertions s similarly for summarization instead of trying to eval that amorphus blob of a summary break it down into Dimensions such as factual inconsistency relevance and informational density and once you've done that assertion based test can go along way are we extracting the correct price are we extracting the correct title or if you're doing natural language to SQL generation is it using the expected table is it using the expected columns these are very simple to eval and reiterates what haml has mentioned about keeping it simple lastly assertions can do everything but they can only go so far so therefore consider evaluate the models maybe training a classifier for factual inconsistency or reward model for relevance this is easier if your evals are classification and regression based but that said I don't know how I feel about LM as a judge what do you mean you don't like llm as a judge I I personally am super bullish on llm as a judge and I'm curious how many of you are exploring LMS judge or have implemented it no yeah there's a judge right here you want to stand up no actual Jud LM judge here yeah anyways we're going to go through some points on what to consider when deploying llms Judge first of all there it's no brainer llms judge is the most easy to prototype you just have to write a prompt to check for the criteria or metric that you want and you can even align this towards your own preferences by providing few shot examples of good and bad for that criteria on the other hand fine-tune models or llms that you know you have to collect a lot of data and set up a pipeline to train as your evaluator are not super easy to prototype and have a lot of upfront investment yeah but that said LM a Jud it's pretty difficult to align it to your specific criteria in the business who here has not had who here has not had any difficulty aligning the LM as a judge to your criteria anyone okay we got to talk later sha um I think that if you just have a few hundred to a few thousand samples it's very easy to F tune a simple model who can do it more precisely second then if you want to do LM as a judge and have it fairly precise you sort of need to use Chain of Thought and Chain of Thought is going to be I know 5 to 8 seconds long on the other hand if you have a simple classifier or reward model every request is maybe 10 milliseconds long that's two ords of magnitude lower and would improve trut next we want to think about technical debt okay when we're implementing our validators in production even if they run asynchronously or they run in the critical path how much effort do we need to put in to keep these up to date with llm as judge if you don't make sure your few shot examples are dynamic or some way of making sure your judge kind of prompt aligns with your definition of good and bad then you're toast and kind of the effect is not as pronounced for fine-tune models but if you don't continually fine-tune your validators on new data on new production data then they will also be susceptible to drift so over all when do you want to use llm as judge it's honestly a resources question and where you are in your application development if you're starting to prototype it um you need quick evals with minimal Dev effort and need something you have a lowish volume of evals start with llm as a judge and kind of invest in the infrastructure to align that over time if you have more resources or you know that your product is going to be sticky go for a fine tune model next I'm going to talk about about looking at the data Eugene mentioned you know you should create evals on your custom or bespoke criteria but how do you know what criteria you want simple answer look at your data great AI researchers but we changed that to Engineers great AI Engineers look at their data so how do we do this the first question actually before how is when do you look at this I know people who never look at their data at all or people who look at it initially after deployment wrong answer you want to look got it regularly I work with a startup that you know whenever they ship a new llm agent they create a new slack Channel with all of the agents outputs that come in real time after a couple of weeks they transition this to kind of daily batch jobs um and make sure that you know they're not running into errors that they didn't anticipate second thing is what specifically are you looking for you want to find slices of the data that are pretty simple or easy to characterize in some way for example data that comes from a particular Source or data that has a certain keyword or phrase or is about a certain topic right simply just saying all of these are bad but having no way of characterizing them and then improving your pipeline based on that is not going to help finally some things to keep in mind throughout this whole looking at your data experience is that your codebase is very rapidly changing over time probably your prompts components of the pipeline and Etc so when you're inspecting traces it's super helpful to be able to know you know what GitHub commit or what model version or prompt version did this correspond to I think this is one of the very successful things that traditional mlops tools did like ml flow for example they made it very easy to trace back and then hopefully you could replay something well I I see the judge shaking as head but great um and finally when using llms as apis pin model versions U llm apis are known to you know exhibit different behavior that is very hard to quantify for for certain tasks so pin you know GPT 4 1106 pin GPT 40 whatever it is that you're using so shya mentioned that we need to look at our data but how do we look at our data all the time I think the way to do this is VI an automated guard reel here's brandolin law adapted the amount of energy to catch and fix defects is an order of magnitude larger than needed to produce it and that's true it's really easy to call llm Api and just get something but how do we know if it's actually bad I I think it's really important that we do have some basic form of guard RS and some of them are just table sticks toxicity personally identified information copyright and expected language now you may imagine that this is pretty straightforward but sometimes you don't actually have control over the context for example if someone's posting an ad on your English website that's in a different language and you're asking your LM to extract the attributes or to summarize it you may be surprised that for some nonzero proportion of the time it actually in a different language similarly hallucinations happen more often that we would like um so imagine you're trying to summarize a movie based on the description you just have a description for the trailer it may actually include spoilers because it's trying so hard to be helpful but that's actually a bad user experience so sometimes you will include information that's not in that here's a tip if we spend a little bit more time building reference free evals we can use them as guard rails so reference-based evals are when we generate some kind of output and we compare compared to some ideal sample this is pretty expensive and you actually have to collect all these go samples on the other hand if we have these labels we can train an evaluator model and just compare it to the source document so for example if we're comparing summarizations we can just check if the summary entails or contradicts The Source document and now we have a summarization I mean a hallucination eval so therefore if we spend some time building reference free evals once we can use it to guard R all new output cool thanks so we're going to wrap up the next minute or so on some highle Bird's eyee view 2000t view whatever you want to call it takeways first off how many of you remember this figure from this pretty seminal paper in mlops that came out maybe 10 years ago 2015 so nine years ago yeah so I think this paper really communicated the idea that the model is a small part and when you're productionizing ml systems right there's so much more around the model that you have to maintain over time data verification uh feature engineering monitoring your infrastructure Etc so you might be wondering you know we have LMS does any of this matter yeah seeing few nods here absolutely um when we have llms it all of these you know Tech debt principles still apply and you can even think the exact mapping for every single component in here to the llm equivalent for example maybe we don't have feature engineering pipelines but you know cast it a new light they it's rack right we're looking at context we're trying to retrieve what's relevant engineer that to you know not distract the llm too much we have a ton of experimentation around that all of this is something that needs to be maintained over time especially as models change under the hood similarly for data validation and verification right we have evales we have guard rails that need to be deployed right it's not just simply wrap your uh model or GPT um in some software and ship it no there's like a lot of investment that needs to happen around the model all right so I'd like to end with this quote from kapati Senpai there's a large class of problems they are really easy to imagine and build demos for but it's extremely hard to mil products out of for example Charles dug up this paper of the first car driven by neuron Network work that was 1988 25 years later Andre kapati took his first demo Drive of 2013 10 years later I hope all of you had a chance to try the weo we got the first driess we got the driess permit for way more in San Francisco maybe in a couple more years we'll have it for the whole of California the point is going from demo to production takes time so therefore that's all we had thank you let's [Applause] build and now I'm delighted and honored to welcome a very special guest Thomas domy Thomas has been fascinated by software development since his childhood in Germany and he's built a career building tools death love and accelerating Innovations at that are changing software development currently Thomas is CEO at GitHub where he has overseen the launch of the first at scale AI developer tool GitHub co-pilot so please join me in welcoming to the stage Thomas [Music] domy thank you Thomas it's B out here yeah well thank you everyone thank you Thomas um let's start with co-pilot many people have shared their own takes on the co-pilot origin story so but what was your personal experience seeing it in grub I don't know you have a sneak preview take us back to the start in 2020 so imagine it's 2020 um it's lockdown here in San Francisco in Seattle everywhere where giab Engineers are sitting so like all of you probably were on a zoom call um one of us had early access to a new model that um open AI had just um released in preview um a version of gpt3 called codex and uh you know one had the um uger I think had the keyboard the leader of GitHub next at the time and we were dictating prompts and asked the model to uh write some code and I think the first aha moment that I had is that you could ask it to write JavaScript code and put the curly braces in the right places and whatnot and you could ask it to write python code and the model in a way you know doesn't work like a compiler it doesn't have a syntax tree it doesn't know these things or you could also argue it knows them exactly like we know it so that was probably the first moment um we kept building uh uh we kept exploring the model and then decided we build this autoc completion co-pilot that you know was the first co-pilot and we build it all you know by being remote while being on lockdown so if uh event if your investors tell you today you need to be in in a room and on the front of the Whiteboard um you can innovate if you want to while being while being uh in your home offices around the world um I think the next moment was um that we shipped um a preview to our internal engineers and we call it a staff ship at GitHub and uh The NPS um survey with those Engineers was through the roof I think 72 73 something like that and typical our early stage products especially you know with a large language model and you know all the hallucinations and the UI wasn't really fig got out yet uh is is much lower so that was kind of like a holy moment uh that we had and um as the product then shipped in mid 2021 and you know Co was still going uh uh we we started looking at Telemetry and the team came and says it writes about 25% of the code in in those files where it was enabled and I remember saying don't believe this like your Telemetry is wrong please go back and validate that and turned out you know that was actually right and uh by now it's about half you know the code that's written some languages like Java even has have a higher acceptance rate and more lines written and so I think those kind of this journey that we went through over the first two years really was like one one moment after another where we saw uh the future of AI long before um chat GPT actually opened everybody else's mind amazing and now it's available to everyone here as well so I think co-pilot started as an auto compete IDE and now it's all over GitHub I know I have PR boards Etc what do you do to make co-pilot um and integrated across all of GitHub like what are some experiments what worked what didn't work I think the first thing is to think about you know what do I do as a leader as the CEO of a company and it's really about constantly reconfiguring our approach um so much of you know the AI world is changing almost daily um there's you know some news uh on the information elsewhere uh every morning and so there is no more a I have a long-term strategy uh I have my features all laid out and work through the back lers really like operating as agile as possible even as we are you know 3,000 person company as part of you know one of the largest company uh on the planet the second is that we really try to meet you know the developer where they are um we say you know we're not trying to build an AI engineer we're trying to build AI for engineers a human Centric approach you know that's where the name what the name copil ultimately uh visualizes um but also you know we're trying to make the developers lives better um and because we have developers ourselves and um every productivity Improvement we can find ultimately helps us at GitHub to build you know our AI product so that really is the approach like looking at what what can we do next uh to make you know our our work um a little bit easier of building more features for co-pilot you mentioned a great point you trying to meet the developer where they are so for now we've been bringing the AI to the IDE yeah can you are we going to try to bring the developer the ID closer to the AI how are you thinking about that you know the idea of bringing AI into the IDE or really into ghost text you know autoc completions was a way of getting around hallucinations um it was a way of saying okay the model is not always going to be perfect but neither are Auto completions right like whether you have Auto completions in your Google Docs or in your email or in your editor in in the old intellisense way as you're typing it cannot know what you wanted to type and so you're used to adjusting your typing and then you find this moment when you press the Tab Key and um even without Auto completion so we think about what developers do in the editor while they write code and the best developers write a lot of code before they get stuck in the newbies and and those that rarely write code like I you know get stuck more often and then you you know control tab or command tab into um into your browser and you open um Google or stack Warlow GitHub right and what you do there is you find code and and you argue with other developers and then you copy and paste that code into your editor and then you modify that as well so it's kind of like in a way stack Overflow has as many hallucinations as as as the model I might have H and not because the answers are bad but because the world is changing so much you know I code a little bit on iPhone projects and Swift and there's always a new Swift version uh after dddc or new excode version so things have changed of how you use the apis and so it keeps the developer in the flow that really the crucial thing here was we didn't you know in a world you know 10 years ago we probably wouldn't even call this AI we would just call call it you know more smarter Auto completion and um the AI piece is not the core of piece the the core the core feature of copilot that helps developers to stay in the flow to get the job done and not be in this constant distraction between the editor and and the browser that's a great point and I think a few months ago you wrote this post about workspace what was the journey to creating workspace and maybe for folks who don't unfamiliar with it what is workspace yeah so now you already mentioned autoc completion that's what how we started um in um November 2022 chat GPT happened so early 2023 we added chat and and gp4 uh to co-pilot in the IDE as a as a separate um um sidebar window so we have that available and it has Rag and and all the information the context available in the IDE but ever since we have been thinking how can we make the developer flow even easier and workspace does exactly that it takes a GitHub issue or just the task and idea that you write down on get.com and it helps you then as part of your code basic repositories to figure out how to implement that change it Bridges from the issue you know from the task description into the pull request into the code and the the magic behind this is that a the human is still in the center so every step of that way you know writing a specification analyzing the current Reaper the current behavior and then using your description to figure out how do you modify this then writing the plan which sh shows you how to change all the files to the implementation which is the diff view if you will the human can interact can change those bullet points can change the code and um what that really does it it gives you a an a pair programmer that helps you to explore the codebase right because the challenge we all have as Engineers is that as as soon as you get moved onto a new project or you want to you know modify an open source project um or you're just you know coming back from vacation you're timeing to remember what in what is implemented where in your thousand plus files that is navigating the code base is the first challenge you have figuring out what's the current behavior and what's the new Behavior so you're having an AI native um a co-pilot native developer environment that helps you along that Journey that you're naturally also doing in your IDE and that really is the key here it's not about you know building an autonomous agent I'm sure you have heard a lot about that in the last three days it's about building agents that helps us as humans to solve a task and learn along the way as we figure out oh you know there's this test file that I also have to modify if I want to implement this feature I love the point you mentioned which is not building autonomous agents and also helping the developers so how should non developers use workspace they can and in fact you know once we announced this um last year at GitHub Universe in November I think the first email we got with feedback was from a program manager or product manager saying this is awesome because now I can uh not only write you know a user story or um a work item I can also see what that would mean to implement in the code base in many ways you know the biggest challenge we have today is can we be as specific as possible when we write down a task you know as product managers or as Engineers ourselves you know often everything is obvious until it is not um and then um you know you you kind of need to size the task right like how long will it take and uh the mythical men month um I think the pragmatic engineer had that a couple of weeks ago is still true most uh most estimates are half as uh half the time that the the drop actually takes and it's so really bad at estimating how much time it takes to get uh done whether it's encoding or whether it's building houses or or roads or infrastructure and so um workspace helps you with that as it helps you to figure out what I just describe is it actually specific enough to write the code for that or to even figure out what the plan would look like can you share a bit about your vision on how you think we will build and code in natural language and how it help us collaborate better devs and PMs coders and non-coders across languages and across the world for me you know the very first thing when you say natural languages I have it on my t-shirt here co-pilot speaks your language is because chat these large language models that we're using today in GitHub co-pilot and many other AI applications are the same models that are also helping us in chat agents they speak almost any language on any major human language and so whether you um you know want to explore coding in English and you don't understand the concepts of uh you know true false booing logic yet or whether you want to learn that in German in Hindi in you know Brazilian Portuguese in in Spanish and Chinese you can do that now and if I you know look at kids today in in school most of them are growing up with mobile phones um you know when you go into a restaurant here in San Francisco on Seattle or elsewhere in the world at night you probably see a family with little kids where the kids have their phone because the parents want to enjoy five minutes on their own and then as then kids grow up you know they see Super Mario or or Minecraft and they get into gaming and that naturally that means how can I create my own game how can I create my own web page copilot enables that enables that in the language that the kids grow up with which you know for the majority of the humans of the on the on this planet is not English um so that's number one it democratizes access to technology it also democratizes access for those that don't have parents at home that have a technical background or that don't have parents at home that have infinite patience but most parents do not I have two kids I speak from my own experience at some point you're just done you know with explaining the world to your kids and you just want to you know switch on the TV and uh and watch your watch this Netflix show and and and but that keeps going if you look into the professional context one of the biggest challenge we have is you know if you would join my company or I join your company uh tomorrow the biggest challenge we have is what's all the institutional knowledge how are things being done you know and what we don't like as humans is ask a thousand questions um especially if you're a new employee in in a big company you're like having this anxiety in your head that everybody else thinks you're you're dumb you why did you get hired in the first place so a copilot also democratizes access to all the information and companies and I think that is going to be changing how we work and not only for developers in the workforce but for really every human thank you thank you Thomas for sharing your vision I guess the next thing I want to ask is maybe a little bit more unhinged speaking of agents in your opinion what makes an agent or co-pilot what's the definition what what's your definition of an agent I think an agent you know is like an AI dishwasher um you fill it um with you know the dishes and you let it uh let it do its thing and then at the end you to take the output and you put it back into the shelves right and today um we have you know we called it used to call it Bots um um you know or cicd in many ways that's an autonomous agent right you push your P request and you run your cicd get up actions or or or a similar product um many compute Primitives that we have today are agents as they get a job done on their own and my monitoring you know to figure out if GitHub up or down is somewhat autonomous um hopefully it Pages somebody without us hearing from you that you cannot access your repository so you know I think in many ways um uh what we're building is still tools that help us to get the job done and there's many jobs that developers have to get done many jobs that now ai Engineers need to get done you saw on the slide earlier all the things that are also still true you know even though you can automate things with large language models um and a lot of work in software engineering is is barging us down um a lot of boilerplate a lot of security compliance you know that Friday evening when you when you want to you know enjy the barbecue because the sun is out and instead you have to update all your lock for Jade dependencies right like security Tooling in fact you now is creating more work it's not a dishwasher it's actually a tool that shows you that the tells you that the dishes are dirty and then you have to do the dishes yourself today and so um that's security tooling right it just adds stuff to our backlog while we actually want to work on the creative side and we want to you want to build new features you want to build Innovative product that creative things um I think many software developers do not understand themselves as a production worker they understand themselves as artists as creators yes and but you know our companies our governments you know the world is requiring us to do a lot of other work and we need AI tools autofix you know things that that scans uh not only for security issues but then fixes those security issues we need those pieces um supported by AI so we have more time for the things we don't want we do want to do and AI takes over the things we don't want to do and that's that's where the agents will go fantastic what's an agent you want to have and how far are we from it I mean I want to have these agents that burns down all my security backlog um it's um as in any company and the the challenge is that I have way too many of these items um and there isn't really a book you can buy um that tells you as an engineering manager of how to balance those two things um you cannot do all the your work into security compliance um accessibility and whatnot you cannot put all your work in into Innovation because your customers will lose all your trust the moment you have a security issue that uh threatens their their data and as such you have to balance those two things or you find AI agents that Springs the work down and I think as any you know leader of a software development company I always want to go faster I always want to get that feature done faster and um I'm sure it's you know the same uh for you folks at Amazon when uh when I have an idea and I ask my folks how long will it take to implement that the estimate I'm getting is like I'm scratching my head I'm thinking I could done that can do that myself faster than than waiting for for my team to do it but of course that that's not the truth the truth is that there's so many other things in the process these days that um we need to find new abstraction layers um that help us to to get control over our development life cycle again that's a great point so last question do you have any advice for devs both new and experien on how they should they should navigate this new world of tools this new of new world of abstractions um in what some say is the biggest Technology Innovation since the that I think you know the most exciting thing about this new technology is and you saw it hopefully over the last three days at this conference is that we are moving into a new world of software development and there have been multiple step functions you know over my uh life um I was born right before the PC was invented uh I remember getting my commodor 64 on a PC in the '90s I remember the open source in the internet and you know internet Open Source before the internet was buying CDs and DVDs um in bookstores the internet came you know um Source Forge and then GitHub came all of a sudden developers started collaborating the mobile wave came and every time we had those step functions software development got more exciting and I think you know we are again at that at that step function it means we can embrace our nerditude we can build new new and I think you know the it's really like like you know for me as the CE of GitHub I don't get to touch code often and so when I get to touch code on a Sunday afternoon I don't want to spend all my time of updating all my dependencies and I don't want to figure out all the things that have changed in the API documentation or how to deoy The Container through the cloud now and what has you know there's so many change things changing around us that we want to bring the fund back and I think that's it's it's AI brings the fun back into software development and so I want to you know encourage you all go back home and and build some cool stuff and and embrace this new world of AI okay that's all we had thank you Thomas please join me thank you so much ladies and Gentlemen please welcome back to the stage your host and co-founder of the AI engineer Summit Benjamin [Music] duny oh my God how are we feeling better than me I hope I'm exhausted what a way to end an event open AI rapid fire talks from Legends and the CEO of GitHub can we have a round of applause for our final Keno speakers incredible I think the only thing left to say is thank you thank you all for coming to this event we can put together the best flows content Productions but it means nothing without all of you the community members who show up and learn engage and connect so we thank you for that so please give yourself a round of [Applause] applause next I want to thank Microsoft they were a sponsor of summit last year last October and Britain winterose in particular was our first Champion I as I recall for taking their sponsorship up a notch so thanks to him and I want I couldn't be happier with their partnership Sharon Allison Kayla you've all been just absolutely incredible to work with and special thanks to all the speakers from Microsoft and GitHub for your hard work and incredible content AWS I've been a user and fan of this company for some time and as someone who produces events I'm in absolute awe of what they do for reinvent and all the other events that they do and memo you are just incredible keep doing what you're doing and atie I'm so excited to hear that you're turning San Francisco's AWS Loft into an AI engineering Meetup Hub you announced that at your keynot it's incredible San Francisco is back baby uh mongodb Google Neo forj you all brought your aame thanks for your partnership and support from the planning and organizational side to the content and production side to all of our other sponsors especially our gold and silver sponsors you not only help to make the event a financial success you make every aspect of this event more interesting engaging and valuable this event is enriched by your presence and we thank you for being here Argus HD I mean seriously Argus HD you never cease to amaze me Tim Ryan Tim the other Tim everyone else you guys are incredible our budget is small but you all make it work and you've they did all the breakouts it's just absolutely incredible you put on an incredible show Motif events do you guys see that Expo that's events they designed and built that we selected them not only because they were the most Innovative but because it was clear to me that they put in the most work and heart into what they do so thanks Dave Ben and everyone else and thanks to the entire Local 16 crew for your excellent craftsmanship in assembling and disassembling this thing had to come down between the sessions we brought the airwall in y'all saw that they did that super super fast so thank you all for that everyone at the Marriott Marquee you've all just been absolutely incredible food was good good coffee was good everyone's super responsive internet good at least as far as I was concerned hopefully yall were as well all of the speakers we can't have an event without the speakers they work incredibly hard on these talks like Picasso it took them their entire lives to create that talk but unlike Picasso it takes weeks or even months to accomplish so to all of our speakers thank you so much for being an incredible part of this event coach and vide for managing our live stream with realtime clips of of our YouTube live stream so thank you Chris Otto everyone at source craft everyone at Cody for helping with the distribution of this incredible content for a wider audience and volunteers these are the folks in the yellow staff T-shirts they are not paid they're volunteers events are super expensive and we can't do without you so we thank you so much and Leah McBride many of you met her at registration and she's an absolute Force seriously she's an entire agency at the event so I'd like you to welcome her to the stage if we could all just show our gratitude with a round of applause and Leah we will get you some more on-site support for next year so uh you're going to be working a little less hard Simon sturmer is a beast of an engineer and volunteers so much of his time to help with a conference website and mobile app this of course has bugs we have volunteers running this if you guys want to contribute and build some cool stuff that we're doing in email me directly Ben a. engineer um um we could use some help um lastly swix seriously you all don't know how much heart and soul and sweat and time this man has put into this event I'm beyond disappointed that he missed this event but he is watching from the live stream Co so can we show some love for swix Max video productions who handles our b-roll Cinema phography and speaker interviews um uh Randall gee who does our photography Angelie fit for live voice overs Sasha Shang who help with the diversity committee everyone from the diversity committee Sarah Chang Tony shw Harper Carrol so many people if I forgot you I'm sorry I put this together five minutes before I went on stage um crazy times but thank you all so much for being here thank you for coming and we will see you next year don't forget to secure your ticket today bye-bye thank you everyone [Music] [Applause] [Music] I and I know to and ready bre and I'm ready to [Music] [Applause] [Music] IAT feel I'll come back [Music] [Music] to I and I'm ready to my bre [Music] [Music] [Applause] I in aone [Music] [Applause] again when iast expected you're all [Music] I hold them up