AI Engineer Paris 2025 (Day 2)
Channel: aiDotEngineer
Published at: 2025-09-24
YouTube video id: wyUdpmj9-64
Source: https://www.youtube.com/watch?v=wyUdpmj9-64
[Music] Heat. Heat. [Music] [Applause] [Music] [Music] Heat. Heat. [Music] Ladies and gentlemen, please join me in welcoming to the stage page. Your MC for the AI engineer Paris, developer experience engineer, Ralph Jabri. [Music] Good morning, Paris. Does this work? Can you hear me? All right. Nice. How's it going, guys? Did you guys enjoy yesterday? If you liked yesterday, today is going to be incredible. You're going to just love it. So, I'm very happy to have you here today for day two. Uh yeah, yesterday was a fantastic day. We had an amazing opening ceremony with Swix and Ben, co-founder of AI engineers. And we had a great talk also from Mr. AI who talked to who talked to us about the problems that they face in the enterprise world, which I found fascinating. Um and we also had a welcome party which was incredible in my opinion. I got to admit these are the best moments uh in these type of conferences is where you get to meet everybody where you get to meet attendees, sponsors and all the engineers and you know and the founders and you can hear it a little bit in my voice. I had a a little bit of a few chats here and there. Um but these are my favorite moments. So, why don't we get started with something? I want you to look at the person on your right and the person on your left and just take a moment and take 10 seconds uh and introduce yourself. Let's do that. All right. All right. That's a spirit. That's what That's how I like it. It's super cool. All right. So, we have an amazing lineup of speakers for you today. And we're going to cover topics from agents, MCP, open models, generative media, and more. And I've seen some of the talks, and I have to say they're incredible. So, you got to be here. Uh, but before we get started, I would like to invite CEO of COB, Yan Leger. [Applause] Uh, the mic is not working for Yan. >> Oops. >> So, maybe I'll take and now I'm back. >> Yay. >> All right. So false starts but here we are for the second day of this amazing event this conference that we assembled. Um I want to speak a bit about content. Um so you might have noticed uh we went from one track to five tracks which is quite incredible in terms of coordination. Uh we have over 35 speakers coming today coming from all part of the world. Some who flew over from the US landed yesterday from Sweden. uh Emil coming uh with us just after um and I hope you you'll enjoy it. Um one key thing this the discovery tracks are also um amazing. Uh we we we had over 500 submissions on the CFP. Uh so you will find incredible content. Um and if you don't you cannot attend this uh sessions all of the sessions will be recorded and available on our YouTube channel. Um, so don't fear uh missing out. You'll be able to to see them all. Um, one last thing I wanted to say is to thank you all again for for joining us. Um, uh, one thing you need to be aware is COB is our serverless platform. We're providing high performance serverless infrastructure to simplify application deployments. You will find a lot of I mean our entire team on site today. So, uh, we have two boosts there. Uh, feel, uh, you're welcome to stop and chat with them. And now I'm going to hand it over back to uh, for the next stage of this morning. >> Thank you, Yan. Let's give it up for Yan. Something tells me something tells me that this is not the last time we're seeing this morning. But yeah, so if you want to have a look at the full schedule, please um download the app where you can find um everything that you got to that you got to know about all the tracks. So this is the main stage, but we have two three other stages. We have discovery one, discovery 2, and we have a lot of workshops and we also have coffee at the expo uh offered by our friends from Tinfoil. So please uh go and check it out. And speaking of the expo, I highly encourage you to also go and check it out and go and meet the engineers and founders that uh that came here to to to meet you guys and uh make new connections and and find you the your next partnership or uh or or job opportunity. All right, so without further ado, I would like to also thank the sponsors. Uh I would like to thank the gold sponsors. So Sentry, Arise AI, Deep Mind, and Alolia. and al also would like to thank our platinum sponsors Docker and Neo4j. And speaking of Neo4j, our next speaker is co-founder and CEO of Neo4j. His work helped investigate uh investigative journalists cracked the Panama papers and enabled NASA to reach Mars two years ahead of schedule and drove breakthroughs in cancer research and fraud detection and so many more areas. Please join me in welcoming to the stage CEO of Neo4j, Emil Efim. [Applause] [Music] [Music] >> All right, let's Perfect. Thank you. In theory, the mic is working. I feel like I hear my own voice. So, yes. All good. Perfect. Bonjour. That's all the French that I know. My apologies. Uh, I have a French chief of staff. I have French investors. My daughter is learning French. So, I'll pick it up for next year. I'll be able to say a little bit more. I I I promise. So the slide here this is says the state of AI engineering. Um I'm not going to talk about the state of AI engineering. I don't think that's my talk to give. I think that's Swix or Ben. I feel like actually Swix did some version of that last night. Um but I spent two decades of my life in databases and knowledge representation. And so instead I'm going to talk to you all about managing state in AI engineering. And more specifically over the last couple of years we've observed hundreds of projects built inside of big companies and small companies building AI applications. And I'm going to share some of the observations from that. Last year in San Francisco, I did a very hands-on practical talk about Graphra and the benefits of Graphrag and how you can get started with that. This talk is a little bit more high level. It's an opinionated view of where I believe the data layer for AI should go. Or in other words, we've identified four properties of a kick-ass data layer for AI applications. All right, but before that, let's start with a little bit of context engineering. Who here knows what context engineering is? Raise your hand. The morning gymnastics exercise is going well. Fantastic. Karpod, I think, had a good oneliner for it. Context engineering is the delicate artisan science of filling the context window with just the right information uh for the next step. It was coined by this gentleman Dexworthy from human layer who wrote a really phenomenal blog series called 12 factor agents which I really recommend for those of you who haven't read it. It's a it's a really it's a great treatment and in factor three he talks about context engineering. He has a really really simple intro for it. Everything is context engineering. LLMs are stateless functions that turn inputs into outputs. Therefore, to get the best outputs, you need to give them the best inputs. So simple, three sentences, a really good kind of summary. And he go went on and he showed like a vin diagram of what context engineering is here. And today in this talk, I'm going to talk about I'm going to focus about the three main sources of state in AI engineering. The state out of your rag corpus, the state out of your agentic memory, and then the state in your application. I call this the state in AI engineering. Okay. But before that, we're going to talk a little bit about a brief history of application architecture. So I'm going to start by bringing you back to ancient times many many years ago before 2022 and we were all building some version of simple crowd apps and lots of architectures around but at some high level this I think was the canonical architecture. We have some UI. We have a backend. We have a database, right? We store data in the database and in object storage. More specifically, we put our structured data in the database. We put our unstructured data into object storage. And then between kind of the UI and the back end, we speak JSON, which is semistructured data. Very simple. But then something happened, right? Chat GPT was launched end of 2022. That's why we're all here. Swix coined the AI engineer term six nine months later, midsummer 2023, right? And we all started building chat bots, right? We started building relatively simple chat applications, right? That sit on top of some kind of orchestration layer. We stuck a bunch of our data and unstructured form into what? Into vector databases, right? And so these very simple but pretty powerful applications ran on top of honestly a reduced version of our unstructured data like the embeddings. Right? So that's kind of what we did in 2023 and then in a very gross annual simplification, right? It's only been two years that we've been doing this, right? 2024 is when everyone started talking about agents at least talking, right? Debate of how much was was doing. Towards the end of 2020 uh 4, anthropic launched MCP, which really helped us think through what tools look like and simplify tool access in in agents. And so as we sit here in 2025, we can see a stack is starting to form, right? At the top, it's not just simple chat bots anymore. It's a real application, right? With some AI embedded AI features on the side or at the core of it. The app, the former orchestration layer now consists of one or multiple agents. And of course, there's a big debate between single agent versus multiple agents that wrap prompts. They use tools, they use memory, right? They use one or multiple LLMs as part of that, right? And then we have the data layer, right? So what's going on then in the data layer? The stack is starting to form, but that really is the upper half of the stack. So let's double click and spend a little bit of time on the lower part of that stack. So we started out with vector databases, right? But after a while realized that you know what just querying on unstructured data in a semantic ANN approximate sort of way right it's powerful but it's actually not enough and so the vector databases realize that you know what we need support for structure data as well and they tend to call it metadata and you use it in metadata filtering for example right but then at the same time there's obviously a bunch of other databases around that started using for retrieval. It's the relational databases, the document databases and graph databases for example. And we all said, you know, that whole semantic search thing that's kind of cool. We're going to add vectors as a feature, right? So here what we have is we have unstructured like a data store centered around unstructured data which is adding structured data and then we have I don't know let's call it Postgress which is centered around structured data that is adding support for unstructured data and a database like JSON semistructured data that is adding support for unstructured data. So it's a real party down there and there's a lot of different models. Everyone's kind of adding what the other folks are having and honestly it's a little bit of a mess. And the big question then is where is this going? And we spent a lot of time thinking about this. What's really required in order to have a kick-ass data layer for that makes it really easy to write AI applications. And we've identified four properties for a kick-ass data layer for AI applications. Four properties. and I'm going to walk you through them one by one. Okay. So the first property you can hear how kind of I talked about it in the state of the union of the data layer, right? Where we're going with this. The first property is I believe that in order to make it really easy to write AI applications that requires you to be able to have a data layer that in a very easy way manages unstructured and structured and semistructured all three types of information in a single data layer and do that well right so store and retrieve of course but you want to index them you want to be able to handle transactional scope across them and all that kind of stuff. So that is the first property. The ability to handle structured, unstructured, and semistructured data. Okay. So let's talk a little bit more about unstructured. Let's double and triple click on that. Right. Okay. For property number two, what do we need to do? What do we want to do with unstructured data? So we just talked about how important it is for all to be able to handle all three types of of data. But of course the fundamental currency of an AI application is the unstructured data. That's one of the L's in LLM, large language model, right? But there's another observation that's been happening as well over the last year or so. There's a lot of talk and you heard it yesterday from Mistral, which is AI applications. They are well applications. There's a lot of software engineering involved in this, right? And if you want to write an application, you probably want to do that with types. You want to have structures and objects and classes. And if a lot of your information is unstructured, the question then is okay, so we have an application, that application is handling people, you probably want to have a person class, but then you have a lot of unstructured data in your data layer. And so you need to be able to bridge somehow take the data out of your unstructured data layer and reflect that in persons and objects and types up in your application so that you have a convenient and good developer experience. Right? So how do you do that? So let's talk a little bit about that. Let's say that we have an unstructured source. I'm not sure if you can read it in the back here, but we have some kind of text maybe out of our rag corpus which has three sentences. Andreas is here today. He is a carpenter. Shout out to ABK. Right? Three simple things. So step one here is to identify the entities, the concepts, the things out of that data. That's called named entity recognition. It's a technique that has been around for for a while actually quite quite a while. And we identify Andreas and he and ABK. And then you can look at this and you can see that the people here the human beings we can read this and instinctively we feel like this is probably the same individual Andreas and he and ABK. So how can we resolve that? How can we figure out a way of combining these things that is called entity resolution and that's the next step resolving these into a single entity. Right? Entity resolution. If you look at this data right here, it's actually sturd data. This piece we can easily reflect up in our objects in application state. This process right here, the named entity recognition, the entity resolution is going to be a really important part of writing applications in an convenient way for developers on top of unstructured data. So the second property of a kick-ass data layer for AI application I believe is the ability to consistently and reliably extract entities out of unstructured data. So that's the second one. Okay, the third one. So let's say that we're building an application and that application is for an online furniture shop retailer, right? And you can see the application. We have some kind of a central part where we have a product gallery of some sort. It has some weird Nordic name for the for the table, the Kira. Table, right? And then we have an AI powered community forum on the on the side. We probably have some kind of an AI bot helping out with assembly instructions, right? Um the data layer of this might look something like this where we have a product database, right? We have the purple boxes in here. We have some kind of bill of material table. Uh we have kind of a product issue tracker table and they're kind of joined up in some way. But that assembly bot that helps out with assembly instructions in the community forum. Well, it probably needs to read data out of our some product bour assembly guides, right? And so we put that in there as well. and we've extracted the central entities out of that unstructured data like we just talked about. So that's the first piece. But then we have this community forum where there's a lot of chatter, right? And people are talking, they're logging on, they're maybe asking for help, they're commenting um and there's individual chat history memory, but we've also extracted entities. And in particular, we're focused on the entities that are globally interesting. In this case, we have a user called ABK who's a carpenter, and he wrote that, you know what, for the Kirina table, the assembly instructions weren't really great, but when I applied two drops of wood glue to the table step, whatever, to the table legs before step eight, then all of a sudden, I could easily assemble it. So of course when we look at this we would like our AI assembly bot right to be able to answer questions not just based on the data over here but also out of some of the stuff out of our agentic memory right but of course the problem we end up with here is that we have multiple Kirona tables in here. So in order to really be able to answer not just with the data out of the assembly guide but out of these instructions we need to be able to link them together. There are two ways you can link them together. One is link during retrieval time right. So you query across your all your data sources. You want to find the kirina table right and you do that in retrieval and there you go right that's one way. The second way is you do it in some kind of a unified layer where you link it together in the data layer of the stack. I think the second is going to be by far the most powerful for most applications out there for three reasons. The first one is performance. If you do multiple retrievalss across multiple of tools, you will frequently end up with dependencies between those retrievalss which adds which adds up the latency. You also have to manage intermediary results which frequently ends up being relatively big actually. You reimplement joins in that application layer and it adds a lot of memory requirements. So that ends up being tricky when you have real production type deployments for trivial small stuff. It's always doable that way. The second reason is we just talked about this actually it's high complexity. you know, we spent 15 years trying to implement joints well in the Neo Forj database and we're still working on it. It actually is pretty non-trivial to make that to make that work well. Um, and then the third, there's actually a reliability piece. Imagine that there was like a trademark dispute or something like that for the Kiruna table, right? And then we have to change it, right? So if we have to change it then all of a sudden in this world right we have to do some kind of a text search across all of our data sources right and update them which is entirely doable but we all know that that kind of start is a little bit messy especially in production especially at scale if you have it all linked up in in the data in the data layer it's a trivial change that automatically propagates across the entire state in your AI application. So I believe that the third property of a kickass data layer to make it really easy to write AI application is an ability to link entities across persistent agentic memory and into your application data which tends to come out of your ragged corpus. So that's the third property. And then the fourth property speaking of the ragged corpus. If we go back to prehistoric times again we think about the crowd app. One way of thinking about this is that it has structured data and an unstructured data in the data layer which I talked about. Another way of thinking about this one is that all the data in here is firstparty data. Right? So what do I mean by that? What is what is firstparty data? Well, in this context, in the context of application development, firstparty data is the information that your application directly collects from its own users. It's data that you own. And couple of examples, profile data, right? The names that people type in, their preferences, that kind of stuff. But also activity data, how long time did you spend on various things, right? Transaction data, those kinds of things. And if you think about to that CRUD era, basically all the information for most of those apps was firstparty data. I'm sure there were some integrations and stuff like that, but by and large most of it was first party data. But then if you look at our more modern AI app and we see that the first party data really sits here in the purple boxes, but the yellow data there at the bottom is actually derived data. It's coming out of the rag corpus the assembly guides for the kirona table. And what this what this presents an interesting tension because on one hand this is fantastic. We want to co-mingle first party and derived data. So that when we refle reflect that state up into your application, the kirin table has all the kernet table object in your typescript application has all the relevant information for it. That's the best developer experience. It also leads to better queries and return treat them. So we want that. The flip side though is we have to treat them differently and a great example that's we have to have the ability to treat them differently. A great example of this is what if that trademark dispute thing happened and we wanted to change the name of the kurina table or maybe we wanted to add to the instructions right people handle this differently in rag but a very common pattern probably the best and simplest pattern today is that you drop all the derived data and you recreate it when the rag corpus changes right some people try to apply change sets and stuff like that but I think that's tricky in reality Okay. So then we would drop all the yellow stuff, right? And recreate this when the rag corpus is updated, which is great. But if it's co-mingled with the purple stuff, you don't want to accidentally kind of cascade across that and delete some of that information. So the fourth property of a kick-ass data layer of for AI applications is the ability to disambiguate between first data and first party data and derived data so that we can handle it differently in the application. It'll be different in different types of application because people have different strategies for it but is all is based on this ability to know is this first party data or derived data. Okay. So we've gone through a little bit of history. We've talked about these four properties of a kickass data layer for AI application and properties that are required I believe in order to make it really convenient to write great AI applications. It's the ability to handle structured, unstructured and semi-structured data. It's the ability to re extract entities from unstructured data to link them between both agentic memory and your rag corpus and application state and then finally the ability to disambiguate between first and and um first party and derived data. So I don't know if any of you are uh work in big enterprises and you're buyers of application software. I think I've seen this many many times happen where someone is in front of me a salesperson and they talk about okay here are the needs that you have and then magically they show up with a product that solves exactly those needs right so that is not what we're trying to do here these are our objective intellectually honest observations we don't have this at at Neo4j today but the reason why we're here Um the reason why we spend so much of our cycles in focused on AI engineers is that this is of course where we want to go and as the CEO I think I have some amount of influence at least over the product roadmap. So this is exactly what we're building towards at at Neoforj. So let me give you a little bit of a quick demo of the first two principles which are not perfect yet but they're really making progress really fast and what it looks like in Neo forj. So imagine that we have a Wikipedia page that we want to as as an example of unstructured data that we want to add to the database and we're going to see our ability to look at structured, semistructured and unstructured data in one platform and the ability to extract entities from unstructured into structured form. So in theory this should work. So here we have the Wikipedia page for Paris, right? We take that and we copy paste the URL. We go into the knowledge graph builder here which will upload it and it's going to start processing. So couple of things will happen right now. We start processing and chunk up that Wikipedia page and we put it into the graph in two forms. The raw data, the chunks, but then also the extracted entities. And what you can see here is both of those things at the same time. We're going to remove the entities for a moment. And here you have the raw unstructured data. These are the chunks of that Wikipedia thing. But much more interestingly, of course, are the entities. Those are the ones that mean something to to us, to humans, to applications. So here you see that we've extracted like some demographic data. It's too hard for you to see, but British citizens and United whatever US citizens in the city of Paris, it's in the country of France. And then it's automatically extracted notable individuals. Right? So these are the whatever this collar is greenish something collar you can see a bunch of really notable individuals out of Paris which is a very humbling list of people by the way this is Igor Stravinsky as an example uh right now so this is all automatically extracted from that unstructured source and you can imagine that once you have these entities it's really easy to reflect them out into your application state in terms of typed objects it's also really easy to query across them in a semantically use useful way. So those are the two properties at play here as an example. So that's it for me. The four properties that I believe are required to build great applications. If you think this is interesting and we're going in this direction, uh Neo Forj is available for free in in our cloud service Aura. So you can check that out the QR code. Uh we have a great graph academy where we teach people how to get up and running and build these types of applications. A few weeks ago, I don't know how many here are in at startups. A few weeks ago, we launched a startup program where we give away free credits and stuff like that. But maybe even more importantly, we're building a team of experts that help startups get up and running with Neo Forj, right? So apply for that at nej.com/startup program. And that's it for me. Thank you very much. All right. Thanks, Emil, for the amazing presentation. >> Thank you. >> Awesome. Let's go now for a few follow-up questions if you don't mind. >> Yes. >> Cool. >> I have the coffee. That's the reward for getting through the representation. You get the coffee. >> Very important, guys. Um, all right. So, you touched on memory and the AI layer. Um, what do you think agentic memory would um would look like in the future? >> Yeah, it's interesting, right? Um, how many in here are using some kind of an agentic memory system today? Like uh so probably third or something like that. It's it's funny, right? Like so this was more more my objective observation, right? as a database guy, as a NeoAj guy, as a graph guy, there's a lot of people who independently come to the conclusion that agentic memory is intrinsically graph oriented, right? There's a couple of YC startups like Zap, right? Um, Mezzero, Cognney, right? Also, we mentioned MCP. Everyone in here knows what MCP is. Few people know that the initial launch of MCP actually shipped a tiny little agentic memory implementation. >> It's a toy. It's 300 lines of Python, right? But what is it? It's actually a >> graph, >> right? And so there's a lot of people who haven't talked to me who independently come to the conclusion that graph is a natural form factor for for for memory. And so that's one of the key reasons why we're really really interested in it and especially like being able to marry memory and the rag corpus like I just talked about I think is so powerful. >> Interesting. All right. Um to the to my ne to to my next question then uh from what you are seeing in the enterprise uh more comp do you think that more companies are building a using AI to solve their internal problems or you think they're buying AI solutions instead >> like building application solutions rather than sorry buying application solutions rather than building >> yeah it's interesting um so obviously we've watched this very carefully we sell prime primarily into the global 2000, right? That's that's who we sell to. That's why we can do startup programs to give it away for free for startup to startups because we make a lot of money from Bank of America, you know, kind of thing, right? Um, and so we obviously think in the early days, let's call it 12 to 18 months ago, almost all of enterprise adoption of AI was in the form of applications. So in other words, they bought solutions, right? We've seen that change a lot maybe experiment experimentally a year ago where people started trying to build a lot but then in 2025 people really put it in production so I think there's a massive surge of build and then like the whole thing of whatever AI amplification of software engineering means that engineers are more productive and then also the barriers of adoption for becoming a software engineer or like being able to produce code is going down right so I I think the way that it's trending right now and will continue is way more build than buy in terms of AI. >> At least that's what I'm hoping since look, they build on us, right? Like so that's my that's my hope, but I I think that's where it's going. >> And do you think that lowering the barriers for software engineering is the primary motive here or do you think that there are other factors that are pushing that? Are they seeing better quality just by building or maybe the products out there are not fully responding to um what they're expecting from an AI product? I think the they they see the promise of higher quality >> but I don't think it's realized yet. Right. If you think about a company like CLA, who here knows about CLA? >> Yeah, most people. So, they just went went public two weeks ago. >> Swedish origin fintech. the CEO there publicly tweeted about wrote this super long form tweet saying you know we kicked out Salesforce we kicked out Workday and 1,200 other SAS tools >> thanks to this AI platform that we built and in that long form tweet it was very clear that the key secret sauce was graphs and Neoforj >> so they built this entire platform kicked out everything right and when I talked to CIOS of Fortune 500 companies there. Not I've never seen the amount of emotion against SAS brawl. Not since the the hate for Oracle. By the way, I like Oracle. I think Oracle is a good database, but man, have CIOS hated Oracle over the years. Right now, CIOS hate the SAS sprawl sprawl that they see. Look, we run I run a small. It's we're a thousand person company, right? Our software budget is massive. I have no idea what SAS products we're we're buying, right? uh imagine you know Verizon or like massive companies like that right and so I think there's a massive promise in doing doing the CLA thing of rationalizing your kind of software ecosystem get software that is bespoke to you and not get 100 features that of which you use four right >> right right uh I think there's a lot of promise in that that I see a lot of traction inside of the enterprise but it it's too early it hasn't yet been realized Uh, one final question for you. So, uh, Neoforj, you mentioned it, is a European company, right? You're based out of Sweden. And, uh, but you also have a huge presence in, uh, in the Bay Area. So, um, how do you see innovation happening in the AI space? Where do you see it most? Do you see it in Europe? >> Technically, we're actually an American company. Um, we're incorporated in Delaware. We're headquartered in Silicon Valley, but I founded it in Sweden. All of our engineering is in is in Europe, primarily Sweden and and and London, not a lot in in Paris, but we are hiring, by the way. So, if you find this kind of stuff interesting, uh please please apply. Um and we talk internally about building an American company with a Swedish soul, trying to marry the best of both worlds. Probably going to end up in some 2 by two where we get the worst of both worlds, but we're trying hard to get the like the best of both worlds kind of thing, right? Um I think it's shifted a lot. When I moved to Silicon Valley, I moved there in 2011 and then I moved back before the pandemic. So I now live in Sweden. But when I moved there in 2011, really to build a company like us, you had to be in the valley, right? Like developer, deep tech, infrastructure, there's just no other place. And you probably didn't have to, but I liken it to running uphill versus running downhill, right? You know, the center of gravity for everything was Silicon Valley, right? That in the early days of AI, it felt exactly the same. like in 23 maybe early 2024 felt exactly the same way. AI innovation wasn't even in Silicon Valley. It was like a eight block radius in San Francisco, >> right? So it was right there. I think things are shifting really really quickly. I love being here in Paris catering to a home crowd but like it's just been amazing to see what's going on here in in in Paris. Even Stockholm in Sweden is getting getting some real traction there with the kind of lovables of of the world. Berlin, of course, London is a little bit slower than I would would would expect, but there it's really starting to happen in in in Europe right now. If I were starting out today from scratch, right? It's not obvious to me that I would move to Silicon Valley. >> Interesting. Well, that that's it for me, Emil. Thank you so much. >> Perfect. Thanks everyone for paying attention. Let's give it up for Emil. All right, so up to our next speaker. So our next speaker leads engineering at Docker. Please join uh join me in welcoming to the stage VP of product engineering at Docker, Tushar Jane. [Applause] [Music] [Music] Hey everyone, hope you're all having a good time. Uh, this is a great conference by the way. Really glad to be here. All right. I hope everyone here will agree with me. Hey, I hope everyone here knows Docker. I'm going to assume you all do. And I hope you all agree that like a key thing Docker has done over the last 10 years is make it easy for all of us and all developers to adopt microservices and containers. bringing standards, easy tooling and trust in the ecosystem with things like official images and hub. We now see a need to do the same for agents and tools. And so that's what I'm going to talk to you about today. A framing I like and we like to uh think about this is agents of the new microservices. The same way we move from monoliths to microservices and needed containers, similar shift is happening with agents. We're now going to move to agents calling each other. Containers are still the right paradigm, but we need to build on top of them. We need to build standardized packaging for agents that understand um what agents are, what the dependencies are. We need trusted cataloges and we need to make it easy for everyone to share and use these. So today I want to briefly talk to you about two things we're doing in this space. First, this is an early exploration we're doing which is we think there should be standard packaging for agents. You can package them as containers, but it's not aware of what an agent is. We today if you just package in a container, we don't know what tools they're using. If you share them and use them uh you have the same problems you had earlier which is like what are my tools what are my configs how do I run them in any environment what's the runtime I need there should be something similar to a docker file like an agent file maybe like imagine a docker agent build pushpull run so we're going to go do that and we're exploring this and building this and to get started here to kickstart this we've open sourced C agent uh that's the GitHub link please go check it out it's an easy to use agent builder that makes it easy to build agents but importantly package them up as OC artifacts and makes it easy for you to share them around um via an OCR registry or hub. And so this is early exploration for us on how to package agents and share them. So we'd love for you all to go try it out and put any feedback and we're going to work more here. And if you're interested in talking more about this, please do find us at a booth. Okay, next um for agents to work well, you know, everyone here has heard about MCP while using MCP. But we think for developers to use this to make it easy, there are a few things needed. one you need same as everything else you need good packaging um packaging local MCP servers as containers is we think the right thing you get security with it you need easy discovery and you need trust still like how do you go find what is the right MCP server for the thing we're using uh like what's the docker official images version of MCP like how do I know what's trusted here and then you need security and easy tooling around this uh prevent rug pulls prevent all other security threats so uh to do this uh we are with two things we've done that we'd love to talk to you about. First is our MCP catalog. You can go to hub.d.doccker.com/mcp and you'll see a trust uh a catalog of MCP servers. Think of this as the docker official images for MCP the trusted verified MCP servers. We'll be adding to this. We'll have community servers. We'll have a way for you all to add anyone to add images. But on top of that, we'll add a bunch of security. So prevent rugps, containerize local servers. We support remote ones too. uh but it's an easy trusted way to get servers you need and second in docker desktop we've added tooling to make it easy to use MCV servers because today you have to go configure each client using cloud desktop cloud code Gemini independently we can make it easy to discover these configure them once use them easily and add a bunch of trust on your laptop when you're doing it so you're not doing just npm installs random software with access to your whole machine but containerized secure So I'm going to quickly show you a demo of that and hopefully you can go try it out after that. All right. So bear with me as we do this demo. Okay. So let's first orient ourselves. Um hopefully it's legible. So if you go to docker desktop you'll see MCB toolkit there. Go check it out. Here's a catalog of servers. This will be growing. Um I can easily add them. I can connect them to clients. So let's do this. The setup here is um you know I've got some PM. They've been collecting feedback in notion. I've got feedback from like, you know, people here on Docker Desktop. What I'd like to do is have something consume this and go create issues for me in GitHub that I can go work on. So, let's go do this. I think I need a GitHub client. So, cool. Let me just add that. And I've already configured this with OOTH as you can see. And then let's go get a notion one too. So, let me go enable that. Cool. And I've already configured this with my secret here. It's all stored in secret management is done. And I'm going to use cloud desktop. So I can just come here. You see a whole number of clients we support easily. You don't have to go manually edit config files. Let's just connect cloud here. So just do that. Done. Great. Let's start up cloud. Da da da. Let it come up. All right. So you can see here in tools there's docker and a whole bunch of tools there. Let's just do a quick test with this. Um uh Uh what feedback do you see in and what's the name of this page in the notion page? Uh this will run for a second and then I'll kick something off uh briefly here. The problem these demos is hopefully cloud is fast. So you can see here it's gone easily connected. I don't have to go muck with any config. Easy configuration. It just worked. And while that's running, I can just go do a little workflow here that I can automate that says read from here and go create issues for me. And so what I've done great, it's done stuff. It's it's getting more. I'm going to I'm going to start a new one. So we can just try something new here. All right. So now I'm going to say go look at the feedback and uh categorize it and put in GitHub. This will take a while to run. So I want to make you all wait for it. The key thing I want to show here is it was really easy for me. I don't have to go muck with any config files. I got official servers that I can trust. They're running containerized versions. They don't have access to everything they shouldn't. We have protection here for any rockps etc. We'll be adding lots more security controls in here. And as a developer, I personally find this very easy to go use and run. And now I can start automating my workflows here. And now we will take this further in the future to let you easily build agents using this tooling. So you can automate all this stuff. Um, and that's running. I won't wait for the whole thing to run. All right. I'm going to go back here. Cool. Here's some pure codes. Go try it out. Uh, go try out MCP toolkit. Hopefully, you find it useful. Give us any feedback, please. And do go play with C agent and start building agents and seeing the packaging. And we love to see any feedback you have there. And over time, you should see from us coming out standardized packaging, hopefully an agent file and more software NCP. That's it. Thank you. Let's give it up for Tushar. Thanks. >> All right. Up next is someone who's been at the heart of how developers build and collaborate for years. He's the vice president of developer relations at GitHub where he helped shape open source communities and has been involved with GitHub copilot since the very beginning. Today he's here to talk about the MCP protocol and and share some hard-earned lessons from running one of the most widely used MCP servers at GitHub scale. Please give a warm welcome to vice president of developer relations at GitHub, Martin Woodward. [Applause] [Music] Hey everybody, thanks for having me. It's good. It's always exciting when you see the thing you're about to talk about being demoed live on stage just before you get on stage. So thankfully the MCP server still works. So that was good. Um my name is Martin Martin Woodward. I work at GitHub and we're going to talk about um our MCP server at GitHub. But actually what I'm going to be talking about mostly is um the MCP protocol and how you can get involved in the MCP community. Uh we'll not be touching that much on the uh the GitHub MCP server at all. It's mostly mostly about the MCP protocol. Hopefully that's good. Um so as you know, you might have heard of GitHub. You know, it's the home for the world's developers. Um but we've created um GitHub co-pilot in June 2021. Um and that kind of changed how we think and how a lot of people work with developer tools. It was a very exciting project to be on. I was lucky to be involved in the the first version and have been involved ever since. Um as we've been doing that project, we've been learning a lot along the way. Um, we've got now over 15 million users. We started by just doing autocomplete. So just, you know, because LLMs weren't that good yet. So we did autocomplete because it was only a little bit of code that needed correcting. And then we, as LLMs got better, we went into chat. And now we're at where we are in 2025 where the LLMs have got so good and can run longer times unsupervised that we're able to move to the world of the the software engineering agent which is where the industry is this year and where we've seen explosive growth um across uh LLM usage for development. But um you need to get data, you need to do all that sort of thing. And that's kind of where MCP has has come into place to be able to do things with your agent and be able to get data into your agent as we just saw. But what I thought would be useful first of all would be to kind of um you know with all these agents together and interoperating, how did we get here? Where you know, how did we get to where we are today? Uh so it's amazing to me. I'm old. uh um very old now and it's interesting to be here at the beginning of yet another epoch yet another change of the way that we build things. Um so as I say copilot introduced in June 2021 uh we only got function calling in GPT in like June 2023. So that's less than you know just over two years ago as we sit here in the room. So this is very very fast. We then rapidly followed that with um copilot extensions and that was basically a way for end users to be able to plug in to that whole function calling thing and and developers to be able to provide tools to co-pilot specifically to be able to then talk to the rest of your development system. We're GitHub. We know that, you know, while we're at the center of your developer universe, you need to be able to talk to everything otherwise there's no point having it there. So that's why we introduced extensions. Then later that year, Anthropic announced uh MCP and the cool thing about MCP was that the tool discovery is more dynamic. You know, um you can tell the LLM what tools you have available and you can give the LLM sufficient context for it to be able to do the tool calling. Uh rather than the function calling which is a lot more kind of API based. Um, we've got full support of MCP inside of Visual Studio Code. And then in April of this of gosh, this year is this year's been f it's been a long year already. So, April, we did the uh local MCP server for GitHub, the official one. When Anthropic launched, they did a a um a version of an MCP server that used our APIs and then we worked with Anthropic and we have our own and made it open source and everything's good. And then we also have uh now um a remote MCP server. So if you don't want to install anything um as you saw in the demo uh from a colleague at Docker, you can just talk to a remote MCP server and everything works. So that's kind of the history and it's moved very very fast. But it's important to know that you're all here at the beginning of this new wave. And so things you might think why how does this work yet? It's because it's so early is is why a lot of times as we've been building MCPs, we've learned a few lessons. Um the first lesson that we've learned actually is that um you know everybody knows MCPs for tools. It's what made MCPs successful uh and why people start using MCPS. allows you to do things like create issues, send email, execute scripts, perform actions, as well as allow you to get data to add to the context of your prompt for your LLM. So, that's the key values. Tools is at the center. But it's much more than just tool calling. Um, if you just call tools, then you miss out on some of the other parts of the MCP protocol that allow you to call tools better, more efficiently. Uh, I did uh an update on the uh MCP protocol. Um, this these are the basic constructs in MCP. So we obviously have tools at the center on the server. But what you can do as a client, you can ask for resources from the server. So you know if you're talking to GitHub, that's things like files, issues, you know, like data that you need to be able to be part of your context. Or if you're talking to notion, that's your, you know, notion forms. If you're talking to a database, that might be database schema, that sort of thing. So the resources you can separately access them as a client of an MCP server. We also have prompts which is cool. Prompts are a bit like the the stored procedures of the uh MCP world. You can ask an MCP server, hey, what good prompts do you have? Um and there's variable substitution in there that you can insert data. So if you know, can I have a prompt to do this thing please? Can I have a prompt to do that? You can actually ask the server for some good prompts. and the LLM can do that and it'll be great. Over on the client, um there the client can also provide a couple of things. One is um sampling. So a client can allow the MCP server to send back um like be able to do a lookup against the LLM that the client is using. So at at GitHub and GitHub Copilot, we allow the developer to pick whatever model that they want from whatever development environment they're in. And so it's essential for an MCP server like ours to be able to use the what the customer wants their LLM to be. So if you want to use clawed sonnet for, great, we can do that. And now our MCP server can go to the client and say, "Hey, use the uke users chosen LLM to do this bit of work for me." Um, or if you want to use Gemini or if you want to use whatever models. So that's why that that exists. And then finally, root is the ability to um the client to specify where in the file system, where in the client server uh resources live and where the server can access. So you don't go accessing for for things that you might need root level permissions for example to be able to see or you know you stay within your sandbox. Um and then dynamic discovery we kind of is part of the MCP protocol but people know it discovers tools. Lastly is actually elicitation that's very underused as you'll see but that allows the MCP server to tell the client hey I need more information about X. I need to make a decision. Can you ask a question? Do I do A or B? That usually means that goes to the client as a question. Do you want me to do A or B? Or it might be the LLM makes that decision for you, but you you're asking the client for uh for input at this point. And that helps prevent hallucinations. That helps prevent ungrounded work. If we look at um you know the tools, it's great. Um, and the reason why MCP was super successful is because it is such a pragmatic way of calling tools. But the problem that we have today is a lot of our examples with MCPs are just focused on the tools rather than the bits around the tools. And so unless we as a community do more work to, you know, get better at tool calling, nobody's going to be aware of what else is possible, which is one of the reasons I'm here to try and help you see what else is possible uh and help you dig in, encourage you to dig in. Um I did a quick survey of all of the MCP clients uh on Monday and of those clients, there's about 80 today, different client implementations of MCP. um 79 of them know about tools but if we look here only three of them know about elicitation part of the protocol or have it implemented so it's very very early in terms of our implementation of the spec was only written in June so it's nobody's fault but it's still early days um if you are experimenting and want to you know you're building an MCP server and you want to test out everything um actually I would encourage I'm not on the VS Code team, so they're a sort of sister team to my team at GitHub. But I would encourage you to give uh VS Code a try. It's the only like it's the only client that's the end end client that implements everything in the MCP protocol and all the authentication specs. So try it there and then that will help you hopefully plug it in everywhere else. Right. So the first lesson that we learned once we started adding lots of tools is that actually tools are not the answer. Just like humans, the more choice you give to your LLM, the more likely it is to get confused and the same is true with tools uh to LLMs. Lang chain did some great research around uh the degradation in terms of uh performance of the LLM for coding. the more tools that it had access to. So um you you get overloaded it can break down and it can slow down the performance as well. Therefore, uh what we did initially, I have a very very easy job because you are my end users, your developers, your technical. Uh so it's easy for me to do things because I just I I can ultimately just give you control and you can decide which tools to switch on and off and you like that as developers. Um many people who are not building developer tools don't have that luxury because if if I went to a normal person when they were booking a flight or trying to do some insurance comparisons you know agentto agent protocols and asked them to pick tools they would no they're not going to you normal people don't want to do that developers like to do that. So we initially gave you uh all of the control in the client to be able to control which tools that you select, but what we obviously need to do is dynamic tool discovery. That's what's great about the MCP protocol in that you can query the server to say which tools are available. But what's been added in the June version of the protocol is actually the ability for the server to say, okay, we've been having this conversation. I'm now going to add a new tool to you LLM that you can call that it kind of unlocked at this point in the conversation and it can broadcast the fact to the calling client that this tool is now available. And then it can broadcast this fact that this tool has gone away to the LLM as well. And by doing that it reduces the amount of tools available to the calling model at any one time to again minimize the amount of tools to prevent increase the accuracy to prevent confusion. So that's what dynamic discovery is good for. Um the next lesson that we learned unfortunately is uh installing MCPs is a pain. Uh nobody wants to do it. what's um you know trying to do crossplatform MCP installation is an absolute nightmare. Now Docker we've just done Docker is fantastic as a developer for getting a containerized MCP server but that's fine for us as developers that isn't going to work for end users you know for for normal people. Um so containerization is great but only gets you so far. And if you look at the amount of commands you can use when you're defining your MCP server, you know, there's everything here. Trying to get it to run on Linux, on Mac, on Windows, basically impossible to do reliably um as a local MCP. Now, for oh, the other lesson that we learned, I did a quick GP of the logs for people calling our endpoints from MCP servers. Nobody upgrades a working local MCP installation ever. Once it works, I am not touching it until it stops working. Now, this protocol is changing so fast. Our world is changing so fast. That's not sustainable for us. You know what I mean? Um, if people are never going to upgrade and because of the way that we do installation, like trying to do auto upgrade is basically impossible. So local again is great for development is not going to work at scale once we start rolling these out to normal people. So um remote MCP servers have been introduced and this was kind of in the protocol from the very early days. MCP has always been a network protocol. The you know and a remote MCP server is obviously installed remotely. um it doesn't have any access to the client like file system and things without asking the clients and it's very easy to upgrade to scale. So as a service provider remote MCP is great and as an end user it's fantastic too. But as a developer there are some downsides you know um if I'm building an MCP server and I have to stand up a remote server first it's hard. So local MCP has some great advantages to you as a developer when you're building MCP servers uh for speed of iteration and for experimentation. The CL the critical thing that you need with remote MCP servers we mentioned there is is good authentication but we're going to touch on that in a second. Locally you can rely on local secrets and the permissions of the local user. Um, so, uh, the trick that we did is we built both a local MCP server and then we used that exact same logic hosted to be our remote MCP server. So, our local server fully open source, everybody can develop on it. Anybody can add features. It's all good. We can as an internal team iterate very quickly. And then we can take that logic, host it remotely, and then we upgrade our remote server as well. And that then enables server to server scenarios. So you can go to github.com talk to the GitHub MCP server and you know manage that whole thing because there's a remote instance rather than having to rely on your local instance. The next lesson that we then had once we did remote MCP servers is uh password or pat based authentication is bad. It's an antiattern. Um it uh you can very easily like managing those secrets is hard. they always become long live secrets which is bad for security. Um you you know preventing them from being accessed by other MCP servers that have access to the file system is very very complicated. You know there are lots of reasons why passwordbased pat authentication is bad which is why in June we actually added um oorthth support to the MCP protocol. So now MCP supports proper tokenized security with shortlived tokens. It supports OICD connect as well. So OOTH support is key. But not every client SDK has OOTH support yet. And we all have if people here have built OOTH integrations. It can be, you know, there's more pain there. There's more friction to go through, but it's necessary friction before you productionize something that's talking to MCPs. Um local you can get away with it fine, but once you go into production, you really need to be supporting OAR. Okay. And then the final kind of lesson that we learned is your MCP is completely pointless until people can find it. You saw in the the Docker example just now uh going into a marketplace, finding MCPs that did what they needed and plugging them in. Without that discoverability, nobody can use your MCP server. Um so we need to make it easier for people to be able to find MCPs. So what we've done there is we've uh as a community we've created the uh an open- source MCP registry. This just shipped like two weeks ago. Um there's a link there. Go and have a look. You basically have now the ability to use an API to publish details about your MCP server to the world as part of a open-source. Everybody shares a base repository of the MCP servers that are available. Now, I've been around, like I said, I'm old. I've been around through lots of different uh programming languages with package managers. I've been around with lots of different, you know, like npm and uh .NET and things like this. I've been around in different ecosystems. The thing that I'm very keen we prevent is uh creating like a single point of failure, a single monopoly of registries because no one registry solves everybody's problems. Um there's a lot of trust involved in which MCPs you switch on and you enable and so we need to be able to make it so people can have their own MCP registries and be able to distribute that load. Um, GitHub now have an instance of the an MCP server that you know goes and you can get like the you know the GitHub uh catalog of MCPS that adds additional data from the base catalog such as how many stars it's got, how popular it is. So we can see context 7's really popular. You know the GitHub one is obviously Markdown that's a cool plugin from Microsoft actually which turns stuff into like doc files and things like that into markdown so you can crawl them. It's cool super popular. Um what we want to do is build an ecosystem just like you saw with the Docker example just now where we have a central registry where we can make MCPs discoverable and then you have local registries. So Docker has a registry that you can go to when you're using stuff inside the Docker area. Your company probably should have a registry at some point that allows you to decide which ones you trust in your environment and all those sorts of things. And that discoverability is kind of the key to the next le next level to make MCP, you know, the success it is. MCP has now become the API layer of AI. It's one it's how it's going to work. I don't see anything else coming along here. There are complimentary things like A2A for being able to talk between agents. But in terms of tool calling, being able to do things, MCP is the protocol we're going to be using. U MCP is fully open. Um we it's available for everybody to use and talk to. So I would encourage everybody to get involved. If you want to be part of this community, now is the time to influence the direction of what we're probably all going to be relying on for the next 20 years. So, um, if you want to go to the MCP community site and get started, that's there. And then there is also a a Discord as well that you can join, um, in terms of, uh, being involved and being part of the community steering the future of MCP. Leave that briefly up for a second till I see the cameras go down. Great. One more second. There we go. Okay. So, there was the lessons we learned very briefly. I'm around all day today if people want to talk more and chat a bit more. But uh what I want to do is encourage everybody to get involved in the MCP community. Let's build from here together. And now on to questions. Thank you for your time. Awesome. Thank you, Martin. All right. Would you like to join me for a few questions? >> Oh, we're sitting down. I'm I'm I'm cool. >> Let's do it. >> Yeah. >> All right. Wow. So so many lessons about MCP, right? Um, and I I actually loved when you mentioned uh uh that you dynamically pull the tools. >> Yeah. >> Um, and you spoke about remote MCP and OOTH. >> Um, where do you see the protocol going or what do you see is there anything that you're actually like needing in the protocol at the moment you feel like? >> Yeah, I mean the protocol as we did it as as it's done in June is pretty much is pretty complete now for what we know we need. Um, I think the biggest thing is to get all the client implementation. So getting the SDKs to make it easy for developers to be able to use the full protocol because you don't want to you don't want to care about elicitation and dynamic discovery. You just and prompts and things. You just want to go to an API and say call this in a good way, you know. So um we the next bit of work is to kind of simplify access to the underlying protocol I think and then probably the discoverability in the registries is the next area >> right yeah I I totally feel you with discoverability I think like maybe agents should be smart enough to be able to go and and find out about the right MCP server just >> that's totally why there's an API there so that agents can autonomously do that but as we've seen um the there are judgments to be made in terms of trust you know, cuz you you're having a thing execute things on your behalf. So, um, and there's also value decisions as to say you had five flight booking MCPS, which one of those do you call? And that logic as to which one you call is like critical business logic that you want to have control over. So, um, yeah, it's an interesting area, but I think the now the fundamentals are probably in there in the protocol. >> Awesome. >> Um, you also mentioned that the MCB is new API layer. >> Yeah. Um, so do you encourage everybody to get started with just turning their APIs into MCP servers? Is that the the right idea to get started? Like if I start with an MCP server, what do you think? >> Yeah, I think um LLMs are ridiculously good at writing MCP servers. So the easiest thing is to, you know, crack open VS Code, you know, Copilot, crack open claude code, whatever you want to use, and get it to build you a quick MCP server that you can play with. Um I've just done like we did an MCP server that allows you to play a game of tic-tac-toe or you know allows you to kind of interact. Doing that locally and being able to see um how it's working just helps you kind of understand the base protocol and then yeah sure um having a remote MCP endpoint for your business might make sense for a lot of people. I also think there's probably the need for some MCP servers uh for like platform engineering to help us in how we you know build and deploy things inside of our organizations. If we can make it so that every developer from their development environment can say give me a new development environment please you know add these add this database connection like all those things where you need to go talk to a human and need to get into a ticket queue and wait. If you can automate some of those with MCPS inside your organization, that'll really speed up your development flow, I think. >> Awesome. Um, >> uh, yeah, you you mentioned that you've been involved with with the developer community for quite some years now. >> Um, so when you meet new students, um, >> yeah, >> who are who who want who want to get into computer science, what do you tell them usually? >> Yeah. Yeah, I'm looking we've got a minute left. So yeah, I I've got skin in the I've got skin in this game cuz my um my son is uh a third year of a fourth year computer, you know, masters in computer science. Um I look after at GitHub. I look after um our education programs as well globally. So you know uh millions of students worldwide. Um I firmly believe we need more computer science students uh tomorrow than we have today. um the the the students that are coming out, we need to give them exposure to all these tools. So we we give it to them for free so that they can them and their teachers can have exposure to using AI tools. But it's really interesting seeing that the teams um it gives them much more exposure into working with somebody else i.e. the AI agents than they would have typically coming out of an education program which is encouraging to me. Uh I you know typically with um when you're a student you basically come out thinking I need to do everything myself and group projects are painful. I never want to work in a team is kind of what you learn from your degree. And so uh but when you're working with AI you know you can you're getting used to analyzing the code that was created by another agent seeing what works seeing where it's gone wrong and being able to critically like fix things. And that's a skill as an engineer we need to just do every single day. So I'm actually pretty encouraged by what I'm seeing in the data of what's coming out. But um I would strongly we need more and more engineers and so you know strongly encourage students to do that but also strongly encourage companies to make sure that they're hiring new junior developers and uh because nothing grows a senior engineer more than growing a junior developer. >> So yeah it's very interesting when you hear people saying that the junior developer is kind of dead but >> yeah not at all. we need more than ever and but the the impact that that junior developer can have is a lot greater than it has been. >> I think there's been never a better time in history to be a software developer, >> especially if you don't speak English as a first language as well because now the LLM can explain code to you in your native language and things. That's huge. Like I take it for granted because I speak okay English, you know, but if you're learning computer programming to have to learn English as well is is horrible. So it's great we're improving that too. >> Awesome. Thank you so much, uh, Martin. >> Good to meet you. Thank you for your time, everybody. Thanks. >> Let's give it up for Martin. Thank you so much. >> All right. How are you guys feeling? >> Good. Ready for a break? Okay. So, 11:30 is when you need to to be back to this stage. Um, so please, yeah, let's go for a quick break and back at 11:30. See you in a bit. Thank you. [Music] [Applause] [Music] Ladies and gentlemen, please join me in welcoming to the stage your MC for the AI engineer Paris, developer experience engineer Ralph Chabri. [Music] And we're back. [Applause] You guys are have enjoyed the expo. It's incredible, honestly. So, have you have you enjoyed the expo? Yeah. I'm I'm I need some energy. I need some expo energy here on on the main stage, please. Okay. All right. Um, so coming right up, our next speaker has spent over a decade building infrastructure and now he's building the new infrastructure for AI workloads. Uh, in his talk, he's going to talk about what happens after the GPU gold rush. Are agents going to need more compute, more storage, more network? Well, we we'll we'll hear about that. Um and uh he's going to talk about diverse hardware as well. So, please join me to welcoming to the stage my friend CEO of COB, Yan Leger. [Applause] [Music] All right, and we're back on stage. So, um, this time I'm not going to speak about AI uh, engineer Paris. I'm going to speak about what we actually do and how we see AI continuously redefining cloud infrastructure. Um, my main u um job is to uh to run a company called COB. And as uh Ralph mentioned, I've spent the last 14 years uh in the space of infrastructure in general. So I've seen a bunch of evolution uh from the beginning of of cloud infrastructure um to now and this new wave with AI. Uh so we're going to look to spend 20 minutes together looking a bit about the evolution um and what is coming next with the agentic era. So the first thing I want to to highlight is if you look a bit back um so two years ago uh two three years ago with genai engineering the main thing which was uh which popularized AI engineering is LLM backed by GPUs. So typically CH GPT and all this the open AI models were running were LMS running on GPU. So if you look at the early 2023 AI stack two years and a half ago it was relatively simple um in the sense that in terms of number of components you tended you used to have like an API vector databases on on one end then you have a limited you had a limited choice of of models. So typically uh in early 2023 you had the GPT models from OpenAI closed source and you had a few opensource models. So the llama models from meta and stability but mistral wasn't even there for instance um and it was running on GPUs with Triton inference server pro probably simply because even VLM didn't exist and we're speaking of only two years and a half ago but now AI engineering is also content engineering VLM so models which include like image and vision capabilities video models VIP coding a big trend of this year MCP and agents. Um we see all of this as an infrastructure provider. Um and I want to dive a bit with you on what the future of AI infrastructure looks like. Um how we are trying to build for the this agentic era and what are the changes we see. So the agentic AI stack is way more sophisticated. I picked um set of technologies which we see in this new stack. So goes from the uh front end which is regularly vibe coded. Uh so we have app builds uh manus or lovable to v code your front ends. Uh our team made an experiment on this which you can see uh on the expo floor. Um we have APIs which might be in more diverse languages because you have more SDKs available. uh you have the a agents um which are probably going to be written in Python uh still um but are at different like components and then even the historical databases have vector capabilities or capabilities for AI. Um then you have MCP and for the execution and the inference of the mall you have also more diversity. So AMD is a credible player now and can be used for specific models. So uh you might have an an image model running uh video model running on a different kind of uh hardware um or LLMs uh running on accelerator here for instance tensor um which is an upcoming uh provider of hardware uh trying to disrupt the NVIDIA market the market of Nvidia. Um, and you might not even see any of this because you might also be operating on a higher level of the stack with inference endpoints. So if you look at what it means in term of hardware infrastructure um, and infrastructure to operate now it's a mix of GPU, CPUs and accelerators. You have large training which is still performed with uh GPUs, time tuning and small training where you you will probably still have GPUs in France where you can have accelerators uh and people are probably today starting to train with AMD GPUs. Um but influencer is where you're going to see mostly AMD GPUs and then agentic workloads which are actually running on the good old CPUs. Now let's dive into agents. Um I want to actually start by defining what is an AI agent, how we see it. Um the and because the definition of agent is varies highly. So um the the simple definition if you look it up is it's going to be a software system that uses AI to pursue goals and complete tasks on behalf of users. They show reasoning, planning and memory and have a level of autonomy to make decisions, learn and adapt. Um for us there is a key question of what is important in this uh definition. What impacts infrastructure uh so it's software system at the end it's still like Python or TypeScript or another language which is going to run on our servers. uh it's going to complete task and they have a level of autonomy and um what it means technically is that the level of autonomy you're going to run untrusted co code uh in general which is generated by a model which is nondeterministic. So anybody who ever worked in security should have should think about like the danger that it involves um and actually the requirements it involves and the first uh requirement we see with agent loads is secure sandbox environments. Uh so to make sure that this entrusted code uh is not dangerous uh to your production uh platform or environment um we have we see requirements in terms of performance because they still need to run fast. The agents are mostly ephemeral uh so they run for a short period of time. um because there is a high volume efficiency is a key topic and the deployment speed is also a a key topic that we see. Um now on the co side we build a global serverless platform and we do this for agents and inference. So we provide um a diverse set of hardware to support all these of these use cases. And I'm going to briefly show you um I'm going to do a live demo if uh it's goes well. Um and we're going to execute sandbox codes uh using the COAB MCP server. Let's do it. All right. So um we are going to use uh code um with um MCP uh connector um here. So the it's called co sand send sandbox what what it's going to do technically it's going to create a sandbox to execute um our untrusted code will be today will be relatively safe we will execute some simple um uh addition but we'll see how what happens behind the scene uh so here I have something which is um preconfigured if I have okay I should be fine um which is preconfigured to which should be preconfigured to run um in a using the cos and box projects. That's interesting. Um okay, so we do have network here. Um and basically, okay, here we go. So I'm going to uh it basically instructs the agent to u um to create a to use a cosmbox to execute untrusted code. So what we're going going to do now is we're basically going to ask him to um execute and trusted code. Can you please um execute this untrusted code? and it's going to use our MCP um uh server to uh create a sandbox. All right. So, here we have the account technically where we run the MCP server already and we're going to see um um uh sandbox being created behind the scene. So, um typically we've done this demo with code. It could be done done with any anything else. So we also have a demo running with like um merol uh to create the sandbox. Here we see that it's creating a a sandbox called calculation sandbox and that it's getting getting created behind the scene uh by the the prompt. Um and so eventually like so this sandbox is is healthy and we are going to get the the result here. Now let's look at what's happening behind the scene and what it means. Um right now I just did this quickly. I went to code I uh executed code but coding agents are doing this automatically for you without you specifically asking to do it. Um so behind the scene as a service provider was what is what we see is that a agentic workload need to create thousands of secure sandboxes daily with subsecond starts like in this case you want the result to be immediate and this year is a year where we started seeing people coming to us telling us hey we need to deploy uh 10,000 container a day uh because we have this agency workloads which do this automatically Um so if you use lovable behind the scene they are going to automatically do this for you and they are going to create a container on an infrastructure provider like us. Um for instance is doing this on on cob. Um one of the key question for us from a technical SP standpoint is how fast can we boot an agent and how many um agents can we boot so that the experience uh for end users is seamless. Um the technical problem behind it is uh we deploy docker containers so they are stored in our registry and we need to transform them in an agents executing your your code. Behind the scene our stack uh is actually leveraging virtualization to isolate the containers. Uh so we use cloud hypervisor um to run the container we have and on top of this we have the container the agents and we might have if it's uh an inference um workload we might have an inference engine like VLM and below we're operating on on bal machine. So when someone types a prompt like this, we need to uh get this all uh ready. So bottlenecks um multiples you have multiple bottlenecks knowing that we're aiming to get subsecond like uh start. So if you were to create a bal servers bare metal server from scratch, it would take minutes. Um if you want to pull an image from a registry can take um several seconds for small images. If it's a large model it's going to be way longer. Um your networking is going to uh be a problem because you have to uh converge between all your servers. So networking configuration and then you also still have to to execute the runtime of your agent. So typically uh the Python runtime is going to take some time to start. In our case, one of the main main bottleneck when we started seeing this increase of volumes was networking. Uh some people might think it's actually the virtualization engine. Uh firecracker has been quite popular lately even even so uh uh it's it's not totally new but in practice for us um the virtualization engine is not the bottleneck. Uh in most cases it's really the networking part and it might even be something completely different in your system. So if you have a vault system uh it might starting to take a lot of request and just like be the bottleneck. Um we are able to we were able to mitigate most of this problem. So we preemptively start the machines. We cach the images on the um hypervisors and on the networking side we made a lot of optimization to reduce the time and we where uh the last part which we don't control is actually the agenting runtime uh which were we don't optimize the python runtime ourself um so this part is still like one of the uh component which might take time to start we do have mitigation though uh to be able to uh to to um for this workload to support this workload. Um and the mitigation is actually uh called scale to zero. Um that's how we mitigate most of the cool start. Um so let me back up a bit. Uh if you on the agentic workload we have two key patterns. you have FMR sendes which are going to be executed like during one minute and if you are for instance using lovable it's going to create a web server at the end which is going to uh um run constantly. One of the challenge uh for this providers is a lot of people try it out uh and then you have a huge fleet of uh services which are actually idling most of the time. Um and so but someone might still be using them and they are looking to maintain it up and running. Um so to mitigate this problem of um cost we scale down to zero the workloads. Um if you've been in infrastructure uh in the past one of the challenge when you scale to zero is how fast you can restart the server uh because we completely shut down the VM. um and so you will hit this time of of startup. So we have two techniques like scale to zero and autoscaling uh to increase efficiency and we're able to reduce uh call starts with uh snapshots. Um so what we do technically behind the scene is we do memory snapshots. So your agents we're going to save the memory. Um we call this mechanism light sleep and basically it reduces the call stop time from several seconds to 100 200 milliseconds which is not visible at all for end users and so we end up being able to manage large of machines uh without um breaking uh increasing your cost uh as a as an operator of such a technology. Um we this is available now on CPU you this technology um we also have scale to zero on GPU we're bringing the same principle of memory snapshoting to GPU2 um for this kind of of workloads um if you want to know more you're welcome to to get started with our platform and we have our engineering today on site uh so uh please uh Don't hesitate to ask them any question if you're curious about how it can be helpful. Thank you everyone. [Applause] >> All right. Thanks. Yeah. Thanks, Yan. We'd like to for a couple of questions. >> Yeah, let's do this. >> Let's do this. Uh I really enjoyed your intro yesterday and I think we need to make sure that everybody understands that COAB is not an agency company. It's not an event company. It's actually an infrastructure company. Right. So >> um and you said that you were in the space for over 14 years which is super impressive. So uh can you touch on a little bit on uh what changed you know in this 14 years building infrastructure um like you know for for cloud and now for uh and now for AI. Yeah, I mean that's um I do love this question uh but uh uh which uh because we I mean we saw a bunch of of changes um while some stuff are consistent. For instance, what is funny is like uh if you look at our engineering technologies and our stack, our technical stack um we started four years ago with firecracker as a virtualization technology. Um we and then at some point when we added support for GPUs we went back to QMU KVM because it was not supported with V firecracker and QM KVM is a technology we were using >> 14 years ago already >> it's been around for a while. >> So there are some things which are kind of consistent in terms of changes then it moves like the the scale at which we have to operate in terms of number of containers. Um it's like before you you would have this here but with several customers. Now you have a single customer which is coming to you and bringing all this workloads for instance. So some of the the um topics are consistent like GPU if you look back in 2017 we were already deploying GPUs right. So I was mentioning 20 uh 23 2017 is what like eight years ago. So we were already deploying GPU. The difference is at the time uh it was actually consumer grade GPUs from uh uh Nvidia um and they were only starting to to create like data center grade GPUs. So you have this stuff which are um consistent and um over time not changing and you do have things which are completely new like the diversity and accelerators. um this is still maturing and we we believe it's going to take five five more years at least to have a decent competition where Nvidia start losing market share. >> Is that what you're predicting? This this is not financial advice by the way. So just saying um >> yeah I mean like so we uh we we do believe that like the monopoly of Nvidia is going to fall. Uh we saw it in the past on CPUs. Um if you look back 10 years um back um ARM was completely inexistent in the data center market. Intel was a dominant player. AMD was struggling on the data center market. >> Uh then AMD came to to market um pushed back the market share of Intel and now Intel is not in a great situation. So >> um so yeah we do believe that the same is going to happen with GPU. Then the question is how fast. >> Wow. Okay. So you're saying that we had the building blocks to build the AI infrastructure before and and now it's just the scale of it, the scale of the demand is just higher than what we used to see before. >> The scale is completely different and the if you look back at like the slide I made on the AI stack, it looks again like a micro service stack. At the beginning like two years and a half ago everyone was like just rack GPUs and um you you have a set of new players who came to market by just providing GPUs right >> uh now um I think if you want to have a decent infrastructure uh you need to have actually um a set of services it's not like as simple again it's uh even if behind the mall itself you have a lot of complexity but it's also now several services that you need to coordinate it again. >> All right. Okay. So, um in your presentation you touched on something that is really important to me which is ephemeral environments and you said you talked about secure send boxes. >> Yeah. >> Um so do you see anything missing in those uh secure send boxes for for for AI agents? Do do what's the next step for them? So on the agentic side what we um see we see a lot of development on actually higher level of the stack. So you need this primitives whereas the container uh starts fast. So you need to be able to execute containers because people want the flexibility to execute different runtimes. >> Yeah. um then you need this isol is isolation which we're uh providing uh as a building block and then people tend to look at SDKs which doesn't don't need to be u um sophisticated but basically where they can create quickly send boxes so we are looking into this also helping like um AI engineers create send boxes without having to think really about the life cycle of the sandbox typically that's what the MCP uh tools are are doing in this uh in the demo I just made. >> All right. Well, thanks Yan. >> Thank you >> for your time. Thanks for being here and thanks for the talk. >> Let's give it up for Yan once again. >> Thank you. >> Thank you. >> All right. So, I'm particularly excited for our next speaker. Coming up next is someone who worked whose work has shaped the entire generative AI landscape. He's the co-founder of Black Forest Labs, the team behind the state-of-the-art model Flux. Before that, he was a researcher at LMU Munich, Nvidia, and stability AI, where he co-invented latent diffusion that powers stable diffusion, midjourney, and deli. He was even named one of the German uh Germany's top 40 under 40. uh today he's going to pull the curtain on Flux and tell us how it's uh how it's built. So let's see about that. So please uh join me in welcoming co-founder of Black Forest Labs, Andreas Blackman. [Music] [Music] So hey everyone, thanks for having me today. Thanks Ro for the nice intro. Um I want to talk about today about our most advanced image model family flux and I want to explain you the concepts you really need to understand how it works. I think R has said a lot about me. So I'm Andy. I'm a co-founder of Black Forest Labs. We are the company behind the well-known Flux model family. And at BFL, we believe that visual media will be the central interface for human communication in the future. And our mission or our vision is to become the central infrastructure provider to power all the human communication through that visual media in the future. Not only what cameras can do, but way beyond that. With that mission in mind, we've incorporated the company in April 2024 and launched it in August 2024. We've grown it since to 40 full-time employees that are distributed among two headquarters. Our main office is in Fryborg in Germany, actually in the Black Forest and we also opened an office earlier this year in San Francisco. So, we've launched our first model family, the Flux model family at the day when we launched the company. Um, and since then we've mainly structured our releases in three tiers. I just want to share a bit light on those. So we have the Flux Pro models. Those are our best, most advanced, highest quality and fastest models and they are only available via the BFL API. They are just enterprise grade and you can scale from zero to massive volumes in yeah nearly instantly and without any infra hustle. As some of you might know my co-founders and me we've been the original developers behind latent diffusion and stable diffusion. Um so we've still very strong roots in the open source community and that's why we also still are publishing openly available models also here in two tiers. We have the flux dev models that are open weights. These are available for everyone downloadable. Yeah. On hugging phase basically they are self-hostable. So perfect for someone who wants to host a text to image model in their own infrastructure and they are fully customizable. So you can fine-tune them to any extent and in fact they actually have a by now a huge ecosystem attached to them of loras of fine tunes like a lot of stuff going on in the open source community. Super nice to see this. And finally we have the open fully open source flux Chanel model. This is super fast and ultra lightweight. Um and it's basically the perfect entry point into the flux ecosystem. Um and if you talk about the ecosystem we can look at the um model atlas on hugging face here um which basically shows all the ecosystem meaning fine tunes and stuff around the most important foundation models across domains and guess which single model has the largest ecosystem attached to it. It's actually our flux dev model. So uh you see that we already shape I think the image generation space in the open source community very heavily and as said our our vision is really to advance this further to become the central provider of images and videos that humans will communicate with in the future. Um all right that much to the company. Let's now come to the core part of the talk. How to unify text to image generation and image editing. Why is that important? I think yeah, first of all, image generation has made huge leaps in the past five years. We've been really at the forefront of this, I think. But image editing has since very recently not really kept up with this speed of advancement in image generation. I would argue that image editing is at least equally if not more important than text to image generation because it allows us to iterate over content multiple times and gradually refine it. I'll show you what that means in a second. And by that it just gives us much more control over the output. Whoever has professionally worked with images I think will totally understand what I mean here. Um so I think that's why it's a very interesting and important problem to solve. Um and with flux context that we released in June 2025. So earlier this year we published the first diffusion model or flow matching model that combined text to image generation and editing. And that really unlocked new properties that we have not seen before. Things like character consistency, style reference, local and global editing. everything within one model available within seconds once you prompt it. So at really high speeds and I think the top row of images here that I brought to you pretty much visualizes this. So we start here at the right uh um part here um with this image we remove the object from her face. We keep the character then and transform it into a completely new scene. So in this case this inner city and in the um rightmost example we just change it to a winter scene. This is all possible within single seconds and arguably whoever has used Photoshop before this took just very long in Photoshop uh before we released this model. Right? So um here are a couple of more examples. Style transfer is really nice. So here at the left example, we just take the input image and transform the or transfer the style of it to a new prompt basically to a new content or we can do fun things like text editing. Here we changed in the right example Montreal to Fryburg while keeping the font uh exactly the same. Importantly, Flux context also solves a lot of interesting business problems. So we can actually get from an in the wild image of this skirt to a fullblown product shot within a couple of seconds or on the right side we see that we can transform a simple sketch into a full-blown render also in a matter of um yeah seconds. So let's look a bit more at the pipeline and how image generation is different from image editing. Let's start with a classic image generation or text to image generation pipeline. We just use a prompt that describes a scene. We push it through the network and we arrive at an image that hopefully follows that prompt. In this case, it does, I think. Um, for image editing, it's quite a bit different. Here, we start with an image and instead of describing a whole scene that we want to generate, we only describe what I now call an instruction text prompt that actually tells the model how it should change the initial image. So here I say convert this to a Lego scene and we get the image of this church as a Lego scene. This all again in a matter of seconds. Combining these two aspects in a single model is super important because it just gets rid of very manual fine tunes that we had to do before or complex workflows. So this is again text to image generation and textbased generative image editing combined in one model and we don't need to fine-tune anymore or add workflows which was the primary means to be get to these editing capabilities um before we release the model. All right, I think this shows gives a bit of a glimpse of the impact. I'll also show you a live demo of the model later. Now I really want to dive into how this model actually works. And in order to understand this, we have to look at an algorithm that's called latent flow matching, which is defined based on two concepts. It's latent and flow matching. And I want to share light on these two concepts because these are really important to understand how the model works. So let's start with latent. This comes from the algorithm latent generative modeling that me and my co-founders invented five years ago while we still were students at the university. Um, and to start with, I just want to show you these two images. The one, the left one is a JPEG image, which is an approximation of the right one, which is a PNG image. I would say these images look exactly the same. But if you look at the file size, we we see something remarkable. I think the JPEG image is actually 10 times or like close to 10 times or close to an order of magnitude smaller than the PNG. So apparently we can remove a lot of information without actually noticing it. And this is very interesting and we can even visualize it a bit more differently if we plot the perceptual similarity of an approximation of an image in the last example the JPEG and the original image in the last example the PNG against the file size we get a plot that looks conceptually quite like this. So we have here on the left side that the um perceptual similarity quickly increases with file size but then stays very very constant. You might ask okay what what does this now have to do with generative modeling? This actually has to do a lot of generative modeling because it shows us that when we train the model on a perceptual signal like an image for that image to look real you don't need to model all the high frequency imperceptible details. So this part, this flat part of the curve because here we don't increase the perceptual similarity anymore. This doesn't actually matter to our eye and training on this would just be a great waste of compute and time. So we should not do this. And this is at the core of latent generative modeling. We really want to find a representation of an image that only contains those details in an image that actually matter to us. And this is how we do it. We train well we call this representation latent space. This is where the term latent generative modeling comes from. So generative modeling in the latent space and this is how we train the pipeline. We train an so-called autoenccoder to extract that latent space before we train the actual generative model. And it works like this. We start here on the left side with an input image, push this through a CNN encoder, extract that latent representation where we will later train the generative model on and then we apply an operation that's called regularization. This forces the model to discard information from this latent representation. It can be done by discrete or continuous. Um we mostly use continuous operations here. uh and then we push it again through a decoder to reconstruct the original image and we train this model end to end. So we use a reconstruction loss that just minimizes the difference between the reconstructed and the input image. And importantly we add this here on the um top right this discriminator loss that is basically a prior to the human perception. This one takes care that the details that are contained in the image are actually only those that matter to our perception. And like this we can arrive at this latent space that really only contains those details that matter to us. So this is the first part and once we've done this we train the generative algorithm to generate images from basically easy and tractable distributions. I come to this in the next slide. So um and this algorithm is called flow matching. Flow matching um is a general family of algorithms that wants to find a vector field which is parameterized by a neural network that maps from a simple distribution. This is here visualized on the right side. So here which is always the normal distribution to a very complex and unknown distribution which is our natural distribution of images. This is visualized here. So this is again in the latent space happening. I just explained this. We have the encoder to encode here and we model everything in the latent space. So we want to learn a vector field that is parameterized by a neural network and flow matching actually gives us a super simple algorithm to learn this. All we do is we couple each sample from our data distribution with a random sample from our normal distribution. So say this one and we just repeat this for every sample in the data set. And like this we can construct a artificial vector field. This is obviously this looks really wrong because in a vector field trajectories can never cross otherwise it's uh by definition not a vector field. The interesting thing with flow matching is if I just do this and train the model to approximate this horribly wrong vector field, I end up with a true vector field actually. So if I do this very often, the model in the end approximates the true vector field where we see that trajectories don't cross. This is the flow matching algorithm. And obviously since we want to control the output based on text prompts, we condition our deep neural network always on those text prompts to be able to control it later. And once we sample the model, so say now we've learned this vector field, the network represents it. Once we sample from it, we just apply the nect vector field step by step in a numerical integration scheme. So we start with a sample from the um easy interractable distribution from the normal distribution and just apply a numerical integration scheme where each integration step is actually a forward pass of the neural network and like this in 50 steps in uh more or less we arrive then at from a sample from this easy distribution at a sample from the data distribution that we then again push through the um decoder and we arrive in at the at at a generated image in the end. So this is the flow matching algorithm combined with this latent generative modeling. All right. So this is how you can generate images based on text. But now how does this apply to flux context because that also can do image editing right? We do this with a very simple trick I would say. I want to share a bit about the architecture of the model. So we train a general transformer model um which is the backbone of what we're doing for the flux models and we condition that this is the same for image editing and text to image generation on a text prompt. This is here on the um top left part here but instead of only conditioning on one image that we want to generate we condition at on an additional image that is now the context image for the model. So we here have two images. We push them through the latent space and in the latent space we form a sequence of tokens. We have text tokens and we have image tokens for the two um images. the one we want to generate and the one that is our context image. And we handle these this token stream with a transformer architecture which is composed of double stream blocks where we have domain experts for the visual and the um text tokens and then classic transformer blocks that are just handling all the tokens similarly. And this is how we parameterize our network. And we can just change from text to image generation to image editing by using or removing this context image. So if you do a text to image generation, we just remove that image. So this is how it works. Just a conditioning trick in the end. Last question I want to answer. How is it so fast? I think whoever knows the flux models knows that they are pretty pretty fast. Um what do I mean when I say fast? they are up to an order of magnitude faster than the um competition. So we see the flux models here on the left side of each plot and this is both for image editing and um image generation. So they are always on the left side and say for yeah image generation we are 25 times faster than GPD one image for instance where we're still for um image editing up to 10 times faster than the slowest competitive model. So how is that possible? And the algorithm that powers this is called adversarial diffusion distillation. Here the goal is to bring down the number of numerical integration steps. I if you remember I I said these were 50 for trained models or more or less 50 in in in uh as a rule of thumb down to as little as four. And in order to do this we do two things. We take a pre-trained flow matching model and initialize two new networks from it. First the teacher, you see it here on the um bottom part of the plot which is just this model itself. And then we initialize another new model. The student the student should learn to generate images that are as good as the teacher images but only in four steps. This is the goal and here's how we do it. We start with an image again encoded into the latent space and then we use the student. This is visualized here on the top part. We use the student to basically generate this image in or generate a generated image from it in four steps. Then we arrive here at the um top right side at an image that in the beginning when I start training this is very blurry and looks very unrealistic. And I want to improve these images as much as possible. How do I do this? We use this generated image from the student, push it again through the encoder and do the same generation procedure with a teacher. But here I use 50 steps and I arrive at a very faithful and nicely looking image. and we just train a distillation loss that compares the images of the student with the images of the teacher and pushes the student to follow the teacher distribution. This alone is not enough which is why we add another um loss on top of it which is visualized in the bottom left part and initialized with a feature extractor called Dino V2. This is a discriminator loss which is again a prior to human perception. This takes care that the images that the student creates are actually following or perceptually pleasantly looking for us humans. So again they they make really sharp edges for the um for the student images um and make them realistically looking in the end. So this is super nice and we can with this algorithm we can get down the number of numerical integration steps from 50 to four which is great but it's very complicated. is like two times encoding. So we see here uh in the in the top row we see an encoding operation also in the bottom row we see an encoding operation and same two times decoding. So it's super tedious to train this and very computationally heavy and this makes us arrive at the final algorithm that we actually used to train the flux model. It's the latent adversarial diffusion distillation. You know now what latent generative modeling is. So I guess you know um the motivation behind this. We just want to get rid of all the encoding and decoding and we want to put the entire algorithm in the uh latent space. And this is what the latent atl diffusion distillation algorithm does. It just does exactly the same thing than the last algorithm but it applies it completely in latent space. And for that we change two things. we replace this dynino I go I go back once to see it this this kind of discriminator loss this feature extractor with the teacher itself so we use the pre-trained model to to um as the prior to um calculate this adversarial loss and we throw away the distillation loss completely by that we can actually achieve much much less compute effort sources with it. Yes, much much less compute efforts and can speed up the algorithm a lot. So in a nutshell latent adversarial diffusion distillation is just a um the adversarial diffusion distillation applied to the latent space and by that we can actually bring down the number of integration steps from 50 to four which is again this order of magnitude speed up that we've seen in the plots. So here we get a 12 times uh 12.5 times speed up. All right. Now you know why it's so fast. And as a final um I think thing that I want to show you, I have brought a short demo to you to to actually see how the model works. So let's maybe use this image. Uh here we have an image of my favorite football club, the SA Fryborg, which is uh the local football club in Fryborg. And we can now just play around with this motive. So we say um put this logo onto a t-shirt. Oops. All right. Here we go. Hope internet is okay. Yes. Cool. Then we wait for a couple of seconds and what we get is live demo. Nice. Okay. Now, now now we're hot. I think now it should be faster. This was too too uh slow for Flux normally. So now we get this nice uh picture of the logo onto an image. The nice thing now is I can use this image and iterate over it. This is what I why I think editing is super super exciting. So now I can say okay this logo is too large make the logo smaller and put it on the top right part of the t-shirt. Again, we submit here. And here's what I get. Again, four images. Uh, and I can I can just go on like this. Let's make the t-shirt red. And I think there's the next thing I want to do some some something more. challenging. Um, let's put this t-shirt onto a human and transform them into the wild. Because so far we've done like somewhat local image edits. Let's do something more global. Oops. Oh, in the park. All right, I'm now generating two examples. I don't know what what's happening here. Okay, par also model code infer what what what what I wanted to do. That's great. So, um nice here also good. That's user and one final edit maybe. Um so this was a global transformation which is nice. And now let's uh an as an example of style transfer um make this a watercolor painting as a final one. Come on. Last one. Ah nice. Cool. And here we go. Now we could print it out uh and place it on on I don't know, take it at home, whatever. Um no, I think I think um the I think the general thing is coming across. It's very powerful, very flexible. Um and I think this was it with this demo. I want to say thank you. We are hiring. Come visit our booth. Um we're just right next to the main stage here. Um we have a lot of openings. So, if you if you're interested in what we're doing, please apply or visit our playground. This is where I basically just demonstrated the model. Thank you so much and have a great conference. >> What a great presentation. Thank you, Andy. Uh, would you like to join me for a couple of questions? >> Absolutely. >> And by the way, for Zuckerberg, this is how we do a demo. >> Just Just saying. >> Thank you. This was live. >> All right. Please have a seat. Um I have to say I'm blown away to see how fast uh the flux models are. Uh it's incredible if you've who tried the flux before. >> Wow. Yeah. So everybody has noticed like how fast it is. Um but how much faster could you make it actually? So do you have a theoretical speed anything you think we can reach >> the I think once we reach larger models um we can actually get to one step generations I think that that that is really the the goal flow matching models diffusion models they came from this kind of what I said numerical integration scheme or sampling process that just takes a lot of steps and I think our our goal is to really get down to one step like our fluxell model for instance does four steps Um and I think once we get to larger models we can actually go down to one step. uh and sure the I think the the challenge or the goal is in the end real-time generation and I think that is very feasible getting better hardware getting better algorithms getting better optimization procedures like on blackwell chips we can do less precision so we can FP4 do FP4 quantizations for the models and I think we will definitely get to um real time >> wow >> generation with flow matching models yes >> wow Wow. So you're saying by having bigger models we're going to make them faster. So it's >> a bigger model is effectively um can represent a more complex transformation. In the end like what I showed also um this vector field or this approximating this vector field is just in effectively a mapping from an input to an output. And the more complex your model gets or the more parameters I add the more expressible these functions are. So I can model way way more complicated mappings and that allows me to just do less steps because if I I can look at it from a from a linearization perspective. Let's say I have a curve and I want to approximate with um linear linear um parts right the more steps I make for this approximation or with linear pieces the more steps I make the better my approximation gets. Mhm. >> And um making a more complex model can get you can make the the function you're learning more nonlinear. So you can actually also add things like a curvature or something into into what you're modeling in in one part. And by that you can model more complex curves in um in less steps in in that way. >> Wow. Okay. Um, and do do you think that in the future uh people like like me who like to use Flux for uh image editing will just drop Photoshop altogether and just be uh editing using these models. I don't know how we will you will be using these models. They could be they actually getting integrated into Photoshop soon. So uh >> wow. >> Um >> breaking news. >> I I I think yeah how you will certainly use them for image editing. where you will use them, I cannot yet say, but I I guess BFL will also make a a good proposal for um using the models on our platform soon. >> I see. Okay. So, it's going to just make like these tools a lot better. >> Yes. And and and I'm 100% sure that um >> these models will be the backbone for all the image editing um that we will see in the future. Yes. >> Wow. Incredible. Okay. So, uh, yeah, you touched on the the different models that you you guys offer, pro, dev, and and Chanel, and, um, so some of them are open weight. Um, but how do you balance your focus between, you know, building uh for building for open source, but also like focusing on state-of-the-art at the same time? >> Yeah, I think um we've just seen it as a very nice value proposition to structure it like having state-of-the-art models in our API. These are just the best models for people who really want to get very fast results um and don't want to deal with things like customization and fine-tuning because to be fair that's not everyone's business and not everyone's interested in that. So I think that's always good for people who want to get to the most powerful models in as little time as possible. But then you have this huge customization um need that a lot of real world applications have like when I'm when I'm building a face app or something face gen face editing app say I need a crucial amount of fine-tuning to just get very good at this very specific problem and for this open weights models are perfect because everyone can use them everyone can customize them or can get help with customization but they can tailor them >> to their specific use case and I think This need for customization goes very well with an open source uh approach because effectively the whole community can work for you. You can just use all the insights that the open research community got from uh like like >> yeah investigated in the in the past right and by that this this gives you just a huge tool stack that you as someone who wants to customize the model can get when they just build on our flux platform. So I think um we see this as a very complimentary kind of approach that we are definitely seeing also for future releases. >> Absolutely. Yeah. So um for for most people they're going to just go to state-of-the-art model but for people who are have specific problems those open open and open source solutions might help them to >> absolutely and I cannot predict how this will continue in the future. I would not say that most people are are going to to close models. I think it's it's just a distribution and I see this as a kind of constantly evolving >> and changing distribution. So we might up in we might end up in a future where everyone uses open models >> or everyone uses closed models I don't know but I think from what we see right now playing on both fields is very important to us and we we just love to see also what people are doing with our models. That's why we also from a I think personal um personal standpoint cannot stop working on open models. >> All awesome. That was my final question. Uh Andy, thank you so much for this presentation. >> Thank you. Let's give it up for Andy. Thank you. All right. So, I can All right. So 2 pm is when we need to be back here for uh yeah our next talk. Okay. So and and now we can wrap it up and go for lunch and uh please enjoy and see you at 2 p.m. Thank you. Heat. Heat. [Music] [Applause] [Music] [Applause] [Music] Heat. Heat. [Music] Heat. [Music] [Applause] Heat. Heat [Music] [Applause] [Music] [Applause] [Music] Ladies and gentlemen, please join me in welcoming to the stage your MC C for the AI engineer Paris developer experience engineer Ralph Chabri. [Applause] [Music] [Applause] [Music] [Applause] [Music] I always dreamt to do that. My god, the timing was perfect. Hello everybody. How's it going? Great. I had so much fun this morning. So, we had Docker, Neo4j, we had Black Forest Labs, and you know what? The fun is not over because our next speakers, yeah, they're going to be so good. And our next speaker has actually nearly a decade at Deep Mind where he was working in super cool projects like Alph Go, Alpha Fold, Alpha Star, and he worked also in Chinchilla, Retro, Gemma, and Gemini. but now he co-founded and he's a CTO of the age company. Um so please without further ado um join me in welcoming to the stage uh Lie co-founder and CTO at the age company. [Applause] [Music] [Music] Hello everyone. Thanks R for the intro. Um I'm Lawrence. I'm a situ and co-founder at H. And uh today I'm going to talk to you a little bit about our work and really our work literally like what it is. Um so me I like to start the day and and I saying to myself I'm going to do one thing one one thing very well. So I'm going to do this slide deck and then uh I'm going to open my email. I can see I have 289 emails and I'm kind of a zero inbox person. So it really disturbs me quite a lot. And then I open my pro email and I have 1,000 emails and then I'm going to open my calendar and then I'm going to like bunch of stuff and like I think we're going to do a lot of micro task u uh throughout the days and uh and uh and this is my desktop what it looks like. I just put everything in it. I think it's simpler and then uh eventually I I just create a new folder old and uh I put everything in it and old is already taken so I have to name it old four for example and then um so there has to be a better way um and this is uh what we are trying to do at H we're trying to build something uh at the intersection of agentic uh UI automation and also uh models uh So agentic is basically anything that has to do with the autonomy, decision making, planning. Uh UI automation is like how do you deal with software um but through the the the user interface. Um so it can be RPA or uh task execution and and model. We think model is important um because yeah you can build on big models generist models and I will say more about that but uh we think it's massively inefficient and uh we can do a lot better with a specialized model. So um really what we are building at age we are building computer user agent so um agents that can control a computer the same way uh a human would. So through the the graphical user interface. So basically the you take a screenshot and uh some context you pass that to the agent and the agent outputs actions uh but in the same action space as you would like mouse clicks, scrolls, uh keystrokes and um and the for that is that right now agent is a big word. you have many agent company a lot of that is really rag or uh tool call MCP APIs and so on. So plugging at LLM interfacing at the LLM with a with like um strongly um um type tools but we think you know it's not going to work for the long tails of tasks that I showed at the beginning um you know not all software will have APIs MCPS soon um and I think there are two reasons for that one is inertia in like especially like big enterprise world uh you know things are slow and then there are more adversary reasons for that um like business model I think there are a lot of legacy software that um um have incentives so that this doesn't happen uh because they sell by the license by the seat so if with Y license you can do more because you have an agent talking to the license then they kind of lose money so they have strong incentives to make that transition slower and since sometimes they have monopoly release it's going to take a while to to happen this transition to agent doing a lot more in enterprise. Um there are also more positive reasons like there's actually a lot of intelligence that has been put into making the API uh making the the UI I mean um the UI you know it's like an effective way to present the information um and uh and yeah maybe like you can already use that context that has been built by many developers as as the context for your LLM. Um and uh also if you want to assist humans uh it's good if you work in the same action space as they do so that they can show to the agent uh how they work what they do and the agent can show to them uh and and automate. So um there's already RPA um where you can automate uh robotic process automation. So you can automate things through UI through scripts and so on. But it's very heavy to deploy. It's brittle. So if you know exactly the workflow changes a little bit or if there is ambiguity it's not going to work. So I think those are the reasons why we uh need and and we are what why we are building computer use agent. Um so concretely I I can show you maybe an example. Um so example is a task. Uh here it's like okay we want to go to Google flight we want to find the oneway business class flight from Buenoseres to Amsterdam on January 10. Um and we want the details of the shortest flight. Um so there are a lot of kind of like weak constraints in this um and let's see um you know what what um an agent is doing. So yeah if you can please uh maybe enlarge the Yeah. So, so we can see. So, what happens is that you have basically um the software. So, here it's a it's a web browser um where um we're going to navigate on Google flight and then the agent is going to take actions. So, and every time we see uh um the action um so here it's like clicking on the on the departure. Now it's writing the the um provenence buays. Um and um yeah, so maybe I can speed it up a little bit. Um let's put uh 2x. Um but um yeah, so the agent doesn't really know how the website works. Uh how the you know, if you click on something, what's going to happen? It's really adapting in real time. Uh here uh it's um um and every step we can see uh basically it outputs three things um some notes uh of what happened what it thinks happened on the previous steps uh some thoughts so what it wants to do now so now it wants to um select um um one way um and then the action is uh click on uh click on the dropdown for round trip because it wants to um to set uh one way. Okay. So, so yeah and um now we are coming to the departure. So, departure is really hard because you have to navigate through a calendar. Every calendar on every website is different. And here we want January. So, we need to click several times on the next month uh until we finally see January. And then we can click on 10. And uh if you want to script that in appear, it's like a lot of work to explain the loop and all of that. Um so now it's selecting the the class um web search. So now it's going to search. It's going to see the flights. It's going to expand. So it found the the the shortest one. It's expanding and uh and yeah, so you can do that with RPA, but it would be a lot of work. And here is just uh one prompt basically. And um and yeah, so this is um this is um our agent surfer H. Um okay, how do I Okay, maybe you can uh lower the the screen, please. Um, and the agent the way it works. So, you have the task. Uh, that's the prompt I I showed at the beginning. Um, and then you have the memory of the agent. Um, and the memory it's there are a few things in it. There's the task, the thoughts, uh, the actions, the notes, and the previous screenshots. All of that goes through a policy model. So, it's a VLM, a visual language model. So like it can be GDP5 or or set or or module. Um and then it outputs actions and we have a few actions. Um we have like refresh, go to we have scroll, wait and write and click. So the agent can write u some text onto some element or it can click on some element. And then we have um a localizer model. So when you when you use an action that has a element target uh there's a second call to a model to uh outputs the coordinates um of that uh element and that's the localizer and finally when we answer when we produce the final answer we have another call to a um validator model and if uh which validates if the task has been correctly executed or not and if it has been um we just return the answer or we incorporate the feedback into the memory and we continue and every time the action is uh executed uh in the browser and for that as I said we can use uh any model but we can also use our own model that we developed um that comes in u so those are our olo uh olo models um that uh we open sourced um they are really state-of-the-art on this task of uh localization that is given uh so I'm going to talk more about this later but yeah I think one question is why would you uh use uh you know a specialized model this is 2025 so everyone is just printing prompting a big model and put your API key and just ship the product that's what everyone does um nowadays so why why bother and why you know try to do another model that does something. Um, and the the reason is really um efficiency. Um, and if you look at efficiency, let's take an example. Let's take chess for example. And chess, you have GP5 or Grog that can play chess uh very well, like like 1,600 ELO. It's probably like stronger than any like most people in this room stronger than me. Um so it's quite remarkable that such a big um model can play at this level without having be you know specialized to do it. But if you compare this to um a specialized agent and model like for example Alpha Zero Alpha Zero it plays at uh 3,600 ELO. So 2,000 ELO more. So that means GBD5 is going to win one game out of every 100,000. And it's also a lot bigger. It's like one trillion parameter compared to 20 million parameters. This runs on like maybe a big P of GB 200 one $1 million supercomput. And that one can basically run on a on a cheap gaming laptop. So all in all, G5 can play chess, but it's like maybe five billion times less efficient than Alpha Zero at it. So 100,000 100 million times less cost effective. So yeah, you have many orders of magnitude and it really doesn't make sense um you know economically to play chess with GP5 and um so that's what we are trying to do here for computer use agents. We are trying to like basically move the the parto front of uh of performance of the agent. So this is um web voyager. So web vagure is a is a benchmark. Basically the task that I showed at the beginning it comes from this benchmark and so you have like 600 tasks and we measure success rate and that is basically the success rate of the agent when you use different models uh when you use uh GPT4 when you use GP4.1 uh when you do yeah 4.1 is here and this is when you use our model that has been specialized for this and we have like currently modest best 5x um cost effectiveness factor but we are we are you know aiming at something closer to uh 5 billion hopefully. Um this is um yeah the task I described about localization. So given a screenshot and an intent here the intent is click on departure date input. um you want the model to predict the the coordinates um and um that's what we worked a lot on and for this particular task we also have measured the performance uh against the size of models and here it's not just like us being better but it's also I think what's interesting is that you have this what I call the specialization frontier where basically every model below there are generous models so here we We have Quen, we have Sonet, we have actually uh maybe GBT much below on this particular task. And then here these are specialized models. Uh and so all of these models they have been trained specifically for that. These are UI venues from uh ends group. They have big team working just on that. This is UITAR from Bance also a big team working just on this problem. And uh and this is our latest release from two weeks ago. all 1.5. So yeah, I think we're going to see a lot of these like parto plots where specialized model really shine and people start to move away from uh from generalist because they they find it's uh it's not very effective. This is how we train um OLO. Um so we initialized the model with um quen 2.5VL which is probably the the state-of-the-art in term of uh opensource uh VLM so visual language models we need to give it a screenshot so it needs to be a VLM um we take we start from that and then we do a little bit of fine tuning uh on a mixture consisting of UI localization examples so we've scrapped many different softwares and websites uh to get like screenshot intent coordinates. Um we have other auxiliary tasks that we think are interesting like question answering on on tables on on UI and then we have a lot of data um that that consists in execution of our agent on on many synthetic tasks. So we have actually expanded u this web vure to something like 100,000 uh synthetic task on which we've run the agent and we collected the the successful trajectories and we are um training on the successful trajectories for web for Android and um so that's the phase one supervised fine tuning and then we do a little bit of gRPO which is reinforcement learning uh and for now we so that is Um uh we are training the model to be successful to optimize like end to end for being successful at this task. Not really reproducing the training data and uh right now we are doing it on on UI localization tasks but we hope to to expand to um to many more tasks in the future. So why open uh we could also keep the model to ourselves. Um and these are I think all of the reasons why we want to do open weights models. Uh I think for the customer uh it builds trust. Uh they know what's in the agent. Uh and they are uh we are startup so they don't want to be fully reliant on us. If we gave them the model they know they can run it if we go under. Um it's a good conversation starter. U people know like maybe uh they've seen the the model or the benchmark. So it's it's a good conversation starter and they can try it before they buy it. Um I think it's good for the brand because um like you don't have to trust me. You can download the weights and you know run the evaluation. So it's verifiable performance um for the tech also as a consumer. So when we build open weights we uh also start from covl in our case. So um and we start from we use existing training pipelines maybe TRL from face maybe uh we can use also inference like VLM like all of the stack is basically already done uh and we don't have to reinvent the wheel we can just uh focus on where we add value and then for the employee I think it's a good drive as well uh they get visibility along the way it can be a long way for a startup to be successful but uh this gives them intermediate milestone and visibility and it helps with retention and and attractivity. So I think all in know it's a it's it's a good deal for everyone. uh what is next for us um in term of research we want to extend to we've done the web we want to extend to desktop and mobile and mobile we want to generalize so before we trained um on web vagure we evaluated on web vagasure and because it's reinforcement learning it's kind of okay uh to train on the test set but uh really it's not so now we are moving away from that and we are training on this big um only on this big um synthetic task mixture that I described. Um we want to annotate use like more humans in the loop. Sometimes the agent is really stuck and you need to just show it uh what to do. So we are building tools for that. Um and then um reinforcement so do more reinforcement learning to optimize end to end for the task execution success. Um so multi-turn and so on. Uh and then in term of platform, we are uh the the stuff I showed at the beginning the demo, we're gonna uh make it public soon. So that's a surfer showcase. We're going to have a portal where you can go and create um you know an account. You have your API key, you can uh launch the agent, you can do inference of the model. Um so so that's coming um very uh very soon. And um so with this uh I thank you for your attention. I leave you with this quote uh which is um you know why we think it's important to um you know assist uh more uh some some kind of uh low-level tasks and um thank you. >> Thank you. Would like to join me for a few questions. Yeah. >> Thank you so much. All right. Please have a seat. Okay. Wow, what a great presentation. I really enjoyed it. Actually, I wrote so many questions, but I'm gonna try to keep it try to keep it a short. Um, so I really like when you highlighted the limitations of API and um and and MCP and you spoke about the contrast with computer use a uh computer using agents. Um I was just wondering what's the biggest challenge in your opinion in building computer user agents. >> Um so I think one one of there are many challenges uh in term of deployment. Um you have um there are many questions like do you run it locally? Do you run it on a virtual machine? Uh what are the credentials? Um, it's a big question. Yeah, we book a flight, but then what's next? Like, are we logged in? Are we going to just have the, you know, credit card uh number uh in the prompt? Probably not a great idea. So, um, yeah, I think many questions around roles and permissions and I think it's it's a great opportunity because like it's already done in humans interface. Yeah. >> And we can reuse some of that. >> Yeah. So if uh an agent needs to act on our on the behalf of a human like what kind of data they can have access to etc. what kind of permissions are we allowing uh allowing them to have very much >> yeah so it's really dependent on deployments um and we are uh getting starting on the on the go to market side and so on but um yeah I think either it takes control of your laptop and you have your credentials there >> um but then maybe needs to confirmation and so on to send you confirmation before it you know pass an order or send an email. But yeah. >> Yeah, I would like that too if I'm if I'm informed. You spoke of uh um go to market uh strategy. So uh can you tell us a little bit more about the business model that the age companies uh actually have for this? >> Yeah. So I think we're going to have the the surfer showcase, but I don't think it's going to be uh something you can pay for. you can try it but uh if you want to we don't want to basically build a per token business model we think it's a kind of a race to the bottom so we try to u do uh this uh forward deployed engineer model and we have goautier our CEO who comes from balante and he's been very successful at uh at doing this in his previous role and uh we'll try to charge by by the value And like how how much uh more can you do and we take a fraction of that compared to uh you know how many tokens like open air is uh like $2 per million tokens how how much you know we're not going to make money like that I think so yeah >> yeah it's quite hard to actually understand how many how much tokens or how many tokens does a computer agent needs right >> so that makes sense to me Um, well, this is it for me. Thank you so much, uh, L. >> Let's give it up for L. >> Thank you so much. >> All right, so we're ready for our next speaker. This is going to be a speedun, but I think it's going to be exciting. So, our next speaker calls himself a technology, technological humanist. He's built systems for NASA, brought medical informatics to Zambia, and today at Neo4j, he's helping democratizing graph databases. His motto, everything is connected. Okay. So, uh he's here to show uh to show us how to generate data and how generating data can be fun. So, please join me in welcoming to the stage generative AI lead for developer relations at Neo4j, Andreas Colliger. [Applause] [Music] [Applause] [Music] Thank you for that introduction. Hello everybody. Hello Paris. So we've all probably done a little bit of vibe coding over the last year or so. I work at a database company. So I think a lot about well can you do vibe coding with data? Can you vibe with data? So this is my talk. This is the snack version of a fuller version of what I've got um that I'll share later. But it's all about and actually I love the previous talk. The idea that as soon as you go down the route of making any kind of agents, you run into all kinds of challenges that lead you down to making a multi-agent system. And you do multi-agent system for a couple of reasons. You do that because the agents have got too many responsibilities. They get confused. They have context rot that happens, semantic drift that happens. So you break down the problem to make it simpler and have more agents that are focused on different parts of the problem. Along the way, because you've got multiple agents, you have the opportunity to use specialized models the way that we saw in the previous talk. Specialized models outperform general models in all tasks. So you have the opportunity as soon as you break things up to have models that do just one part of a puzzle. Okay, that's the quick setup. So I've been building a multi- aent system to actually vibe with data to help you build a knowledge graph because I I work for a graph database company. And this starts with top level agent that interacts with the user and like helps figure out what they user wants to get done. But then there's three specialized channels here for either taking in structured data, unstructured data, and then once you've built a graph, retrieving information out of that, what's called graph rag from the from the graph perspective. So I'm going to take you through a speedrun of what this looks like in action. I have a longer talk about like the details of how you implement this. All the code is open source. I'm happy to share that with you guys as well. Okay. So here's the speedrun in video form. Imagine you want to build a bill of materials graph. You don't know how to do that. So you turn to this agent. You say, "Hey, I would like to create a bill of materials graph." Send that off. The first job of the agent is to figure out what do you really mean? What are you trying to get done? It goes through a bit of bouncing around here like figure out which agent should pick up this this whole thread and realizes okay I don't quite understand what you're trying to get done here. Let me ask some clarifying questions. So it asks tell me more about what you're trying to do with the knowledge graph. And I tell it okay great I want to trace from the products all the way through the bill of materials all the way down to the suppliers so I can do root cause analysis for like product issues that are reported like on a website review or something. Okay. Human in the loop is important to me. So after the the agent decides I think someone understands what I just said it says this is what I heard you say was that correct I approve it and then it moves on to the next stage and the next step having understood what I'm trying to get done it goes off to find what data is available that might satisfy creating that kind of a graph it has tools available to it to look through the file system for all the files that are available it can kind of graph through those and find the content of the files and then here it makes a recommendation about here's the files that I just found some product files some assembly files s and also some suppliers. It presents that to me again for approval. I said, "Sure, that looks great." It goes on to the next agent. Now, we know the user's intent. We know the data available. The third agent's job is just to focus on given the data and the user's goal, what would the data model look like? So, it goes off and does some analysis on that. And the outcome of the analysis is that it comes up with a proposal for what the graph could look like. And again, it's just a prop proposal. It doesn't actually do anything yet. Shows that to me. I can go ahead and look through it and see that looks correct. It looks like it covers all the kind of data that I'd like to have inside of the graph. And here I'm carefully scrolling through looking sure making sure all the nodes are there, that the nodes are connected to each other in a good way. I've got a good connected graph. And here again, I'm just going to go ahead and approve that part of the workflow. With that approved, we can finally transfer over to the agents that's going to go ahead and build the graph. And the building part of it and this is where the specialized agents really would shine. Specialized both agents and models is that if you go from just ideiation with the user, you kind of maybe want OpenAI generalist model for that. As soon as you go down to a specialized task, having specialized models would be amazing. And here we've gone through all the way pulling in those data files, creating the graph, and it's even recommended some queries that I might be able to run there inside of the interface. Now, instead of, you know, dealing with interface, I'm going to pop over to the database side to see what the the data actually looks like. And here, I've done a match looking for um the products that are inside of the graph that was just created. This particular product is the Stockholm. I forget if it's a table or a chair, but that chair is made out of some assembly, some different pieces, and those pieces have some parts, and those parts come from suppliers. All that was created on the fly by this multi- aent system. That is my speedrun. If you'd love to hear the long view, come talk to me afterwards. I've got about 10 seconds left, so I can't take any questions. Thank you. [Applause] >> Thank you. Thank you, Andre. Awesome. So, our next speaker leads a developer experience at community at HuggingFace, uh, where he's been championing open source, audio, and ondevice machine learning. He's here to walk us through the state of open source and uh uh open source LLMs in 2025. Please join me in welcoming to the stage head of developer experience and community at hugging phase VB. [Applause] [Music] [Music] Hello AI engineers. Uh it's nice to be here. Um and uh thank you so much for tuning in for the talk. Um in the next 20 to 25 minutes or so, I'll present a report on the current state of open LLMs in 2025. Uh and hopefully you will uh you'll be inspired to try them out by the end of the day or by the end of the talk. Uh let's see. Um so when we talk about open LLMs like the one one of the first questions that that pops to the mind is like how do they stack up against u the likes of GPT5 the likes of um Claude Sonet and so on and so forth. Thanks for us. There is um an evaluation service um or rather a company which uh has made it uh which has made it its mission to benchmark closed as well as uh open models against a standard set of evaluations. What you see on the screen is the average performance of recent LLMs on a bouquet of events ranging from uh math, coding, um scientific rigor and so on and so forth. The black bars that you see are um are proprietary models. Um and the blue bars that you see are uh open models. And um what you can see right now on the screen is that the like in the in the top 10 there are pretty sizable amount of um open models. So uh I can see three or four here. And uh if you look at in look at it in absolute scores uh the the open models are actually quite close to the closed models as well. You can see that GPD5 high currently with like reasoning effort high is at 67 absolute score whilst uh GPDOSs another open model from um open AAI is at 58. And uh there is a small caveat here is that um whilst the open models are tracking proprietary models quite closely uh the the closed models on the screen that you see uh come at a higher cost um at as well as at like a higher token budget meaning like you have to um reason for for longer and and so on to get to the same um stage. Whilst now we've established that um open models are sort of okay. Um the next question comes is like are these are these easy to use? Um surely like you know you have to deal with like model weights, you have to deal with different runners, you have to deal with you know um setting up your virtual machine and so on and so forth. So surely this must be like difficult to deal with. Um well let's look at it. Uh there are typically three ways of um of of like working with open lens. Uh number one is is a serverless API. This is similar to how you would interact with um with open AI with anthropic with uh Gemini and so on and so forth. Um which is essentially that you take like a fiveline snippet, you give the model ID and you uh pass a prompt in and you get some sort of generation out. Um second is is a managed deployment wherein you select the model weight you you click on a few buttons and automatically this um um a provider this could be similar to model labs could be similar to coab um hugging face and so on and so forth would package these model weights up and you get an endpoint out. Um or last but not the least, since you have the the model weights themselves, you can deploy it yourself. Um which means that you set up your own virtual machine. This could be a VM right in your basement or could be a GPU cluster on AWS, Google Cloud. Pick your favorite cloud provider um or at your company and so on and so forth and you set it up. Um now let's let's let's look at these in a bit more detail, right? Um, let's talk about the serverless API. I'm pretty sure at like everyone in this room would have um dabbled around with a code snippet similar to this. This is like a boilerplate uh open AI chat completions um code snippet that you would use to interact with GPT4, GPD40, uh GPD5 and the likes of it. Um in the year of 2025, we've we've we've adopted this as the standard. be it chat completions API, be it responses API, you can uh access pretty much any open um LLM um with pretty much the same standard uh which means you you have a fi you have a familiar SDK. So you in this case you can swap GPD5 with for example GPD OSS or could be like one of the latest Quinn models um and so on and so forth. You you can choose by a bouquet of um of routers. So you could use um hugging face inference providers, open router and there are many other providers like this. So there is quite a lot of optionality for you. Um and um that's it and then just like plug in your prompt and just go go about and build your um applications. So it's pretty much the same experience as the frontier models which are proprietary um and so on. You can go one level down the stack as well. Um, which is what we were talking about in managed deployment. Um, in this case, you typically go to a LLM market space. Uh, this could be, you know, uh, Hugface inference endpoints, Lambda Labs, Prime Intellect, model. There's there's a lot of competition. There's a lot of healthy competition out there, healthy providers out there each trying to um, give the best sort of service to you. Uh, pick your own uh, GPU. So this could be an H100, A100, T4, L4, whatever works with your budget size, whatever works for your for your specific use case. Um, and that's it. And then just deploy, you know, like in in most CAS cases, it is as the as the image on the on the slide is. Um, you pretty much just select a model. In this case, we select Quinn next. And um, you just hit deploy, right? Um, and that's it. And within like two or three minutes, you have a deployed endpoint that you can then use. Um, last but not the least is um is how to use um um is is is if you want to sort of deploy these models yourself, you might want to do this in case you want to have the maximum control. Um you know, you want to sort of make sure that wherever your prompts go is is something that you you want to have full control on. you want to make sure that the that the model abides by your specific rule cases and so on. Um in this case you you would typically start by choosing your own inference engine. Uh there is a huge variety of inference engines uh namely VLM, SGLAN, TGI, there's a lot of other like niche uh inference engines which you can choose from. Um once you've choose these um these inference engines, you provision your own um sort of cloud GPU cluster. This could also be just you know like a GPU lying in your uh basement and um you set you set it up and that's it. And you have like a private and secure deployment. Uh each of the three sort of um ways that I spoke about have their own utility, right? So if you want to like run fast uh you would probably want to start with something which is serverless. Um if you want to have a bit more control and you want to make sure that the net intelligence that you supply to your app uh without having like a dedicated devops or MLOps team uh then you might want to look at manage deployments. If you want to have like really a lot of control on like what is going through um you want to have full provenence of your prompts, you want to have full provenence of your outputs, then you want to sort of have something um that you want to deploy yourself. Um the best part about this all is that given you have the model weights, you have the optionality to you know choose essentially. Um and you can go through the stack, you can you can work through the competition, you can work through all the platforms and so on. Um but now you might be asking like why do I even need so many deployment options, right? This is a this is a classic sort of buy versus build um sort of you know um argument. you no matter where you are in your sort of like life cycle of a of a startup of a business and so on you you want to always increase the the optionality that you have with respect to um what you're providing right so in case like your your your startup or your application depends on uh LLMs you want to have it um you you want to always have like one failover over the other you don't want to always depend on just one provider uh case in point is um is the recent issues faced by Anthropic. Um no shade to Anthropic. I use Claude every day. If there's someone here from Anthropic, I love you guys. Uh but this is a recent uh blog post from them which was about the timelines of um um of issues that they faced while serving uh Claude Sonet um as well as um if I'm not mistaken Opus models and the issues here specifically was that um they had like an issue with like context window. So they had an issue with uh with the way they were routing um um the prompts. And what what all of this led to is that you were you were calling the anthropic models the same way as you were, but a small percentage of these requests were being were not being fulfilled the way they were supposed to. Meaning you were getting like slightly lower quality uh outputs at the end. Given you have no no sort of visibility into what's going wrong, this can cause like quite a bit of issues because without having any visibility on what's going wrong up the stack um you just get degraded outputs and um this is something that you know based on whatever use your you're um uh you're working on you would want to optimize, right? And of course like we've we've all heard the lore, right? like Sonet is dumber during the day but you know it becomes much smarter during the during the evening like you know uh when you're when you're coding in the night like cloudset is much nicer and so on and um um of course a lot of this is is is placebo but the fact that we don't really have access to the model weights makes it much difficult to be able to um narrow it down down to one thing um and as uh one of the one of my favorite researchers that I look up to quite a bit Andre says um not your weights, not your brain. Um and and again, you know, like you want to have as much optionality as possible and um and so on. So now that we've we've established that open LLMs are are good enough, um open LLMs are sort of there are quite a lot of options for you to use these LLMs. Uh let's look at some of the recent um trends in um in the open LLM landscape. Um so we'll go through three trends. Um the first one is um is up until up until like last year um up until December of 2024, OpenAI 01 um was the state-of-the-art reasoning model. Um and uh when they when they when they released the model, they um uh it was it was one of the first um sort of thinking model in the sense that before you get to the final response, the model sort of contemplates about the response itself. And because it it contemplates you get a higher quality of output. U when they released the model they they decided to hide these chain of thoughts or like this this contemplation from the model itself. Um fast forward to um fast forward to January of 2025. Deepseek one of the um one of the key um LLM players from China released Deepseek R1. Uh this was a very huge 685 billion parameter model. Uh MIT licensed meaning you can use it for commercial uh use cases. You can use it for any of your own bespoke use cases. Um it like they released this model which was competitive with uh OpenAI 01 with Gemini um at that time and um and a lot of other models. The best part about this was that they opened the entire chain of thought for anyone to use. Uh which means that chain of thought was not sort of commoditized anymore. Um any sort of open participant, any startup, any business could have the same sort of capability of getting higher result from the same model. Um from there on in fact they didn't just stop there. Um they proved that because the because the chain of the thoughts are now because the chain of thought is now public, you can distill this chain of thought to smaller models. Um the way they proved this was by distilling the the R1 chain of thoughts um into a smaller Quinn 8B LLM. Um and to their surprise, um when they when they trained it specifically on math, uh this model beat a 25 times larger open model. So that's an 8B model sort of uh beating um 235 billion parameter model. Um and and this is where the the the sort of reasoning revolution sort of started in this year. um uh and and now pretty much reasoning has become like a standard across all models. Uh in fact the recent GPT5 series models all of the models have reasoning except they they have like a reasoning effort. So the models can have you know low reasoning effort, medium reasoning effort, high reasoning effort um and so on and so forth and and pretty much all open models have have a thinking variant, have a non-thinking variant and so on. Next, um, in general, LLMs, um, are as good as the amount of context you give them, right? So, the more context you give them, the better they would be at their downstream task. Let's say you want to summarize a research paper, if you just provide the abstract of the research paper to the LLM, um, the the the the output itself would be quite sub-optimal. But at the same time if you to the same LLM with the same prompt if you provide the entirety of the research paper the the summary of the research paper would be would be much more guided which be much would be much more on point and so on and so forth. What this means is that these LLMs require a ton of context to be effective. Back in 2024 um your average open LLM had a really small context. So uh 32k to like 64k tokens was uh was kind of like the standard like you know you would see your llama you would see your uh Quinn and Gemma and so on and they would pretty much range around 32 to 64k which means that they were good enough but they weren't really like useful for uh for like large context tasks as you know openai was or probably anthropic was and so on. Fast forward to 2025, 128K, 256K has become, you know, the standard, which means that now all the other use cases that you had to previously depend on um proprietary models for you don't you you can do the same with like open models and not just that um you know 1 million context even is now feasible. This is also feasible with like proprietary models. So this is not something new but um we have coin models which are capable of 1 million context. you have llama 4 which is capable of even 10 million contexts. So you so you you really are now not bounded by the context itself. you can really like just try a lot of um these um experiments as we started sort of increasing in this in this context, right? Like as we started um um as we started being able to sort of squeeze in as much information to these models, there was another interesting trend uh which is that the that the cost of these models also started decreasing. A lot of this sort of came from um uh from you know like optimizations at the software level at the hardware level as well as as well as at the architecture level. But the fact of the matter is if you had one euro or one USD um you you you get much more bang for buck now as you got the same time last year. Right? U this is again like a like a chart explaining the same thing from the good folks at artificial analysis. um and so on. Heading to the to the third trend, um back in 2024, um open LLMs required a steep learning curve. Um there were a lot of problems. So um for you to be able to deploy um a local LLM, you had to look whether or not the chat template is correct. So chat template is a is a way in which you your model learns how to map an input to an output. So if you say hey summarize this this paper um it would take this it would um format it in a way that u that the LM understands and then you know you get an output um and these were quite malformed. didn't have any sort of you know standards. Um tool calls were very difficult. Um we had uh model precision issues. Um so essentially like we didn't know like what was the model trained like whatever precision the model was trained on versus like whatever precision the model was deployed on. So there were quite a lot of issues like this and of course like there was latency issues and you know memory requirements um for those. Uh fast forward to 2025 uh we we now have standards for chat templates. So everyone has defaulted to chat ML format. Um 4bit and 8bit quantization has become first class citizen. In fact like all the new LLMs that you see are pretty much FB8 first which means that you don't you do not require as much um VRAM on your GPU to um to use these. And these are native which means that you don't you don't really like lose out on um on any of the performance. Uh and in fact now with the recent OpenAI GPT OSS we've seen that 4bit is also becoming kind of like a standard. Um but surely there has to be a catch, right? Like all of this is too good to be true, right? Um and in some cases maybe it is. Um let's look at where do proprietary models still win today. Um first of all for for general reasoning. So what we've seen is that open models are very good at certain specific tasks. So you would see open models being really good at for example tool calling, really good at you know math, really good at uh science, really good at coding and so on. But there's no like one model which is good at everything. Um much like what your proprietary models offer. um similar to you know anthropic similar to open AI and so on and so forth. So there is still like a there is still a hill hill to climb for like general purpose reasoning when it comes to it. End toend multimodel is uh is quite superior for um for proprietary models and more specifically for um in this case openai. Uh, OpenAI has a huge sort of margin when it comes to GPT real time, um, GPT40, advanced voice mode. Just the fact that you can have like a like an actual 5 minute chat with u with the advanced voice mode, tell it whether whether you want to go it um whether you want to ask it to go fast, slow, um, or like ask you to teach sort of French, ask you to teach German, whatever it may be, you can just ask things. um when it comes to advanced mode and also GPT real time. This is something which we haven't really gotten to when it comes to open ecosystem yet. Um last but not the least um um proprietary models have a sort of very nice well- definfined safety and jailbreak scaffolding. Uh open models when when it comes to open models you often have to define this yourself. you have to make sure that they are um you know covered from all sorts of issues or all sorts of uh potential um jailbreaks and so on and this is something which like proprietary models have kind of mastered at this point. So if we were to sort of summarize this in like one slide, um what would be a playbook for you to simply just try open models is pick your pick your simple project any project um that there may be swap the proprietary with the with the open model um evaluate on the same test that you have been evaluating on it swap with another model tune your prompt and let it drip pretty much. Um last but not the least um what's next and like what is some stuff that I'm quite excited about from these trends. Um first of all I'm quite excited about like smaller and domain specific models. Um in the recent times um the Gemma team at Google DeepMind released Gemma 290 million parameter LLM. It's a small multilingual LLM which can run on your browser, can run on your devices and so on. Um and and these sort of small domain models are in my opinion the way to go like you you you lower the cost you increase the access to a lot more people um and so on. Um second is effort based reasoning. So this has started to become like a thing which openai is doing which is um essentially from the same model you can just define what kind of effort do you want from the model. So do do you want like a low reasoning effort, a medium reasoning effort or a high reasoning effort? And this effectively makes your singular model uh effectively three models and uh for the same deployment you get much more bang for the buck. Um better quantization schemes. So you know FP4 becoming uh like a norm and 4bit um quantization becoming like the go-to for uh for all deployments. Um and last but not the least like sparse and faster. So at this point we all know that all major frontier LLMs aree which is mixture of experts and the sparser the uhe is the lesser the active parameters and and the faster the inference would be and we're seeing quite a lot of um we're seeing some trends towards like spareres and I I hope the community decides to sort of double down on that. Um, that's it. You know, I would recommend be be close to the source. Um, and always default to open. Um, thank you very much. [Applause] Thanks, VB. Please, let's give it up for VB. >> Thank you. And if uh if you have any questions for him, I think uh I believe you're still around, right? >> Yeah. Yeah, I'm tight. Yeah. >> Awesome. Yeah. Thank you. >> All right. So our next speaker, our next speaker built machine learning infrastructure at Uber, Apple, and Adobe before she co-founded Arise AI. She also been recognized on Forb's 30 under 30 list for her impact on AI. Today, she'll show us why system prompts shouldn't stay static and how agents can actually evolve their instructions in the real world in real world environments. Please join me in welcoming to the stage chief product officer at Arise AI, Aporna Dinakaran. [Applause] [Music] [Music] Hey everyone, welcome, welcome, welcome. All right, today I'm going to talk to you all about prompt learning. Um hopefully we have a good next 20 minutes together. So a little bit about what what I do. One of the founders of AriseAI. A risei is one of the leaders in AI observability. We help teams go all the way from development to production. So we help teams actually trace their applications, evaluate them. We're going to talk a little bit about Swix's controversial statements today uh about the eval matter. Um and then also actually help them develop and iterate using prompt iteration. Um, so let's jump into it. So this is actually a real post from Hacker News that I just put on this slide, which is I think just really emblematic of what people are feeling today. Somebody's asking, are there any real examples of AI agents doing work? Um, and he's asking kind of anyone have an example which I understand to be intelligent that isn't just a glorified or rebranded workflow automation. And I think you get a sense of this kind of skepticism from people who are building with agents today because they're just really brittle. It's really hard to get them to work in the real world. And um I think what we're starting to see is that there's common patterns um that really good agents are starting to have. And when they don't have them, they tend to be brittle. One of them is the system instructions. If they remain static and they're not consistently kind of being updated, this is what we're going to dive into today. that tends to make you keep it so that the agent doesn't actually learn from its environment. You have I think there's screenshots now of all the to-do lists inside of cursor and inside of cloud, but actually planning and being able to have you know planning that updates is something that's starting to become a really common pattern in agents. Um how they call tools or the guidance around tools tends to be something that I think some of the better agents that we're using do really well. And then context engineering is a whole whole domain that you know when the context isn't passed correctly either between hand between agents in a handoff we start to realize that this is this is a you know where agents end up tending to be brittle but today what we're going to actually talk about is system prompts. This is a really viral tweet that Andre Karpathi kind of tweeted in May. Um and I think he's actually starting to hit on something which is really interesting in this space. What he's talking about here is that there's a major paradigm kind of missing for LLM learning. He's given it a name, system prompt learning. He's like pre-training for knowledge, fine-tunings for habitual behavior. But there's a lot of human learning that actually feels like a change in the system prompt. So, you learn something from your environment. Something doesn't go right. You end up taking a note for yourself. Hey, next time I see this scenario, I'm going to act like this. And this needs to go somewhere. And if you end up putting it in pre-training, I mean, that's a lot of work. It's a significant amount of effort. Uh you can do it in fine-tuning, again, significant amount of work, but you actually have English feedback and an explanation that you can just put in to the system prompt. And this is significant a significantly kind of higher dimensional feedback than just a scalar score with typical, if many of you guys are familiar with like RL type of um approaches. And so why not use these system prompts to actually pass in this type of feedback and make sure that the system prompts don't just remain static. This is actually Claude's system prompt that got leaked on GitHub uh a couple of months ago, 24,000 characters, uh 18K words. And you can see here that this is it's pretty detailed in terms of all sorts of conditions and and how Claude should behave. And I think the key takeaway from this is that this didn't happen in a single iteration. No one just wrote the system prompt overnight and then put it into production. This was meticulously shaped over collecting data, looking at workclaw didn't do well, and then actually using that to iterate on the system prompt. If there's anything you take away from everything I've said so far, it's that the system prompts are actually really key to building effective agents. And as agent builders here, that is something that you can actively shape. And there's all sorts of different prompt optimization approaches that we're going to talk through today that you have in your toolkit. There's kind of the old approach which is more traditional RL type approaches. So the traditional RL type approach what you have is a scalar reward. You have a scalar reward. You know the there's some sort of gradient descent type algorithm that actually uses it to go update the RL system. Requires a lot of examples. It's you know really expensive in this world. There's other type of approach called metarrompting where basically you are now actually updating the the prompt itself but you're still using kind of a scalar score to identify what are the tweaks in the prompt the system prompt you should actually make what we're going to be talking about today is kind of a new approach prompt learning inspired by Karpathi which actually uses the English feedback so not just the scalar scores but it actually uses the English feedback back to improve the prompt. And I'm going to show you on actually an agent uh a type of agent that's really successful coding agents. Um so just to recap on what system prompt learning is, it basically takes the data. So this is the inputs, the outputs, also the explanations or the annotations. It takes the original prompt. You're going to pass that into a meta prompt and then you're going to get a new prompt. Um this is kind of the theory. Let's see how this actually works. We put this to test and I'm going to run you through the benchmarks that we did on the Klein system prompt. So for those of you who don't know what Klein is, Klein's one of the leading open source agents uh coding agents. So the entire kind of you know it's entirely open source. You can go look at the system prompt today. Um we heavily cut the system prompt. It's like 30 pages long. Um but the system prompt basically has you know whatever the system prompt is and then a section where you can actually add rules. rules. So if you're familiar with like cursor rules, client also has something like you know its own rules that you can go in and add. Typically when you use the out of the box client, the rules are empty. So we started off with kind of a this was the system prompt with empty rules. And the first thing we did was run an initial benchmark. How does the initial kind of system prompt do um just on its own without any modifications? We tested it on bench light. Sweet bench light is like you know 300 plus software engineering problems. Um and we ran kind of client on plan mode on Sweetbench light. So client just for context has both plan mode and act mode. Plan mode is basically where it generates a plan but it's not actually generating the actual code and then running the code on the actual problem. We're actually currently working on act mode results. So you know hopefully I'll tweet about this soon. Um but we tested this using plan mode and uh in plan mode I mean the results were you know okay as you can see it's not great but like around 31% basically on swe um and if you go in and you actually look at what you know where client actually fails and these are some type of examples just to kind of build some intuition. So this is actually one of the problems from uh a library called marshmallow. It converts complex data types and client was asked to fix some type of bug where the program um you know if the input was none then the program used to just crash. So can you handle that better? Um so client basically you know just did exactly that. If data is none just just return. Um but it fixed just that single case. But the input type can actually be not just a single kind of word. It can actually be a list. It can be a complex data type. So it actually has to go in and check all these different scenarios before it can just return. So this was one type of you know scenario where client jumped to a very minimal fix without looking at what are all the different types of cases that input can be. Um here's another one that kind of initially it failed at. uh another kind of repository, Simpy uh Simpai, Python library for kind of symbolic mathematics where basically uh if there was some sort of operand like you know two times a matrix but you had some other type of operand like an at sign it would uh incorrectly behave like multiplication instead of just failing. Um what it should do is just like raise some type of error or um you know suggest some type of fix. In this scenario, what it basically did was that it actually, you know, it ignored kind of the the actual Python language kind of contract around what you should do if it's, you know, the fallback option for Python and it wrote some sort of ad hoc fix which didn't take care of multiple scenarios. Um, again, another type of situation where the fix kind of looked right, but you dug in deeper and it wasn't actually the right solution. So, we decided to go and actually test out prompt learning on this. uh we took kind of Klein original Klein on the original kind of system prompt that it came with. We ran it across the entire kind of Sweetbench light on all the sets of problems and it generated a whole suite of actually outputs. We then took those outputs and then the the test set we you know that we tested on SweetBench actually had a golden data set. So it had the ground truth basically with the actual kind of PR with the actual test patch and we passed the solution that you know Klein generated against the golden data set of the ground truth and we asked an LLM as a judge to actually evaluate you know is this the right solution? Did it generate a plan that would actually solve this problem? Um we wrote a whole template. We passed those inputs in. And this is a key part that's important here. We not only just asked like correct false, we actually asked to give an explanation. So why did it actually why was it wrong? Give me a reason why. And this is important because whether you do this using an LLM as a judge or whether you do this using a human annotation, this is the English feedback that Karpathi was actually talking about in his tweet, this is the higher dimensional feedback that an LLM can actually take and use to update its system prompt rules. So we took these explanations and we passed all of this into the meta prompt. So the meta prompt now has the original prompt that we started out with which remember had no rules. We then passed in this data. This data is what was the problem to solve, what was the solution that client came up with and then here's the you know whether it got it right or wrong along with the explanation. All of this got passed into the metaprompt. And there's a lot here to unpack around how do you manage the context window like will all of this fit? How many examples? Um you know hold that thought for a bit but basically all of this goes inside the meta prompt and we were able to generate kind of a new system prompt. So this is just like a diff view telling you in the old world the rules were empty. In the new world there's all of these rules and you know I couldn't fit it all on a slide but you know even in something like five 10 loops that we did there was you know close to 100 different rules um all of these different types of um different types of use cases and errors that client would make and all of these got passed into um the rule section. So fun fact if you're using cursor if you're using client this is actually something you can do today. Um so all of these goes into the rule. You might be asking, okay, well, does this really work? Like, what was the performance? Tell me what the benchmarks were. Okay, before I get into that, I'll just, you know, tease a little bit with the same problems that we saw earlier. So, this is one where, you know, it was performing on that Marshmallow library, the Simpai library. Both of these, we actually reran this with the new updated system prompt. This time, it was actually able to generate a correct solution. It's actually each one of them had a corresponding rule that was in the rules um that was added to the system prompt. Um and overall I think with about 5 to 10 loops we actually saw about uh 15% accuracy in kind of the the test performance. So something around like 30% to like 45% accuracy after 10 loops. Um we did a typical just in case any of you are curious we bench light uh we we broke it up 50% train 50% test. Um and we ran this across 300 kind of software engineering problems. Um and this took probably about 5 to 10 loops to see some sort of improvement. We also tested this on BBH which is you know known for being much more difficult tasks for language models and BBH actually has a different categories of software engineering problems. And I think what we were really excited about and we're going to share some of these results, you know, more publicly soon, but uh some of the more harder types of problems like salient translation error detection or snarks, we actually saw even after just five loops, something that the LLM was, you know, client was not doing so well on able to see pretty massive jump in terms of performance and actually very minimal kind of uh, you know, regressions in some of the other categories. So, this is one where just adding the annotations and adding where actually it didn't go well was was something that was able to see some pretty incredible improvements. Um, so what's the takeaways for for for this group for agent builders? We just ran this on some of the hardest, you know, you would say most successful types of agents out there, which are coding agents. I think the key takeaway that we wanted to share with you guys is one collecting those errors and actually either running evals or doing any kind of annotations on them is actually really important because you won't really be able to understand what to go fix if you're not collecting those examples. Um this is an example of basically one of the things that we do which is where you trace the entire application. Um, this is typically when you hear people talk about things like online evaluation versus offline evaluation. I mean, I think one of the things that Swix was saying yesterday in his keynote was like like eval should they really be blockers before you ship something into production. I actually really don't think they should be. I think you should ship things you should put them into production. But I think what most people actually don't talk about is that online evals, actually running evals in production on top of your data is probably way more important because now you actually have the data, the traces, your your logs that you can then use to identify what's working, what's not working. And that's more important because it's your own data. You're not writing some BS tests that you're using as a blocker before you roll it out. you're actually evaluating your traces and then you trust those traces to then go back and feed it into iteration. You use, you know, just like we did where we grabbed those annotations, we fed them back into prompt iteration. Well, while we were iterating, we were running those same eval. This is just a, you know, an experimentation view where we tested those system prompt changes. We tested what rules were better and we actually used that as almost like unit testing before we deployed it into production. And so when people tell you, oh run eval offline, run it before you deploy, I think in you know the way we think about it is that those are unit tests and unit tests are important but you know you can think about them as I I always think real data real production systems getting visibility into that is often more more important. Um some of you guys might ask okay well did prompt learning only work on coding agents do you really see it useful in other domains? Um, Bron learning is actually we we've been using it across a lot of different domains. We use it with ton of our customers and use cases even our own agent actually. This is uh another set of results that we've published around structured JSON webpage generation. So more structured outputs. Um basically after uh five loops we were able to see significant increases in kind of accuracy. again here um this is interesting where basically here the more the rule set kind of grew um you can see here that didn't necessarily always mean that accuracy was going to be consistently up. So actually having uh differentiation and basically how much rules you actually pass in um is is something that you need to kind of iterate and tweak on in in your use case. Um this is another use case support query classification. So think more of like those customer support bot type agents. Here you can see here uh it goes vertical. So one loop all the way down to five loops. We're able to see about 8 to 9% kind of accuracy increase as we as we increase these. Um so um big takeaway I think what we're going to see a lot more this year people talk about is how are you consistently having the prompts learn from the environment. Some of these updates will be how I showed you human in the loop. Human kind of going and updating those system prompts. But I actually think a lot of those updates are going to be totally automated. You're going to collect data. You're going to run evals. You're going to run a prompt optimization approach. You're going to update your system prompt. Um, and this is actually going to be a big paradigm shift in how people build agents. Awesome. Well, thank you so much for your time. I have any questions? >> Awesome. Thank you, Aparno. Would you like to join me for a few followup questions, please? >> Yeah, let's do it. >> Awesome. Okay, please. >> Uh, this was fascinating. I like that you used a software engineering engine to demonstrate that. Oh, that was pretty cool. Um, so actually, yeah, I have a question regarding the context window since you said that. >> Yeah, >> we're going to talk about it later. >> Yeah. >> Um, yeah. So, how do you manage that? >> Okay, good question. So there's kind of two two maybe parts to this question that people commonly ask. One is won't the meta prompt just become sorry the the new prompt that's generated just become super big >> because you have all these rules that can be generated. Is it creating like a new rule each time every time it sees some type of error? >> I mean one of the things that you can actually do is >> you have a metaprompt. the metapar prompt you can actually pass in rules like um hey keep keep the new prompt to be of a certain you know length keep it to be x number of words so the metaprompt actually gives you some kind of ability to force the new prompt to fit within you know fit some sort of parameters so that you can't actually pass all of those kind of uh you know rules inside of your new prompt question you might be asking is okay well what if I just have a lot of examples are all of those going to fit inside the metaprompt and what does the context window management of the meta prompt actually look like? >> I think this one's actually the lot harder question because the more you actually have this application or this agent running in production, >> right, >> you're going to collect more and more scenarios of failures. That's a good thing. >> But >> every time you do these types of identifications, you know, we see teams trying to okay, well, this is one specific scenario, this is another specific scenario. What we end up recommending is trying to find categories of problems. So doing kind of error analysis but also error categorization so that you don't have to pass every type of example into the you know metaprompt but you actually pass kind of you know examples that are replicative of a larger type of category of problems. >> Okay. Um but when you pass on yeah all that data all that information to the metapro um can we can we get to a place where we we start seeing some degradation in the >> maybe in the generations >> for sure I mean I think this is like a classic you know ML type issue where you know do you get to a part where you overtrain or you overfit so you overfit the new prompt to be replicative of just your >> errors. but it's not actually kind of uh it doesn't translate. So I think this is where it's important to have, you know, make sure you're testing the train accuracy, seeing that the train accuracy doesn't go down, comparing that with your test, you know, your test set, but then also having a complete blind set that you're not kind of overfitting your metaprompt to >> and also LLMs are getting so good, right? Like we now we have context windows of million, two million, which is and I I think it's only going to get better going forward with all that infrastructure that we're building. Hopefully, >> yeah, >> that's what's going to happen. >> Um, all right. Okay. So, can you tell me more about the difference between prompt learning and how do we how do we compare it to frameworks like DSPI? >> Yeah, this is a good question. So, um we love DSPI. I think DSPI was is is awesome and kind of, you know, I think they put out a lot of different prompt optimization approaches. I don't think about this as like a us versus them. I think there's going to be more prompt optimization techniques out there. So, DSPI actually has um a couple which is slightly different strategy maybe. So myipro and um like the fshot learning. So those are some of those are a lot more you know call it programmatic in that it takes a few shot you know it takes a few examples and just passes those in and those end up becoming the actual um that ends up being passed into your your new prompt. Um some of them still use scalar rewards. Mhm. >> So like I was explaining earlier, they don't actually use the English feedback, but they just use the skill or reward to actually improve the prompt. So some of those strategies are just different than what we're proposing. But actually, they did release a new approach called GEA. Uh this is very recent. It's uh philosophically similar to what we just talked about with prompt learning. So we're actually yet to run some I think we're currently running some benchmarks right now on GEA to see how it performs. So we'll probably tweet out some of those results soon. Awesome. Um, one one final question for you. >> So, uh, how does the prompt improvements fit fit into the context window of the metaprompt? >> Okay. Yeah. So, this is this is kind of what we were talking about earlier, which is like around managing your context window. Um, I again, I think, you know, there's probably two things that really matter in, you know, I think we think about all the time is identifying and doing error analysis is is really important. So looking at your traces or your application, finding those errors and then instead of you don't have to pass every single error back into the metaprompt, but you can start to pick category, you know, kind of examples that are replica, you know, reflective of an entire category. So that way you're kind of being smart about what you pass to the meta prompt. >> All right. Well, thank you so much, Aara. Thanks for your time and for the talk. >> Thanks everyone. >> All right. Oh, and if you're interested at all in any of the things we talked about today, there's a workshop I think my colleague's running actually on offline evaluation and prompt iteration. So, it's a little bit more hands-on if you want to try it on your own agent. >> Awesome. Thank you. Let's give it up for our partner. Awesome. So, uh, our next speaker was part of the founding team of Zenley, later acquired by Snap, and he's the founder of, uh, ZML, a high performance AI inference technology aiming to push the limits of what's possible beyond GPUs. So, this is going to be fascinating. Okay. Today, he's introducing a breakthrough attention mechanism. Please join me in welcoming to the stage founder of ZML, Steve Morren. [Applause] [Music] [Music] Hey, thank you. Thank you everyone. Um I'm Steve. Uh and today we I'm here to talk to you about some breakthrough technologies we've been you know working for the limbs which you know hopefully in our in our you know uh hopes pave the way for a limited context. Uh first of all we are ZML or ZML depending on which side of the Atlantic you're watching. Um we are building this universal inference uh stack and engine. So essentially it's like the same literally the same code the same binaries even uh that runs on you know any for any models on any language sorry any chip sorry any language model or any model on any chip uh but this is not what I'm here to talk to you about today we're going to do a little bit you know of a deep dive into llama and and also probably one of what what we think is one of the most fundamental problems in LLMs today. Uh so this is the lamb architecture. So usually every time you like generate a token this happens. That loop you see in the middle is for each layer that happens. Um what we're interested today is one mechanism that is fundamental to the way transformer works and it's called the tension. This thing happened each at every you know single layer. There's a few operation layer but this one happens every layer and it has a tiny um a tiny problem. This is the uh the mathematical formula of you know attention. Um, you don't need to get into the details, but the problem we all have with attention and ultimately that binds everything we see from the hardware to the the the reason there's a context window etc. is because that algorithm runs in quadratic complexity which means for everything you add to the algorithm you get you get to do the square of it and that makes us pretty sad and also pretty limited in what we can do. Um, but there's something, you know, about that formula like if you squint your eyes real quick, you'll see some you might see something that is of interest and that interest is the soft max, right? So, what's with the softmax? Well, this is the equation of of the soft max. Uh, and long story short, you know, you don't need to uh to read it, but it's essentially exponential of all the elements divided by the sum of the exponential. And what that means it means that tiny signals becomes a lot bigger. And what is the end result in the end? The end result is that this is an actual you know output of an of an attention for tokens. And what you can see is that it's mostly sparse. There's a lot of empty room. So there's in the end a lot of completely close to useless calculations that are being done every time you know you run that algorithm. So, you know, as an ex engineer, as an ex backend engineer myself, I look at this and I squint my eyes and I see, well, is this a graph really? Could we only do the yellow dots, right? And not, you know, not do the um the black dots, right? The the GPU will compute everything. So, is this a graph, right? And turns out, yes, it is very much a graph. It's a graph in latent space, but still a graph. And what's very cool about the fact that you know if we model this thing to be a graph problem instead of a pure raw matrix multiplication is that we can run it in log of n. And as you can see on that graph, you know, log of n is, you know, flattens as the context sizes augments, right? And this, you know, paves the way for unlimited context ultimately because it's not a ben and it's not, you know, n square anymore. But there's it's very nice, you know, very promising, but there's a tiny tiny problem. The tiny problem is that because it's a graph, we need branching. And this this means that for this GPUs are close to useless. They are very bad at branching. They can do it. But if you have you know worked on GPUs you might you know understand a bit deeper why. Um but CPUs are good right and we get into this thing in which we might do the calculation and we might you know model this as a graph problem uh but only on CPU. So the question is can we do it fast enough? Now maybe let's do some math real quick. So this is the uh per layer. This is this is a layer of llama. And what we see is like there's 32 32 of them in an 8b. If we want to run it at you know 100 tokens per second that is 300 microsconds per layer. Now this is like our time budget if we want to you know achieve that that that that throughput which is roughly 5090 territory. Um if we look at everything around attention we see that already 200 millisecond is of the equation because of all the other operations that leaves us about 100 microscond to do the actual you know calculation than the actual attention that's not a lot of time right so the question becomes you know is it enough like do we have enough time to do the calculation of attention number one like the vector and the multi the matrix multiplications etc as a graph on the CPU and all of this in less than 100 microcond. And actually we can and the reason we can is that because we only compute essentially the yellow dots, right? So this is what you see as a trace of a whole one layer. And so that is pretty interesting or at least encouraging because we only spent about 30 microscond doing the act the actual calculation. And remember, you know, we skip the black parts. But there's another side benefit to that because if we run the attention on the CPU and not on the GPU, it means that we we have more GPU memory. In this case, if you run a model on the GPU, you have part of the GPU dedicated to the model and part of the GPU, sometimes the majority of the GPU dedicated to the KV cache. that creates a plethora of problem for those who are deploying LLMs in production environment static routing and all of these things but if we do the attention on the CPU in system memory then suddenly the KV cache doesn't need to live in the GPU which gives us more memory for the model and actually makes also the model completely stateless the GPU sorry completely stateless and so in this case KV cache now lives in system memory the GPU PU will send data to the CPU. The CPU will run the calculations, update the KV cache and send back the attention to um to the GPU. But there's a another catch to this is that we are in need of a lot of CPU cores then um roughly about one per KV. So the question is like how do we get these scores like let's say we want to do a batch run. So perhaps you know is there a way we could get these scores uh maybe not locally on the machine which has sometimes you know underpowered CPU because they're pretty much useless except for PCI lanes maybe but could we get them like these somewhere else because remember now the attention and the KV cache is completely separate from uh the GPU and so maybe maybe could we get you know this CPU power over the network remember we spend about 30 microcond or so on the raw calculation that leaves us about 70 microsconds to let's say to attain this throughput so let's do you know are is physics on our side not easy um let's do some calculations so roughly the payload we would need to send is the the number of you know attention dem attention heads KV heads etc which is about 10k for an 8B per layer per request. Now, let's be conservative about this and maybe we're running on a 10 gigabit internet connection, right? So no infinity band, no crazy stuff, you know, somewhat I would say uh premium 10 gigabit, but it's not the end of the world and it's not very I would say unattainable and physics tell us that we could run this thing at least do the roundtrip of the data would go at about 16 microsconds. So suddenly we we might ask ourselves are we ready to spend 15% of the time budget on essentially unlimited CPU cores because we can provision and unprovision them over the network. I think that's a pretty good deal if you ask me. And so this is what it would look like. So you would do the calculations, the dense calculations on the GPU, then extract the data, send it over the network as UDP of course because you want to be you want to be compatible uh feed it feed it to the other machine, compute the attention on the CPU and send everything back and you need to do this faster than the GPU would calculate it himself. Now the way the reason we cheat is that we do much less calculations than the GPU would. So, I mean, let's try it, right? Let's do a ping and oh man. Oh crap. So, physics is on our side, but engineering obviously isn't. So, how do we square that circle, right? There something's got to be there. There must be something we can do because, you know, physics is on our side. We're not at least at this point, we're not fighting physics. So maybe we have a little trick up our sleeve we can pull and it's time for kernel bypass. And so the Linux kernel we are at this stage you know uh thankfully for us or maybe not so thankfully for us uh at which the Linux kernel is might be too slow for what we are trying to do. There's too much latency and so this is where we enter another technology in the mix and it's called DPDK for those who recognize it. What it allows for us is directly talk to the network card in a way that is very very low latency. The driver is actually built inside the application itself. And what this allows us is to get much much closer to the theoretical latency. So theoretical latency in this case would be about 16 micros microcond. We get about 20 measured. So that is, you know, still within the budget and this is exactly what we've been building and we call it attention D. And the clicker has a slight lag. And if you're up for it, I'm here to give you a demo. We're good. All right. So just so you know, this is what it's running on, right? So what you what you see is like a a a machine with GPUs and a machine without. All the attention is running on the machine without the GPU. So nothing is missing, right? So let's crack this one. Can you see it? Nope. But it's fine because I can do this and that, right? Okay. All right. So, of course, this is a this is a video because uh demo gods, right? But what you see on the left is attention the running on the left machine. You see all the cores 100%. And what you'll see on the right is a standard LMA model running standard inference. There's nothing happening on the model side. They're communicate communicating over the network and you know I'll let you see what it you know how it performs. So we're loading the weights and then off it goes. But oh, where's my I have lost the uh quick time. Okay, but we're not done. I think you are in uh in luck because maybe we can do a little bit more than this. You see the KB cache is not on the GPU anymore. That leaves us more room. So maybe we what if we could fill the entire GPU without any, you know, room for conversation, any room for the KV cache, only the model, completely stateless on the GPU. And so this is what we have for you. A live demo I'm actually going to do, you know, as we speak in which we're going to run a 32B model FP8 on a 32 GB GPU, which is, as far as we know, has never been done. Um, because there's it's useless in the real world except if you have this. All right. So, oh, I have the feedback here. Good. Um, so are we live? Okay. Hopefully the Wi-Fi is with us. So I'm going to run attention D on the bottom terminal. Yeah, it's live. And on the top terminal, I'm going to run a 32B model, Quinn 32B on a 32GPU. Just so you know, 32 GB GPU. Just so you know, that model fits when it's done loading. There's 20 megabytes remaining in the GPU. So essentially useless. So let's try it. I'm going to tell it to write an MLP CUDA kernel. Hopefully it doesn't crash. It crashed. That's how you know it's a real demo, right? And there we go. So just so you know, this is as far as we know, it's the first time it's ever been done. That model isn't supposed to run. Thank you. So, and just so you know that it's not, you know, a fake thing, I'm going to kill attention D. And yes, generation stops because it's now, you know, shooting packets into the void. All right, let me just go back. So, the demo gods were with us today. Pretty good. Thank you very much. Uh that is us. Feel free to or be sure to check out our repo. This is open source technology. Um and you know would love to see you part of the community. Thank you. >> Amazing talk and demo. Would like to join me for a few. >> All right. >> Okay. >> Um you can have a >> Oh, water. Thank you. Yeah. >> It's surprisingly hot. I know, right? Especially when you're on a live demo, >> right? >> And you don't know if it's going to flash. Yes. >> Um, >> all right. So, I'm going to start with the elephant in the room. What What is written in your t-shirt? >> Ah. >> So, for those like me and like our team which are terminally online, this is a reference for Zuck. >> Okay. It was a he had like this shirt, but it means essentially ZML or nothing. and he had like a Zuck or nothing. So I'm like, well, no, this is us, right? >> This is ZML now, not not Zuck. >> All right. Okay. So, um, you touched on branching and you said the GPUs are bad at branching. Can you can you explain that to the rest of us don't understand? >> Yeah. Well, GPU are very good at parallel processing and the reason is they operate in a mode called SIMT, single instruction, multiple threads. And so when you have branching, so we can go into the nitty-gritty nitty-gritty details, but at least you know bird, we can like because there's like warps and etc. But >> long story short, when there's a branch in in GPU code, the the cores that are going to do one side of the branch will wait for the other side to finish. >> Yeah. So essentially if let's say you have let's say modulus two and you want half the GPU to work let's say on odd numbers and the other one on even and you do an if in the middle >> half of the GPU will always be asleep waiting for the other side because the instruction uh sequence is common to at least the the whole I would say group of processing. So they're really bad at doing this which is why you know people spend a lot of time trying to do clever tricks etc. But it's very good if you're like doing image processing for instance because it's always the same thing you do for every pixel or matrix multiplication for that matter. All right. Okay. Um yeah so yeah you showed this and you spoke about attention. So um I'm wondering if this is specific to uh related to the transformers architecture or can we imagine other architectures also being compatible with uh >> at least it's related. So the transformer architectures uses attention. Uh there's other architecture that are using like everybody's rushing against that n square. >> N square is the reason we have hbm. It's the reason data sizes are shrinking are trying to shrink quantization and so on. Um so it's part of the transform architecture at least attention but not only uh sometimes you see like I would say assemblies of you know common architectures and there are also some linear architectures uh memba for instance um so but for now at least for a good while everybody's trying to unseat it it hasn't been >> unseed yeah transformers are here for now at least you know for until next time famous last words, right? But um I had one more question for you if that's all right. Okay. Um so I think where do you see ZML going after this? Like what would you like the community to do with it? Obviously it's open source. I see that you have we have a GitHub link here. What do you think is the next step for it >> as a community? I mean what we're about is being very very radical about inference um at essentially all the the the layer in the stack from you know single digit microscond stuff to you know throughput whatever uh you know loading a model in like we have benches in which we load like an entire model in like one second from SSD um so we're very radical about very high performance, very low level um and building the tools to achieve this, right? Um it's I mean number one, it's way too expensive. Um but also we are we haven't been you know computebound in a long time >> and so what we're trying to push with with with ZML is how can we build you know an ecosystem a software stack um a product line which is entirely built around the thesis which is universal you know chip support very low latency uh you know everything is a hot pass you know type of thing because that you know attention mechanism you know we demoed uh you could you know in theory implement it in PyTorch right but it would run at like maybe a tok maybe a token per second >> uh because that 30 microcond would become millisecond right so at some point fighting latency is very hard and this is what we're about um but from the ground up and as a community yes we have a framework that that is open source we want people to use it there's refes we are the first users of our framework so you know bear with uh and it's constantly changing etc. But it's you know improving very very fast and we have some cool demos many many more incoming. >> Awesome. So pushing the boundaries of physics and uh trying to make physics on your side and >> stay compatible but stay closer I would say to physics. >> Awesome. Well thank you so much Steve. That was fascinating. >> Thanks man. >> All right. Thank you. >> Let's give it up for Steve one more time please. All right. Okay, so we're back at 4:30. Now it's time for a a welldeserved break. So, uh, we still have coffee by our friends from Tim Foil. So, I'll see you here at 4:30. Thank you. [Music] Heat. Heat. [Music] [Applause] [Music] Heat. [Music] [Applause] [Music] [Applause] [Music] Heat. [Music] Ladies and gentlemen, please join me in welcoming to the stage your MC for the AI engineer Paris, developer experience engineer, Ralph Chabri. [Music] [Applause] [Music] Yes. All right. Welcome back to the main stage. We still have three amazing speakers left, so I hope you're going to enjoy this. I I know that uh uh we had a lot of speakers have prepared a lot for this. So um up next is our developer relations engineer at Llama Index where she helps developers build product ready agentic applications. Today she's here to show us realistic abstractions on how to build an alternative to notebook. So you got to pay attention to this. Please join me in welcoming to the stage developer relations engineer at Llama Index, Tuana Chelik. [Music] Hello everyone. It's great to be here at the first uh AI engineer at uh in Europe. Let me just make sure this screen is on. All right. Uh, all right. So, um, hopefully I can get my slides up and I'm going to be talking about building a notebook LM alternative fully open source and hopefully I can inspire some of you to try it out. Uh, without further ado, this is me. My name is Tannum, although you just had an introduction. And I have been with Llama Index since May. So, why Notebook LM? Well, Notebook LM is pretty cool. If you have not used Notebook LM yet, I am going to show you quickly what it's all about. And this whole talk is about realistically what are the abstractions we need to actually build an alternative to Notebook LM. So this is notebook LM. uh when you start up a notebook in notebookm you're greeted by this kind of page where you're asked to drag and drop maybe a file or you can also link it up to your Google workspace provided some URLs YouTube links um or just simply just copy paste uh content in there and what you get out of it is actually a lot of stuff so here you see me the other day I've provided um a document on an IKEA guide to build a kitchen and you have a summary here. Uh you can uh do some question answering. Um and you can do so much more. Uh on the right hand side you see that you can get an audio overview. Uh you can get uh sort of flashcards for an FAQ. uh you can get even a video generated and one of our favorites uh you can also create a mind map out of the whole context present in your document which is a great way to abstract over all of the information present in your document and kind of make it easier to think about. So we set ourselves a challenge um and we also decided to make our lives a bit difficult by setting ourselves these challenges because we really wanted to focus on creating an alternative but we wanted to make sure we were doing it justice. So these are the three main things we wanted to focus on. Um we wanted to be able to provide complex documents and when I say complex documents I'm talking about documents that maybe provide that have tables images. what I love to call a layout bonanza because not every document is uniform in the layout. Um, we also wanted to make it and this is I'm going to show you later what was really the pain. We wanted to make it um reusable especially the functionality and tools that we were building. Again, this is an open- source project. We wanted it so that you can go ahead and use any of those tools and functionality separately without it having to be our project that you're using it with. And finally, and you'll see why this is important, we also wanted to have some level of control over the flow, which is where things start to differ to what you might perceive as classic agents, where agents get all the agency to decide what to run when. We wanted to make it so that we had a bit more control over the process. And it so happens that I work at a company that provides many tools and products that help us get to this uh project and make it easy to build. Um so I'm going to talk about Llama Index very briefly. We have two main sets of products at Llama Index. A lot of you I hope some of you uh know about Llama Index, the open source framework that allows you to build your own agents and design your own logic. That is our open source framework. That is one layer. But we also have Llama Cloud. And Llama Cloud has quite a few products now. It's growing every day. Uh which provides tools and and um and uh products that allow you to pause complex documents, extract structured information out of them. You can also use Llama Cloud as your own vector store that you manage that you use to chunk and embed your own documents as well. So great, we have some ingredients to start the process. So we set off and the two main uh Llama cloud products that we initially started with are Llama Pars and Llama Extract. So one thing we wanted to do is make sure that users that want to use what we now are going to from now on call notebook Llama could be able to upload documents of any shape or form. So llama paths are super useful here. We don't uh we don't worry if there's any tables, images, um layout differences at all. And also Llama Extract. Llama Extract is my favorite out of the bunch. Uh I love working with it. And a lot of the times we see people use Llama Extract in situations that are probably a lot more boring than what we're going to be talking about today. But we often see llama extract being used in situations like this where you may have long complex financial documents and really the thing that you want to focus on is a set of information that you want to extract from it and you really don't care about the rest. So we have here the SEC filing by Nvidia and we want to get from that complex PDF and just focus on a few key components from it. And you can do that in a few ways. You can either go ahead and select one of the predefined schemas we have in Llama Cloud or if you're like me and you prefer being in code, you can go ahead and do the following where you have a paidantic schema. And what's important here, not only are you defining the schema, so also the data types that you're after, but you're also providing a model again of your choosing, a description as to what it should be looking for within the document. And this description is used to then go and extract that relevant information. And then you can go ahead and create what we call an extraction agent. You can either again do this within Llama Cloud or in code. you deploy an extraction agent which can then be reused over and over again. Um, and so here you see my extract agent for SEC filings. Now what we noticed is that we can actually start using our very own llama extract to start off with the creation of the initial uh notebook llama where again like you saw in the original notebook LM you had FAQs ready, you had a mind map ready, a summary ready, etc. We decided to go ahead and use Llama Extract as a way to start the notebook up. And you'll notice here we've defined a schema where we're asking for a summary of the document, highlights, so bullet point highlights of the context in the document. Uh we're asking for questions and answers. So that's step one. And the next thing we decided to do was make use of our framework, our open source framework, specifically the workflows because this is a way that we can define um our sort of uh business scope uh what the agent is able to do and not able to do. Uh and we're able to kind of architect the pathway that our application is taking without it relying on an LLM to make a decision in the first place. So, we're going to see how our workflow is built a bit later on, but um this is the point I want to shout out Cleia because this is really her brainchild. Uh this is my colleague Cleia. You can follow her on LinkedIn if you like. Uh she's our open source engineer and she put a lot of effort into notebook llama. So, I hope you like it and I hope you decide to contribute to it in the end as well. So, what does notebook llama do? Notebook Llama. I'd want to point out again that this presentation I'm definitely not uh presenting to you a finalized polished product that just exists and that you can go ahead and use, of course, you can go ahead and use it, but really our aim is um that this is a start and if you do want to contribute or you see that something's missing, it's not the complete Notebook LM replica obviously. So, please do feel free to contribute to it. Um and we do a few things. We do have our mind map generation. That's the first thing we decided to implement. Uh we can ask questions and do QA over files. We're going to have a look at it in a bit. Uh we can extract tables and images as well. We have podcast generation. We've very recently added observability to um the product as well. Um and now I'm going to quickly show you what we can do with it. All right. So, I've decided I'm not running this entirely live because I don't uh trust the Wi-Fi. So, this is something I've done just before. And I'm going to just increase uh this. Uh so, here you see me uploading um the IKEA kitchen guide. uh once you've uploaded it, very similar to Notebook LM, we're getting a summary, we're getting bullet points, we're getting an FAQ, and we're getting a mind map that I have to be careful when I scroll over because it moves around a lot. Uh our mind map does not look as sleek as you may see. It does not look as sleek as Notebook LN, but nevertheless, we have it. Uh you also have document management where you can look at previously uploaded documents. This is the same one. So, it's not very interesting. Um, I've tried the IKEA kitchen guide many times. Um, and we also have document chat. So, here again is me asking the question, what are the main steps I should take to install countertops in my new kitchen? And what I get is a very sort of in-depth answer with citations. And I can also look into a bit more detail about what the sources are and why it's claiming that this is the best way to install countertops. Now before I switch back to the presentation, one quick thing I want to point out is uh let's not worry about the PDF loading. But here we see our extraction results of the IKEA kitchen guide. You saw earlier the schema that I had defined. The schema is right here. And here is the result. And you'll notice that these are the exact results you saw in the uh UI for notebook llama as well. This is how we set up the notebook. We get the extracted summary, highlighted bullet points, question and answers, and that's basically our entry point into notebook llama. With that, let's switch back and go on. All right. So, I'm going to focus on explaining to you two main components of notebook llama. So let's first start off with the first uh homepage where we load the file and we generate the initial uh notebook with the mind map and the bullet points etc. So how does that all look? We want to result in the uh image you see on the right hand side. So we want to start with a PDF. Next we want to be able to generate some sort of mind map. And next we want to generate the whole notebook with the summary and bullet points etc. So actually by by thinking about this we've already kind of described to ourselves exactly what the logical flow should be ideally and how do we do that in uh llama index we have a abstraction called workflows which is a class that you can extend to create any workflow and a workflow really consists of two main things you'll notice here that there are methods called steps and steps are expecting events and they can also emit events. In this case, you see a one path system where we have three steps. Each of them are expecting a certain event. But the cool thing is we can define our own events completely from scratch. So what we did is define three events. We've defined our own uh custom start event. We call that file input event which is simply a file name. We want to end up with a stop event which is again our custom stop event. And we've called that notebook output event. You'll notice that the notebook output event has the mind map, summary, highlights, etc. as well. And intermediate we have a mind map generation event. This is again custom code by my colleague Cleia where she used the summary and highlights um to generate a mind map from scratch. Now we have the steps. We have to actually put it in some logical order. So again we define our own custom steps and the first step in the process is a step that we call extract file data. This is expecting the file input event and it can actually branch out. We account for failures. So maybe a mind map wasn't uh generated for some reason in which case we return the notebook output event uh which is an empty notebook with maybe a message indicating as to why a failure happened. But if not and if we're successful we actually return the mind map uh event mind map creation event. Next we actually have an event called generate mind map and this is where in a successful scenario we are able to generate the mind map and now we can create uh what we call the notebook output event. Now one thing you may have noticed here is actually most of the logic is within MCP tool calls and the reason for that is the challenge we set ourselves at the in the first place and we wanted to make sure that any sort of custom logic we were building into this application. We could actually put it into an MCP server and provide them as MCP tools. So if you like the functionality that you see and you want to extend the repository in any sort of way, you can use all of those tools in isolation or go ahead and actually use the mindmap generator with maybe claude. Uh so that's the reason why we decided to put all of the main functionality in MCP tools. So let's have a look at our first tool which we call the process file tool. Uh, and the most important thing here is the description. If you've been using MCP tools and servers with your MCP clients, you'll know that the description is pretty important because it's the one thing that the LLM has to use to decide whether it's time to run that tool or not. So, we've described it as this tool is useful to process files and produce summaries, question answers, and highlights. So what's going on here is again you might notice there's an extract agent. This is the point in time in our workflow where we call our extraction agent to return all of the highlights and FAQs. We already have our extraction agent deployed in Llama Cloud. We don't have to worry about that. But when we hit this point in our workflow, we call that extraction agent and we wait for a structured output. Once we have that, we're done and we can actually start generating the initial notebook. Next thing is the document QA functionality you saw. And this we manage with Llama cloud indexes as our sort of database, our vector store if you will. And we also make use of claim verification. And for that we have Llama index uh Llama cloud indexes which you can connect up to any of your own vector stores. And once you have your data there, you get to decide when you're at the rag process where you want to do retrieval augmented generation or claim verification etc. This is where we again picked to use our Llama index open source framework because this allowed us to make it so that you can switch around models easily, maybe build it into a completely different workflow if you wish to do so as well. So how does that look? Again, we've wrapped this in our MCP tool. We've described this uh as very simply query a llama cloud index, but you'll notice here we're pulling in a uh query engine called citation query engine from the llama index framework. The LLM is again switchable and the retriever is our Llama cloud index. And simply this tool is run whenever there's a question from a user and we return the answer to the user in the UI as well. All right. Finally, I want to talk about something we've been working on very recently. Uh is that now we've um made it so that you can if you have already an MCP server running with tools running and you want to use pre uh configured workflow agents with llama index. You can bring them in as any other tool. You might provide a Python um agent or a function agent as well. So you don't have to define any tools and functions yourself. From now on, we've made it so that if there is an MCP tool out there that you want to uh use, you're free to do so with a Llama index uh workflow or a predefined agent. The same thing is valid for all of the Llama Cloud tools that we mentioned throughout this presentation, including Llama Extract. any extract agent that you create with Llama cloud uh you can run as its own individual uh MCP tool. The same goes for indexes. Any indexes you may have within Llama Cloud uh you can run as an MCP tool and you describe it with your own custom definition. And that is a video of me explaining the whole process. And this is again an open-source tool uh an open source server that you can run yourself. And with that, um, the whole point of this presentation was hopefully that you'll try out Notebook Llama. I hope that you contribute to it and also it's a great way to discover basically all of the products we have with Llama Index and Llama Cloud. So with that, thank you very much. >> And I can take questions. >> Thank you, Tana. Would you like to join me for a couple of questions, please? Thank you so much for the presentation by the way. Wow, very nice slides. Very colorful. I love it. >> Thank you. >> Um, >> our logo changed a bit too, so it's a bit more colorful now. >> I see that. All right. Um, question. What's the best framework to build agents? >> Oo, spicy question. Um, I think it I I love to give this answer. Everyone calls it a very political answer, but I think it's really true. Um, they all are trying to get to the same that it's the same means to an end. I don't know if I'm saying this correctly in English. Um, but I think it really depends on the type of developer experience you're after. Some frameworks are a lot faster at inte uh integrating the latest uh research. Some frameworks are a lot more stable. Some some frameworks do a great job of having observability and tracing inherently and some focus on other more sort of AI specific functionality. So that's my political answer to that. >> Awesome. Hope you enjoy. >> Llama index guys if you care about developer experience from what I'm >> I'm being truthful here. I'm Devril. So there you go. That's the answer. But llama index. >> Awesome. So yeah. Yeah. You mentioned you're in Devr so you meet a lot of Llama index users as well. What are some of the cool things that you see in build? >> Oh wow, that is a good question. So we have very two two very distinct groups of people with Llama index users. I will say Llama index users seem to be a lot more creative. Uh it is an open source pro product. So I guess that's uh that's um expected. Um and Llama cloud users are usually um they're doing this for business purposes that maybe enterprise customers etc as well. Very fun uh projects. Okay, that's going to be a tough one. Um, we had one uh specific use case where it's a very difficult situation, but this user was trying to use uh Llama Cloud to pass incredibly complex sort of architectural charts. I think it was like a system grid or something. So, I think that always sticks out to me. This is a very difficult one to solve as well. >> Wow. Awesome. Um, one final question. >> Yes. Uh in the architecture overview we also see Postgress the base database by the way just just saying no bias here. Uh what is that used for? >> So um all of the data that you saw the context and the documents you saw are all stored in uh Llama cloud. However we do run Postgress locally uh and that stores everything that the notebook uh the notebook llama UI needs. So for example the mind map is generated and you have an HTML content for it we store that in Postgres so that you can go back to it. >> Awesome. Thank you so much for Tuana. So let's give it up for Tuana please. >> Thank you. >> Thank you. All right. Are we ready for our next speaker? >> Yeah. Okay. Let's do it. So, our next speaker leads the AI developer experience at Google Deep Mind, where she's helping shape how um builders everywhere use the largest I'm I'm going to It's 5:00 p.m., but we're going to get through this together. How builders everywhere use the latest generations of models. And I think she's going to talk about a lot of cool models today. So, I'm very excited about this talk. Uh today she'll take us beyond chat bots showing us uh live demos on V3, Gen3 and Gemini 2.5 Pro. So please join me in welcoming to the stage AI developer experience at Google Deep Mind, Paige Bailey. [Music] Thank you so much. I'm so excited to be here. And even though it's 5:00 p.m., I think there will be some things to wake us all up uh later today. So, hi everyone. My name is Paige. I'm fortunate enough to be here with my excellent colleagues from Google DeepMind. Uh everybody uh everybody from GDM kind of raise your hand. Um excellent. uh Ahmed, uh Guiam, uh uh Patrick, Ian, they've all been valiantly kind of sharing demos, staffing the booth, presenting all throughout the day. Um and we're here today to kind of wrap it all up. Um some of these things might look a little bit familiar if you were at some of our earlier talks, but hopefully there will be enough new stuff that it'll be exciting even for people who might have experimented with our models before. So uh as mentioned these are the folks from DeepMind who are here today. Um and I also just want to say that Google has been a little bit busy. Um if folks might have amen uh remember back to a year ago I think there were uh you know not as rapid of release cadences. There were fewer models released out to the market, fewer features. Um over the last really year and a half we've been kind of accelerating a pace that's really exciting to see. I'm a long-term Googler and I've never really seen us ship at this kind of a rate. I think we're releasing a new model or a new feature every 5 days or thereabouts. Um, a couple of these models that have been recent to market are Gemini 2.5 Pro. So, show of hands, how many people have used that? Um, excellent. Uh, Gemini 2.5 flash image preview, aka Nanobanana. How many folks have been excellent? Also, we're not the best at naming things. Um, V3, which is our video generation model. Show of hands. Awesome. Uh, Gemma 3N, part of our open model family, like open- source kind of rules and fuels the world. And then Genie3, which is also our open worlds model. Um, so, uh, you'll see all of these today and then also talk through some applications and when you might want to use one versus the other. So, Gemini is special in a number of ways. One of which is that it's natively multimodal. It can understand video and images and audio and text and code and all of the above all at once in multiple languages. But it can also output multiple modalities. So Gemini models are kind of unique in the market in the sense that they can do many many things. They can output text and code but also images. They can edit images. They can output audio like you might have seen in uh GM's section of the workshop earlier this morning or earlier this day. Um but that means that you can do all sorts of things. You can have image understanding, um, editing, speech to text, texttospech, all kind of natively incorporated into one model in addition to things like PDF understanding. One of the folks here today has been using Gemini to kind of pull in PDFs and extract out images, um, extract out bounding boxes for different geospatial coordinates, um, and then also using it for real-time conversations. Um, so this is just an example of something that you can do with the uh the nano banana model. Give it an image of a car and have it kind of picture perfect turned into a convertible. Um, uh, our robotics team has been using something called Gemini Live to kind of orchestrate many of the different robotics behaviors. Um, Gemini being able to automatically detect bounding boxes and different features kind of helps the robotics models understand where to grip a specific object, where to head if you ask it in natural language to kind of rotate um, a tool or to build you a salad or something similar. Um, we've been incorporating Gemini into our smart glasses, so it can kind of give you recommendations on the fly as you're navigating across cities. Um, it can give you help with your math homework or your physics homework as you're looking at something on the screen. Um, we saw this uh in action at Google IO earlier this year if you want to take a look. Um, and it can also do things that feel a little bit like Hitchhiker's Guide to the Galaxy. So, if you remember the Babelfish where you have one person speaking in one language and you hear it in your native language and you speak something back and they hear it in theirs, um, that is possible today with the Gemini Live API. um which is pretty remarkable. Um but I am not a fan of slides. I'm actually very bad at slides. Um and so I think it's much more interesting to kind of see some of these things in action um in AI Studio uh and in some of our model uh some of our model scenarios. So if you haven't seen AI Studio before, this is the best place to go to get access to DeepMind's models as soon as they're released. You can select different models here off to the right, see different details about them, um, including details like pricing, the specific model name. So this is what you would incorporate into the API. Um, if you wanted to call that specific model as you interact with the models, you can also click this get code button and you can see the libraries called here to engage with the Gemini models. So we have a Gen AI SDK which is unified with our Vertex AI compadres over in Google Cloud. Um, so instead of having a distinct model for the Gemini APIs and one for our enterprise customers, these are now merged for Python and for TypeScript. So if you need to toggle from one to the other, um, it's pretty seamless. And then we also have a couple of features like the streaming feature, um, generating media, build, um, and the like. Um, and so hopefully folks saw our build feature a little bit earlier today. Show of hands. How many folks um saw that at the expo as well as the others? Um given that there are a few folks who haven't seen it, I'm going to do a live demo of build featuring the nano banana model um including deployment and then we're going to race over back to show a couple of the other capabilities. So with oh pardon me um with build, one of the fun things is that you can describe a natural language and app that you would like to create. Um and then you can build it in real time in the UI and deploy it uh via Google Cloud and Google Cloud Run. So as an example um let me uh let me toggle back. Um so hopefully uh where did the the other sort of screen go? Um h there we go. Excellent. So the uh with the build feature, you can describe in natural language the app that you would like to create um and you can deploy it in real time via Google Cloud. So I could say something like um create an app that takes uh a webcam photo um of the user. Uh the app should then uh use that photo uh to ground um the creation of a uh let's say Dungeons and Dragons character. Uh the app uh should use Gemini 2.5 image preview aka a nano banana to modify the image um to uh uh to show uh for the D&D character um uh so it looks similar to the user. um make sure the app is well designed and that the uh the stats for the character are included. Um and I'm going to hit control enter. And immediately we see Gemini 2.5 Pro kind of breaking down every step of creating this app, kind of walking through all of the different architectural considerations that it would need to make. Um selecting the different models that are available. Uh, and since this is baked in natively to AI Studio, it's incorporating the latest models and the latest features from the Generative AI SDK. So, if you ever played Sim City 2000 or any of the the other kind of Sim games, that loading screen probably looked a little bit similar. Um, all of the code is getting written here on to the right. Um, you can see this really nice file explorer here in the center. Um, and then we also have a really handy save to GitHub feature off to the right if you wanted to save to a public or a private repo. Um, all of these things are kind of uh generating prompts on the fly. So if you need to have a specific prompt um to engage with the model as part of the app, you can. Um, it's handling all of these services like the webcam. If it encounters any errors, it will take the error, feed it back to the model, and then use Gemini in order to resolve it. Um, and at the end of this, we should be able to deploy to Cloudr Run as well. This is all part of an improved code generation and vibe coding experience with our latest Gemini models. They've been topping all of the leaderboards, including Ellar Marina, um, for these code generation scenarios, and it's been really exciting to see how they've improved over time. So, let's try and see if this works. Um, I'm going to begin the quest. Uh this does uh hopefully uh look look okay. So it's consulting the ancient tomes, rolling the dice of destiny. Um powered by Gemini, unleash your own imagination. Um so let's uh also zoom out a little bit so we can see the full app. Um bargaining with a mischievous fay. Oh, that sounds very uh that sounds very cute. Um, and then that does look very much like me. Um, like well, and my dreams, right? Like but the uh but it does have kind of the original portrait. Um, neutral good bard. I'm an entertainer. Well, that's that's uh nice to say. Warm smile and gesture of peace. And then you can see the character backstory along the way. I actually really love this and think that this should probably be my Facebook image. Um, but Lyra Brighton. So cool enough. If you select a Google Cloud project, um I'm just going to hit this one. Uh it verifies the project. Um and then you can click deploy app. Um it creates a unique URL behind the scenes that you can share with your friends, your family, um uh other folks that you want to join your D&D campaign. Um and then after this is created, so you can see the app, it gets deployed. Um, but even cooler, you can also take a look at all of the logs, the Google Cloud, um, uh, kind of services that have been created along the way. Um, and all of these things are are kind of, um, scaled out to production. We care about the the headaches of Kubernetes and the like, so you don't really have to. Um, you can also see billing associated with that account. So, if I pull up the billing um for my linked account, which is where all of the apps uh that I've deployed are uh are getting utilized, you can see that clearly I have a problem, but also that uh there are a lot of Gemini APIs uh Cloud Run and Cloud Storage um components just kind of added behind the scenes so I don't have to worry. Um, another cool feature that uh that I mentioned my colleague Guom shared a little bit earlier today um is the live feature. So, I'm going to go ahead and click stream. Um, you can add to the system instructions. So, you can say something to the effect of uh please only speak to the user in French and uh kind of have that added. Um, and then you can do all sorts of things. You can share your screen, you can share a video feed, and the model kind of interacts with you dynamically um using uh the languages that you specify or using a variety of languages based on the user conversation. I'm going to go ahead and ground with Google search and then I'm going to go ahead and share my screen um and ask Gemini what it sees. Um and I've already got kind of pulled up into a tab uh Google Collab. Do not judge me by the number of untitled notebooks that I have. Um, but I'm going to go ahead and ask Gemini what it sees. Hey there, Gemini. What do you see on the screen? Google starting or generate with AI. Was that correct? Like, okay, cool. Cool. Excellent. Good job. Excellent. More cheers. The um but the uh but even cooler, you can also ask for help with the UI. So, you could say something to the effect of, "Hey, Gemini, how would I change the runtime type in this collab notebook? change runtime type. Awesome. So, it was able to kind of navigate me through the UI, tell me what to select, and since I've turned on Google search grounding, I can even ask, "Hey, Gemini, is it going to be raining uh in Paris today? Like, what is the weather like in Paris?" [Music] >> Yeah. So, it was asking It sounded like it didn't get it right. Um but uh but uh it does have the ability to look up information on Google search. So you can ask it natural language questions. Um even specify your own function calls and the model should be able to to kind of pull in that information. Um in addition to all of this kind of operating within Google um AI studio. Uh again if you click get code you get everything that you need in order to replicate the experiment. We've also been baking Gemini into Google Collab itself. So, one feature that feels very underloved and underutilized, at least to me, um, who really enjoys working with data, is that you can specify a different data set. So, I'm going to copy the path to this CSV file, um, which you can just kind of see here on the left. You can also add a link to a URL and ask for Gemini to help with web scraping it. But you can say something to the effect of please do exploratory data analysis, if I could type on this CSV file and build a model to predict California housing prices, which is always like a very depressing thing. Um, but kind of similar to what we just saw in Google AI Studio, what happens behind the scenes is that Gemini builds a plan. So, it kind of creates um a step-by-step instructions for what it would need in order to accomplish the task. So, you can see here um you can see here loading data kind of doing EDA, feature engineering, data prep-processing, model selection um and all of that kind of gets incorporated into the step-by-step process. uh and then Gemini writes the code and executes it within the context of the notebook. So if you've never used Collab before, this is kind of a notebook based interface on top of uh some compute with a whole bunch of Python libraries installed. What's happening is that Gemini is doing each one of these tasks. It's describing its reasoning. It's writing the code. It's executing the code. It's using the outputs for each of the cells that it's created to to kind of inform the next steps that it uses for its analysis. So, it's analyzing these data. It's creating these really rich detailed plots for the CSV file, but it could also be JSON. It could be TSVs. It could be like a table that you're importing from a database or from BigQuery. It's doing feature engineering. It's deciding based on the structure and the shape of the data what kind of model it should be using to train. Um, and then afterwards it's giving you kind of a summary of the results and explaining all of its reasoning traces along the way. So, this is pretty cool. I I started doing machine learning back in 2009, like back before scikitlearn even found its legs. Um, and it's been really really just brain to me to see how all of these tools have evolved um over the over the context of the the last year or so. And so it looks like it's decided to create a linear regression model. It's training the model. Um and then it should give us some sort of insight into the uh into kind of the the uh the results that it found for the data set and the R squar values. So you see MSE R 2 um and then hopefully a summary of all of the results at the very very end. So this is pretty cool. I highly encourage everyone to play with the uh these features um these agents that are baked into Google Collab. Um the features that are baked into AI Studio, the models that we have available um and to test them out uh to test them out for all of your use cases. Um but in the interest of time, um I'm going to go ahead and go back to our original presentation. Um, and we're gonna carine along the way for the rest of the examples. So, VO3, like hopefully everybody is excited about V3. V3 is one of three of our generative models that have been released into the public. Imagine 4 is for image generation. Liria uh for music generation. Um, but we uh like I given that no AIE presentation is complete without a quote from Andre Karpathy. Um, I'm really excited about video in particular because it's an incredible medium to kind of help educate, but also to help communicate, especially for audiences that are like my niece and nephew's age. You know, they already tell me like, you know, Paige, nobody reads anymore. Um, but I I think video is something that can resonate and really find places um with people who have different learning styles, with people who maybe uh don't want to take the time to uh to to kind of learn um through alternative means. And so V3 is kind of the first step in the path of making that a reality. So uh in the interest of not cycling through all of the text on the screen, I'm going to do a lot of showing. So, we've got a lot of new features as part of our video models. One of which is uh kind of this character consistency or reference powered video. So, here you see a person and a hallway that have both been incorporated into um into kind of the the prompt for the VO model. This is still using V2. Um but then you see kind of that same character walking down the hall um as you describe. um reference powered video, very similar. You can have the same character just in a variety of scenes, a variety of lighting cases. Um and it still looks like the same happy little monster um kind of swimming or walking or hanging out at a gas station. Um we also have a feature where you can take in an image and animate it. Um a feature where you can have an image with a guide for the animation. So, not just kind of like creating what it looks like might happen is the next step, um, but actually nudging it to be like a woman walking down the road in Texas. Um, you can have, uh, these sort of camera controls where you can control the style of the video outputs, outpainting, which was really really important for recent uh, a recent exploration that DeepMind did to kind of restore the Wizard of Oz into a into a state such that it could be displayed on the sphere in Las Vegas. um adding objects to scenes, removing objects, playing around with perspectives and camera controls. Um uh there's just an example of removing objects. You can do the same with our nano banana models, by the way. Um you can have reference face movements, so an avatar and kind of reference face movements that control it. Um first and last frame. So you can define the first frame, the last frame and ask for the V models to interpolate between it. And then again all of that possible with our last iteration of VO models. V3 kind of takes it to the next level. So with V3 um you can do things like uh you can do things like have um photorealistic video of people and places. So, these were all created just through prompts and just through prompts coupled with input images. Um, hopefully you can hear the audio also displayed in the room. Um, but these are uh everything from kind of these futuristic landscapes um to things that feel a little bit like a scene from perhaps Lord of the Rings um where before Froto throws in the ring uh into um but lots and lots of cool things to do with Vio. Again, in the interest of time, I'm going to show you just one more video um to kind of get the gist of what's possible. Um created by our friends in the Google uh Google Paris office. >> Yep. It's like daft punk. I can't believe this new Veo model. It is amazing. Artificial. Artificial. Artificial. [Music] artificial. [Music] >> Awesome. So if you see M Blancc anywhere around the Google Paris offices um definitely tell him that he has an excellent taste for film making. Um so moving along to the next slide or attempting to move along to the next slide. Yep. So, one of the one of the things that I wanted to try with the V3 model when it was first released was to see if I could create or really replicate a commercial um that I had seen on television. Um I wanted to see how much easier it would be to do with V3 versus V2 and also how many other models I would need to stitch together in order to make that to make that possible. Um, so this is the this is the advertisement in question. >> Hey, my name's Paige and what makes the Chick-fil-A chicken sandwich original to me is the crispiness of the breading and the tenderness of the fillet. It's tasty, it's warm, it's total satisfaction. So, that's not me, Paige. That's like some other page that also likes chicken sandwiches. Um but uh the the kind of the next step um to to kind of build a similar style of experience with V2 is I take the original image, I give it to Gemini 2.5 Pro, I asked it to segment it into 8second clips, generate a unique prompt for each 8-second clip. Um that was all used to kind of guide and steer the V2 outputs which again as a reminder don't have audio. Um, so to have audio incorporated, I used the texttospech version of Gemini in order to create the audio track. I used uh kind of a a music model to to generate a 30- secondond clip for the audio in the background, which I also used Gemini 2.5 Pro to describe. I stitched it all together uh using Camtasia, though you could also use Movie Pie. And then I got that final uh video, and it looked a little bit like this. [Music] Hey, my name's Paige and what makes a Chick-fil-A chicken sandwich original to me is the crispiness of the breading and a tenderness of the fillet. It's tasty, it's warm, it's total satisfaction. And I actually kind of like this one better. It's a lot more chill. Um but uh but again, this is using V2, a collection of models. took about, you know, 2530 minutes in order to create end to end, which is still a lot less than it would have taken um for for kind of a professional movie making team, I think. Um but one of my colleagues also told me that clearly it was wrong because like the breading on that chicken was like completely different than Chick-fil-A breading. Um it was definitely Popeye's breading. Um so, uh like you know, like some work to be done. Um but with V3, the process is much simpler as you can see. So you just have the original video. You ask Gemini 2.5 Pro to create the prompt or the collection of prompts. In this case, I only wanted the first 8 seconds. Um, it generated the detailed text description and then that's what I used to kind of give to the VO model to get that final output. Um, and so this is what it >> Hey, my name's Paige and what makes the Chick-fil-A chicken sandwich to me is the crispiness of the breading and the tenderness of the fillet. >> And so that's pretty strong with just a single prompt. Um, Genie 3 is our new frontier model for uh kind of exploring worlds. You can navigate just with the arrow keys. Um, behind the scenes, it's powered by techniques similar to our VO models as well as a lot of uh integrations with Gemini. So, it's it's kind of a a harness style approach for for generating these worlds and allowing you to navigate through them. Um, but you can just create via a prompt or via an input image the kind of worlds that you would like to see. Um, navigate through them. Um, and there's even consistency. So, if you interact with the world, if you draw something on the wall, if you look down and you see that your galashes are yellow, um, it will remember that and persist it through the duration of the exploration. Um, you can even see what it would feel like to experience a hurricane in Florida um, from from a road. Um, which is pretty rad. We should be having a trusted tester program for it coming pretty soon. Um, but stay tuned. Uh, and can't wait to get that out to folks. So, we've talked about a variety of models, Pro, Flash, Flash Light. Um, and I also want to talk a little bit about our nano model family, which is small enough to fit on mobile devices and to be embedded within browsers. um specifically uh Gemini Nano and then also some of our other open models like Gemma 3 and 3N. Um so interestingly uh for Gemma 3, how many folks have used Gemma 3 or heard about it? Um Gemma 3 is remarkable in a few ways, right? Like so you can see here that our 27 billion parameter version of Gemma 3 um is uh kind of able to fit on just a single H100. Um, so just one GPU as opposed to the 32 that you would need to run DeepScar 1 or DeepSc V3. Um, and even cooler, uh, Gemma 3N, which is 4 billion parameters in size, so small enough to fit on your laptop, like small enough to be, you know, locally hosted. Um, it's free to use because you can download the weights. It's actually exceeding the capabilities of our Gemini 1.5 Pro model, which was our best model, you know, six or seven months ago. So I I just want to like underscore that, right? Like our best model, the one that took multiple TPUs to run um is now uh you know being bested by an open model that is small enough to fit on your laptop and that you can use for free. So if you fast forward 6 months from now, right, like and you look at the frontier like what's possible today. So Genie3, VO3, like nano banana, like all this cool stuff. Um, you could easily imagine that probably all of those models or equivalents would be friendly enough to run on your local devices without you having to send data anywhere, without you having to send something to a server. And also, the frontier would probably look dramatically different than it does today. So, I I truly think that open models and locally hosted models are the future. Um, and we're deeply deeply investing in that, not just for Gemini on Pixel devices with Gemini Nano, but also something called Gemini and Chrome. We announced this about 5 days ago. Um, but with Gemini and Chrome, uh, you can do a variety of things. You can summarize information across tabs. And coming soon is a feature that feels very similar to Project Mariner if you've seen that before, but it allows you to kind of ask in natural language, hey, here's an email with a request for me to make something. Go on, uh, Instacart or go order me all of the groceries and put them into my grocery cart and check out, please. Um, and all of this would be available embedded within the browser. um and just kind of free for you to use or to interact with as part of your Chrome native experience. So this is a you know it feels very futuristic. It feels like sci-fi um but it's it's stuff that is definitely coming down the pipe from Google's perspective. I also want to underscore there has never been a better time to be a founder. So, if any of you in this in this room are like startup curious, uh there's uh almost nothing stopping you from creating a company um that's, you know, cash flow positive and getting it out into the world. Um we've also seen a whole bunch of VCs that we partner with founding or uh not founding but supporting solo founders much more than they have previously. So if you've always felt a little bit unsure just because you know you wanted to build a business but you didn't necessarily have a partner um again never been a better time. Um smaller teams also capable of doing outsized amounts of work. Like if you talk to the Black Forest Labs guys or the Mstral guys it still boggles my brain how much they're able to do with just a relatively small number of engineers. Um, so I just want to encourage you if you need any encouragement to go build, go create, um, get things out into the world. With that, um, hopefully I didn't go too much over time. Um, but I wanted to just say thank you. Thank you so much for having us. Um, and, uh, thank you for all your great questions, for sharing your use cases. We're really excited to see what you are about to build and have been building with the Google Deepline models. Um, and your homework for today, if you haven't already tried it, is to go to ai.dev and start experimenting with Gemini, generate an API key, and start using it in your projects. So, thank you so much. Thank you to the team from DeepMind who is here today that has been doing all of this work. Um, and uh, we appreciate you. >> Thank you, Paige. That was amazing. >> Excellent. Uh, do we have time for questions? >> Yes, I have I have a few follow-up questions for you. >> Excellent. And it's okay if they are spicy. We can we can talk space. >> Let me think now. >> All right. >> Um All right. They're not that spicy, but I think >> so. Amazing presentation. I'm I'm like the mind blown emoji all over. I'm I'm >> Well, I was blown away. I I'm speechless now. >> Um I've seen you use uh V2 instead of V3. I know that you said V3 is V2 but supercharged. Yep. >> Um, but are there like real use cases for V2? Now, even if V3 is available, >> there are some features that are still only available via the V2 APIs. Um, but we're quickly trying to integrate them with V3. So hopefully in the future there won't be a need to rely on the V2 models. You can just use V3 natively. Um, uh, from a pricing perspective, we're also bringing down the costs of V3 pretty significantly. They've already dropped uh, uh, over the last month or two. So uh like definitely definitely uh if you're thinking about building something or creating video like prioritize VO3 explorations compared to BO2. >> Um awesome. Yeah, you spoke about so many models and you mentioned uh J3 and I was wondering how much of it uh how much of Genie is actually used for autonomous vehicles and and to train uh autonomous robots as well. >> Gotcha. So, so, uh, the Genie 3 the Genie 3 models, I don't think that they've necessarily been generating data to to be training autonomous vehicles at the moment. Um, but we do uh include synthetic data into pre-training and post-raining for the Gemini family of models. So, I'm sure that folks have played video games before or if you haven't like live a little, you know, like there there are a lot of there are a lot of really cool games out there. Um, but uh like a lot of the footage of agents interacting in video games kind of accomplishing task exploring the world are also really useful training data for models such as world models or even models like VO3 or Gemini's uh sort of video understanding capabilities. So, so I definitely think that that those kinds of data would be really really interesting for models uh to incorporate back. But even just kind of the agents exploring world style video game video footage is really really helpful for model training too. >> Very cool. Um I have one last question for you. >> Yep. >> So um >> just one >> I have so many written down but I yeah for the sake of time I can only have this one but um I think I'll catch you later for sure. The um so what's unique about Deep Mind's models and what are you excited about? Yeah. So, Deep Mind's models, I really really love how we're kind of pushing the boundaries for multimodal outputs. You know, the audio outputs, the video outputs, but also, you know, open- source is kind of near and dear to all of our hearts. I think the Gemma team is doing phenomenal work and I really have been excited to see how these smaller models can be incorporated into, you know, things like Chrome and to mobile devices. um and really can't wait to see more of the road map of Gemini and Chrome. I think it's truly magical to kind of imagine this world where you can uh you know have all of your data kept local um but it can accomplish a variety of tasks for you um or can toggle efficiently between models that are zero cost um because they're just running on local devices and then model uh models that need to be uh to be used that might be hosted server side. >> Wow. Well, thank you so much. Thanks. Uh, thanks, Paige. You guys have been cooking for sure. Shipping every five days is incredible. >> Well, everybody on the team has been contribut like it's it's been fantastic to get to work with everyone and I I feel fortunate and honored to go to work every single day. >> Thank you so much. >> Excellent. Thank you. >> Give it up for Paige. >> All right, time to announce our last speaker of the day. Up next, um, our speaker led generative audio research at Google Brain and early, uh, and earlier worked at the speech recognition at Facebook AI research before he co-founded Qout uh, where he serves now as chief modeling officer. Today he'll talk about full duplex conversation with Moshi, speechtoiki, and more. Please uh join me in welcoming to the stage Chief Modeling Officer Neil Zegedore. [Music] [Music] Hi everyone. Uh thanks a lot for having me today. Uh I'm happy to talk about our work on scaling realtime voice AI. And uh before that I would like to say a few words about QAI. So QAI is a nonprofit uh AI research lab we created in Paris two years ago. Uh so thanks to the generous donation from Xavier Rodul Sad and Eric Schmidt. Uh it's an AI lab that is focused on open research and open science. So the main mission is to make big advances in AI in particular around multimodel LLMs. And the specific thing about QA is that since it's a nonprofit, all our inventions are published uh shared in open source. We train PhD students, we collaborate with academia and so on and so forth. And so what do I mean by scaling real time uh audio? So you may be familiar with solutions. There are a lot of them uh around like 11 Labs, Azu and so on. And so currently the main applications in uh AI voice are around offline content. So typically you will generate an audio book uh with a synthetic voice or you would have uh a character in a movie that could say a few artificial uh sentences or a small character that can uh in a video game also interact with you and so on and so forth. So this is mostly offline content. So this is content that is highly qualitative and is generated as low volumes. a huge opportunity that is currently not really addressed is everything that is interactive and very high volume. So taking against uh again the example of uh the difference between all life content and the rest uh when you make an audio book you can pay a lot you can spend a lot of time iterating because you generate it once and it's consumed by a lot of people on the other end there is a lot of content that is generated on the fly and needs to be processed right now and is consumed by a few people and then thrown in the trash. So for example, if you look at uh gaming, if you have interactive NPCs with whom you're going to talk, uh these interactions, there will be a lot of them. They are heard by a single person and so they need to be very very scalable, right? It's the same for robotics. Eventually, emanate robots are going to be play a huge part in our society and interaction with them will be mostly vocal, right? So we're talking about lot and lots of volumes of audio much more speech AI speech than is currently existing. Same for media. Audio book is massive media where generate one audio for a lot of people. But now when I show some examples something that is starting to to be uh a new product is personalized news where you have your own news instead of consuming the same as everyone you have some personalized news digest about your interest. It's the same in that context. This is very focused on a on a single person. So you need to generate massive volumes of audio. Um and what does we look for when we mean voice AI. So the first thing is the kind of application you are interested in. So people will naturally think about speech synthesis or voice agents. But there are a lot of things you can do. So synthesis is a task of you give a text and generate the corresponding audio. Transcription is a fact of on the other hand writing what is being said. is very useful for example for meetings uh and so on. Translation is translating speech in another language most like hopefully in real time conserving the voice and so on. Transformation is all the kind of audio effects you can do on a voice. So if you look for example of visual effects in even apps like Tik Tok, Snap, Snap and so on, there's a lot of AI in it, right? So you can make a lot of AI based transformation on your face. But the audio effects is like pitch up, pitch down, slow, uh faster, you know, it's not really using AI right now. So you could imagine much more richer transformation. And the last one is the full conversational experience, the voice agent. So now when you have this set of capabilities, what are you looking for in terms of quality? First one is fidelity, right? So, it needs to sound like if it was recorded in a studio and not on a smartphone in the subway. Uh, the second thing is you want to be able to design voices either by cloning them or by writing uh natural language description like I want a middle-aged man who smokes way too much and has a deep voice or something like that. Um, you want emotions to be uh rightfully understood by the AI and also be produced consistently. So right now you might be crying uh with your AI voice and it's going to be like oh that's so great and so and so you know it's not really exactly the kind of interaction you are looking for. The flow should be very natural right now and I talk a bit about it. You need a lot of discipline when you talk to an AI and so you'd rather want to have something that is you know much more natural like a human conversation. Finally the latency needs to be very very low. And now about scalability. I think the two main challenges we are facing right now and Paige say a few words about ondevice models. Sorry. Um if we want to scale audio generation to reach all the use cases I showed before and gaming assistants uh personalized media and so on either we need to be able to generate very large scale volumes on the cloud or we need to be able to have small scale generation on device. Right? So let's say you want to make NPC in a video game. either it's hosted on the cloud and you need to generate millions of hours of NPC voices uh you know every month or so uh or everybody is running on their PS5 uh the local TTS device and so in that case it's scaling through devices right so in particular if you look at quality and scalability there is kind of a trade-off where people currently have to choose so uh if we take speech synthesis again very high quality low scalability the audio book right you generate it once it needs to sound very good but you don't need to generate a lot of them right you you make the Harry Potter's books uh and you sell them and you make a lot of money uh on the other end voicemail requires a lot of scalability because everyone has their own voicemail it's a you know interaction that you cannot predict so it needs to be generated for everyone uh but the quality can be pretty crap right if you call your voicemail you don't need a full emotional understanding uh you know just want to to know what people wanted to tell you at the intersection of quality and scalability there is a personal assistant. Now the content is not mass media. Everybody wants to have their own interaction but they want it to be as qualitative as if it was the premium audio book. So in that case you need to nail both aspects at the same time. It's the same for translation in uh if you do diplomatic interpretation really not our focus right now because the you know the expectations on reliability and accuracy are extremely high. Same if you are asked by I don't know Netflix to dub a movie uh you know the expectations on the quality are going to be very very very high. On the other hand if you're translating uh you want to be translated while you travel it's a bit related to the you know the recent release of in the AirPods Pro then it's okay I guess if the quality is a bit lower uh as long as it's useful and kind of reliable and in particular can run on device so you can bring it with you in your travels. And at the intersections of quality and scalability, you have meetings, phone calls or small creators. So small creators, it's a bit like in opposition to movies. So in the case of movies, you make one movie for a lot of people. Uh for small creators, it's the content is consumed by a few people, but there is a lot of content that is generated at the same time. So you need to translate much much more content. Finally, for voice agents, one example I like is a startup working on airline claims. So if you do airline claims, it's like a AI that calls the aine and say, "Yeah, the flight was late by 1 hour, so you need to give us a 300 bucks." In that context, if you pay $10 for the voice, it's fine, right? Because you get a lot of money. But now, uh, if you have a bot at McDonald's that wants to take your order, it needs to be super cheap. But at the same time, you know, the quality can be pretty crap as long as it takes the command, right? And at the intersection, you have video game NPCs, interactive podcasts, uh, e-learning. So if you want to learn a language through voice, not only it needs to be very cheap for you because you don't want to spend like 500 bucks a month to talk to it, but also you want the experience to be enjoyable and to be qualitative and so on. So you need to nail both aspect at the same time. So what I'm going to show is how you can address all these aspects and the story at Qoutai around them. So the first one is about quality. So the first project we did at we did at QAI called Moshi, it was about creating the first full duplex conversational AI. It was more than a year ago uh before the release of the advanced uh voice mode from OpenAI. Now it's something that people are used to to have uh you know conversational chatbots all of them without exception even today still rely on duplex setting. It's a bit like a key walkie. So either the AI is speaking or it's listening which means that it requires this discipline. So you know it's always a bit awkward when you talk to an AI because if you interrupt it or you cough it thinks you are interrupting and so it starts breaking the flow of conversation and so on and so forth. So it's not very natural because you need to adapt to the limitations of the voice rather than the voice adapting to the fact that I don't know maybe you are erratic in the way you speak. Full duplex is the fact that the model always speaks and always listens. It can interject at any time. It can be interrupted at any time. Exactly like in a human conversation. In a human conversation, if you're on the phone with a relative from your family, the amount of overlap speech is around 20%. Which means that 20% of the time people are speaking on one another. And that makes it, you know, like a a rich conversation in a way. And so how do we address that? That's a small technical part in this uh in this talk. Uh we wanted also to address another aspect which is that people rely on cascaded system. So typical chatbot uh is uh speech to text and then you have an LLM and then you have text to speech. How can we merge all these steps into a single one? Because if we merge them into a single one, we don't lose emotional information because we don't go through text and the latency is going to be much better. So the way we did it is by taking inspiration from text models. So very quickly a text model text LLM is a probabilistic model where you give a sequence of word and predict the next one. So QI is an you predict AI and then you inject AI and predict research and so on so forth. So our main algorithm and this the expertise of QTI is audio language models. How do you make an audio language model? The most basic thing you could do is say okay I'm going to put audio as input and audio as output of the LLM. It doesn't work at all for a very simple uh reason. Look at the sentence QA is an AI lab based in Paris. It's eight words. So if you want to pass it into an LLM it's very few tokens. It's around 3 seconds to to pronounce the sentence. Even at 24 kHz, which is not studio quality audio, you get 72,000 values, which means that you have a super long sequence to pass through your LM. And given that self attention has a quadratic cost, so it's proportional to the square of the length of the sequence. It means that it's 100 million times more costly to process audio than text with an LLM. So what we did instead is that we invented Mimi. It's a codec. It's like an alternative to MP3 or Opus. It takes a large audio file and it compresses it so densely that it's a bit like a text representation. In the case of the sentence I showed, instead of having 72,000 samples, you goes through 37 tokens. So now, you know, it's almost as if it was text and you can just train your LLM to predict these this these tokens. Then your LM predicts these audio tokens and you have a decoder. It's a generative adversal network and it's going to reconstruct high quality audio from it. So now we have turned the task of audio generation into a task of language modeling and uh in particular that's the main standard architecture used by uh Google openai and so on for modeling conversations and the way they do it is that they have audio lm and they have this sequence of uh audio tokens that either represents the lm or the user. So in that context uh the model is going to predict uh the next turn based on the on the previous turn. It works nicely and can give all the chatbots that you have right now. But it's still a talky wy because it's either the model is speaking or the person is speaking. So what we did is we invented what we call multiream modeling. It's very simple idea. Uh we just have two streams in parallel which means that both can be speaking at the same time. Both can be silent at the same time. One can talk and and the other around. And what it gives and I'm going to show you is the demo we did more than a year ago now at CAM. You're going to see two things. The first thing is conversation with Mushi where you're going to see that the model has still the lowest latency ever at this point. Uh sometimes when it guesses where you're going, it starts answering before you're done. Uh you're going to see Alex speaks at the same time and there is no problem with this ambiguity where both are totally the same time. And second video, it's me during work in our office where it's super noisy and it's good advantage of full duplex models is that since they don't have this awkward handling of speaker turns, you can have a lot of background noise and they just work pretty well. The the planet is Sirius 22. Can you plot a trajectory course to it, please? >> Yes, sir. >> Okay. How long is it going to take us to get there? >> I've mapped it out. It's approximately 5 months to get there. >> Okay, that's that's not too bad. Uh, do you think we have all we need on board the ship to start the mission? >> Yes, sir. We have everything we need. >> Okay. >> Good day. How are you doing? >> Hey, uh, I'm doing some construction work right now, as you can see. Uh, I'm I'm looking for some advice. Can you help me with that? Sure, I'm happy to help. What's the issue you're facing with your project? >> So, in that context, I couldn't even hear anything and the model has no issue handling noise. Uh, the second thing we we tackled is capability. So, the multistream architecture we invented, it's not an architecture only for dialogue. It's basically a general framework for modeling any kind of audio task. In particular, if you take your two streams and instead of having the LM speak and the user speech, you can have me speaking in French and me speaking in English and now you can predict me speaking in English from me speaking French and you get real time uh speechtoech translation. Hi everyone. I'm delighted to be able to speak to you today in French since my phone will translate what I am telling you into English. As you can see, it is in airplane mode. It doesn't have an internet connection because our machine translation model allows for real real time voice translation entirely on the phone. You can find a model for free download on Huntington. So in this context, this model was made by a single master intern in a few months because since it's the same architecture, it just generate new training data. It's very easy to go from conversation to translation. It's again extremely robust to noise. >> My name is Alexandri and I am testing this model in extreme conditions. At this moment there is very loud music and I can barely hear what I'm saying. However, our model is able to translate live. >> So it's Gavinsky behind. You cannot probably recognize him. But anyway, so this is for translation again. You can there is a live demo coming and everything can go wrong. So please bear with me. Uh from translation to transcription now. So okay I showed two tasks now. Instead of predicting English from French, you can predict uh uh text from speech. And in that context, basically the idea is uh is the following. So it's still the same architecture, but now what it can do is uh real-time transcription. So at the moment, it's the most accurate and fastest real-time transcription uh in the world. It's open source. Um and the way it works is it predicts in a continuous fashion the text uh from the speech. And so this and now I can go back to English and it's going to work hopefully. I hope so. Okay. Uh first demo. Okay. Back to the presentation now. Uh you can also do it the other way around. So instead of predicting speech from from speech, you can predict speech from text. So it's uh in that case streaming text to speech. Uh so text to speech there are a lot of them out there. The specificity about this one is that it's streaming on text which means that as you type words they start being generated. It's useless for human because it will be super weird for you know typing and start speaking. But when you want to make it work with an LLM it's very useful that let's say you want to make a conversational agent from an LLM and the LM is going to produce a use paragraph. You don't want to wait for the LLM predicting the paragraph to start predicting the audio. You want both to be predicted at the same time. So in that context, what you're going to see is the actual realtime latency. >> Coming up next, we've got something special. The AI engineer Paris afterparty. Enjoy drinks and fruitful conversations. See you next time. >> Please stay. It's not I should have put it at the end of the presentation. So it's not it's not Please a few minutes more and we can get have some drinks. Um and interestingly so there was a nice talk earlier by our friends at Pyote but again instead of predicting just text you can also predict text with uh the label of who is speaking when. So if you want have a meeting with a lot of people uh you want to be able not only to transcribe a single stream of text like is often done because it's hard to pass but rather having the model detect who is speaking and then associating the right transcript to them. All of this still in real time. >> Understand that? >> Yes. Uh, >> okay. I tell you, if you need to tell me something, I want I'll let you come up here to the podium so you can speak into the microphone and I can hear you. >> Yes. Uh, the attorney here. I'm wanting to fire him. >> Uhhuh. >> And you know, I don't feel like he's doing any >> Who are you going to Who are you going to hire? >> I'm not going to hire nobody. I'm going to try to get a different public defender >> there. There isn't You have a right to an attorney. >> Yeah. Anyway, so you know, I don't want to put too much content because it's going to be strike. Anyway, uh so that shows the flexibility of the of of this model. Uh and now what you can do is you take the real time uh speech to text and a real time text to speech and you can put an LM in the beginning. So in a way we are back to what I showed before which is the cascaded system. So it has more latency and it's not speech to speech. But the very nice thing about that is that it's completely customizable. So you don't have to touch the LM. So you have a LLM with like a vision understanding and function calling and tool use uh rag whatever you name it. Most people they just want to make it speak. They don't want to train a new model that is completely replacing their text stack. So in that context and that's going to be the demo that worked perfectly during the started freaking out uh 30 minutes ago. So let's see uh in that context the way it works is uh for example I can put uh the upload a 10second sample. So in that context I uploaded a voice sample from General Shard from LA and I write a personality to the LM. So just a prompt explaining that I'm talking to General and now you can have like design in with a 10 second sample and a very short text uh a new conversational experience basically. Excuse me. I mean my general for the sake of the audience who is mostly English speaking maybe you can switch to English. So right now I'm giving uh a talk on audio language models and I was thinking maybe you could have some advice uh for me. Maybe you could try explaining it in your own words. Ah, when I think of language models, I am reminded of the importance of communication. During the darkest days of World War II, we had to convey messages clearly and with conviction to unite the French people. In the same way, these models must be precise. >> Okay. Thanks a lot, Mag. Uh, thanks a lot. Have a good one. And so finally the last one is about scalability. So uh the demo I showed of the ondevice uh speech to speech translation you can al also run it on the cloud and on a single H100 you can process 320 concurrent conversations. That means that uh on a single GPU you can serve hundreds of concurrent conversations. U similarly if we look at our speech to text and text to speech. So our text to speech in terms of streaming solutions is more accurate than whisper for example. And the throughput it's the number of seconds you can process in one second. Right? That means that we have like a 400 times realtime factor against a five times real time factor for whisper streaming. Same for our TTS that is open source. If we compare to the main open source TTS or use DIA, CSM, Chatterbox, uh we have more than 100 time higher throughput while having a uh smaller word, right? Which means that we have better pronunciation and fewer errors. What that mean is that now you have something that can scale to address all the applications that I I've started describing because now if you want to make NPCs in video games where everybody is going to have uh 20 villagers uh they can talk to in a huge open world and so on it starts to become you know believable which was really not the case uh until now. And to give you a final idea of uh just a final demonstration of this uh you can as of today on LA province listen to uh our news article with our so uh yeah probably not the best topic. I really apologize about that. But here what you see the different the difference between two type of media. What I just played is uh the journal for everyone right? So that's kind of the audio book. It's a single media generated for everyone. On the right, unfortunately, it's not available on on the web yet. It's on the app. It's a personalized news digest. So in that context, there is only a solution like ours that can scale to generate for each uh you know person reading a journal their own uh news stream. So basically our goal is to go and remove the trade-off between quality and scalability and provide both. And uh in particular what we have been doing with LA Province for uh uh you know incorporating our our models into products is the kind of collaborations we are as well looking for right now. So you can find all our code on GitHub our pre-trained models on face. We publish every all our research on archive and in conferences. So all of this can be found and uh we are interested maybe in opening you know uh limited access to an API that we can serve to people who wants to you know try to prototype large scale generation. So what we're excited about is if someone want to make a video game with NPCs with voice or someone wants to auto automate customer support for something or create education app and so on super interesting the kind of use cases. So please don't hesitate to reach out. This link uh brings to a Google form where you can uh reach out to us. Thanks a lot for your interest. Thanks a lot for bearing with me. Uh yeah, thanks for your attention. >> Wow. Thanks Neil. Thank you Neil would like to join me for a couple of questions. Awesome. All right. There's so much to unpack there. Um All right. So uh yeah you you showed multiple models. I don't know where to start from but uh what's the most challenging part of developing a model like a full duplex model? Uh so the reason why we did the full duplex model is um so when we started working on it was uh uh early 2024 and dialogue at this time was really a task that nobody had been able to tackle at all. Right. Uh TTA started to work. we had done at Google I work on music generation and so on so all of this was kind of you know working well and there was really nothing for dialogue so at QAI our strategy is always the same right we are a small lab a small team uh so what we try to target as projects are things that are very new and can be tackled by a deep expertise of the topic rather than you know scaling a lot the resources and so on so what is the most challenging is that in particular and I really love it about audio audio models are typically very small they are around one to three billion model uh parameters. So it's not about scaling huge infrastructure and so on. It's really about understanding how the human hearing works, how speech production works, how you can put that knowledge into machines. And so it's it's really about having a very deep knowledge interest of the field of audio rather than you know just uh a general knowledge of machine learning I would say. >> All right. Very cool. Yeah. You you mentioned that you have a small team but is also nonprofit. So can you expand a little bit about that? So the reason why we created Qoutai as a nonprofit is um so the ambition is to is to create something that can foster uh new ideas that are groundbreaking. And what is interesting that if you look at the main inventions in in AI uh for example one that is transformer right the transformer architecture that was invented at Google it was invented with a specific application in mind which is machine translation but it was invented at Google brain the fundamental research lab and the reason why it's so is because it was a complete rethinking of the way you do sequence modeling people were using recurrent neural networks there is no relation between recurrent neural networks and transformers which means that you really need to be ready to say okay let's I think everything from first principles maybe we can waste a year uh but you know if it works out it's going to be huge. It's very hard to do that in a startup. So we decided to do a nonprofit because we really wanted to have a mission where we can try stuff fail and just focus on making big things without having distractions and Xavier Eric Smith and they were very happy to accompany us with that. >> That's very cool. But you still have some um commercial partnerships as well right? So we start getting interested in that with the main reason of uh as I showed you in the benchmarks. We are a nonprofit and at the same time we make research models and realize are very competitive competitive to the point where they can be competitive with commercial solutions and so we'll be interested in providing them to people not only through uh you know open source models but as well uh specific models that where let's say someone needs a model in a specific language and they are okay to open source it and they want us to work on it. Obviously we need to find a way to make it work because we are a small team but uh we are very happy that our open source models can also be used in uh in products. >> Awesome. So what's next for QI and what are you excited about? Uh so what we're excited about in a way and I think it's quite interesting is uh so Moshi was the first fulllex model and then unmute what I showed with general is a cascading model and all the field seems to go back to cascading model because it's just so much simpler and more convenient for products which means that the the progress for full duplex model has kind of remained almost flat uh since Moshi and it's clearly the you know the end goal is this. So for us it makes a lot of sense to keep walking that path towards models uh where you you really feel like there is a deep understanding uh mutual understanding between you and the machine uh with whom you talk and it it makes the the conversation very enjoyable but at the same time it's very useful. So you cannot just give up rag and function calling and so on because then you have a chitchat buddy and it's pretty fun and very natural but it's kind of useless except a chit chat buddy. So you want to have like a chit chat buddy but that can access the knowledge of the world and make a lot of complex stuff. Exactly. >> Yeah. >> Awesome. Well, thank you so much, Neil. Thanks for the presentation. Uh that was my last question. >> Thanks a lot. >> So, all right. Let's give it up for Neil. >> Thank you. >> Awesome. Very excited for QI. So, uh before you leave, I would like to say one more thing. Hold on. >> Oh. >> Oh, sorry. Yep. I got I got a clicker here. I want to well, thank you so much for bearing with us and staying with us until now. I know it's been a long two days, but we're very excited for having you here. I personally really enjoyed this experience with you guys. So, thank you. I'd like to thank every one of you. So, thanks. Let's give it up for everybody here. All right. Look, this event wouldn't have happened without you, without your support, the support of the community, and the support of our sponsors. So I just wanted to thank uh everyone. So I wanted to thank Docker, Neo4j, Sentry, Deep Mind, Arise, Alolia, and everybody else who supported us throughout this this uh this event. And uh and we're super happy uh like the quality of the speakers were just amazing. I don't know about you guys, but I really enjoyed and I had the front row uh uh to to watch all those talks and it's fantastic. But before you leave, we have two things for you. So, we have if we're all going to go upstairs by the expo, we're going to take a group photo and that's going to happen in 3 minutes. And we have a very, very special announcement. So, make sure to be around for that. Okay? So, we can all walk together and in 3 minutes, group photo and a special announcement. And thank you so much again. Bye. Woo. Yeah. Yeah. [Music] [Applause] Heat. [Music] [Applause] [Music] Heat. Heat. Heat. [Music] Heat. [Music]