AI Engineer Summit 2025: Agent Engineering (Day 2)
Channel: aiDotEngineer
Published at: 2025-02-21
YouTube video id: D7BzTxVVMuw
Source: https://www.youtube.com/watch?v=D7BzTxVVMuw
[Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] for [Music] n [Music] ladies and Gentlemen please join me in welcoming to the stage your MC's for the AI engineer Summit agent engineering day the founder and CEO of superintelligent nlw and the founder of touring post cassinia [Music] saw hello everyone welcome to the AI engineer Summit the agent engineering track I think think it's going to be an incredibly exciting today yesterday was all about the leadership today is about the builders so we're super excited to have you here uh I'm the host of the AI Daily Brief podcast the CEO of super intelligent which is an agent Readiness platform most of all I'm one of your MC's for today good morning New York and uh this is such a beautiful winter day um and good morning to over 13,000 people who watching us online it's a live stream um welcome to AI engineer Summit 2025 five and uh my name is Cia I'm founder of touring post it's a newsletter about Ai and machine learning and um um also as well your Co MC today um yesterday um it was a lot of amazing content but I was really looking forward for this day because today we will be discussing agents and this is such a h topic so um we're talking about agents engineering track and it's all about Builders all about you um who are Act shaping our agentic future so across these talks you're going to see a bunch of different themes uh which together I think give you a really good sense of the story of the moment that we're in AI agents uh in the spotlight um but their development stretches back decades um research began in the 1950s with symbolic AI gained traction with expert systems in the 1970s and evolved into multi-agency systems in the 1980s so today's leap isn't in theory but in scale um llms and automation Frameworks are making a gench workflows practical um driving Real World Adoption across Industries and what was once theoretical is now operational on an unprecedented level that shift from Theory to practice is one of the absolute key themes of today you're going to hear a lot about what's actually happening now things that are real things that are live not just things that could be in the future yes and in fact um things haven't gotten so real that we will hear a number of use cases um and deployment stories um such as how agents are impacting the finance space at Jane Street and black rock very excited about that but of course not everything is figured out the other big theme for today is the big open questions the challenges that remain questions of scaling accuracy memory yes we have a very very exciting lineup today um and as um AI Engineers Engineers we will go deep on Gemini deep research how it was built um we will learn from open AI um and entropic how they think about building their agents um we will explore how reinforcement learning uh mean what reinforcement learning means for agents um and discuss how to scaffold wisely while scaling effectively efficiently um experts from widen weights and biases data dog Morgan Stanley Bloomberg brightwave Galileo and other amazing companies will share their insights and don't forget tomorrow we will have Hands-On workshops before we start we have to give a big thank you to our sponsors especially the AI engineer summit's Platinum sponsor salana salana for those of you who don't know is a blockchain uh and crypto ecosystem it's one of the EOS systems that's most directly at the center of this intersection between AI agents and crypto um in short they are building a permissionless layer for unlocking and allowing your agents to create wealth uh I think it's extremely telling that they are here supporting this event in terms of where they see their priorities so big thank you to salana if you want to learn more about them they have a large Booth downstairs with three demonstrations in the Expo um and this event would not be possible without um all of our sponsors um all the companies you see on the screen are pushing the boundaries of AI engineering and represent a fascinating uh mix of pioneering uh Pioneers shaping um the future of the field um the Expo area is just down the stairs in the hallway on the lower level you probably um been there yesterday um but please take time to visit our sponsors chat with their um experts and top Engineers there is a wealth of knowledge to gain um from them and um you know you can make connections there are a lot of opportunities to explore and these sponsors could be your next collaborators they can be your service providers or maybe mentors if you just starting your AA Journey so um take a time to visit them um last Quick announcement before we get out of here from a logistic perspective um at the break following their sessions most speakers will be able to answer your questions at one of the three speaker Q&A lounges there's one on this level and then two uh down below just outside the Expo um there's also during break time the hallway track which is allows you to gather uh and talk about the topics of the block from the session it's a way to sort of engage in the conversation in a more direct way you'll see there's several breaks happening throughout the day uh hopefully you take advantage of those to go engage with the speakers engage with each other uh and make sure of course that you do not miss the Afterparty after all of this is done in the expo at 5:00 pm so are we excited about today yeah good yeah um so please help me to um welcome this person who probably all of you know um he's the editor of leaden space he's founder of small AI he is a co-founder of this super practical um AI engineer Summit um he'll set the context for today's track and uh discuss what needs to happen in 2025 to make this year of the agents please join me in welcoming swix [Applause] [Music] oh nice uh it's this thing on hi good morning everyone love that love that um I'm going to get right into it one of the challenges we have with Summit is that we actually ask our speakers to do very short talks so I as the lead of summit I have to do even shorter talks so let's go uh you can see a lot of the there be a lot of show notes and homework you can see it on the live stream how is AI engineering doing uh it's pretty good we have an O'Reilly book that's pretty cool um uh chip is actually a good friend and she's actually speaking at uh she's giving her a keynote for the workshop session uh tomorrow which is pretty cool uh Garner hates us Garner thinks we've we've hit the peak so it's only downhill from here guys I'm sorry to inform you that engineering is over uh there's no there's nowhere else uh else to go but down um a lot of the what I try to do with these eight this uh talks that I do at at each conference is to try to Landmark the the stateof the art or the state of the industry um so with Lon space I I did the rise of the AI engineer with the first AI engineer Summit we talked about the three types of AI engineer and with last year's a engineer Worlds Fair we talked about how the discipline of engineering was maturing and spreading across different disciplines um I think this is starting to get a little sale by now a few million people have seen this and like you know uh used this to form their teams I think that was the intended effect what I am encountering these days is the two resistance from two sides of the AI engineer Spectrum uh if you come from an mle point of view you think that the AI engineer is just like mostly an mle plus a few prompts if you come from a software engineering point of view you think that it's mostly software engineering and uh calling a few llm apis um I think over time it the AI engineering is going to basically emerge as its own discipline and it's still not there yet it's still very very early I still say things like oh yeah aiee is 90% software engineering 10% AI I think that will grow over time and I think this is the year when it starts to spread out and that's that's what I'm here to talk about a little bit today um so for example I I think like what I try to do with AE is also like it's a is a work in anthropology like how people describe themselves form groups form identities and form Industries uh so mle you know it leaks out in your language um they say test time computer because the only reason to run inference is to test it uh AE will maybe say inference time Compu because we actually really care about inference um software Engineers may be reasoning um and I I think you see these differences I try to articulate them over time um part of what I want to do here to set context is to explain why we've kind of pivoted AI engineer Summit to be the agent engineering conference um uh it's not a decision that we made likely because um we're saying no to all these things we're saying no to rag we're saying no to open models uh gpus and we're just saying uh you know this is the only thing that going to do today um and but like closing all those doors actually opens up others so when we put out the call for speakers we uh made up all this list of uh you know other a agent engineering disciplines and and I soon realized we didn't have to I'll talk about this in a bit um I also looked at last year's top performing Talks on YouTube and you guys told us uh that you know you really wanted all the all the agentic things now the only problem with this is that we only got speakers who basically made agent Frameworks for a living uh and everyone's asking the the the real question who's putting this in production so we had a new rule this year of all right no more vendor pitches um you know you complain about yeah let's oh thank you uh as a as a curator makes it so much infinitely harder because uh basically the people that you're about to see have no incentive to come on stage and share what they're sharing uh but somehow we talked them into it so uh I hope you're looking forward to that uh the other thing also I realized that like everything plus agent works basically so agent plus rag Works agent plus Cent Works agent plus search works um and this is kind of like the simple formula for like making money in 2025 uh most of these most of these names you'll see in the talks that uh that will follow uh in in the sessions um Sor me if you heard this one before 2025 is the year of Agents right if you say it often enough it might be true uh I think that when people make predictions often times they confuse what they want to happen for what will actually happen um so maybe you believe sa Adella maybe you believe Roman maybe you believe Greg Brockman maybe you believe Sam mman all of them want you to believe that 2025 year of Agents uh and I'll be very honest uh me and my co-host alesio I think I saw you over there hey um uh we were pretty skeptical as well we we're on the record Being skeptical actually actually all of you are being on the record because last yesterday uh bar played uh Family Feud with with our with our audience and the number two uh buzzer that everyone is tired of hearing is Agent um but fortunately you guys are not TIR enough because you came to it today I have you for one more day of uh of agent talk uh but we're on record March 2024 with David Lan uh the former VP avenge of openi uh saying that we tell people to take agents off of their branding uh now we tell them to put it back on so okay there um I I I think I'm I'm doing this as a public service to start any agents conference we have to define the word agents are you guys ready all right I actually have one I it's a Monumental task I could do it in one slide um so if you talk again this this is a very POV sort of anthropological point of view the machine learning people will talk about some kind of reinforcement learning environment they want to talk about actions achieving goals and all that um AIA will don't know what they what they want yet uh this the software Engineers are very reductive they just you know put in a for Loop um okay you it seems like you agree um so uh fortunately you know I think every aie conference needs to invoke the name was Simon Willison uh he is our uh patron saint um he's actually gone in crowdsource 300 uh definitions of what an agent is so I didn't have to survey all of you I I was thinking about asking every single speaker to start with what is your definition uh it doesn't matter uh there's here's six of them right you it's either about goals it's about tools it's about control flow it's about long running processes it's about delegated authority uh it's about multi-step task completion yeah I see all the phones coming out don't worry it's on the live stream right there's like 20,000 people uh watching along um and then there's a there's a bunch of other things uh I think I think the last one on the bottom left bottom right is uh is an interesting one like just have some things that everyone defines agrees as an agent and make sure that they're sort of your agent definition is passing those things um except so that was my one slide there's my slide uh of like what defining an agent and then yesterday opening I went and dropped a new agents definition uh on the live stream uh that you can watch yesterday as well um so this is something they're obviously going to work with um and I think you should definitely pay attention to to this because they're they're building on top of this new definition as well so that's defining agents why now why is why are agents working now when they did not work a year ago two years ago um I have a rough idea so the people were talking about capabilities and so uh you can see that capabilities even even on the trajectory of 2023 2025 um have been have been really growing and they started around to hit human baselines uh right about now um and I also have a map of other uh reasons as well so I'll just bring you through each of them most people will say oh yeah we have better reasoning now we have better tool use now with have better tools um including mCP which which you're doing a workshop on uh tomorrow uh but I think there are some other less appreciative things which I'm going to bring up to you right now model diversity right uh the opening ey market share has gone from like let's say 95% two years ago now down to 50% it's much more diverse uh landscape including like just this this past week um two Frontier Model Labs that are possible challenges to open the eye have emerged and which I think which I think is um really exciting for 2025 we we don't actually know what is going to shake out to it by the end of the year uh the second thing is uh that the cost of intelligence is super Mor low is what I call it um it's it's gone uh the cost of GPT for level intelligence has gone down 1,000 times in the last 18 months um and you can see the same curve starting for the o1 level intelligence um uh and also we now start to have RL fine tuning options um I have zero experience in this area but fortunately one of our speakers will uh is going to talk us talk to us uh later today about this about this um so we have all these reasons we have I have a few more uh you know in our conversation with Brett Taylor U he talks about uh C charging for outcomes instead of instead of cost um there's a lot of work on multi-agents as well as faster inference as well that's coming out from the the better Hardware that we have um there's more homework there if you want this is all sourced and you know has has has some backing in our in our l space conversations but I don't really have time for that okay so one last thing for you guys on agent use cases so uh I think most people agree with like bar um barries uh building effective agents talk um he's going to talk about how coding agents and support agents have product Market fit I think now it's fair to say deep research has pmf um but also I will say up and coming are some of these use cases some of which you you're going to see in the the talks later but I also want to offer anti- use cases can we please stop demoing agents that book flights yeah no more flight booking agents uh I want to book my own flights thank you very much I I want to book my own instacart orders and also please don't ask for Ted it right okay so uh one yeah and I think the reason that the tell that uh you know this is this is the headline that I saw yesterday I had to put this in um opening I reported 400 million users uh which is a 33% growth from three months ago um and then you can ask deep research to research open Ai and draw this chart of chat gbt growth uh going from uh zero to uh 400 million users in two years in two and a half years um so uh I I I remember this chart very well because chpc spent a year not growing and why did it spend a year not growing because they didn't ship any any agentic models um and if you actually just look at the uh the sort of weekly active user chart and stretch it out you actually get this chart uh which is actually super interesting because it basically shows that one one um the sort of o1 models have doubled CH GPT usage and if you stretch it out um CH GPT is going to hit a billion users by the end of this year this year uh it's basically going to Quint tupo the number of users it had uh as of September of last year um and so like the the the the the growth of chbt and the growth of any AI product is going to be very very tight to reasoning capabilities and the amount of agents that you can Shi for your users um it is it is real it is it is huge huge numbers this is 1/8 of the world population that's going to be using chbt by the end of this year and I think there's a lot of money left on the table for everyone else so um I hope you enjoy doing that um I'm well pasted time so I'm going to skip all this but basically I I think that the job of a is now evolving towards building agents in the same way that mes build models software engineers build software um so uh I'm going to skip all that you can see all you can see all that on on the on the live stream U but we're actually uh you know just here to welcome you to the show um and uh I'm really excited to introduce you to everyone so um thank you and I hope you enjoy [Music] will 2025 be the year of the agents here to present how to build and evaluate agents is the author of AI snake oil sash [Applause] kapor the theme of this conference today is agents at work unfortunately for the next 18 minutes you'll be stuck with me talking about how agents don't work very well today and how we can do better when it comes to AI engineering so there is so much interest in Agents from all fronts in the product world and in the industry World in academic labs in research if you're someone who doesn't think that companies will be able to scale language models all the way to AGI and what we're going to see more and more of in the near future is Agents that are not really deployed directly but function as small parts of larger products and systems and this is what AI is probably going to look like in the near future swix came up with a few dozen definitions of AI agents this is one of them where language models control the flow of a particular system in fact even when people naively think of um chat GPT and Claude as models uh these tools are actually examples of rudimentary agents at some some level they have input and output filters uh they can carry out certain tasks they can call these tools and so on so in some sense agents are already widely used as well as successful we've now seen mainstream product offerings uh that can do a lot more open AI operator can carry out open-ended tasks on the internet the Deep research tool can carry out 30 minute long report writing tasks on any conceivable topic so that's the first reason I think today's conference is timely but the second reason is that on the flip side the more ambition ambicious visions of what agents can do are far from being realized so on the left here is a vision for what agents can do something out of Science Fiction films like the film her and on the right hand side are how these ambitious products have failed in the real world so far now I'm pointing this out not to criticize the specific products on the slide but to genuinely challenge the audience into the challenge of building AI agents that really work for the people who are about to use them and so over the course of this talk I'll talk about three main reasons why agents don't yet work and what we can do to realize the potential of agents to get past some of these failures the first one is that evaluating agents is genuinely hard so to begin let's see some examples of how when people have tried to productionize agents these agents have sort of failed in the real world do not pay is a US startup that claimed to automate the entire work of a lawyer um the startup co-founder even offered a million dollars for any lawyer who would be willing to argue in front of the US Supreme Court using do not pay in an earpiece in reality a couple of years later in fact very recently the FTC fine do not pay hundreds of thousands of dollars the reason for the fine was that the performance claims that do not play seem to be making were actually entirely false now you might consider this a case of rushed invention of a small startup making claims that it cannot back so let's look at some of the work from more wellestablished companies Law Firm Nexus Lexus Nexus as well as West law are widely regarded to be some of the leading lawtech firms in the US a couple of years ago Lexus Nexus launched this product which it claimed was hallucination free in its ability to generate legal reports and reasoning but when Stanford researchers evaluated Lexus Nexus and Westlaw products they found that in up to a third of cases and at least a sixth of cases these language models hallucinated um in particular in some cases the hallucinations basically completely reversed the intentions of the original legal text in others the paragraphs were made up uh they have about 200 examples of such errors um in leading lch products we've also heard examples of AI agents soon automating all of scientific research so this is an example from startup Sakana doai Sakana claimed they had built a research scientist that could fully automate open-ended scientific research now our team at Princeton wanted to test this claim in the real world in part because automating scientific research is one of our main research interests so we built a benchmark we created this Benchmark called core bench the tasks in this Benchmark are way simpler than what you might expect from open-ended real world scientific research um they just try to reproduce a paper's result even providing the agent with the code and the data needed to reproduce it so as you can imagine this is far simpler than automating all of science what we found is that the best agents as of today cannot even automate scientific research reliably less than 40% of the papers can be um reproduced by the leading agents now of course you can see these models getting better and even if an agent can automate only 40% of reproducibility that is a huge boost because researchers spent a lot of time reproducing baselines from past results but on this basis to argue that AI can soon automate all of science or that agents will R render scientific researches obsolete is way too premature in fact when people actually looked at how well Ka a AI scientists worked they found that it was deployed on toy problems that uh it was evaluated using an llm as a judge rather than human peer review and that in fact once you start looking at the results they turn out to be very minor tweaks on top of other papers think undergrad research projects rather than fully automating all of Science Now a couple of days ago as I Was preparing the slides for the St I came up with another claim or Sakana came up with another claim where they claimed to build an agent for optimizing Cuda kernels the claims were indeed very impressive they could lead to a 150x improvement over the standard Cuda kernels that Pyar comes with the issue though was that if you sort of analyze their claims one level deeper you would see that they were claiming to outperform the theoretical maximum of the H 100 by 30 times clearly this claim was false and once again it was because of the lack of rigorous evaluation it turned out that the agent was simply hacking the reward function rather than actually improving the Cuda kernels once again the point is not to call out a single company but rather to flag that evaluating agents is genuinely a very hard problem it needs to be treated as a first class citizen in the AI engineering toolkit or else we continue risking failures like the ones on the slide the second reason why building agents that work in the real world is hard is because static benchmarks can be quite misleading when it comes to the actual performance of agents and that's because for the longest time we focused on building evaluations that might work pretty well for evaluating High well language models too but agents are not the same as models for example for most language model evaluations all you need to do is to consider an input string and an output string those are the domains where language models work it's really enough to construct a valuation on the other hand when you thinking about agents these agents need to take actions in the real world they need to interact with an environment and so building this sort of evaluation that makes these changes possible that creates the virtual environments within which these agents operate is a way harder problem a second difficulty in evaluating agents is that for llms the cost of evaluating a model is bounded to the context window length of these language models you can basically look at these evaluations as having a fixed ceiling but when you have agents that can take open-ended actions in the real world there isn't any such ceiling you can imagine these agents calling other sub agents there can be recursions there can be all sorts of systems uh maybe just llm calls in a far Loop and because of this Cost needs to be once again a first class citizen in all evaluations of Agents if you don't have cost as an access alongside accuracy or performance you're not going to really understand how well your agent works and finally when you build a new Benchmark for a language model you can basically assume that you can evaluate every single language model on this Benchmark but when it comes to evaluating agents these agents are often purpose-built so for instance if there is a coding agent you want to evaluate you can't really use a web agent Benchmark to evaluate it on and this leads to an second hurdle which is how do you construct these meaningful multi-dimensional metrics to evaluate your agents rather than um relying on a single Benchmark to evaluate how well it works now all of these concerns might be thought of as theoretical um you know you could reasonably ask why do we care if static evaluations don't really work well for agents the reason is that because of these differences with the cost and the accuracy because of the single focus on optimizing for a single Benchmark we are basically unable to get a coherent picture of how well an agent works so at Princeton we developed this uh agent leader board that tries to solve some of these issues in particular for example for the core bench leader board I mentioned earlier um you can have multiple agents which are evaluated with cost alongside accuracy so here on this parito Frontier you can see agents like Cloud 3.5 um scoring about as much as the um open ice1 models but the cloud model actually costs $57 to run whereas o1 cost 664 even if the performance of open ow was a couple of percentage points higher which it wasn't in this case by the way but even it would were for most AI Engineers the choice here is obvious you would any day of the week take a model that costs 10 times lesser while performing about as well now in response to this sort of two-dimensional parito um I've often been asked um are llms becoming too cheap to meter in other words why do we even need to care about the cost of running an agent if the cost of uh creating these models is dropping drastically and it is indeed true that costs have dropped drastically in the last few years if you compare Tex DaVinci 003 which was open AI model back in 2022 um to today's GPD 40 mini which in most cases outperforms this older model the cost has dropped by over two orders of magnitude but at the same time if you're building application that need to scale this type of approach is still quite costly and especially from the point of view of releasing prototypes one of the barriers is for AI Engineers is you really need to sort of iterate in the open and so if you don't account for cost your prototype might soon end up costing you thousands of dollars and finally even if the cost of uh scaling inference time um llm calls continues to drop what is known as the chevin Paradox I suspect will keep increasing the overall cost of running agents so JV Paradox is this Theory from a 19th century British Economist who figured out that as the cost of uh mining coal reduced the overall usage of coal increased not decreased along several Industries the same happened when the ATM machines were introduced all over the US people expected a loss of jobs for bank tellers but what happened was the opposite because ATMs were so easy to install the number of Bank branches actually drastically increase leading to an increase in the number of bank tellers employed this is also what I expect will happen as the costs for language models keep dropping drastically and that's why for the foreseeable future at least we do need to account for cost when it comes to agent evaluations so how do we do all of this um in an automated way well with the holistic agent leaderboard or Hal uh we've come up with a way to automatically run agent evaluations on these 11 different benchmarks already and very many more are on the way um beyond that though even if we come up with these multi-dimensional um benchmarks even if we do come up with cost controlled evaluations there are still certain issues with this type of evaluation and that's because agent benchmarks have sort of become the metric against which VC's fund companies an example is cosign which raised its seed round of funding based on its results on sbench in fact agent developer um cognition raised 175 million at a valuation of $2 billion driven primarily by the fact that the Asian did very well on S bench unfortunately Benchmark performance very rarely translates into the real world so this is an excellent analysis of how well Devon Works Devon is the agent developed by cognition um from the very impressive Folks at answer. um instead of relying on standard benchmarks they actually tried to incorporate Devon into their real world and what they found was that over a month of use they tried it for 20 different tasks and it was only successful at three of them so this is the other reason why this overreliance on static benchmarks can be really misleading how do we get over this one of my favorite Frameworks to think through this is the work by Folks at Berkeley called who validates the validators on the top is the typical evaluation pipeline which consists of singular llm calls against static metrics which is the um sort of broken Paradigm for AI evaluations that we just discussed and at the bottom is what they propose they propose having humans in the loop who are domain experts who proactively edit the criteria based on which these llm evaluations are constructed and that can lead to much better evaluation results overall this brings me to the last key takeaway for why Agent performance does not really translate into the real world which is the confusion between what capability is and what liability is so very roughly speaking capability means what a model could do at certain points of time for those of you who are technically minded this means the pass at K accuracy of a model for a very high K that means at one of the K answers that the model outputs are correct on the other hand reliability means consistently getting the answer right each and every single time when agents are deployed for consequential decisions in the real world what you really need to focus on is reliability rather than capability that's because language models are already capable of very many things but if you trick yourself into believing this means a reliable experience for the end user that's when products in the real world go wrong so in particular I think the methods for training models that get us to the 90% of it what in swix term would be the job of a machine learning engineer don't necessarily get us to the 99.999% % what is often known as the 5 9 of reliability and closing this gap between the 90% And The 99.9% is the job of an AI engineer and I think this is what has led to the failures of products like human Spin and rabbit R1 it's because the developers did not anticipate that not having reliability in products like these would lead to a product failure in other words if your personal assistant only offers your orders your do Dash food correctly 80% of the times that is a catastrophic failure from the point of view of a product now one thing people have proposed to fix this sort of issue to improve reliability is to create a verifier something like a unit test um and on this basis there have been several claims that we could improve the inference scaling capabilities of these tools and get to very reliable models unfortunately what we found is that verifiers can also be imperfect in practice for instance two of the leading coding benchmarks human eval and mbpp both have false positives in the unit tests that is a model could output incorrect code and still pass the unit test and once we account for these false positives what we have are these inference scaling curves bending downwards so rather than model performance continuing to improve if there are false positives in your verifiers the model performance sort of bends downwards simply because the more you try the more likely it is you'll get a wrong answer and so this is also not a perfect solution to the problem of reliability so what is the solution I think the challenge for AI Engineers is to figure out what sorts of software optimizations and abstractions are needed for working with inherently stochastic components like llms in other words it's a system design problem rather than just a modeling problem where you need to work around the constraints of an inherently stochastic system and I want to argue in the last one minute of my talk that this means looking at AI engineering as more of a reliability engineering field than a software or a machine learning engineering field and this also brings me to the clear mindset shift that is needed um to become successful for uh from the perspective of being an AI engineer if you look at the title slide of my talk um this title slide sort of pointed to one such area where we've already overcome certain um types of limitations of stochastic systems and that is with the birth of computing the 1946 aniac computer used over 177,000 vacuum tubes many of which at the beginning of this process used to fail so often that the computer was unavailable half the time and the engineers who built this product knew that this is a failure from the point of view of the end users so their primary job in the first two years of this computer was to fix the reliability issues to reduce it to a point where it becomes well enough it works well enough to become usable by the end user and I say that this is precisely what AI Engineers need to be thinking about as their real job it is not to create excellent products though that is important but rather to fix the reliability issues that plague every single agent that uses inherently stochastic models um as its basis so this is what I'll leave you here with today um to become successful Engineers you need a reliability shift in your mindset to think of yourselves as the people who are ensuring that this next wave of computing is as reliable for end users as possible and there's a lot of precedent for this type of thing happening in the past all right with this I'll leave you with the 3K takeaways it was a pleasure being here thank you [Applause] [Music] let's dive with our next presenters into Gemini deep research please join me in welcoming to the stage staff ml software engineer of Google mukun SDAR and product manager of Google Gemini arish [Music] [Applause] San cool uh hey everyone I'm arush I'm a product manager here at Google hey I'm M I'm a software engineer at Google working on deep research um so uh I don't know if people have had a chance to uh try deep research on Gemini um or are familiar with the product but you can try it if you go to Gemini Advanced and if you scroll past 2.0 flash 2.0 flash thinking experimental 2.0 flash thinking experimental with apps 2.0 pro experimental you will find uh 1.5 Pro with deep research which is what we built um and if you have the chance to use it and you pay the 20 bucks uh you'll see that it's a personal research agent that can browse the web for you to to build the reports on your behalf and so our motivation and what we want to talk about today is kind of why we built it some of the product challenges we overcame and some of the technical challenges you'll face of building a web research agent um so our motivation was really we wanted to help people get smart fast um we saw that research and learning queries are some of the top use cases in Gemini but when you bring like really hard questions uh to chat bots in general what we were finding is that it would often give you a blueprint for an answer rather than actually give you the answer itself right so we had this query that we used to throw around of like tell me what does it take to get an athletic scholarship for shot put and like how do I go get one and often the answers would be things like you should talk to coaches you should find out how far you should be able to throw and you know uh you should make sure you have good grades but really what I want to know is like okay what are the grade boundaries like how far do I need to actually be able to throw I want something super comprehensive and and that's where we saw a big opportunity yeah so we said what if you remove the constraints of compute and latency at INF time let Gemini take as long as it wants browse the web as much as it needs and see if we can trade that off for a much comprehensive answer of the user but you got to do it in five minutes because beyond that uh we don't have the chips um so uh this brought a bunch of product challenges for us um Gemini up to this point is an inherently synchronous feature it's a chatbot um and so you wanted to we needed to figure out how do you sort of build asynchronous experiences in an inherently synchronous product um you also wanted to set expectations with users right deep research is good for like one very specific thing but a lot of user queries to Gin are things like what's the weather write me a joke things like that where waiting five minutes is not going to get you a good answer and we wanted to set expectations uh and the last thing is our answers can be thousands of words long and we needed to figure out how do you make it easy for users to engage with really long outputs and um in in a chat experience um so let's walk through kind of the ux and kind of think about how how we solve some of these right so imagine you're a VC uh and everybody's talking about you know investing in nuclear in America and so you come with this query like hey help me learn the latest technology breakthroughs in small nuclear reactors and tell me interesting companies in the supply chain so the first step um when you bring this query to deep research is that g actually put together a research plan for you and present it in a card and so this is the first way in which we're able to communicate with users like this is different this isn't your standard chatbot experience something's going to happen you're going to hit start but it's also an opportunity for us to actually show the user a research plan that they can edit engage with kind of like a good analyst right they they wouldn't just get to work they'd actually show you okay here's how I'm going to approach this and it's a way for users to if they want kind of engage and steer the direction of the research further now uh once you hit start we actually try and show you um what Gemini is doing on the under the hood in real time uh by showing you the the websites it's browsing and this is a feature that was built before thinking models and thoughts are also a really great way of kind of showing transparency of what the model is thinking um but what's really nice here is while you wait you can sort of Click through the websites dive into any of the content um but what we also inadvertently saw is people trying to game that number to see how high it could go so we definitely saw people push that number into the into the thousands uh to try and um you know see how many websites deep research could read um finally we kind of get this report that's you know thousands of way long and um we're really inspired by what kind of what anthropic does with artifacts and so we thought that was a really great way of sort of being able to pin an artifact so that users can actually ask questions about the research while reading the material they don't have to scroll back and forth and what's really neat about this is it means it's easy for you engage in sort of changing the style of the report adding sections removing sections asking follow-up questions and uh and it sort of makes that really easy and the last part that's super important is kind of user trust and also doing right by the Publishers so we we try and always show is all the sources we read as well as all the sources we used in the report because not everything that we read is used but it stays in context for followup questions and and also sort of these are all things that um carry over to Google Docs as citations and things like that if you choose to export uh so I thought today we can pick some of the challenges uh that one has to encounter while building a research agent and talk through some of them so uh I picked four for today so one is this this long running nature of tasks introdu is a couple of things that we need to look into second is the model has to plan iteratively and spend uh its time and compute during this time effectively so what are those challenges there and it has to do this uh while interacting with a very noisy environment that is the web and as you do this and uh read through information very quickly you can start seeing your context grow and how do you effectively manage context so if if you think about a job that runs for multiple minutes and something that can make many many uh different llm calls and calls to different Services they are bound to be failures right and today we're talking about o of minutes but you can very easily think in the future of uh these kind of research agents taking like multiple hours so it's important to be robust to intermediate failures of these various Services of various reliabilities and so being able to build a good State Management solution being able to recover from errors effectively so that you just don't drop the whole research uh task due to one failure that's one the second aspect of doing this what it enables us is to enable this feature uh crossplatform so we believe more and more uh users will start kind of registering your asks uh or your research tasks and just like walk away do their thing and then you need to get notified and this can happen now across uh devices and you can pick off uh uh reading it uh uh uh once it's done so now what is the model doing it uh like through these you know uh few minutes uh so let's take a example right so here uh we're looking for uh athletic scholarships uh for short putut there are many facets to this query and we kind of show this in a research plan like AR showed the first thing the model has to do is try to figure out which of these sub problems it can start tackling in parallel versus things that are inherently sequential right so the model has to be able to reason to do that and uh the other challenge is here you see you're always going to land in this state where there's partial information so it's important to look at all the information found so far before you decide what to do next so in this instance the model found hey it's it knows the qualifying standards uh for the D1 division but in order to provide a complete report and answer the user's question it has to go figure out what the equivalent for the D2 and D3 divisions are so this notion of of being able to ground on information you find and then plan your next step is key another example of partial information could be when you make searches um so in this case we're trying to find the best roller coaster for kids uh you might find results uh that provide partial information again so here uh you end up at a link uh which talks about the top 10 roller coasters but does not mention anything about them being suitable to kids uh so the planner has to recognize this fact and then go ahead and in the next steps of planning try to resolve this uh dis ambiguity um another example of uh challenges in planning is information is often not found in one place you find facets of information spread across different sources so here uh we're trying to find uh what would uh what would it take to get a certification for a scuba dive uh in in in some dive centers nearby so you see uh one part or One Source has uh the kind of the structure of uh what what you have to go through to get a certification but in a completely different Source you have this notion of the pricing for this Diving Center so the model has to weave this together to figure out um you know what the cost structure for such a certification would look like then there's the classic uh entity resolution problem so you might find mentions of the same entity across different sources so you need to be able to reason about some information indicators to kind of figure out if they're talking about the same entity or you need to explore more to verify such uh dis ambiguities um yeah I think most people here have worked on some notion of a web problem and we know like it's super fragmented so uh here you see two different websites uh talking about the same thing uh about music festivals in Portugal this year uh on the left uh if you end up at such a website it's easier and you get most of your information in one go uh on the right uh the layout is different so having a robust uh browsing mechanism if you want to navigate uh the web for your research tasks is another uh important challenge so like we saw there is a lot of these intermediate outputs and as you do this and you start getting streams of information during your planning you can imagine your context size growing very quickly um the other challenge that uh about context size is your research task doesn't typically end with your first query people have follow-ups people can say hey can you also do the same for this other topic so there is like this kind of a followup uh deep research and uh that also adds pressure on the context uh we at Gemini have uh the liberty of really long context models uh but uh even then you have to design uh some way to make sure you you effectively manage your context and there are multiple choices here each come with various different trade-offs uh we're showing one here uh where we kind of have like this recency bias so you have lot more information about your current and your previous tasks but as you get to older tasks we kind of selectively pick out uh you know things what we call as research notes and put it in a rag that way the model can still access it but it's being selective uh I'll hand it back to Aros about to talk about what's next yeah so we were super excited to put this feature out in December we weren't actually sure if anyone was going to use it if anyone was going to Care um to wait five minutes uh for something and uh we were really positively surprised by the reception um and and really what we what we saw um was hey we've built something that's maybe as good as like a the analyst right and we give it away for 20 bucks but um you know that's that's really great and um but what it does is it just retrieves from the open web and it's a text in text out only system right and so where we sort of we sort of see a few different directions of where research agents are going to go next and the first one is around expertise right so how do you go from a McKenzie analyst to a McKenzie partner or Goldman Sachs partner or like a partner of the law firm right so that's really around not just being able to aggregate information and synthesize it but also think through the so what of how do like what are the implications for what we're going to do and and what are the most interesting insights and patterns that come out of it the the other thing is you know there are plenty of domains Beyond Professional Services like The Sciences where you you know want to get really good you know you want something that can read many papers form hypotheses find really interesting patterns in you know what methods we used uh and and come up with Noel hypotheses to explore however um just because you build something that can be really smart doesn't mean that it's useful to someone right so um if we were thinking about a use case of running a due diligence on a company the way you'd present that information to me would be very different to the way you'd present that information to say a Goldman Sachs Banker right um for me you really want to talk through like what like what is this company and how's its position strategically but a banker would want to know all the financial information actually have a DCF that they could look at right actually uh have a have a much more like fine grained uh sort of uh fin uh Financial modeling and Analysis and and that really should shape the way in which you browse the web right the way you browse the web the way you frame your answer the kind of questions they pursue should be very personalized to kind of meeting the user where they're at I think the last part is sort of something that goes across domains of what models can do right so not just being able to do web research with text but being able to combine that with abilities and coding data science even video generation right so coming back to this example if you're doing a due diligence yeah what if could go and do like a lot of statistical analysis and actually build Financial models to inform the research output that it gives you right telling you hey why is this a good company or not um I should say Google doesn't give Financial advice and you know it's not a financial advisor um but yeah and so we really excited about the potential we think there's a ton of Headroom to make research agents better and we are really glad we didn't call this Gemini Deep dive which was our best name before uh before launching this feature um that's it thank you so much thank [Applause] [Music] you our next presenter is member of technical staff at anthropic here to present how they build effective agents please join me in welcoming to the stage Barry Jang [Music] [Applause] all right can you guys hear me yeah all right awesome wow it's uh incredible to be on the same stage as uh so many people have learned so much from let's get into it my name is Barry and today we're going to be talking about how we build effective agents about two months ago Eric and I wrote a blog post called building effective agents in there we shared some opinionated take on what an agent is and isn't and we give some practical learnings that we have gained along the way today I'd like to go deeper on Three core ideas from the blog post and provide you with some personal amings at the end here are those ideas first don't build agents for everything second keep it simple and third think like your agents let's first start with a recap of how we got here most of us probably started building very simple features things like summarization classification extraction just really simple things that felt like magic two to three years ago and have now become table Stakes then as we got more sophisticated and as products mature we got more creative one model call often wasn't enough so we started orchestrating multiple model calls in predefined control flows this basically gave us a way to trade off cost and latency for better performance and we call these workflows we believe this is the beginning of agentic systems now models are even more capable and we're seeing more and more domain domain specific agents start to pop up in production unlike workflows agents can decide their own trajectory and operate almost in independently based on environment feedback this is going to be our Focus today it's probably a little bit too early to name what the next phase of agentic system is going to look like especially in production single agents could become a lot more general purpose and more capable or we can start to see collaboration and delegation in multi-agent settings regardless I think the broad Trend here is that as we give these systems a lot more agency they become more useful and more capable but as a result the cost latency the consequences of Errors also go up and that brings us to the first point don't build agents for everything well why not we think of Agents as a way to scale complex and valuable tasks they shouldn't be a dropin upgrade for every use case you if you have read the blog H you'll know that we talked a lot about workflows and that's because we really like them and they are a great concrete way to deliver values today well also when should you build an agent here's our checklist the first thing to consider is the complexity of your task agents really thrive in ambiguous problem spaces and if you can map out the entire decision tree pretty easily just build that explicitly and then optimize every node of that decision tree it's a lot more cost effective and it's going to give you a lot more control next thing to consider is the value of your task that exploration I just mentioned is going to cost you a lot of tokens so the task really need to justify the cost if your budget per per task is around 10 cents for example you're building a u high volume customer support system that only affords you 30 to 50 selling tokens in that case just use a word workflow to solve the most common scenarios and you're able to capture the majority of the values from there on the other hand though if you look at this question and your first thought is I don't care how many tokens I spend I just want to get the task done please see me after the talk our go to market team would love to speak with you from there we want to drisk the critical capabilities this is to make sure that there aren't any significant bottlenecks in the agent's trajectory if you're doing a coding agent you want to make sure it's able to good code it's able to debug and it's able to recover from its errors if you do have bottleneck that's probably not going to be fatal but they will multiply your cost and latency so in that case we normally just reduce the scope simplify the task and try again finally the the the last important thing to consider is the cause of error and error Discovery if your errors are going to be high stake and very hard to discover it's going to be very difficult for you to trust the agent to take actions on our behalf and to have more autonomy you can always mitigate this by limiting the scope right you can have read only access you can to have more human in the loop but this will also limit how well you're able to scale your agent in your use case let's see this checklist in in action why is coding a great agent use case first to go from design doc to a PR is obviously a very ambiguous and very complex task and second um we're a lot of us are developers here so we know that good code has a lot of value and third many of us already use cloud for coding so we know that it's great at many parts of the coding workflow and last coding has this really nice property where the output is easily verifiable through unit test and CI and that's probably why we're seeing so many creative and successful coding agents right now once you find a good use case for agents this is the second core idea which is to keep it as simple as possible let me show you what I mean this is what agents look like to us they're models using Tools in a loop and in this Frame three components Define what an agent really looks like first is the environment this is the system that the agent is operating in then we have a set of tools which offer an interface for the agent to take action and get feedback then we have the system prompt which defines the goals the constraints and the ideal behavior for the agent to actually work in this environment then the model gets called in a loop and that's agents we have learned the hard way to keep this simple because any complexity up front is really going to kill iteration speed iterating on just these three basic components is going to give you by far the highest Roi and optimizations can come later here are examples of three agent use cases that we have built for ourselves or or our customers just to make it more concrete they're going to look very different on the product surface they're going to look very different in their scope they're going to look different in their capability but they share almost exactly the same backbone they all they actually share almost the exact same code the environment largely depends on your use case so really the only two design decisions is what are the set of tools you want to offer to the agent and what is the prompt that you want the instructor agent to follow um on this note if you want to learn more about tools my friend mahes is going to be giving a workshop on model context protocol mCP tomorrow morning um I've seen that Workshop it's going to be really fun so I highly encourage you guys to to check that out um but back to our talk once you have figured out these three basic components you have a lot of optimizations to do from there uh for coding and computer use uh you might want to uh cat the trajectory to reduce cost for SE where you have a lot of tool calls you can paraliz a lot of those to reduce latency and for almost all of these we want to make sure to present the agents progress in such a way that gain user trust but that's it keep it as simple as possible as you're iterating build these three components first and then optimize once you have the behaviors down all right this is the last idea um is to think like your agents I've seen a lot of Builders and myself in included who develop Agents from our own perspectives and get confused when agents make a mistake it seems counterintuitive to us and that's why we always recommend to put yourself in the agents context window agents can exhibit some really sophisticated Behavior it could look incredibly comp complex but at each step what the model is doing is still just running inference on a very limited set of contexts everything that the model knows about the current state of the world is it's going to be explained in that 10 to 20K tokens and it's really helpful to limit ourselves in that context and see if it's actually sufficient and coherent this will give you a much better understanding of how agents see the world and then kind of bridge the gap between our understanding and theirs let's imagine for a second that we're computer use agents now and then see what that feels like all we're going to get is a static screenshot and a very poorly written description this by your truly let's read through it you know you're a computer use agent you have a set of tools and you have a task terrible uh we can think and talk and reason what we want but the only thing that's going to take effect in the environment are our tools so we attempt a click without really seeing what's happening and while the inference is happening while the two execution is happening this is basically equivalent to US closing our eyes for 3 to 5 seconds and using the computer in the dark then you open up your eyes and you see another screenshot whatever you did could have worked or you could have shut down the computer you just don't know this is a huge leap of face and the cycle kind of starts again I highly recommend just trying try doing a full task from the agent's P perspective like this I promise you it's a fascinating and only mildly uncomfortable experience however once you go through that mildly uncomfortable experience uh I think it becomes very clear what the agents would have actually need it it's clearly very crucial to know uh what the screen resolution is so I know how to click um it's also good to have recommended actions and limitations just so that you know uh we can uh put some guardrails around uh what we should be exploring and we can avoid a necessary exploration these are just some examples and you know do this exercise for your own own agent use case and figure out what kind of context do you actually want to provide for the agent fortunately though um we are building system that speak our language so we could just ask Cloud to understand Cloud you can throw in your your system prompt and ask well is any of this instruction ambiguous does it make sense to you are you able to follow this you can throw in your two description and see whether the agent knows how to use the tool you can see if it wants more parameter fewer parameter and one thing that we do quite frequently is we throw the entire agent's trajectory into cloud and just ask it hey why do you think we made this decision right here and is there anything that we can do to help you make better decisions this shouldn't replace your own understanding of the context but you will help you gain a much closer perspective on how the agent is seeing the world so once again think like your agent as you're iterating all right I've I've spent most of the talk talking about very practical stuff uh I'm going to indulge myself and spend one slide on personal amings this is going to be my view on how this might evolve and some open questions I think we need to answer together as AI Engineers these are the top three things that are always on my mind first I think we need to make agents a lot more budget aware unlike workflows we don't really have a great sense of control for the cost and latency for agents I think figuring this out will enable a lot more use cases as it gives us the necessary control to deploy them in production the open question is just what's the best way to define an enforce budgets in terms of time in terms of money in terms of tokens the things that we care about next up is this concept of self-evolving tools i' I've already hinted at this two slides ago but uh we are already using models to help iterate on the two description but this should generalize pretty well into a meta tool where agents can design and improve their own tool ergonomics this will make agents a lot more general purpose as they can adopt the tools that they need for each use case finally um I don't even think this is a hot take anymore I have a personal conviction that we all see a lot more multi-agent uh collaborations in production by the end of this year they're well parallelized they have very nice separation of concerns and having sub agent for example will really protect the main agent's context window um but I think a big open question here is um how how do these agents actually communicate with each other we're currently in this very rigid frame of having mostly synchronous user assistant terms and I think most of our systems are built around that so how do we expand from there and build an asynchronous communication and and enable more roles that that afford agents to communicate with each other and recognize each other I think that's going to be a big open question as we explore this more multi-agent Future these are the areas that take up a lot of my mind space if you're also thinking about this uh please sh me a text I would love to chat okay let's uh bring it all together if you forget everything I said today these are the three takeaways first don't build agents for everything if you do find a good use case and want to build an agent keep it as simple for as long as possible and finally as you iterate try to think like your agent gain their perspective and help them do their job I would love to keep in touch with everyone of you if you want to chat about agents especially those open questions that I talked about uh you'll be incredibly lovely You' can just you know uh J on some of the these ideas uh these are my socials if you want to get connected and I'm going to get end the presentation on a personal anecdote so back in 2023 I was building AI product at meta and we had this funny thing where we could change our job description to anything we want um after reading that blog post from swix I decided I was going to be the first AI engineer uh I I really love the focus on practicality and just making AI actually useful to the world and I think that aspiration brought me here today so so I hope you enjoy the rest of the air engineer Summit and in the meantime let's keep buing thank [Applause] [Music] you our next speaker works for a company that's built industrial grade AI agents for Consumer Brands like Sonos ADT and Sirius XM here to give us a peek into how they do it is AI product manager at Sierra Zack Renault [Music] [Applause] wedin hey everyone uh my name is Zach Radine uh I'm going to be telling a few stories and hopefully we'll leave you all entertained and with an idea of how we build agents and improve them at Sierra so in a nutshell Sierra is the conversational AI platform for businesses and just PLL of the room out of curiosity how many people have heard of Sierra so most of the room but not all if you've heard of us you probably associate us with uh chat experiences and perhaps with customer service and that's a lot of what we do uh but I would say that we're kind of broadening out in both cases uh probably by the end of this year most of our interactions will be over the phone um so that's already a big area for us and we'll also have a lot more touch points we have a lot of customers uh which I'll show today who are using us for um sales for subscription management for product recommendations kind of all pieces of the customer experience I noticed yesterday were a lot of people here yesterday some people so it was funny to watch people were reflecting on you know how much has happened in Ai and they had these timelines and they went way back in time and so Colin from augment code went all the way back to 2023 uh Wasim from writer was talking about purpose built models and went all the way back to 2020 and Grace from Lux Capital went even further she went back to 2019 although if you zoom in you can see actually the first thing here is still from 2020 so everyone was reflecting on ancient history in Ai and it was all this decade so I'm going to zoom back even further 2016 in the AI caves and I know uh what you're thinking you know AI goes back to the 70s and all that but it definitely felt like the caves in 2016 uh I know because if you zoom in on the bottom right you can see I'm actually down there I was working at Google uh with a bunch of amazing computer vision engineers and uh what that meant in 2016 is we were really trying to help computers understand the difference between Chihuahuas and blueberry muffins and you know it's not actually that simple uh it's not just Chihuahua and blueberry muffins you know it's dogs and bagels dogs and Ms s and of course dogs and fried chicken and so in other words what we were doing is we were building the first version of Google Lens um and at this time I lived in New York City I was in the East Village and I had about a 30 minute walk to work and on my walk I would see a bunch of stuff New York's one of the greatest walking cities in the world and I would say what's going on there what are they even doing or oh I wonder if that bookstore is nice or I wonder if this restaurant is tasty or oh my goodness look at that dog uh and so there were also a bunch of flowers on the walk at this time Google Lens was in its infancy and one of the very few things that computer vision models were actually good at that had some consumer application was identifying plants you might still know this today it's kind of in the you know is that bug poisonous category and so I'd ask questions on the walk like you know can it tell the color of the plant in addition to the species or what's that what type of fern or or Palm is that and there's a bunch of flower shops on this walk so I'd even walk in these these are all actually photos from 2016 from my walks to work and I would go in and test them all out and as you can imagine you know sometimes it was accurate and sometimes you know it wasn't necessarily wrong but it wasn't really on the nose either and so it felt like a slot machine and I think everyone here who's building with AI can probably understand that feeling of uh it worked five times in a row Why didn't it work the sixth time whether it's the non-determinism of the inputs or the non-determinism of the outputs that's just part of what it means to be building with AI so let's fast forward a bit to present day Google Lens you can not only search what you see you can also shop what you see you can do this on Google Images on YouTube you can do it with your camera you can translate non-latin character sets into English so you can read the washing machine in Tokyo and actually figure out what settings in your Airbnb you should use you can do your math homework I'm a little bit too old to have benefited from this but apparently it's a Brave New World out there for the kids and of course uh this is from the Google ends homepage you can still identify flowers so this is all very mind-blowing but in my opinion it comes down to consistent step-by-step iteration over a decade and when we think about what drives this we're all engineers in the room we understand that you need a process to iteratively improve to get better without also getting worse and this over time has kind of been considered software development life cycle how do you continuously impove improve how do you implement test maintain analyze design and go through this as many times as you can let's rewind a bit more 2012 the AI caves you know the drawings are a little bit less sophisticated I'm not there yet uh I've been oblad and I pulled some headlines from around this time you can see this is uh around when Google brain was watching cat videos and identifying them on YouTube and it was a big breakthrough I don't know if anyone remembers how big this model was it was about a billion parameters this was a huge breakthrough if you think today the frontier models are about a trillion parameters so it was one 1,000th it was as if this whole room had like a quarter of a person in it and so uh it was still very impressive at the time there was also a theory you know everyone thought computers would be limited in terms of what they can achieve I think this is a less popular Theory today what I'm trying to say is it was a long time ago this is also around the time that Mark Andre published his famous essay that said soft worri is eating the world and that took a lot of people by storm if you looked at Stanford University on campus you would have seen some early stage startups forming on the lawn does anyone know which startups I'm talking about you can call it out okay you might be thinking Snapchat uh not that one I did actually hear door Dash in the back very good guess not that one either of course I'm talk you look like stylish people so I I think you'll know what I'm talking about I'm talking about Chubbies Chubbies had a contrarian idea that was also right which was not only is software eating the world but teeny shorts for men are also going to take over and uh as I mentioned they were correct which you can see here and you can also see here fast forward to 20 4 uh kit Garten SVP of commercial at chub we were fortunate enough to host her in Sierra's office and chub has had an amazing brand since they were founded and they've always been on the Forefront of customer experience they've always been thinking about how to level up and how to make the experience more fun and better for their customers and so it clicked immediately for kit that the same way you needed a website in 1995 the same way your business needed a Social profile and a mobile app this millennium in 2025 you need an AI agent to represent your business and to help your customers so kit and Chubbies partnered with Sierra we came up with an AI agent which is affectionately called Duncan Smothers first and foremost he's incredibly capable but almost as importantly he's always down to clown Duncan's mothers is on the Chubby's website and can help you with a variety of cases I got permission from kit to show some of these conversations to you today so you can see what some of the Sierra interactions look like under the hood and some of the things that these agents are capable of so on the left here you have a customer asking a question about sizing and fit Duncan is able to empathetically help them while asking questions like what's your waist size and offer product recommendations at the end it gets a thumbs up from the customer another example another thumbs up this is inventory tracking Duncan can tell what's in stock and help customers choose new items and then finally package tracking and refunds so more customer love uh in this case the Duncan is able to inform the customer actually there's a couple different tracking numbers for your order and in the second case issue a refund and so when we talk about autonomous agents agents actually taking action not just answering questions this is what we're talking about and the results for Chubbies have been they're able to help more customers more quickly and with higher satisfaction the way that we get to this is because we believe at Sierra that every agent is a product that means that you can't just drag and drop a bunch of boxes you need a fully featured developer platform you need a fully featured customer experience operations platform in order to work on this the same way you would work on your mobile app the same way that you would work on your website if you want the best results and so when Chubbies is partnering with Sierra it's not just using the product it's actually partnering with our team and so so we have dedicated agent engineering and agent product management functions that you can think of sort of as forward deployed with our customers working closely with kit and her team on a daily basis by the way remember that face that you just saw on the last slide were any was anyone here at the AI engineering World's Fair uh back in June nice got some whoops from the audience uh so I know Ben was there he's up there on stage introducing everyone and the energy was electric you can see the crowd is packed when I got there the first thing I did was I sat down at the Deep gram Workshop this was the uh about three months into me building voice agents at Sierra and I was very interested in what deep gram had to say what did they think of the latest multimodal models how are they handling latency how are they handling tone and phrasing all of these problems that were new at the time and I sat down next to a man named sha and sea and I were nerding out about how to increase the speed of our developer Loop by using the say command on Mac and then using a program called loop back in order to pipe that into the browser so that we didn't have to wear headphones and talk and look awkward in the office Sean gave me his contact info he was interested in Sierra and a few months later uh there we are working together in the office so when I told our company and our Founders hey I'm going to the AI Summit uh you know I hope it's as productive as the last one I'm excited to learn they said go find more Sean so hopeful that people in the audience will say hi after this uh whether or not you're interested in working at Sierra I'm interested in meeting you and so uh I'm I hope to meet you later today anyway back to Duncan mothers the point of the software development life cycle the point of our agent engineering team is that even if Duncan is not perfect today he should be getting better every single day and so what we did is we sought out to build something like the software development cycle borrowing as many Concepts as we could and inventing new ones where we needed to the issue is that large language models are like building on top of a foundation of jello and so you can't just take everything out of the box and have it just work while traditional software is deterministic fast cheap rigid and and governed by if statements that always follow logic large language models can be non-deterministic they can be slow they can be expensive to run they're very flexible though they're creative they can reason through problems and so we wanted to create a methodology that takes advantage of all the strengths of large language models and then also is able to invoke traditional software where it's helpful and that brings me to slide 78 the agent development life cycle so at Sierra this is the process by which we build and improve AI agents you might be thinking about it like oh that looks kind of like the software development life cycle and I think the devil is in the detail so I'm going to dive in a little bit it's not that these are revolutionary or Innovative Concepts it's that each one of them involves iterative refinement with customers in production to make it as productive and as bulletproof as possible so if we dig into quality assurance for example if you work at one of your customer one of our customer companies you have access to Sierra's experience manager what that means is that you can dive in and look at every conversation and you can look at highle reports of how is the agent performing in real time you can IDE feedback so for example if Duncan Smothers has incorrect inventory maybe it made one API call to one warehouse but it didn't make all the API calls that it needed to or one of them timed out whatever it may be you can report this issue it then will lead to an issue being filed which leads to a test being created and then once that test is passing we can make a new release and over the course of time a Sierra agent will go from having a handful of tests at launch to hundreds and then thousand thousand of tests as it improves another example here is it's not always that the agent is making a mistake sometimes there's an opportunity to go above and beyond uh Chubbies actually has each of its agents have a budget in order to Delight customers and so in this case Duncan mothers could actually you know door Dash the shorts from a retail location if they're not available online so this is the agent development life cycle at work but the thing is a year ago we were doing this all manually this was kind of early on in in in the history of Sierra and we were learning what works at each of these stages and with the uh improvements to AI we're actually able to add AI to each part of this life cycle and speed up the improvements in the present day but it's bigger than just Duncan the agent development life cycle is more effective the larger the customer is and while Duncan handles hundreds of thousands of requests we have customers that are doing tens of millions so the more valuable the velocity and change management are when you're that big and the change also comes from everywhere it's not just that oh there's an issue with the agent and we need to improve it there's tons of stuff going on outside there's all those graphs at the beginning of this presentation showing how fast our space is moving you have models being upgraded you have new paradigms like reasoning models you have multimodality and more and more when we think about how these impact the agent development life cycle reasoning models are a force multiplier toward each step we're actually able to be more effective applying AI to development to testing to QA and every step in between now another one that's near and dear to my heart I mentioned the Deep gram Workshop eight months ago which was an accelerant uh in my understanding of the voice landscape is building for voice and I started working on this about a year ago uh and in October we were able to launch Voice generally available at Sierra one of our large customers that has benefited from the agent development life cycle that has you know tens of millions of customers in the United States is Sirus XM and with Sierra's voice capabilities they're able to pick up the phone right away every time to answer their customers the way that we think about voice I think is similar to the way that we think about web development today if you remember 10 15 years ago a lot of websites were you know m. website.com you had two separate websites for mobile phones and for desktops and then we graduated to responsive design and this is how we think about our AI agents at Sierra too under the hood it's the same platform it's the same agent code but it's able to be responsive to whatever Channel someone reaches out in and whatever modality you're operating in of course you can still customize the same way you might have a different layout you can still have different phrasing you can still parallelize requests to achieve lower latency but it basically just works out of the box I'll close with a few thoughts this is something I've been thinking about a lot lately one of the most fascinating and fun Parts about building with AI is that large language models remind us of ourselves in short they're unpredictable they're slow and they're not that great at math but also so it allows us to be great designers by having empathy in a way that we probably couldn't ever before with computers and so you can actually put yourself in the shoes of the robot you can put yourself in the I don't know primordial soup of the jello and you can think about what it would mean to actually build a good experience and as someone who's building voice agents and a bunch of you I bet in the audience are I know there's kind of this thought on are these multimodal agents the real deal you know should just kind of wire everything together and hope it works and the question I've been asking myself a lot lately and what our results have kind of shown us is you know how would you do if someone just passed you transcribed text of your conversation partner with a few hundred milliseconds of delay and then you had to respond on the spot and so what we're building at Sierra is much more robust and very exciting to me and I hope to talk to you all about it I think on my badge it says voice too models is the thing that I'm excited about uh and so here is kind of a sense of the robustness and the richness of what you can create when you let large language models have the same inputs and same experiences that humans have um and so uh thank you for your time today I look forward to a lot of engaging discussions and uh it's great to talk to you all [Music] [Applause] [Music] our next presenter is a researcher at Morgan Stanley please join me in welcoming to the stage will [Music] [Applause] Brown hello everyone uh thanks swix and the whole AI engineer Conference team for putting this together and having me I am will Brown I am a machine learning researcher at Morgan Stanley and today I want to talk to you all a bit about what I think re enforcement learning or RL means for agents so I was in grad school at Columbia for a while and I mostly worked on theory for multi-agent reinforcement learning and over the past couple years I have been working at Morgan Stanley on a wide range of llm related projects some of which look kind of like agents but I will not really be talking too much about that today uh I'm also relatively active on X the everything app and that will become relevant later in the talk this talk I think will be probably a little different from most of the talks at the conference um it's not about things we ship to prod it's not about things that definitely work and you should go do tomorrow that are like proven science or best practices it's about where we might be headed and I want to really just tell a story that will synthesize some things that have been happening in the broader research Community um and uh where these Trends might be pointing do some speculation and also talk about some uh recent open source work of my own um and the goal of this is to help you plan and understand what reinforcement learning means what what it means for agents and how to best be ready for a potential future which may involve reinforcement learning as part of the agent engineering Loop so um where are we today most llms that we work with are essentially chat Bots I think it's helpful to think about open ai's uh five levels framework here so we did pretty well with chatbots seems like we're doing pretty well with reasoners um these are great models for question answer they're very helpful for interactive problem solving we have the 01 03 R1 grock 3 Gemini Etc models that are really good at kind of thinking longer um and we're trying to figure out how we take all of this and make agents level three um and these are systems that are taking actions these are systems that are doing things that are longer and harder and more complex and currently the way we tend to do this is chaining together multiple calls to these underlying chatbot or Reasoner llms and we do lots of things like prompt engineering tool calling evals Ops giving the models tools of their own to use having humans in the loop and the results are like pretty good um there's a lot of things that we can do and then there's a lot of stuff that it feels like is around the corner that we're all imagining about AGI but we're not really to the point yet where these things are going off and doing the things that we would imagine an AGI is really doing to the degree of autonomy that that would I presume entail so I think it's useful a bit to distinguish between agents and pipelines I think Barry's talk earlier was a good way to kind of frame this I'm going to use pipelines to encapsulate what Barry called workflows um and I think these are really systems with fairly low degrees of autonomy and there's a very non-trivial non-trivial amount of engineering required to determine these decision trees to say how does one action or call flow into the another how uh to another how do we refine the prompts um and it seems like a lot of the winning apps in the agent space have very tight feedback loops and so whether or not you want to call these agents or pipelines these are things where a user is interacting with some sort of interface they're telling it what to do the thing will do some stuff and come back relatively quickly things like the IDE like cursor winds Surf and repet um and search tools that are really good at Harder question answer maybe with some web search or research integrated but there's not that many agents nowadays that will go off and like do stuff for more than 10 minutes at a time I think Devon operator and opening eyes deep research are the three that really come to mind is like feeling a little more in the like autonomous agent Direction and I think a lot of us might be wondering how do we make more of these and the kind of traditional wisdom is like okay we'll just wait for better models once better models are around we can just like use those will be good but I think it's also to kind of take note of like the traditional definition of reinforcement learning and what an agent means there which is this idea of a thing that is interacting with an environment with a goal and the goal that and this system is designed to learn how to get better at that goal over time via repeated interaction uh with the system and I think this is something that a lot of us are either doing manually or don't really have the tools to do which is once we have our thing that it's set up to make the calls we want and the performance is like 70% and we've done a lot of promp tuning and we wanted to get up to 90% but we just like don't have the models to do it or the models struggle to get the success what's our path forward um and so in terms of model Trends I think I won't spend too much time talking about this but uh pre-training seems to be having diminishing returns to Capital at least we're still seeing kind of like loss go down but uh it does kind of feel we need new tricks um reinforcement learning from Human feedback is great for making kind of friendly chat Bots um but it doesn't really seem to uh be continually pushing us at the frontier of smarter and smarter and smarter models uh we talk a lot about synthetic data and I think synthetic data is great for distilling uh larger models down into smaller models to have kind of really tiny models that are really performant but on its own it doesn't seem to be an unlock for like massive capabilities uh getting better and better unless we throw in very ification in the loop or rejection sampling or any of these things and that really takes us to the world of reinforcement learning where this seems to be the trick that unlocked test time scaling for o1 models and R1 um it's not bottlenecked by needing manually curated human data and it does seem to actually work um I think we all kind of took note about a month ago when deep seek released the R1 model and paper to the world and I think this was really exciting because it was the first paper that really explained how you build a thing like 01 we'd had kind of speculation and some rumors but they really laid out the algorithm and the mechanisms for what it takes to get a model to learn to do this kind of reasoning and it turns out it was essentially just reinforcement learning where you give the model some questions you measure if it's getting the answer right and you just kind of turn this crank of giving it feedback to do more like the things that worked well and less like the things that didn't work um and what you end up seeing is that the the long Chain of Thought from Models like o1 and R1 actually emerges as a byproduct of this it wasn't kind of manually programmed in where the models were like given data of like 10,000 token reasoning steps this was the thing the model learned to do because it was a good strategy and reinforcement learning at the core is really about identifying good strategies for solving problems um it also seems like open source models are are back in a big way there's a lot of excitement around the open source Community um people have been working on replication efforts for the o1 project um and have also been trying to distill data from o1 down into smaller models and so what next how does relate to agents um I think it'll be helpful to know a little bit about how reinforcement warning works the key idea is to explore and exploit so you want to try stuff see what works do more of the things that worked less of the things that didn't and so in this feedback loop um demonstrated here in the image we can see a CH a challenge or models uh supposed to be writing code to pass test cases and we give it rewards that correspond to things like formatting using the right language and then ultimately whether or not the test cases are passing and so this is kind of a numerical signal that rather than like training on data where we are kind of curating this in advance we are letting the model do synthetic data roll outs and seeing scores from these rollouts which then are fed back into the model and so the grpo algorithm which maybe some of you have heard about is the algorithm deeps cued I think it's less of like a technical breakthrough in terms of it being a really important new algorithm to study but I think it's very conceptually simple and I think it's a nice way to think about what reinforcement learning means and the idea really is just that you for a given prompt sample end completions you score them all and you tell the model be more like the ones with higher scores um this is still in kind of the single turn Reasoner model non- agentic world uh and so the challenges that lie ahead um are going to be about how do we take these ideas uh and extend them into uh more powerful more agentic more autonomous systems but we do know that it can be done so open a deep research still has a lot of questions that we do not know the answers to about how it works but they have told us that it was endtoend reinforcement learning and so this is a case where the model is taking up to potentially 100 tool calls of browsing or querying different parts of the internet to synthesize a large answer and it does seem I think to many people's VI check opinions very impressive um but it also is like not AGI in the sense of you can't get it to go like uh work in a repo or like solve hard software engineering tasks and people have kind of anecdotally found that it does struggle a bit for like outof distribution tasks or like if you want it to fill out a table with like 100 very manual calculations it can struggle there and so it seems like reinforcement learning on one hand is a big unlock for new skills and more autonomy but it's not a thing that so far has granted us agents that can just do everything and know how to solve all kinds of problems but it is a path forward for teaching a model skills and having the model learn how to get better at certain skills particularly in conjunction with environments and tools and verification um and so there is infrastructure out there for doing this on our own kind of um a lot of it is still rhf Style by which I mean it's about kind of single turn interactions where the goal is we have reward signals that come from kind of human data that has been combined into a reward model um and if we want to have RL agents becoming part of our systems maybe we will get really good API services from the large Labs that let us build these things and hook into GPT whatever um or Claud whatever and train these sorts of models on our own with finetuning but we also don't really have these options yet um opening ey has kind of teased their reinforcement fine tuning feedback but it's not a multi-step tool tool calling yet and so I think if we want to plan ahead it's worth kind of noting and asking what would this ecosystem look like and there's a lot of unknown questions like how much this will cost how small can the model spe will it generalize across tasks uh and how do we design good rewards and good environments and there's a lot of opportunity here um open source uh infrastructure there's a lot of room to build and grow and determine what the best practices are going to be what the right tools will be as well as companies that can build tools for to support this ecosystem uh whether or not they're already in the fine-tuning world or not um and services for supporting this kind of agentic RL and I think also it is worth thinking about things that are like not literal RL in the sense of training the model but at the prompt level there's all sorts of automation we can do so if you've used dspi I think that is kind of adjacent to RL in the flavor of having a signal that we can then uh bootstrapped from to improve our uh underlying system based on improving Downstream scores um now I want to share a story with you about a single python file I wrote a couple weeks ago um so this was the weekend after R1 came out and I'd been reading the paper and thought it was really cool we had not had the Nvidia stock crash quite yet um and uh I was just playing around some experiments I was taking the a hug a trainer from huggingface that had the grpo algorithm and I was getting a really small language model llama 1B to do some reasoning and then give an answer for math questions and I started with like a pretty simple system prompt and I was just training the model to let it see what it did and I had kind of manually curated some rewards in terms of what the scoring function should look like and I just kind of like tweeted it out um where I had an example of the model kind of looking like it's doing some self-correction and so showing that the accuracy gets better as well as the uh length of response will initially drop once it learns to kind of Follow The Format then it goes back up as it learns to kind of take advantage of longer Chain of Thought to do its reasoning and this was not the first thing to replicate in any sense I wouldn't really call it a true replication um it was far from the most complicated and I think that actually caught a lot of people's imaginations and it became kind of a thing um so over the next two weeks after that it just took on a life of its own where a lot of people were kind of tweeting about it and forking it and making modifications to it and making it something you could run in a Jupiter notebook making it more accessible writing blog posts about it and it was interesting um because it to me didn't feel like a thing that kind of merited this level of excitement but what I think was catching people's imagination was that it was one file of code it was really simple and it invited uh modification in a very userfriendly engaging way which I like to call rubric engineering and so the idea of rubric engineering here is that similar to prompt engineering um to have a model do reinforcement learning it's going to get some reward but what should this reward be in the most simple version it's just like did it get the question right or wrong like does a equal B but there's a lot more you can do Beyond this and so I think the the single file of code exposed uh examples of this where you can give the model points for things like following this XML structure like if it gets a certain tag right you give it plus one point um if it has an integer answer that's still the wrong answer but it's learned that the format should be an integer answer get some points for that um um and there's a lot of room here for getting creative and for Designing rules that are not just Downstream evals to for our own sake know whether a thing is working but to allow the model itself to know whether it's working and use that as feedback for going further and training more um and this is very early stages there's a lot of things we don't know and I think there's a lot of opportunity to get creative and explore and try things out such as using LMS to design these rubrics uh autot tuning these rubrics or autotuning your prompts with Frameworks like DSP um incorporating LM judges as part of the scoring system and then also I think reward hacking is an issue to be very cautious of where the idea is you want to ensure that the the reward model you're using is actually capturing the goal and it doesn't have kind of these back doors where a model can kind of cheat and do something else that ultimately results in it kind of getting a super high reward without learning to do the actual task um and following this I have been trying to learn from those lessons of what I saw people using out in the wild and make something that is a little more uh robust and uh usable for actual projects Beyond just one file of code um and this has been a kind of very recent effort it's not a thing that I'm telling you to go use for all your problems tomorrow but I think it's my attempt doing some open source uh research code um that will help people potentially try these things out easier and answer some questions uh about this and so what this really is it's a a framework for doing RL inside of multip in en Ms so the idea here is that lots of us have built these great agent Frameworks for using API models and the hope would be that we can leverage those existing environments and Frameworks to uh actually do RL so here the idea is you can just create this environment thing that the model plugs into and you don't have to worry about the weights or the tokens you can just write an interaction protocol and then this gets fed into a trainer and so once you build this environment you can just kind of let it run and uh have a model that once you give it some WS learns to get better and better over time um and to conclude I want to talk about what I think AI engineering might look like in the RL era so this is all still something that is very new uh we don't know whether the op shelf API models are going to just work for the tasks we throw at them it might be the case that they do it might be the case that they don't um one reason I think that they might not be the entire solution is that it is really hard to include a skill in a prompt you can include knowledge in a prompt um but a lot of us when we try something we don't nail it the first time and it takes a little bit of trial and error um and it seems to be the case that models are like this as well where a model does get better at a thing and really gets a skill nailed down by trial and error and this has been the most promising unlock we've seen so far for these higher autonomy agents like deep research um fine tuning might still be important I think a lot of people wrote off fine tuning for a while because open models were far enough behind the frontier that like a prompted uh Frontier Model API was just going to beat a smaller fine-tuned model I think one we're now seeing the open and close Source Gap be close enough that this is less of a concern a lot of people are using open source hosted models in their platforms um and also uh RL the most kind of true version of RL that deep seek did for their R1 model that open I has talked about doing for uh deep research requires doing some reinforcement learning um there's a lot of challenges here there's a lot of research questions we don't know the anwers to um but there's a lot of things that I think these skills we've learned from doing AI engineering over the past couple years translate very directly to which is that the challenge of building environments and rubrics is not that different from The Challenge of building evals and prompts we still need good monitoring tools we still need a large ecosystem of companies and platforms and products that support the kinds of Agents we want to build um so I think all the stuff we've been doing is going to be essential and it's worth looking ahead a little bit to see if we end up in a world where we have to do a little bit more reinforcement learning to unlock things like true autonomous agents or innovators or organizations that are powered by language models um what does that look like uh we will find out [Applause] [Music] ladies and Gentlemen please welcome back to the stage MC for the AI engineer Summit agent engineering day the founder and CEO of super intelligent nlw [Music] all right awesome first session thank you all for uh for being here um thank you will for for a great way to close us out and for all the other great present presenters as well um quick clarification before I let you guys go tonight so there is no on-site Afterparty uh the Expo closes at 4 p.m. the venue closes at 6 p.m. however there are a number of Affiliated events um if you check the website homepage for info uh and RSVP instructions that all there but again Expo closes at 4: this venue closes at 6 so we'll be wrapping up conversations uh and evening plans at around 5:30 um with that we drop block one uh and so we're going to do a 30 minute break now if you want to check out and have discussions with the speakers um the Q&A lounges are available to meet them uh the first one is on the first floor to the right as you exit the theater and there are two downstairs as well um also we recommend making time to stop by the sponsor Expo you're going to find coffee snacks uh and also the amazing products and services of our sponsors so thank you very much and we will see you back here in about half an hour [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] oh [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Music] [Applause] [Music] [Music] [Applause] w w [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Applause] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] ladies and Gentlemen please welcome to the stage MC for the AI engineer Summit agent engineering Day founder and CEO of superintelligent nlw [Music] you all right welcome back to another excellent session um this Sprint is really really interesting we have sessions from Jane Street about how they do AI engineering uh challenges to scaling agents by Bloomberg uh a session on trusting but verifying from brightwave and kicking it off uh we'd like to welcome to the stage Brennan Rosalez to talk about agents and investment management at Aladdin co-pilot from Black Rock [Applause] [Music] [Music] [Applause] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] n [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] our next presenter is a software engineer at Jane Street presenting how they build AI powered developer tools please join me in welcoming to the stage John crezi [Music] [Applause] sorry my name is John kzi and I work on a team at Jane Street called AI assistant our group roughly uh is there at Jan Street to try to maximize the value that Jane Street can get from large language models and I've spent my entire career uh in Dev tools before I worked at Jan street I was a GitHub for a long time and then before that I worked at a variety of other Dev tools companies and llms kind of present this really amazing opportunity in that they're so open-ended that we can build kind of anything that we can imagine and it seems like right now the only thing moving faster than the progress of the models is kind of our creativity around how to employ them uh at Jan street though we've made some choices that make adoption of off-the-shelf tooling a little bit more difficult than it is for other companies and kind of the biggest reason that we have this problem is that we use o camel as a development platform for those not familiar with oaml it is a a functional very powerful language but it's also incredibly obscure language uh it was built in France and its most common applications are in things like theorem proving or formal verification it's also used to write programming languages um we use oaml kind of for everything at CH street so just a couple quick examples when we write web applications of course web applications have to be written in JavaScript but instead we write oaml and we use a library called JS of oaml that is essentially a oaml bik code to JavaScript transpiler when we write plugins for Vim those have to be written in vimscript uh but we actually use a library called vcam which again is O camel to vimscript transpiler and uh even people at the that are working on fpga code they're not writing verog they're writing in an O camel Library called hard camel uh so why are the tools on the market available not good for working with o camel I think it kind of comes down to a few primary reasons the first and the most important is that models themselves are just not very good at oaml and this isn't the fault of the AI labs this is just kind of a byproduct of the amount of data that exists for training so it's there's a really good chance that the amount of o cam code that we have inside of Jane Street is just more than like the total combined amount of oam code that there exists in the world uh outside of our walls the second is that we've made things really hard on ourselves partially as a byproduct of working in O camel we've had to build our own build systems we built our own distributed build environment we even built our own code review system which is called iron we develop all of our software on a giant monor repo application and just for fun instead of storing that monor repo in G we we store it in Mercurial and uh at last count 67% of the firm uses emac instead of normal editors maybe like vs code uh we do have people using vs code but emac is the most popular and the last thing is we're dreamers I mean kind of everyone in this room hopefully is is a dreamer in a way uh and what I mean by this is we want the ability to kind of take llms and apply them to different parts of our development flow and light up different parts so maybe we want to use large language models to resolve merge conflicts or build better feature descriptions or figure out who reviewers for features be and we don't want to be hampered by the boundaries between different systems when we do that over the next 15 minutes I'm going to cover our approach to large language models at Jan Street uh particularly when it comes to developer tools um I'm going to talk about custom models that we're building and how we build them I'm going to talk about editor Integrations so these are the Integrations into uh to vs code emac and neovim and I will talk about uh the ability that we've built over time to evaluate models and figure out how to make them perform best and I guess at first glance it's not really obvious that training models at all is a good idea I mean it's a very expensive proposition it takes a lot of time and it can go wrong in a ton of different ways who here has trained a model before or tried to train something like a model maybe took a foundation model and trained on top of it cool we were more convinced after we read this paper this is a paper from meta about a project called code compose and in this paper they detail their results fine-tuning a model specifically for use with hack uh hack is actually pretty similar to O camel uh not in its like syntax or function but really just in the fact that it's used primarily at one company and not really used much outside of that company even though it's open source so oh actually a fun fact hack is implemented in no camel I think that's just like a total coincidence but uh we were pretty naive early on we read this paper and we decided that it would be really cool if we could uh replicate the results we thought we would just take a model off the shelf we would show it a bunch of our code and then we would get back a model that uh worked like the original model but knew about our libraries and idioms it turns out that's just not how it works uh it's not that easy in order to get good outcomes you have to have the model see a bunch of examples that are in the shape of the type of question that you want to ask the model so we needed to First create a goal a thing that we wanted the model to be able to do and in our in our world the goal that we came up with was this we wanted to be able to generate diffs given a prompt so what that means is we wanted a user inside of an Editor to be able to write a description of what they wanted to happen and then have the model suggest a potentially multifile diff so maybe you want to modify the test file an ml file and an mli which is kind of like a header file we wanted the diffs to apply cleanly and we wanted them to have a good likelihood of type-checking after they had been applied and we were kind of targeting this range of up to 100 lines as an ideal zone of what we thought llms would actually be capable of and in order for that to work we needed to collect data like I was talking about before we needed data of the training shape that looked just like the test shape and this is what that shape looks like for this task you need to be able to collect a bunch of examples of what context the model would have had beforehand and then some prompt of what you want the model to do written hopefully in the same way that a human would write it and then some diff that would accomplish that goal so context prompt diff and we need a bunch of these examples so how do we get these how do we get these training examples kind of the first place to look is features features is I mentioned a code review system that we built internally this is what it looks like it's called iron uh features are very similar to pull requests I think you can just you know swap that term in your head and features that first glance have exactly the data they want on the description tab they have a human written description of a change and on the diff tab they have the code that accomplishes the goal of the developer but on closer look they're not exactly what you want right the way that you write a feature description or a p request description is really very different from what you might want to say inside of an editor so you're not writing multiple paragraphs in the editor you're just saying something like fix that error that's happening right now and that's just not how we write feature descriptions another problem with these features or poll requests is that they're really large right often it's a feature is 500 lines or a thousand lines so in order to use it as training data we would need to have an automated way to pull features apart into individual smaller components that we could train on so we need smaller things than features what are those well maybe commits commits are smaller chunks than features uh this is what a typical commit log looks like a chain street so this is not like a git short log this is literally just like an actual I want you to look at this as an actual git log and where it says summary Z that's my commit message we don't really use commits the same way that the rest of the world uses them so we use commits mostly As checkpoints between different parts parts of a development cycle that you might want to revert back to commits don't have a description and they also have the same problem in that they're not isolated changes they're they're a collection of changes what we actually ended up with was a approach called workspace snapshotting and the way that that works is we take snapshots of developer workstations throughout the workday so you can think like every 20 seconds we just take a snapshot of what the developer is doing and as we take the snapshots we also take snapshots of the build status so is the build that's running on the box we can see what the error is or whether the build is green and we can kind of notice these little patterns if you have a green to Red to Green that often corresponds to a place where a developer has made an isolated change right you start writing some code you break the build and then you get it back to green and that's how you make a change maybe this one the red to Green this is the place where the developer encountered an error whether that's a type error or a compilation error and then they fixed it so if we capture the build error at the Red State and then the diff from red to Green we can use that as training data to help the model be able to recover from mistakes the next thing we need is a description and the way that we did that we just used the large language model so we had a large language model write a really detailed description of a change in in as much words as it possibly could and then we just kept filtering it down until it was something that was around the right level of what a human would write so now we have this training data and training data is kind of only half the picture of training a model so you you have the the supervised training data and then you need to do the second part which is the reinforcement learning this is really where models get a lot of their power right we we align the model's ability to what humans think is actually a good code so what is good code I guess on the surface good code is I mean it's it's code it has the parses code meaning if a piece of code doesn't go through the oaml parser and come out with a green status that is that is not good code I would say by most definitions uh good code in oaml because it's statically typed is code that type checks so we want to have good code be code that when it is applied on top of a base revision can go through the type Checker and the type Checker agrees that the code is valid and of course the the gold standard is that good code is code that compiles and passes tests so ideally in during the reinforcement learning phase of a model you could give the model a bunch of tasks that are like verifiable we have the model performs some some edit and then we check whether or not it actually passes the test when applied to the code so we did that uh we've done this as part of our our training cycle and we built this thing that is called uh CES it's the code evaluation service you can think of it kind of like a build service except with a slight modification to make it much faster and that's that first we pre-warm a build it sits at a a revision and is green and then we have these workers that all day just take diffs from model they apply them and then we determine whether the build status turns red or green and then we report that error or or success back up to the build function and through continued use of this service over the course of like months we're able to better align the model to write code that actually does compile and past tests it turns out this exact same setup is the one that you would want for evaluation so if you just hold out some of the RL data you can also use it to evaluate model's ability to write code kind of looks like this you give the model a problem you let it write some code and then you evaluate whether or not the code that it writes actually works and training is hard and it can have kind of uh catastrophic but hilarious results so at one point we were training a code review model and this is a totally separate model but the idea was we want to be able to give some code to this model and have it do a first pass of code review just like a human would do to try to save some of the toil of of code review we trained this model we put a bunch of data into it we worked on it for months we're real excited and we put our first code in for uh for code review through the automated agent it spun for a bit and it came back with something along the lines of um I'll do it tomorrow and like of course it did that because it's trained on a bunch of human examples and humans write things like I'll do things or I'll do this tomorrow uh so it's it's you know not very surprising so having evaluations that are meaningful is kind of a Cornerstone of making sure that mod don't go off the rails like this and you don't waste a bunch of your time and money in the end though the real test of models is whether or not they work for humans so I'm going to talk a little bit about the editor Integrations that we've built to expose these models to developers at Jan Street kind of when we were starting building these Integrations we had three ideas in mind the first idea was wow we support three editors we have neovim vs code and emac and we really don't want to write the same thing three times so ideally we don't want to write all the same context build strategies and all of the same prompting strategies we want to just write it once the second is that we wanted to maintain flexibility so we had a model that we were using at the time uh that was not a fine-tuned model we were pretty convinced that a fine tuned model was in our future we wanted the ability to do things like swap the model or swap the prompting strategy out and last we wanted to be able to collect metrics so in a developer uh in their in their editor developers care about latency they care about whether or not the diffs actually apply so we wanted to get kind of on the ground real experience of whether or not the diffs really were meaningful for people this is the simplified version of the architecture that we settled on for this service the AI development environment essentially you have llms on one side and then Aid handles all of the uh ability to construct prompts and to construct context and to see the build status and then we are able to just write these really thin layers on top of Aid uh for each of the individual editors and what's really neat about this is that Aid sits as a side Car application on the developer machine which means that we when we want to make changes to Aid we don't have to make changes to the individual editors and hope that people restart their editors we can just restart the Aid Service on all of the boxes so we restart Aid and then everyone gets the most recent copy uh this is an example of Aid working inside of vs code so this is the sidebar in vs code very similar to something like co-pilot except this thing allows you to uh ask for it and get back multifile diffs uh and you can see it kind of looks like what you'd expect in VSS code it's it's you know a visual interface that lays things out really nicely this is what we built in emac though so in emac developers are used to working in text buffers they move around files they want to be able to copy things the normal way that they copy things so we actually built the a experience in emac into a markdown buffer so users can move around inside this markdown buffer they can ask questions and then there are key binds that essentially append extra content to the bottom of the markdown buffer AIDS architecture lets us plug various pieces in and out like I mentioned uh so we can swap in new models we can uh make changes to the context building we can add support for new editors which I think probably sounds far-fetched but this is something we're actually just doing right now uh and we can even add domain specific tools so different areas of the company can supply tools that are available inside of the editors and they kind of end up in all the editors without having to write individual Integrations a also allows us to AB test different approaches so we can do something like send 50% of the company to one model and 50% to another and then determine which one gets the higher acceptance rate Aid is kind of a an investment that pays off over time every time something changes in large language models we're able to change it in one place Downstream of the editors and then have it available everywhere and things change like really often we need to be ready uh when things change what I what I've had time to show you today is only a small portion of what my team is doing and we've got a lot of other things going on so we're finding new ways to apply rag inside of the editors we're applying similar approaches to what you've seen here to large scale uh multi-agent workflows we are working with reasoning models more and more but the approach is the same through all of these we keep things pluggable we lay a strong Foundation to build on top of and we build the ways for the rest of the company to add to our experience by adding more domain specific tooling on top of it uh if you think what I've said is interesting and you want to talk more about this I would love to hear from you you can just find me outside and thank you for your [Applause] [Music] time next up the head of AI engineering at Bloomberg is here to present challenges to scaling agents for generative AI products please please join me in welcoming to the stage Anu [Music] [Applause] kodor oh man it's really hard to see my photo that big or small um thank you so much for inviting me um as I was trying to think what would be a good topic to present at this talk the organizers were really nice and so a lot of things that you'll hear today was influenced by what the organizers thought was is important CU there really so many things happening that are exciting to talk about in the agentic landscape so let's get started the first thing was um late 2021 I think lm's really uh were starting to capture the imagination as a company we've been investing in AI for almost 15 16 years so we decided we'll build our own um we'll build our own large language model took all of 2022 to do that and 2023 we wrote a paper about it we had learned a lot about how do you build these models how do you organize uh data sets for these how does evaluation work how do you Cox performance in certain zones sort of this but then chat GPT happened I think the open weight and the open source Community has come up so uh beautifully along so while we continue to do very similar work as a strategy we pivoted to say let's build on top of uh whatever is available out there we have many many different use cases so I think we we pretty much pivoted to say We'll build on top uh if it helps you in any way on how we are doing things so there you go uh the other was uh I think there was a curiosity on how exactly uh does a company like Bloomberg organize its AI efforts so um I report into I report to the uh Global head of engineering and we are organized somewhat as a special group if you will we work a lot with our data counterpart Bloomberg is a really strong large data organization that you can appreciate now helps us out a lot uh we work with the product the CTO in in cross functional settings about 400 people 50 teams London New York uh Princeton and Toronto so that's a little bit about our our group okay so um we've been uh Building Products using generative AI um starting with tools more agentic for 12 to 16 months now I think the effort has been really really serious and so there have been so many things we've had to solve in order to build something today using what's available today uh and then I decided somebody must cover all of these topics so I'm not going to talk about these at all right uh I think there are some wonderful speakers talking about this uh I'll try to hang around a bit after this and really um I'm really bullish on what the developments are in any one of those challenges that we need to solve I think it gets easier and easier to solve those challenges so please don't read these as being pessimistic it's just realistic right I need to build and ship things today and that means these are the things I need to deal with today uh again we won't be touching on any of these topics today um so internally it was really hard to say what's an agent and what's a tool because everyone kind of had their own vocabulary and then this really nice paper came out so when I'm talking today when I say a tool I mean on the left hand side of that uh it's cognitive architectures for for language agents if you haven't read it you should uh try to read that paper and then an agent is really like more autonomous has memory can evolve so whenever I say agentic it's on the right hand side of the spectrum and the other one is the left hand side um so that's what my vocabulary will be finally to set the stage for the talk um I don't know how many of you know about Bloomberg I certainly did not know as much as I do today when I joined so um we are a fintech company as you can imagine from my nice uh jacket or jumper and our clients are in finance but Finance is a very diverse field so uh I'm listed here 10 different archetypes of people who are in finance and they do very different activities but they also do a lot of similar activities and so um what is like a short form of thinking what Bloomberg does we have we both generate and accumulate a lot of data this is unstructured and structured so news research uh documents slides we uh also provide access to websites there's a lot of reference data uh Market data coming in so if you just want to know the scale every day we get 400 billion pics of structured data information about a billion plus um unstructured messages millions of well- written documents which include news and this is just every day and we have over 40 years of history on it so when we say we offer information as one of the things to our clients this is the scale at which we are working uh the rest of this talk I will uh as you can imagine we are building a very broad set of products so to focus the talk I'll talk about one particular uh archetype uh research analyst if you didn't know what a research analyst done here is a does here is a short course so uh there's a research analist they are typically an expert in a particular area think like you know I'm a research analyst in AI or semiconductor or technology or electric vehicles and the kinds of things they need to do on a daily basis are written at the bottom so they are doing a lot of work with search and Discovery and summarization a lot of things with unstructured data on the left hand side they are doing a lot of work in uh in data and analytics structured data and analytics in the middle part of the segment they are reaching out to their colleagues both to disperse and gather information so there's a lot of communication and then they also uh some of them are also building models uh which means they need to normalize data they need to actually program and generate models as well so this is a a research analyst in a uh in a nutshell uh the other bit is because we are in finance and we've been here for we've been in finance for like since founding 40 years ago there are some aspects of our products that are non-negotiable and uh those include things like precision comprehensiveness speed throughput availability um some principles like protecting our contributor and client data making sure that whatever we build there is transparency throughout these are non-negotiables it doesn't matter whether you're using AI or not so these should ground you in the kinds of challenges we face when we use what's available today to build agents okay so what was the first thing we did uh again 2023 is when I think we got serious so the first thing we did was for the research in the in the zone of helping the research analyst community um companies public companies in particular they have scheduled quarterly calls that discuss the health of their company they talk about their future it's a conference call a lot of analysts attend the call uh there's a presentation by the company's Executives and then there's a Q&A segment and during earning season it happens that on any given day many many of these things are happening so I told you that a research analyst has to stay on top of what's happening every single day so transcripts of these calls need to be generated again AI is used and in 2023 we saw an opportunity to say well we know what for every company which is a which is operating in a particular sector we know what are the kinds of questions are of interest and maybe we can try to answer them for the analyst to take a look at and that way they can be informed on whether they wanted a deeper dive or not right seems like a simple product and again I'm talking about work that started in 23 so where the technology was we still needed to do a lot to bring it to the market keeping our principles and features in place so what does it mean just focus on the right hand side if you will um performance out of the box was not great like Precision accuracy uh factuality things like that uh and for those of you who are interested in mlops I think there was a lot of work done in order to just build remediation workflows and circuit breakers because remember these summaries are not somebody just chatting with a transcript it's actually published and everyone gets to see the same summary and anything that is an error has an outsized impact for us so we constantly monitor performance remediate and then the summaries get more and more accurate so a lot of um I think a lot of monitoring goes in behind it a lot of cicd goes in behind it as well okay so today how are the products that we are building how does the agentic architecture look like well first of all it's semi- agentic because I don't this is an opinion we don't yet fully have the trust that everything can be autonomous so there are some pieces that are autonomous the other pieces that are not autonomous guard rails is a classic example of for example Bloomberg doesn't offer Financial advice so if someone starts with hey should I invest in then you know you need to catch it we need to be factual that's again a guard rail so like those are not optional pieces for any agent those are coded in as you must uh you must do this check so just take this keep this image in mind it'll come back okay so this is about this is a talk about scaling so with that long Runway let's get to scaling so I just wanted to cover two aspects of scaling I'm hoping that both these aspects will be more of a confirmation and not a surprise to any of you um so let's see so the first thing is if you want to build agents and you want each agent to evolve really quickly because when you build the first time unless you're a magician it's going to suck a a bit and then it needs to improve and improve and improve right so how do you get there well let's go back to how some really good software is built when I was a grad student I use matrix multiplication a lot and this is a snapshot of the generalized Matrix matrix product and if you read the API documentation it lays out every aspect of the input every error code how long it will take is also available in documentation it's just it just works right right and when you build software on top of such really well documented well-written software your software also tends to be robust your products tend to be robust even from 20 years ago when we started using machine learning to build products like you know there are tools like apis that use models or pipelines of models behind them you as a caller or a person Downstream of such apis there is a bit of stoas stochasticity if I can pronounce it correct uh in right you don't quite know what the result will be and you don't quite know if it'll work for you or not and this is despite best intentions of establishing you know what the input distributions are and what the output distributions are there's always a bit of stochasticity it was still okay to work with them and I'll tell you why it was okay to work with these but when you enter using llms and agents which are really compositions of llms the errors multiply a lot and that is something that causes a lot of fragile behavior and I and we'll just take a look at it and and I I hope my answer is mildly surprising to you on how to avoid the fragility um in 2009 we built uh a news sentiment product it was basically to detect if a piece of news for a given company would be beneficial for that company or not so the input distribution we knew which news wires we were monitoring we knew which language it was in news wires also have editorial guidelines on how they write things so well it's while this while the API that sits in front of the model is not as clean as like Matrix Matrix multiply you still have a very decent handle on okay what is coming into my system and the outputs are obviously just like you know it's minus one to plus one pretty much so like the output space is also very easy training data we built it from scratch so we know the training data we could have really nice held out in time and space um test sets and then we could establish the risk of deploying this we could monitor it so despite all of this guard rail being present we still ended up having a lot of outof band communication on anyone who's Downstream of us so for example if you were consuming our stream of output on sentiment we would give you a heads up we would tell you that hey the model version is changing if you have a downstream application using this as a signal you want to test it out things like that this was the landscape that's changed a lot when you think about building agentic architectures like you want to make improvements to your agents every single day you don't want to have a release cycle where there is a you know a purely batch regression test based release cycle because there are so many customers who are Downstream of you who are also making independent improvements to your model so I'll give you like one small example right so uh one of the one of the workflows that we have agents for is um for a research analyst is uh I told you that structured data is something that they look at the question here is US CPI for the last five quarters Q is just a quarter there's an agent that deeply understands the query uh figures out what domain it should dispatch to and then uses a tool it's there's an NLP front end to the tool but uses a tool to basically fetch the data right um turns out that the data is wrong and which is why you need the guard rails the data is wrong because of one character that was missed uh it fetched monthly data as opposed to quarterly data and if you're actually building a downstream workflow where you're not even exposing the table a good research analyst would catch it but if you're not even exposing the table and you're just looking at an answer that says well it looks like the answer is 42 it's really hard to catch these compounding errors which is why it is easier to not count on the Upstream systems to be accurate but rather factor in that they will be fragile and they'll be evolving and just do your own safety checks even in like I'm talking about within my own arc people are independently operating every version of the data and anals analytics API tool that's coming out is better and better but being better means being better on average it doesn't mean it'll be better for you as a downstream consumer so building in some of this um guard rail I just think is good sense and that almost makes you go faster as you factor out individual agents and each agent can evolve without having these handshake signals of well every Downstream caller I have I have to make sure that they understand what's changed and they sign off that I can actually release my I can promote my U new agent to like beta or production I think we just need to like change that mindset and be more resilient so that's one the second thing is as much as I used to code one one one fine day long long ago I'm a manager now so I thought I'll talk about Arc structure and I don't know how many of you will um resonate with it Bloomberg like I said we've been building these things for like 15 years and traditional machine learning um it has a particular factorization of software and that software factorization is then reflected in the arc structure if you are lucky you have the reverse convey uh law of design but you but you really need to rethink that as you start start using different Tex stacks and start building different kinds of products um what do what do I mean how many agents Do you want to build and what should each agent do and should agents have overlapping functionality or not these are some basic questions and typically it's very tempting to just say let's just keep our current software stack and see if we can build on top of that or let's keep our current Arc structure and build on top of that and so what I've learned is on the columns here you can see you know the first two columns are vertically aligned teams the next two columns are horizontally aligned teams and there are some properties in the rows and what we've learned and we've actually done some reog what we've learned are in the beginning you don't really know much on what the product design is going to be and you want to iterate fast it's just easier to like collapse The Arc collapse the software stack and just say here's a team go build what needs to be built and figure things out and that's where you want like you know really fast iteration you want sharing of code data models things like that the more you have understood this for a single product or a single agent the more you understand what its use is and what it's good at and what it's not and you actually build many many of these agents and that's when you start thinking okay I can go back to the foundations of building good software and good Orcs And I want to have things like optimization on it so I want to increase the performance reduce the cost make it more testable make it more transparent and that's where you move into the bottom right corner of the segment where you do have some horizontal so in our case like guard rails are horizontal we don't want every team every one of those 50 teams like trying to figure out what does it mean for me to not accept user inputs that are thinly wailed Financial advice inputs right like it's something that you want to do horizontally but you don't also don't want to you want to figure out for yourself what is the right time uh for you and your organization to start creating horizontals to also start breaking out some of these monolithic agents which are reflected again in your structure and start creating smaller and smaller pieces so all that said and done like you know just again for the uh running example of a research agent this is how it looks like today so you know I think taking in the user user world and and session context and deeply understanding what is the question and then figuring out what kinds of information are needed to answer that question uh it's factorized as its own agent uh reflected in the art structure same similarly for answer generation we have a lot of uh rigor around what constitutes a well-formed answer again that's factored out I call it semi- agentic like I alluded to before because we do have guard rails that are non- optional there is no autonomy there you have to call it at multiple points uh and then yeah like we build on top of like years of traditional and more and more modern forms of data monging like you know your sparse indices have become dense and hybrid indices now so yeah that's a little bit and I think I'm right at time so have a nice day thank you [Music] our final speaker this morning will teach us how to distill accurate actionable insights from vast multimodal data sources he's the founder and CEO of brightwave please join me in welcoming to the stage Mike Conover hey everybody uh I'm Mike coner I am founder and CEO brightwave uh we build a research agent that digests very large corpuses of content in the financial domain so you can think of due diligence in a competitive deal process you are pre-term sheet you step into a data room with thousands of pages of content uh you need to get to conviction quickly ahead of uh other teams you need to spot uh critical risk factors that would would diminish asset performance um it's a fairly non-trivial task um you think about mutual fund analysts its earning season you've got a universal coverage of 80 120 names there are calls transcripts filings it's um a fairly non-trivial problem to understand uh at a sector level but also at the individual uh tier level what's what's happening in the market um or goodness you get into confirmatory diligence and you've got 80 800 vendor contracts and you need to spot uh early termination Clauses you need to understand thematically how is my entire portfolio uh negotiating their vendor contracts it's um frankly not a human level intelligence task and the reality as we've stepped into this space um is that these uh these professionals uh just get put in a meat grinder Junior analysts are um tasked to do The Impossible on extremely tight deadlines um I come from a a technical background um prior to Bright wave I was a data bricks and create a language model called Dolly uh that was one of the earlier models to demonstrate the power of instruction tuning um for eliciting uh instruction following behavior from from open source Technologies and um as I have met with these professionals I have developed a deep sense of empathy for um the stakes and the human cost of doing this work uh manually when it comes to the role of the individual in uh finance workflows and Financial resarch research um we think of the parallels to early early spreadsheets you go to an accountant or Finance professional 1978 before the Advent of computational spreadsheets you say what's your job well I run the numbers it's cognitively demanding these people write this stuff out by hand on literally wide pieces of paper called spreadsheets it's cognitively demanding it's important to the business and it's time intensive it feels like real work and now nobody wants that job and it's not because there aren't finances professionals it's not because nobody's doing analysis it's the sophistication of the thought that you can bring to bear on the problem has increased so substantially because there are tools that allow us to think more effectively more efficiently what we're seeing what we're hearing from our customers is that a system like brightwave that is able to dig and not just brightwave these this class of knowledge agents is able to digest volumes of content and perform meaningful work that accelerates by orders of magnitude um the efficiency and also uh time to to value in these markets and so the purpose of this talk is to relate um sort of the intelligence that we've developed uh in in the course of building this High Fidelity research agent um and just things that we're seeing both technically but also in terms of product affordances I mean the the design problem that you have to solve is how do you reveal the thought process of something that's considered 10,000 pages of content content to a human in a way that's useful and legible that is not a uiux problem it's not a product architecture problem that existed three years ago and the final form factor has not been determined chat everybody's very Target fixated on chat um that's probably not enough so the the first thing that I'll observe is that non- reasoning models are performing greedy local search so the the Bloomberg talk highlighted that sort of fidelity issue like a really concrete examp example you put a reuter's article in 40 and you ask it to extract all the organizations goodness if it's not going to give you products and if you have a 10 5% error rate and you chain calls like that you're going to introduce um sort of in an exponential way uh the likelihood of of error being in these models and so the the winning systems will perform end to end RL over tool use calls where the results of the API call are in fact part of the RL um sequence of decisions so that you can make locally sub optimal decisions in order to get globally optimal outputs um the reality is that that's still an open research problem you know how do I Avail myself of a knowledge graph or I did not do that okay um uh how do you Avail yourself of these tools in an intelligent way um so that you get globally optimal outputs it does seem like that that is not a solved question so the reality and I think it's like heartening to see um this is a theme and I think everybody in this room can be sort of comforted by this you got you got to build a product today and like you're you're going there's going to be this talk of the bitter lesson that more data more compute better models dominate all other approaches like nobody wants an expert system nobody nobody wants to use spy to do named entity recognition um the sorry um I was not in the speaker notes uh it you can think of being more circumspect about what is the scope of behaviors that the system the agent is going to engage in sort of like a regularization parameter which constrains the complexity of the model and that limits the likelihood reduces the likelihood that it will go truly off the rails and begin to produce uh degenerate output you can think of it sort of like a like multi- the most interesting interactions I've had with language models are deep into a conversational tree where you can think of selecting at each branch each response there a set of uh reactions that I can have to the model output and I'm steering I'm choosing this is what knowing how to use language models it's that's a skill um and many people who have real full-time jobs may not invest in developing that skill this is not dissimilar to what these RL systems are doing and if you can think of a multi-turn conversation as not just establishing a a human orchestrated Chain of Thought but really that set of tokens defines the activations of the model and if you think of the activations of the model as defining a program what you are doing when you respond to the model and say no not quite like that more like this is if you think of the the activation weights or the activations as a point in a vector space you are nudging the activations to a place where they can finally solve the problem at hand and I think that's what the chain of thought process or the sort of reasoning monologue is performing it's it's getting the activations to a position where it can actually solve the problem so it's actually not I don't it's cute that it you can interpret it but I would prefer if it just got to the right set of activations automatically um and so from a product affordance standpoint people are not going to want to really become prompting experts in a deep way and frankly it takes you know easily a thousand hours um and so the scaffolding that products put in place in order to orchestrate these workflows and and shape and the the the behavior of these systems um I think had you know these verticalized product workflows are probably going to be enduring because they specify intent they take that weight off the user um so some of the things that we see with respect to archetypal design patterns in the space consider a basic autonomous agent you really want to mimic the human decision-making process and decompose what is it that a person would do well if I need to understand how this uh poly polypropylene reslin uh manufacturer um is is managing costs I might look for public market comparables and that would that would you know maybe entail going to the SEC filings or earnings called transcripts and I would assess content potentially from a Knowledge Graph constructed from uh previous deals that that you know I as as a private Equity investor have done um news corpuses assess which which document sets are relevant to me distill down from those documents findings that substantiate um premises or hypotheses that I might have about this question or this investment thesis um and then enrich and error correct those findings and so a couple points on this one is that um it is actually so I forget who it was but they were talking about it was the Deep research team talking about um on that next step what are my intermediary notes what is it that I believe on the basis of what I found that's actually an extremely useful think out loud about what do we believe given the facts as they uh have materialized on that first pass through the the the data set um enriching individual findings that are distilled down from documents is an extremely powerful um design pattern likewise um it's it's you can ask these models you know is this accurate for that reuter's example you can say uh is this factually entailed by this document or is this actually an organization um and the model can frequently self-correct and what we've noticed is that it is you can do that in the Json um as sort of like a Chain of Thought Behavior but it's also it's actually more powerful to do it as a secondary call because the model is kind of primed to be credulous it says well you know I told you was and so yeah I'm probably right um so it's interesting how you can tease apart some of these steps into multiple different uh calls and then through this process of synthesis you're able to weave together fact patterns across many many many documents into a coherent narrative um and that control Loop we think that obviously human oversight is extremely important um the ability to nudge the model um with directives or or sort of selecting this is an interesting thread I want you to pull that as extremely important and that's because the human analyst always is going to have access to information that has not been digitized that's that conversation with management that's uh your portfolio manager thinks this class of Biotech is just hairbrained um that taste making I think is going to be where you see um the most powerful uh products lean I firmly believe with respect to the nodes in that Knowledge Graph and we prob many people in this room probably reached this on this conclusion as well but you still see this oh we got a portfolio manager agent this is the fact Checker and that sort of it needless like anthropomorphizing of these systems um it Con strains your flexibility if the design needs of your compute graph change and this is this 197 I think it was 1978 Bel you know the Unix philosophy it's like you think about piping and teing on the bash command line I guess I date myself I still use bash not Z Shale um just simple tools that do one thing and that work together well and text is the universal interface um it's 40 years ago 50 years ago jeez um so our friends at Lon space put together this plot with respect to the structure of these graphs I obviously that Paro Frontier which is the sort of efficiency Frontier it's two bat in a thousand a day um the efficiency Frontier for compute and performance trade-off or Price performance trade-off um that Frontier is going to continue to move out but there will I believe there will for at least in enduring time be a frontier and what's notable about that is that you have to select then which tool which system which model am I going to use for each node in the compute graph and the reason that this is important is what I call the latency trap if you think about the plot of time devalue and realized value for agentic systems and I think this is extremely important it's very easy to think oh man it's going to do all of these things it's going to you know I'm going to check it and airror correct and then you know in 25 minutes it's going to be banger and I think even with high quality productss like opening eyes deep research it's you're not always sure that what you're going to get out is high quality so there's there's kind of like a question of like which side of the diagonal it's probably not a straight line but is that product on but also from a reps standpoint the impulse response for the user how well how well refined is you can think of like my expectation for what the report is going to look like and what the report actually looks like is the loss and the user's mental model is developing a sense for how do how do my prompts elicit behaviors from these models if it's a 8 Minute feedback loop it's 20-minute feedback loop goodness you're not going to do many of those in the course of a day and your faculty with the system and the product is going to be low so synthesis is is really where a lot of the magic happens in these systems and um a couple observations so notice that it I don't know has anybody in this room ever had a 50,000 token response from any model no they say it's you know 01 is 100,000 context output context length um I'm not so sure and it's because the instruction tuning demonstrations these human gener synthetic or human generated outputs that are used to post-train the models have a characteristic output length it's hard to write 50,000 coherent uh novel words and so the likelihood that the models are able to produce I mean even um A1 still is about 2,000 3,000 tokens better than 40 and so what happens it's kind of like a comp there's a compression problem so I have a very very large context window for input I'm compressing that information into a set of tokens and so it's the like the difference between writing a book report and a synopsis of each chapter you can you can be more focused and um specific about what is it that I want those ,000 tokens to be focused on um here we have uh you know I said write write a an analysis of the Global Financial crisis goodness if I don't think the rise of the Shadow banking system warrants more than three sentences and so if you if you can be more granular and more specific um you can get higher quality higher Fidelity more information dense outputs out of these systems by decomposing your research instructions into multiple sub themes um Additionally the last point I'll make on this problem is that of uh the the presence of recombinative reasoning demonstrations in the instruction tuning and post-training corpuses is so it is uh easy to say here you know given the text of The Great Gatsby this is the epilog and write a new epilogue for The Great Gatsby because the cost of internalizing that Corpus is fixed effectively you read the book and then you write five epilog and it's like goodness I got it synthesis really is about weaving together disparate fact patterns for multiple documents think about the applications to biomedical literature synthesis I need to read all of these papers and then have something useful to say that actually brings together the facts from these documents now there's like a a cute trick you could try which is to say given the biography of Any Given paper write the abstract as an in as a as a posttraining exercise but it's just really hard to get highquality intelligent thoughtful analysis of many many many different documents and so there are limitations in practice for uh even state-of-the-art models in terms of how they are able to manage complex real world World situ situations uh factors like temporality um the perplexity had a well so temporality is hard um and being able to understand you know something like a merger and an acquisition um you know this these proforma financial statements are different from those that came um before the event um if they addendums to contracts it's important to propagate with um evidentiary um passages a metadata that contextualizes why do I care about this what do we think about this document um what how should I consider this in relation to the other de uh pieces of evidence in in the in the context window um so I'll now shift a little bit with some some examples from from our the product that we've built which is um how do you reveal the thought process of something that's considered 10,000 pages of text and I think that it is more like a surface and one where you're able to um it's it's kind of like this like people you may know the Facebook and Linkedin recommendation algorithm for for um connections uh feels uncanny good in part not because I mean the algorithms are okay not great um have gotten a lot better over time but in your visual cortex there is a a bundle of nerves that are uh exclusively dedicated to face recognition and the ability to say in a in a you know 6x6 grid of faces goodness I know that person and so you attend to the things that matter even if it's actually a low Precision product experience and so the ability to give the person um details on demand is extremely important um we'll see so here we have a brightwave report um we you know I think the ability to click on a citation and then get additional context about this not just what document is it from but how should I be thinking about this what was the model thinking in the course of this um as well as structured interactive outputs that give you the ability to pull the thread and say well tell me tell me more about that Rising capex spend in bright wve um you can highlight any passage of text so it's not just the citations but you can highlight any passage of text and say tell me more what are the implications of this I think open AI gestures towards this with respect to Canvas and the ability to increase the reading level of of a passage having a continuous surface that not just these citations um but in fact any uh finding should be interrog um likewise you can think of actually GNA pause it's not going to pause I'm gonna go back and do this again um you can think of the set of things that the model has discovered it reads all of these documents it develops a view it weaves the facts together um as a as a high-dimensional data structure and the report is one view on that data structure it's kind of a low low effort point of entry into the the space of ideas you want to be able to turn over that Cube and see especially in finance um the receipts what's the audit Trail for this system that's read all of these materials and so being able to in this example click into the documents is one level but having all of the findings laid out for you whether it's a fundraising timeline um ongoing litigation I'm able to if something catches my attention click on it this is where that that investor hello investor analyst taste comes into play I'm able to say tell me more about that it's like a magnifying class for text something catches my eye this patent litigation the goodness that seems important um you had a factory fire in Mexico that wiped out you know critical single Source supplier um what are you going to do about that that ability to drill in and get additional details on demand is extremely important in these systems and I think candidly um we we do not yet have the final version the final form factor of this class of products um but it it's an extremely interesting design problem and I will say uh we are we are hiring so these QR codes not only is it a great place to work we've got uh people from Goldman sex and UBS and meta and Instagram and anaplan and we just hired senior staff software engineer from Brave um goodness we got a stack team we also have a $10,000 referral bonus so I'm going to see a lot more phones come out now um $10,000 referral bonus for all of these roles primarily the product designer and the front-end engineer we're hiring staff and Senior staff level professionals we we have a small team of extremely experienced individuals um and this is structured like the DARPA red balloon challenge if you're familiar um so if you refer the person that refers the person that we hire you get a th000 bucks and so on and so on and so on all along that exponentially exploding uh referral tree so we're bright wave we build knowledge agents for finance workflow I appreciate your time [Applause] today ladies and Gentlemen please welcome back to the stage MC for the AI engineer Summit agent engineering day the founder of touring post Cassini [Music] AA so wonderful to see you and see your faces and but I bring good news I'm bringing good news it's lunchtime um thanks to Mike who already left um for this amazing Deep dive if you have any questions to any of the speakers uh please find them in the K day um areas um one is um right here on this floor uh other two on the lower level and I just wanted to say that this session was amazing what a morning uh um I feel buzzing with insights and I hope you got a lot of um interesting uh things for you to think about um each talk was an absolute gem uh we got the sneak peek into a lotting copilot's Enterprise multi-agent platform we learned Jane tra's tooling um for or camel uh Bloomberg's challenging um challenges in scaling generative AI agents and we learned about um bride waves knowledge agents um all these companies are hiring so go talk to them if you're interested um that's it enjoy your lunch um and uh we'll see you back here at 2 p.m. thank you so much ladies and gentlemen lunch being served [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] a [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Applause] [Music] [Applause] [Music] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] n [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] n [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] w [Music] [Music] [Applause] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] this [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] n [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] a [Music] [Music] [Music] [Applause] [Music] [Applause] [Applause] [Music] [Music] [Applause] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] w [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Applause] [Music] MC for the a n engineer Summit agent engineering day the founder of touring post [Music] [Applause] kiaa welcome back I hope you enjoyed lunch and some sessions from our sponsors downstairs um it never stops right a just never stops it constantly something something something and um how are you feeling are you ready for some more awesomeness awesome um so this Sprint is um this Sprint this um Sprint of sessions is packed with action and some valuable insights into AI engineering let's see what's on our list um it's agents are built in The Fringe getting from 90 to 100 how to scale 500 million AI agents in production with two Engineers um voice AI uh your board isn't special and um how to scaffold wisely with that please join me in welcoming our next speaker head of product engineering in Winder Kevin [Applause] [Music] how that's crazy all right how we doing New York so my name is Kevin this is our first ever wind surf presentation so you could say it's the first time we're kind of spilling the beans on what the ID is all about so thank you all for coming I'm going to be talking about wind surf the first AI agent powered editor so my name is Kevin how I lead our product engineering team we're a team based out of uh San Francisco and thank you so much to swix and Ben and the whole AI engineering Summit team for inviting us here and letting us speak to you all um it's been a pleasure talking to people in the audience at the booth and just generally talking about AI uh so let's dive into it wind surf is an a gentic editor and we're going to talk a little bit about some of the principles that we use when we're building a product like this so we believe that agents are the future of software development and you all are here so you kind of understand the power of what agents can do both for software engineering and otherwise um but to start I'm going to take you down a trip down memory lane right let's go back to 2022 co-pilot was the state-of-the-art it just came out of beta people were experiencing the ghost text they were seeing their completions and it was one of the first times that people really got to see the magic of what AI could do for developers was making them more productive and we codium uh decided we were going to be one of the first companies to also launch an autocomplete product so we garnered a couple million users on our vs code jet brains Vim emac extensions um raise your hand if you were one of those codium users nice nice um but we always knew that intelligence was going to get better right back then we were doing short completions maybe finishing your functions but we knew that there were going to be better models larger models better training paradigms completely new you know RL new tool use all this stuff and so we knew that we wanted to build the best experience for devs possible so even back then we started looking at agents we started thinking about what could the future of software development be if models just got bigger and so we built the best experience that we could at the time and that was a chat autocomplete product but we always knew that copy pasting from chat GPT was going to be a thing of the past we also knew that people are going to probably tab less we're going to have llms that we'd be able to generate more and more and who knows you know we always think all right agents are the best now but as a company we're always thinking about the future we're technology optimists so if an ID in the future who knows we might not even be writing code inside ID of inside of idees we'll just be there building the best product for S and so this year 2025 is finally the year where I feel like we are all recognizing the power of Agents inside of software development agents are here to stay and wi surf I'm proud to say is pushing the envelope of that technology and we're going to talk about some of those features um and we're going to keep pushing that agentic future um because we believe that you know agents are going to move software engineering in a direction that no other llm has done in the past so this slide I guess is titled Vibe coding with wind surf or also just coding in windsurf um so I'm going to give you a quick demo about this is the windsurf product here you have a a sidebar this is our agent and you can see that we're going to be building a python web scraper so what this is going to do it's going to build a python WebCrawler give us some stats about the website um and you can see it's actually installing dependencies from pip um it's doing so inside of the terminal that you use so you can interact with it um it's suggesting edits setting up your virtual environment and we give a user a very helpful accept and reject so that you can go through and have confidence that the code that it's generating works for you and your codebase um this is of course there's a lot more features under the hood um some of the things that our users like to do they like to look up documentation we have web search enabled by default um it always looks at your codebase so you can grap through your codebase uh we can generate commit messages you can drag and drop images right the possibilities are truly endless and I'm going to be talking about these features are powered by a handful of principles a handful of through lines that we as an engineering team hold true as we're building and as a team we always go back to the same Mission which is to keep you in the flow and unlock your Limitless potential right we want to work on we want to handle the grunt work for you we want to handle looking at your debug stack traces we want to handle modifying your original source code we want to pull the correct version of documentation so that you never have to worry about pulling in the correct context these are problems that we are trying to solve um and we want you to spend time on things that you are good at right the things that make us all excited which is shipping products building great features um and generally just shipping code and so with that goal in mind how do we tell what to work on um it's a game of input and output so we want to allow users to give the least amount of explicit input possible to produce the most correct and production ready code right we want you to contribute less and we want our agent to contribute more and we do this by reducing the amount of human in the loop required by doing things like background research we are always trying to predict your next step and we'll make decisions on your behalf so that you can move faster and this might all seem like a fantasy but winds surf launched three months ago on November 13th um that date is forever branded In My Memory um and these are the results that we're already seeing so in three months we've been generating 4.5 billion lines of code that is an absurd number and since the time I started this presentation we've actually probably sent users have probably sent thousands of messages to Cascade asking it to refactor code to write new features to build new pages on their website um and also a fun statistic since we're all Engineers here uh we've had 16 nights in the last you know 90 days where we've been woken up in the middle of the night from pag your duty on call because we've had some reliability issues um due to us exceeding our capacity right these problems these we've had immense success getting people onto the platform um and we've been very fortunate to have the issue of being some of anthropic and open ai's largest uh consumers and so with these Mission with this Mission and Metric in mind let's walk through some of the principles that we use when we're building this agentic editor and for those of you that have used winds surf um you might learn about some of the new ways that you could use the product and also for my own curiosity how many of you have heard of wind surf just so I know who we're talking about oh let's go that's sick um how many of you use wind surf okay everyone who put their hand down doors over there um all right let's get into it so the first principle trajectories what is a trajectory we use trajectories to read your mind so unlike edit other editors like cursor the elephant in the room our agent is deeply integrated into the editor and we'll talk about what exactly that means but on one half you can imagine an agent has to understand what you're doing and then on the other half it has to understand and be able to execute things on your behalf and this has led to features like one of my favorites quote continue my work so we are building up an understanding of the user as you're writing code as you're executing terminal commands and then you can actually just go into the the agent sidebar and just say continue my work and it'll actually continue executing that and it might even give you a full PR or a full commit right we also have things like terminal execution mode right it can automatically use the llm to decide what is safe and not safe so that if you're running something like git it'll just work or if you it'll probably prompt you if there's an rmrf somewhere you probably don't want to run that automatically and the LM would be like oh we should probably flag to the user to confirm this these are just some of the ways that we try and let the human be in the loop but as minimal as possible and then finally we also have you know a a stellar ux a stellar design team that's been working on how to integrate these sort of Cutting Edge features into a product in a way that allows the user to feel like they're in control to be able to accept and reject changes into their code so they can have confidence in the code that they're pushing to production so here is how a trajectory Works um we have this notion of a unified timeline so an agent is working in the background behind the scenes to understand what the user is implicitly doing so this includes things like viewing files navigating around your codebase um let's say you edit a file uh and then the agent will edit a file this all kind of goes into a shared timeline of actions you can imagine this includes things like searching grepping um making edits making commits right the user has this sort of holistic understanding of what you're doing and this entire experience is Unified by this shared timeline so you can contribute to it it can contribute to it and in this way you never run into the problem where you're talking to the agent and it undoes the change that you just did or has some you know outdated notion of what the file state is so this is a first class principle of ours and when we decided we were going to build an editor we were going to build it around this notion of an agent in a shared timeline and so here's an example of this feature in action here we're adding a new function and you're seeing the autocomplete and all the kind of like bells and whistles of that feature and in the right side we just asked continue my work this is a new function we probably want our form Handler to use this new function and you can see based on the context that we gave it by making edits it's guessing Okay we probably want to make this file change to this file maybe some others and then at the end it's actually just saying okay let's just run npm run Dev and it can run terminal commands on your behalf in the background in your kind of like command J terminal popup um and in this way we're keeping you in the flow right something that would have taken minutes is now taking seconds and here's another example uh the terminal is now deeply integrated into the agentic timeline so here if you're typing commands you know the classic example is like I npm install a new package or I pip install a new package the agent should know oh you just installed this package why don't we go ahead and implement it into your project and based on context that it's able to pick up around the codebase it can continue that line of work so we very strongly believe in a future of no copy paste right you should never have a situation where you're in a terminal or you're in a document or even on a website and you're copy pasting text into an agent that's just not how the way the world works and in the same way we strongly believe the future is not going to be at Terminal here's another example of commands running inside of your terminal I've been talking about this for a little bit and this concept of a trajectory allows us to automatically execute things inside of a Sandbox that is as similar to the way you run commands as possible so instead of running some shell script in the background what we do is we put this right inside of the place that you would actually write terminal commands so if you pip install something or it pip install something it's going to the same environment you'll never have this instance of kind of weirdness and and this is all part of our effort to bring these two sides the agentic side and The Human Side close together as close together as possible and you do this through building a unified product we believe that developers are here to stay and if you want to work seamlessly with a developer that means the agent has to understand what they are thinking wind surf has to be ubiquitous and the agent will be reading more and more of your mind doing things that you might not even know it's doing in the future we'll be looking not just one to five steps in the future but 10 20 30 steps into the future it'll be writing unit tests before you've even finished defining the function it'll be performing codebase wide refactors on multiple files based on you just simply editing a variable name all this is part of this unified trajectory concept now the second principle is is meta learning so even if winds surf understands what you're doing in the moment there is still an inferred understanding of your code base and your preferences and your organizational guidelines that let's just say senior engineers at your company have built up a notion of over time we call this concept meta learning so wind surf we've built from the ground up to to adapt and remember these things about you and your company so if you think about a frontier llm right the best llms that they exist in the world they're very very smart Engineers definitely more capable than than I probably more capable than most of you they can just write an enormous amount of code and do so correctly and it probably runs and compiles pretty well but what they do not have is the exposure that you've had the education that you've had and the ability to kind of remember and and know how you personally or your company writes code and so what does this mean for our product we've implemented a concept called autogenerated Memories so over time we build up a memory bank what you are doing so you can say remember that I use Tailwind version 4 or remember that I use react 19 instead of 18 and these things will be remembered you say them once and they'll be remember forever we also allow people to implement things like custom mCP servers so you can plug in your favorite tools we can adapt to your workflow we will also allow you to Whit list and black list commands going back to that same concept we want to keep you in the flow as least or sorry we want to keep you in the flow as much as possible but we can tell the agent hey never run an RM command without my approval and so in this way it learns learns about your preferences over time and if you think about what makes a developer effective it's because they remember things that you tell them and Wier must also model this Behavior if we hope that AI should write and maintain projects for us so in the short term this means you don't need to prompt the agent again and again to do the same thing over and over um but in the long term the AI should just feel like a seamless extension of yourself it's this idea of explicit versus inferred context and we always have the saying at the company ideas are cheap so here's an example of autogenerated memories in action here we're not even explicitly telling it remember this thing we're just giving it an architecture overview we're asking what does this project do and it's remembering based on a couple tool uses it's looking at a couple different files looking at the routes and now it's committed to memory hey this is the project that this person is working on here are the end points that are available and we can reference that in the next message that we send right so in the next future conversation we can now onshot things because we have a notion of a memory bank in the same way documentation is auto learned we know what packages you're using because of your package Json because you've explicitly told us and we're able to look up the web look on the web for documentation that matches those versions and we do so all implicitly and so the dream of meta learning is that you can have an entirely inferred sense of context based on a code base or based on the usage of the product and autogenerated memories are a step in that direction um we strongly believe that having a rules file you know we do allow users to to add a rules file we strongly believe that a rules file is a crutch you know by the end of 2025 99% of the things that you're going to put in a rules file will be interpreted or inferred based on your code base or your usage so our dream is that every single windsurf instance every single user using windsurf regardless of the company or the type of person the skill of the developer will be personalized to that user and you'll only have to tell it one thing and finally my faite principle which is scale with intelligence so what does this mean now that wind surf understands what you're doing in the moment right the first principle and can improve over time the second principle how do we actually build an agent that will scale with the rates at which llms are scaling and while we're trying to get always give you the best tool today we recognize that new models are coming out every other week right every day there's some new article about some new pattern and it's really really hard to keep up but we always think at codium how do we stay on top of this how do we build the best product for not just today but 3 months 6 months 12 months out three years from now so in 2021 when chat GPT came out you probably like me we all had our imaginations running wild we're like okay we're going to solve you know AGI post economy whatever but obviously there's a lot of things that need to happen between then and that future and so models at that time were quite frankly a little bit too too dumb to be able to comp accomplish everything that we wanted them to do so we built up a lot of infrastructure and you and I have all probably done this we build out embedding indices we build retrieval heuristics we have output validating systems to make sure that the code that it's generating is good right these are all things that were able to help at the margin but this is all predicated on the assumption that we were operating with a fixed notion of intelligence 2021 2022 these models we were operating we were building all this infrastructure to compensate for areas and edge cases that models could not handle and what's very different about the way we're approaching wind surf is that we want our product to scale with the models so if the models get better our product gets better and I'll give you one such example um it kind of surprised me I was you know when I landed in New York I tweeted that we deleted chat in Cascade I was like a very I don't know I was just I had thoughts and weirdly a lot of you picked this up and this is an example of something that we feel very strongly about one example of this principle in practice is that we deleted chat so what does this mean we only have an agent and it's called Cascade inside of wind surf chat is a legacy Paradigm and we completely replaced it and as you can see here users are enjoying it or in fact they might not even know the difference but they're just enjoying the higher quality an example of this is at mentions we built at mentions and probably you all have used at mentions because context was not very good a year two years ago today winds surf can dynamically infer the relationships between bits of code and documents 90% of the time you do not need to at mention something all you need to do is let the retrieval system in the agent kind of plan out what it needs to do and then reconstruct the context automatically for you so at file and at web these are very helpful patterns when you're working at kind of the margin but these are eventually eing out basis points in the long term we believe that llms are going to improve and they already have improved to the point where you don't need to explicitly specify an at mention the llm should be intelligent enough to pick it up and so in this example previously I was implementing superbase inside of a xjs app previously you'd be webbing you'd be at docs at codebase at this that no just add superbase right and it's able to infer and plan out let's search the web let's behave like a human would and to get into this there's there's also web search built into um winds Surf and what's very special about this is that it reads the web the way a human would read the web so instead of these hardcoded rules and you know we probably could have created an embeding index but we would probably get very low quality results and so instead we said the llms are very very good let's the model decide what it wants to do let's have it decide which search results to read what parts of the page to read and then finally give us an answer and so we believe that as models will continue to get better we're going to be continuing to do unsupervised work we're going to generate full PRS we're going to read complex documentation the possibilities are truly endless and so here are some of the principles that we just talked about um where are be going with this there's a lot of ways we can take this right the engine underneath wind surf is really really the secret sauce and we believe that we're going to be 2025 is going to be a whole new world no rules files generating PRS generating commits it's going to be crazy and we're already seeing this 90% of our users or sorry all of our users 90% of the code that they're writing is generated with Cascade that's an astonishing number autocomplete was more in like the 20 30% this is insane right people are using agents today to accomplish so much more than they could have in the past and we're all software Engineers I want to make sure that every single person in this room is armed with the best tools and those best tools are agents and like every good thing in the city I expect tips 25% of your ticket price which I heard was quite a lot um here's the actual QR that you're probably curious about um this is wind surf's download link we offer a free tier um so go ahead and and scan that start using the magic today and then finally we have some killer swag at our booth um you can also connect with me on Twitter I try and stay active with the community but thank you so much for watching I hope that you all learned something about how we're building at windsurf and enjoy the rest of the conference thank [Applause] [Music] you our next presenters will tell us how they scaled 500 million AI agents in production with just two Engineers please join me in welcoming senior software engineer at method Financial Mustafa Ali and the founder and CEO of open pipe Kyle Corbett all right um hey everybody uh yep I'm Kyle Corbett from open pipe and I'm here with Mustafa Ali from method we're going to be talking about how method has scaled in production to over 500 million agents uh and basically all the the tricks they use to to make that actually work yeah so a little bit about method is that we essentially collect and centralize liability data from across hundreds of different data sources this includes tapping into the credit bureaus uh connecting with the card networks like visa and MasterCard um and just direct connections with the financial institutions and various other third party sources and you know we uh sort of aggregate and enhance this data and serve it to our customers who are typically other fexs Banks or lenders and they use this enhanced data to um anything really to do with debt management so refinancing loan consolidation liability payments or just Personal Finance Management um yeah and at open pipe what we do is we help you build uh train and deploy open source models um for actual usage we also let you use in production your signals you get from users from the environment to improve your model continuously over time and that's some of the things we'll be talking about uh what we did with method nice so one of the early challenges that we faced at method while coming up with this you know aggregation pipeline uh was that some of our customers basically came to us and said you know it's really nice that you can give us the balance and payment information on a specific liability for their end consumers but you know what would be really nice is if you could also give us some of these liability specific data points like the payoff amount on an auto loan or the escrow balance for a mortgage and you know we said okay let's do some research so we go back to to some of our data partners and basically ask them you know is there anything you know we can plug into to get these kinds of data points and what we found was there's really no Central API that we could get access to that would allow us to get some of these data points and of course ideally we would want to work with uh directly with the banks but you know having already worked with banks before and just from initial conversations we realized that it would easily take up to at least a couple of years before getting anything solid done and you know we we're an early stage company so we want to build for the customer fast um and so that's really what we're trying to come up with a solution that we can just you know uh push into production tomorrow and so just to get better understanding of how some of these companies are operating today uh the services that they're providing today how are they doing that in the first place right like they must be getting that data somehow so we go back to some of these customers and basically ask them you know how are you guys operating and what they tell us is it's kind of interesting so a lot of these companies they basically hire offshore teams of contractors and you know they uh these teams are basically responsible for calling these Banks um on behalf of the company and the end consumer they authenticate with the banks gather the necessary information somebody has to prove check it it gets sent back um and then it gets integrated into the financial platforms um and it get surfac to the user is used for underwriting stuff like that and so that's the status quo that we're dealing with here and when you think about it that's a very inefficient manual process right it's it's when you try to think about scaling it doesn't doesn't really scale it's a very um it has a lot of problems you know it's expensive because one person can only do one thing at a time right so if you want to scale uh you basically have to hire more people and for the same reason because it's so synchronous it's also really slow um and the main I guess the the biggest problem with that is also that there's a lot of human error involved and um you you need to hire a team to fact check it uh to proof check it and um it's the the the the worst thing that you can end up with is to to surface basically inaccurate financial information and so conceptually though if you think about it it's kind of like an API right you have the request component you have the authentication component you have the response validation all that stuff uh so essentially when you drill this problem down into the core problem that's really just trying to make sense of um unstructured data right so if only there was this magic tool or software that we could use that was really good at parsing unstructured data and and you know lucky for us around the time that we were trying to solve this problem open AI announced gbd4 and you know as people like to call it there was this Cambrian explosion of AI or llm enabled applications all around us and the results were just mind-blowing um and we thought to ourselves you know this this this is the perfect thing for us this is like a godsend uh so we tried to like you know we tried to see if there's anything there that we could use and if there's one thing that we all know in this room is that advanced llms especially post gbd4 are really good with um with parsing unstructured data so tasks like summarization or classification they're really good with that kind of thing so we want to test that theory out and see what that can get us and so we put our heads down hack together this agentic workflow using GPD 4 and as expected you know it worked really well so we tried to like expand some of our use cases because that you know the API costs are high so we wanted to get as much as we could from a single API call and you know it turned out to be really good at that so we tried to obviously this was in a very controlled manner um but this was in production and so we were testing out uh different uh extractions basically and um you know everything was going really good uh but as soon as we started to increase a little bit of uh traffic uh what we found was you know the bill had to come du and um it was a lot so $70,000 for our first month in production with GPD 4 and you know this was this made leadership really unhappy and you know but um but it was something it was something they were they were fine with because the value that we were getting out of gp4 was so immense um and so we actually kept this thing in production for at least a couple more months as we tried to work around this kind of cost problem and you know cost wasn't the only thing that we were concerned with um as we started to scale some of these use cases we quickly ran into a wall with prompt engineering it only takes you so far um one thing we realized that even though gbd is really smart it's not a financial expert so you had to give it really detailed instructions and examples uh to really make it work with all kinds of use cases that we were trying to Target um so it's hard to generalize those kinds of prompts they become really long convoluted it's always a cat and mouse Chase with you fix it for a certain scenario and it breaks for another one you fix it for that one it breaks for the previous one and so you're all this going back and forth we didn't have any prompt versioning so we had to figure out a better way to make this work for all of our use cases and so the tldr here is that you know we we didn't want to adopt that initial solution that I just talked about earlier in the slides because of its scaling challenges and just because it was so inefficient but we kind of ran into the same scaling challenges with GPT where it was expensive because we couldn't really optimize for caching because of the variability and responses and the prompt tweaks we were making all the time and the Baseline latency that we were finding was actually really slow so we couldn't you know it was over overall we couldn't scale concurrently and similar to human errors that were kind of in a different nature we had AI errors which were just hallucinations that were hard to catch um and we just couldn't scale with this kind of system but we still kept it in production because for a specific use cases was actually really really good and so now the problem shifted from solving that core problem of trying to make sense of unstructured data that was solved with GPD now the problem shifted to how do we scale this system how do we build a robust uh you know agentic workflow that can handle this kind of volume reliably and so some of the ballpark figures that we came up with you know is that we we're going to be at least making 16 million requests per day uh we're going to have at least 100K concurrent load and you know we need minimal latency to um handle this kind of real-time agentic workflow so sub 200 milliseconds and you know so the natural next step for us was like we thought to ourselves do we buy more gpus do we host our own model like what do we do at this point um so that at that point open pipe comes in yeah so about a year ago we started working with method on solving these issues that Mustafa just listed and we actually found that the that those three issues he listed right which are quality cost um and latency are very common um these are things that you know across almost everyone we work with uh at least some subset of those are really top of mind um and so with uh method specifically we were working on okay how do we how do we solve those problems in a way that that makes this uh you know a viable business for you so uh the first thing we did was start measuring error rates um you know like like he mentioned uh even AI models are not perfect uh these are all probabilistic systems getting to a 0% error rate was not really feasible but we were able to see different models had different uh had different performance characteristics there so on Modern models on the task they're doing these are the rates we're seeing on gb40 um we're at about an 11% error rate uh and with O3 mini it's much better it's a 4% error rate um the way you measure that is going to be specific to your business and that that's actually true to some extent for all three of these things we'll talk about uh in the case of method this is actually relatively easy to measure luckily because they have this agentic workflow but like ultimately what the agent is trying to do is is fill out um you know extract all this information he was talking about Bank balances things like that and so you can you can have a human go through the flow and figure out what the real number should be and then you can compare an agentic systems final outputs to that and see if it was successful or not um which which made this part relatively easy to calculate uh so these are kind of the error rates we're getting um on the latency point of view uh we see that GPD 40 is around a second uh to respond uh and then O3 mini takes about 5 seconds for their specific task again this is somewhat task dependent uh depending on how much you know for example O3 has to think as you're measuring this you also want to make sure that you're using real production conditions that you're actually doing um you know like a real diversity of tasks uh that that match what you're actually doing and at a reasonable currency level that matches your production um and we also measured the cost um so again cost uh this is something that is going to obviously be specific and how much it matters is also very specific to your use case as well um interestingly O3 mini even though it has a much lower per token cost than GPD 40 if you just look at like the pricing page on the API for their specific use case uh we found it was a little bit more expensive because it has it generates many more reasoning tokens so it has much longer outputs um again though this is somewhat Tas dependent so I just recommend and um actually just just as an aside I would recommend once you get to the point that you're trying to optimize that you have sort of that initial proof of concept with with some model something that works I think it's really worthwhile to it can be as simple as like literally just writing like you know three different Python scripts that like are able to categorize each of these for a different model um and then as new models come out you'll be able to quickly tell how they're doing um okay once you've done or in this case once we've done this this sort of U benchmarking of where the models are then next question is all right what is where do we need these models to be where do we need to get to um and so again this is very task dependent uh in the case of method uh they do have speal like they have um extra checks that happen after this where they look and see okay are the numbers that came out plausible do they match you know the types of things we're seeing before all the all these different kinds of checks they're doing and so they didn't need to get all the way down to a 0% error rate but of course those checks are still followable and so um if it's over a certain point then then some fraction of those errors are going to get through and that's going to be bad so we found around a 9% error rate was was able to get them what they needed um from a latency point of view so the way their agent works is a real time system uh it it needs to be able to respond quickly to to move uh through the the basically like through the whole flow to get the information it needs and so they did have a hard latency cut off um we see a wide variety in this for what it's worth we have some customers that I talked to who it's like hey if I get a result back at some point in the next few days like that's totally fine this is a background bash process um we have other customers who are doing real-time voice with a human on the other end of the line and it's like hey you know if I'm over 500 milliseconds that's not going to work for me and so again you just have to know for your specific case how much this matters same with cost um in their case because of that very high volume as mustaf was mentioning cost is pretty important to them um again depending on your use case usually mostly dependent on how high volume it is um will determine how much cost matters to you but but it's something you you you should know these numbers for your specific task as you're comparing different models okay so um we're looking here at this uh of course as you're looking at this this slide you can you may see there's a problem here which is um of the two models we're comparing at least none of them actually meet all three of the requirements we need to be able to deploy this in production and uh you know gb40 on both the error rate as well as the cost we're not quite there um and then 03 mini uh on the cost but especially on the latency it's just not going to work for what we need so this is the point at which uh method came and they talked to us we're like hey we're not able to hit what we need here um because again we're not uh yeah we're these these models aren't getting us where we need to be so what we work on at open pipe is fine tuning we work on building custom models for your specific use case and so I'm going to talk about why you would want to do that and how that helps in this case um first I would say fine tuning is a power tool uh it does take more time it takes more uh engineering investment than just prompting a model uh so you don't really want to do that until you have actually benchmarked the production models just prompting them and seen whether they work or not um so in this case in meth's case and and in all of our customers cases uh they found that they were not able to hit the numbers they needed um and so that's the time you want to bring in fine tuning um so let's look at we were able to find tuna model and see uh how that was able to help uh because it can actually really bend that price performance curve a lot um so on the the error rate uh which is basically just the inverse of of accuracy if you want to measure it that way um we were able to get to a place where we were doing significantly better than GPD 4 and importantly better than that threshold they needed uh this used to actually be much harder to achieve it required a lot of manual uh labeling of data and things like that it's actually become much easier over time because of the existence of models like now O3 mini um which allows you to just use your production data you can you can use your uh basically the inputs you're using production you can uh generate outputs for them using a model like O3 mini and train on them we find like in this case that often you're not able to quite get uh to the the performance of the the teacher model the model o03 mini in this case that you're using but you can get quite close to it and usually do much better than you know uh a slightly less good but much much larger model um you know in this case uh the model we ended up deploying with them is just an 8 billion parameter llama 3.1 model and and we find that actually for the majority of our customers a model that large or smaller is is good enough and is able to hit the numbers you need from quality um but uh yeah the important thing is to be able to Benchmark that and to answer that question for yourself um on the latency point of view because actually this this is sort of the magic of being able to move to that much smaller model because we've got this 8 billion parameter model it is way easier to deploy in a low latency way um there's just many few fewer calculations for your sequential calculations for the number of layers and so you can get just a much lower latency you can even and we we didn't actually have to do this in method's case but something you can do is you can train this model you can deploy it within your own infrastructure collocate it with the application code that's using it um and even completely eliminate the the network latency uh and then finally uh on the cost front again just because this is such a smaller model um you end up with a much much lower cost uh and so that for many of our customers is a big is incredibly important is to be able to get that performance number you need um while still maintaining a relatively low cost um in in method's case we were actually able to far exceed the sort of cost thresholds that they were looking for to make this viable um which means that they don't have to worry about this from sort of a unit economics point of view uh in in in the way that they did when they were using the larger models um so just um to sort of reiterate what I started with before um this is a power tool uh the fine tuning uh is it does take a fair amount of work um not an extreme amount of work but significantly more work than you do for prompt engineering however if you're not able to get to the reliability numbers you need uh through just prompt engineering with the models that exist out there without tuning it is a viable way to very strong bend that price performance curve and get to a much better place uh which uh which which can help you get to a very large scale in production just like method did nice um so yeah just to wrap up here uh one thing that or at least a couple couple points that we want to highlight is that you know the reason we put two engineers in the title is also because it's not that it's not that complicated right you can get away with using we identified a specific use case and we got away with just using the cheapest model that was out there uh we fine tuned it we already had the data from GPT in production so we already had the data we didn't have to go digging around for the data in the first place uh so we already used that and we used the cheapest model that gave us the fastest performance and you know you don't need to buy your own gpus um and the the other thing that we realize is that productionizing AI agents actually requires a little bit of uh some level of openness uh and patience from the engineering team from the leadership team it's because when you write code we're all used to writing code that just work works you push out a future and never breaks because you're not changing anything but with AI agents you it takes some time to get to a point where it's like production ready and actually gives you the responses that you're looking for um and you know I I feel compelled to say something about as to Mark the top of the traditional software engineering job so I'll leave you with these last few words if you're Inu pivot to aiee thank you thanks everyone [Applause] [Music] our next presenter is a staff software engineer at Super dial and he's here to tell us how to make reliable voice AI agents please join me in welcoming to the stage Nick [Music] [Applause] kotakis awesome hey everyone uh I'm Nick I'm an engineer at Super dial and first of all big thanks to the organizers this event has been awesome I've had a blast talking to you guys connecting with you guys and hearing all these great talks um somehow I'm one of the few voice AI talks today in this weekend so I have a lot to cover we're going to Dive Right In if you're new to voice AI I hope I can provide a nice little framework to think about this very fast moving space and if you're building with voice AI already I'll be sharing some little anecdotes from our own scaling Journey that I hope will help yours as well so voice AI in 2025 extremely exciting we're seeing these new smart really fast really affordable llms that are supporting a lot more complex conversational use cases uh but you still kind of need some tricks to take your chat agent and turn it into a voice agent we have these low latency really realistic super generative textto speech models but sometimes we have audio hallucinations and we have to deal with things like pronunciation and spelling with all the new things that people are building there's this explosion in voice aai infrastructure and tooling and evaluation systems and a big question becomes what's actually worth owning and the big one on everyone's mind are these new speech to speech or voice too models uh and our take is that for a lot of production applications they're not quite yet ready and a big reason for that is they start to Output things that aren't actually speechy are natur uh things that you can use to build a reliable conversation and this we saw this when they first came out they were like imitating people's voices and from the start that's why we've kind of been favoring uh reliability over this sort of realism so today I'm going to talk about how we at Super dial approach agents as a service how we think about the voice AI engineer and The Last Mile problem so once you have your little voice uh MV VP all the challenges that you're going to face trying to actually make it reliable and put it to work so at Super dial we're in the business of phone calls specifically one of the most annoying phone calls ever that phone call to your insurance company so for Mid to large-sized healthc care administration businesses we sell the super dial platform and with super dial you can build your script so design the sort of conversation ask all the questions that you need to get over the phone you send us your calls via CSV API or we also integrate with a lot of EHR software systems and then you know within the next couple hours in the next day we send you back your results in a stretchered format and this makes for a really interesting agentic contract that we sort of have with our customers so from their perspective they're paying for results they tell us who to call which questions to ask and we tell them the answers internally we have a little agentic Loop set up so that uh we go out we wait for these offices to be open we wait for um you know the call centers to open so we can actually make these calls we will attempt to make the call with our voice bot and then if our voice bot needs to bring in a human to complete the call or cannot complete the call after a certain number of attempts then we send it to a fallback team and this is something that of course we're transparent with with our customers in fact it's a benefit to them because it's kind of inevitable with these Healthcare phone calls calls that sometimes you need to bring in a human so with us they know that no matter what happens the call will get made whether or not it gets made with a human or a bot doesn't matter to them they get their answers reliably and in a structured format uh and with all these calls we try to do our best to learn from them so we'll update the sort of office hours for the given phone number we're calling and learn from the sort of phone tree traversal that we just tried so that when we call it again we get even better at that sort of call and because there are sensitive phone calls we want to make sure our system always works so randomly we'll take out some of these calls audit them make sure everything's working uh for a quick little demo this is actually a prior authorization call uh this is after the point where we've traversed little phone tree by clicking the right buttons and now we're talking to a human and trying to get some questions answered for a customer I know your first name hi this is Sarah are you calling from a doctor's office or I'm calling from provider's Office do you have a member ID or a case number the member ID is what is the CPT code the CPT codes are 81243 okay hold then so there's a Kon file uh that was initiated for the code 81243 it is pending so this case number is and we have not received any clinicals for this case yet okay what is your name again and what is the reference number for this call First theme is you may have the pending case number as a call reference number and the fact number on where to send a clinicals just thanks so much for your help you're welcome thanks for calling have a great day so that's it uh if that call was really boring to you than if that call was really boring that's kind of just how these things go a boring call for us is an excellent call because it turns out a lot of work is boring uh so with the system we've been ble to save over 100,000 hours of human Fone calling time and we're on track to save Millions more in 2025 and what's really incredible about voice AI today is that we did this with a really lean team of four Engineers so building the whole full stack web application these EHR Integrations the bot you just saw all while bringing on new customers supporting new conversational use cases really quickly and a big part of why that was possible was because we really all embraced this role of a voice AI engineer so let's kind of uncover what's unique about a voice engineer today and what hats they may be wearing so starting from switch's like original graph we can kind of see that a voice AI engineer is going to deal with multimodal data so MP3s Audi bytes in addition to transcripts you're dealing with transcription models voice models speech to speech all that sort of thing the application you're building it's in real time latency all of a sudden matters so much more you're going to be dealing with async in Python a lot more more than you probably wanted to be doing and the product constraint here is almost always going to be a voice conversation so people have really high expectations of how these sorts of conversation goes uh for us like we're slotting ourselves into an existing uh sort of like business interaction and people expect us to be conversational and fit into that use case so to Grapple with all these challenges we kind of have two sayings at Super that we've been saying over the past year and a half say the right thing at the right time and build this plane while we fly it so the trickiest part uh for us is customizing all these scripts and all these use cases for each customer individually and then we really rely on this kind of like horizontal voice AI stack to help us out with all those other problems and this is kind of how we think about the voice AI engineer today and it's Unique roles and in the larger context we really at this inflection point where it's so easy to build out an MVP for these sorts of applications that ultimately what is going to make your voice bot unique isn't its Voice or its Interruption handling or how realistic it sounds or how it does turn taking ultimately it's going to be in the conversational content and the design there and the vertical Integrations around it that make your agents work actually valuable and if you're like me and your favorite classes in college were the AI ethics ones everything I just said about moving fast building with generative AI could raise a few red uh raise some alarms so it's not hard to imagine how voice AI apps specifically could be biased against people with certain accents we can certain dialects or be really spooky when they sound so real and then say weird things so in the US we both like enjoy and suffer from a lack of AI regulation and that leaves the onus ultimately on the AI engineers and leaders in this room to think about these sorts of problems this is not going to be like a talk on like AI safety and ethics but I think for voice AI specifically with how it's such like a new modality of interaction with artificial intelligence today I think it's really important how we go about building it so for AI Engineers when we go about making tooling and infrastructure choices uh remember that like developing AI should be really accessible and collaborative and the work that AI does should be for everyone and a key part of making sure that's the case is choosing tooling and infrastructure so that a really diverse set of stakeholders can be involved in that process from the start so with the role of the voice AI engineer kind of scoped out now let's dive into some of the last mile problems in voice AI that we've been dealing with so when we started out we had a really scrap together pipeline of like a transcription model and an llm and then a touch to speech model uh this was awesome to get started at but you know we faced a lot of problems very quickly and a lot of what we were learning was not new at all so though the voice agents we see today are better than ever voice UI itself is not that new so when we were just getting started uh around a year and a half ago I had the chance to speak to Kathy Pearl who is a close family friend and has been working on uh the ux of Gemini she's been in the conversation design game for like 20 years or something uh and back in the day like voice UI was lots of phone tree design and then it BEC these Alexa and Siri type things and now we're just in this whole new world but a lot of the principles remain the same and one of the biggest things that's changed with developing voice UI is the shift from prescriptive to descriptive development so we no longer prescribe what we want our bot to do over the course of the conversation by mapping out every possible direction that it could go instead we describe what we want to do and then kind of pray to the Jenner of gods that it happens and for this you know there's a lot of things I talk about with conversation design but it comes up really quickly when that becomes your main interface one thing for us is when we're asking these questions you know should we be really open-ended with it or kind of constrain the user into selecting from a list of choices and for us because these are existing conversations we find it's often better to just go General hope the call center representative gives us a ton of information and then instead of trying to prevent them from saying the wrong thing we try to adapt to whatever they say so Kathy's recommendation was hire conversation designer if you're thinking about these sorts of problems they're experts in this and if you're just a voice AI engineer and you want to get started in this kind of thinking a great recommendation is to do little table reads so have one person pretend to be the bot and the other person to pretend to be a user and the sort of like transcript that you may write out by hand immediately the sort of gaps and awkwardness of it comes out when you say these things out loud so knowing all these things we were really excited to work on our conversations but we had kind of had to deal with the tech de debt of the orchestration framework that we had built so we really hit our stride when we started using pip cat for voice AI orchestration this is an open source framework maintained by the guys that daily it's really easy to extend and hack upon which is important for our use case when we need to do transfers and stuff um and we make really long phone calls these can be like an hour and a half long so a big decision for us in choosing pipat was that we can self-host it and deploy it and scale it how we want so with some of our like voice orchestration headaches dealt with we really wanted to get back to focusing on our conversations and everything in this slide for us is really not unique to voice UI uh and AI so I'm going to kind of speed over it two interesting decisions we've made here because we just have you know an LM in the backbone uh we chose to own our own open AI endpoint we find this leads to a better interface with a lot of these new voice AI tools so behind our open endpoint we can kind of route to different models that are maybe more uh latency sensitive for all of our generative respons we route them through this tool called tensor zero tensor zero is relatively new they have this nice framing of LMS uh if that quote interests you I recommend you look them up and talk to them they're awesome uh this is like a little open source tool so you can do whatever you want with it they give us kind of structured and typed llm endpoints that we can then experiment with in production so that's our gateway to our LM and then all of our logging and observability we self-host Lane FS and we self-host these things also because these are like healthare calls we have to be hyp a compliant that's often an easiest an easier way to deal with you know the rapid growth of this space so there we do like anomaly detection evals and data sets so with a good plan in place for our llm sort of work another big challenge is our touch to speech system so when you make these sorts of phone calls your password is basically your name your date of birth and then your member ID or something which is like a 12 digigit Long stren of characters that you have to be able to communicate over the phone and something we quickly realized was that what our llm is outputting is not necessarily what we want to shove through our text to speech engine and neither of those things may actually match what's in the recording so a little example of this and this is like a personal last mile is that if you're building me a personal voice UI application it should say my last name correctly so my last name is pronounced kotus most people and most models will say kotus but with a lot of new tools out there this is the syntax this company called rhyme uses you can spell out the exact sort of pronunciations you want and then for things like spelling where you may have kind of an intuition for like the sort of pauses and breaks you might want to use to say a really long word you can use something like this little spell function um and then with all this stuff like because this is outputting audio bytes we usually review recordings to make sure that this all sounds okay in addition to checking the transcripts and to start wrapping things up I have a couple little mini Last Mile problems that we've had to deal with oh and yeah with voice to voice models all this sort of rule based stuff gets a little more complicated so some little mini ones uh we used to be called super bill and we called our bot Billy because we thought that was a fun name turns out that's an awful name the phone because we would constantly have these conversations where people were like hey nice to meet you Billy and we would say it's Billy not Billy so yeah think about your persona a lot dial that in early uh if you're just starting don't build from scratch what's going to make your Bot unique is the conversation and there's so many new tools out there like pipe cat that you can use to get a quick jump start track latency everywhere time to First Bite for each of your little processors is the new most important metric and is something you always kind of have to keep an eye on uh upgrade paths this is a big one for us when we need to make sure we have really high transcription accuracy so we use deep gram for our speech DET text engine and we know that whenever we kind of want to improve that part of our system we can work with them to fine-tune a better model have fallbacks ready it really sucks when open eye goes down for a little bit and all of a sudden all the concurrent conversations you have have are just down the drain so have fallback ready for each part of your stack it's really easy to set that up with something like T to zero there lots of other tools that'll help you figure that out and then end to end testing this is pretty unique for voice UI and or voice AI uh it seems like people are kind of settling on telepan as a boundary layer to test your Bot with like an external service we do a couple different things the easiest test for us is to create a kind of fake phone number that just plays an MP3 if your Bot can't talk to an MP3 then you probably have bigger problems next we can kind of create uh a simulated voice tree with like different uh like phone tree building tools and have our bot pseudo navigate it and then there's lots of generative services like Koval and V where you can have your Bot talk to another bot so some takeaways for a quote unquote vertical voice AI engineer choose your stack wisely the better decision you Mak you make here it will allow you to focus on the things that are really truly unique to your conversational experience laser focus on the last mile because this is where ultimately you can provide a lot of value and put your agents to work and then ride the wave there's so much new stuff happening in this space and whenever new models come out you want to be able to use them quickly and you also want want to be able to use them safely so thank you very much I'm excited to talk to you all and hear about what's so special about your [Applause] [Music] conversations our next presenter is a head of Applied AI at ramp here to teach us how to scaffold our agents wisely please join me in welcoming to the stage Rahul sango tuvalu [Applause] all right while we're getting set up um Can anyone find what the problem is with this slide yeah working on it there we go nice I think I'm the only presenter using figma slides so I had to use my own laptop for it um cool guys so the problem here is it's a bitter lesson but lemons are sour so um I only realized like 10 minutes ago but I like the graphic on there so a little bit about me um had of a plaat ramp I've been working on LM for four years which is well which is uh kind of a long time I guess uh in LM land everything started happening really when chat GPD came out um so I was trying to build what people would Now call an AI agent company back then we were just doing customer sport we're trying to make our chat bot smarter we're trying to figure out what what models to use to or what tech to use to get them to respond to customers better and we were messing with gpd2 and Bert and models were so frustrating stupid and the context windows were small and they were not very smart reasoning and it was just incredibly annoying and we just wrote lots of code around these models to get them to work at least somewhat reliably and along the way as models got smarter just kind of had to delete more of that code and this ended up seeing a lot of patterns in what what code needs to get deleted how to build agents and what ways that will scale with more intelligence and clearly we're going to continue to get a lot more intelligence and I just wanted to uh maybe talk about a single idea throughout the talk uh through various examples uh we'll we'll do some uh some uh setting but I'll also have a bunch of demos to kind of like drive home to point and maybe it can convince you guys that uh there's a certain way of building agents that's slightly better than other ways I also built a structur extraction Library called Json former um I think it was the first one I don't I'm not fully sure but timing wise it was before all the other major ones um and that was was also scaffolding around a model models were too stupid to up with Json and we were just really begging and pleading pleading it and forcing it to act in ways that we want it to be so as I said earlier just have a one core agenda item here which is want to convey one idea uh we'll start off all of you probably read the essay bit or less and just quickly go through what it is uh we'll go through a production agent we have at ramp and how it works in three different ways of architecting it and then I have a demo that to really push maybe how we think about how software and backends and things will work in the future so very simply the idea is just that systems that scale with compute beat systems that don't so there's two systems and uh without any effort the the system one of the systems can just think more or use more compute in some way that system tends to be systems that are rigid and fixed and just deterministic so from that idea it's pretty clear like if you're Building Systems you might as well build systems that are prove with more compute and this this seems pretty obvious like obvious conclusion from the B lesson taking it a step further why is this true it's because exponentials are rare like they just don't exist most things in the world aren't exponential so when you find one you just should hop on strap on just take the free pass and go for the right and probably shouldn't try too hard and there's a lot of examples from history that uh kind of reflect this so for for chess and go and computer vision Atari games like people have tried to build lots of systems and written a lot of code and my way of thinking about rigid systems just like spending a lot of time grinding weekends and writing very clever software well abstracted uh maybe trying to synthesize human reasoning and thought process into features and then using them in clever ways and trying to approximate how a human would think and if you actually fix the amount of compute that approach will win but if it just turns out if you end up scaling out how much search you're doing the general method always ends up winning even uh like in all these cases so Atari go and computer vision a little bit about ramp so ramp is a finance platform that helps businesses manage expenses payments procurement travel bookkeeping more efficiently and we have a ton of AI across the product so automate a lot of boring stuff the finance teams do and employees do with uh submitting expense reports and booking your flights and hotels and uh submitting reimbursements all that and so a lot of the work behind the scenes is just we're interacting with other system sys um helping like Legacy systems and helping employees get their work done faster so let's actually talk through one of the systems we have today at ramp and um maybe some talk through the different versions of the system and how it evolved over time so we're going to talk about something called a switching report it's very simple agent all it needs to do taking a CSV a CSV arbitrary format so the schema could be seriously anything from the internet and we want these csbs to come from third part party car providers so when people onboard to ramp we want to give them a nice checklist and say hey here are all the transactions you have on other platforms and we want to help you move them over and the more transactions come on ramp the more we can help you and the more you'll use our software and more everyone benefits and so the switching report is just really a checklist but to read people's CSV transactions we need to understand those and other platforms have all these kinds of crazy schemas and so the the description of the problem we have here is just for an arbitrary arbit like CSV how can we support um parsing it and then into some format that we we understand so let's just start with the the simple approach right is like let's just take the 50 most common third party card vendors um and just manually write code for all of them and obviously like this this will just work it is some work not a lot of work but you still have to maybe go to 50 different platforms and download their csvs see what schemas they have and then write code maybe if they decide one day they change their format your thing will break but that's okay you'll get page and you can wake up and go fix it so let's maybe introduce some LMS in here so from the over engineered code where you ended up writing 100,000 lines maybe we don't we don't we want a more General system so let's introduce a little bit of LM a little bit of AI in here and so in the deterministic flow let's maybe add some or just like scripting In classical scripting land let's add some more um calls to open AI or you have an embedding model you want to do semetic similarity or something like that so then let's just take every column in the CSV that comes in let's try to classify what kind of column it is is it a date is it a transaction uh is it a transaction amount is it a merchant name or is it the uh user's name and then we map it on and then we probably could uh end up in a schema that we're happy with again most of the compute is running in classical land some of it is running in fuzzy like llm land but this is somewhat looking like a more General system let's go maybe a different approach when like we just go all the way through let's just say we're just going to literally give the CSV to LM and say you have a code interpreter so you can write whatever code you want pandas or all the faster rust based ones um you have all these python packages um you allowed to look at the head of the CSV the tail whichever rows you want um and then I just want you to give me a CSV uh with this spefic format and here's a unit test here's a verifier that you can use to tell if it's working or not turns out this approach actually doesn't work like we tried it um if you only run it once but instead if you run it 50 times in parallel it's actually very likely that it works really well and generalizes across a ton of different formats the amount of compute here is actually probably like what is that number 10,000 times more than the the first approach we came up with but again like what is truly scarce in the world is engineer time maybe not for not in a while but at least today and we'd rather have a system that works really well and even with a 10,000 times more compute it will probably cost less than a dollar and every transaction that switched over every fail CSV will cost ramp way more money than whatever money we spend on this exact this exact architecture so this is a very specific uh example it's like how does this apply to the agents that we all build and maybe the systems we're all working on turns out something like this actually generalizes so if you look at a three approach is and let's assume like The Black Arrow is just classical compute and then the blue arrows are fuzzy land so it goes into neet and all all sort of weird matrix multiplication happens and we're in latent space and gets all alien intelligency and then comes back to a classical land first approach there was no AI we just wrote code and it just worked mostly the constrained agent so the second approach we broke into fuzzy land from classic land when when we decided we wanted similarity scores or something something like that and then the third approach is actually flipped where the llm decides it needs to go into classical l so it writes some code write some pandas or uh python code and it decides to break in into this classical L when it needs to but most of the compute is fuzzy actually this is maybe not the most accurate graph like because I proposed that we run it 50 times it more so looks like this but if you look at a back end in general they're all request response so some sort of messages going in it's like a post request or get or update or read any sort of credit operation and we're really just asking the back end to take this piece of information do whatever you must with it run out whatever mutations you want and return me a response and almost all systems we built so far as like Humanity I guess like look like the first one but more people are using open AI open AI makes billions of dollars and probably a lot of the systems that use them look like number two where just regular uh programming languages are calling into open AI servers and we're running some fuzzy compute what we're seeing in like more and more parts of the ram codebase we're moving to the third approach because it just tends to work well because all the blue arrows if you did nothing Absol absolutely nothing we all went to vacation for the next year the big labs are still working and spending billions of dollars making those models better so the blue arrows will get better and so how much blue arrow you're using in your code base actually will help directly your company without much effort from your end so this is what I was saying is like the bitter lesson is just so powerful and exponential Trends are so powerful that you can just hitch hitch the ride let's um let's take this idea like further um let's actually like go all the way like something something crazy um on the left you'll see a traditional web app so usually the way it works is you open um gmail.com and some uh static file server and Google sending you bu of JavaScript in HTML and CSS your browser renders that um and shows you some nice UI nice HTML that's user friendly maybe you see some emails maybe you click on one of them um the friend makes a request to the back end the back ask the friend end friend end asks the back end give me the content for email and whatever ID it is and then the back end hits database and gives you the result and maybe they Cod gen maybe they use all the Cod genen tools available to make Gmail so that that was probably the LM only worked when the software engineer was writing the code but once the code is written and it's like pushed to production it's just classical compute and on the right I'm actually proposing a different model which is the back end is the LM it's not Coen it's this LM is doing the execution it is the backend so the llm has access to tools like cod interpreter and potentially has access to um through that making request Network requests and also has an access to uh DB so I have a mail client actually that works with this principal and this is my test email so if y'all want to see any emails you send to me in a minute or so you can send me an email but please be nice all right I think um it's probably enough time so I'm going to go over so we have this email client I mean we still have some regular JavaScript to hook into the LM hook the LM into the browser but when I do log in I'm going to use my email just say what oh it's probably okay we're good we're good all right we're saved I think thankfully I have a room full of Engineers so there's a dot but the reason it's so slow is because when I open this page and log into Gmail the Gmail token is actually being sent to an llm we're saying literally this is a LM chat session what we're we're seeing on the screen is like hey LM you're you're actually simulating a Gmail client you have access to Oh all the emails you have access to um Raul's Gmail token and a Cod interpreter and so just render some UI based on uh what you think is reasonable for the homepage for Gmail client and so looks like it decided to render his markdown uh I think we actually tell it to render his markdown and it's rendering all the emails that a bunch of people sent me from here so looks like it says uh hello from California so I'm going to click on that when I click on that we're actually not running um any like backend calls or anything like that we're just telling the LM the user clicked on that piece of text in this case it was hello from California and the ID number the LM now has the information on what the user clicked on and it has the chance to rerender the page much like a web framework would so again it goes back it probably hits uh a get request for that specific email and pulls the body what is this agent going to do I'm watching you live so the LM just decided this is the appropriate UI for a Gmail client also I have other features the LM thought was reasonable so looks like I could Market as unread or or delete the email if I want to uh maybe I'll delete it because it's not that good of an email I'm sorry it is very slow because we're doing a lot but wanted to push you in this direction because this kind of software barely works dang I guess not um also I clicked on it and now the LM is trying to do something with me clicking on it but anyway um this kind of software barely works today and it doesn't mean it won won't work in the future uh but with exponential trends like things might just like this might just take off um so just wanted to push you all to think in this direction um yeah we'll software more software look like this I don't know we'll see thank you [Music] ladies and Gentlemen please welcome back to the stage MC for the AI engineer Summit agent engineering day the founder and CEO of super intelligent [Music] nlw all right guys thank you rul and everyone else who presented um you know the theme of this whole event is agents at work and one of the things that we called out this morning is that what makes it so different uh than events that we've had in the past is how much this is about real world happenings what's actually being built the challenges that we're facing in deployment in production I think this session was a great example of that we are headed into another break we have about 30 minutes now a quick reminder you can go meet speakers in the Q&A lounge or you can check out the sponsor e uh Expo there is also coffee and snacks down there so see you in about half an hour that concludes this session please enjoy one final break in the Expo with sponsor demos food and drinks and a special panel on the Expo stage with [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] a [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Music] oh [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] please welcome to the stage MC for the AI engineer Summit agent engineering day the founder of touring post [Music] kiaa hello so this is our last stretch of sessions um it's been such a long day and I'm I'm excited um still um this is going to be three sessions there is a there's one change but you don't want to miss this uh sessions they're amazing the first the first one is creating agents that co-create uh the second one is about education this is a change it called uh the next AI Engineers um and the last one uh would be what does it take to build a personal local private AI agent augment that augments you deeply sorry um it's all about building now it's all about building a ey that is truly um integrated into our lives and now from very very early age um with that please help me to welcome our next speaker she has a very interesting experience on working on both Chad GPT and Claude please welcome Karina and [Applause] Gwen a little tired hey everyone my name is Karina and I'm an AI researcher at openi before that I worked at antarik for about two years working on cloud so today I would love to chat more about what kind of scaling paradigms that has happened in the past two to four years in AI research and how those paradigms unlocked New Frontier product research and also going to share some of the vignettes from some of the lessons learned by developing Claud and Chach PT products some design challenges and lessons um and how do I think about the future of Agents as they become from collaborators to co- innovators um in the future I would also love to invite you to engage um in the conversation so I'd be more than happy happy to answer some of the questions at the end cool so not sure if probably the majority of you know this but I think there are two scaling paradigms that has happened in AI research over the past few years the first Paradigm is the next token production and you might have heard this as called pre-training and what's really amazing about the next token prediction is that it's a World building machine the model learns to understand the World by predicting the next word and I think fundamentally if you think about it this is happening because certain sequence is caused by initial action and this is irreversible and so the model learns some of the physics of the world to understand and the token can be anything right the tokens that we pre- chained are strings words pixels it could be anything and so to predict what will happen next the model needs to understand how the world works and this is why preining worked and so you can imagine the next token prediction is the massive multitask learning and what's amazing about this is that during prating some tasks are really easy to learn such as translation right like the word boarding in French is the model also learns a lot about the world the capital of France is Paris and because some of the information is much more present on the internet and in some of the knowledge artifacts the model has much easier time to learn this but we actually the reason why compute is so important and scaling computer in the paining stage is so so important is because there is a new class there is a class of tasks that is really really hard to learn and for example the model learns a lot about the physics it learns so much about the problem solving generation and The Logical Expressions it learns some of the spatial reasoning although it's not perfect um but we getting to the complexity of the tasks such as math when the model has to compute this number during the next token prediction is actually really high so that's why you need Chain of Thought or might spend more compute on a Chain of Thought to help the model to reason through more computational such tasks another class of tasks that I was thinking a lot about is creative writing it's actually really really hard and the reason why it's so hard for the model is because you know you can predict very nicely the style of the writing but a lot of the creative writing is actually World building and storytelling and the plot and it's much much easier for the model to make a mistake for the next token prediction in such a way where it will completely deteriorate the plot coherence which is really important for the stories and this is an open-ended like research problem um creative writing in itself and the reason why it's because it's really really hard to measure what is a good creative writing what is not a creative writing and obviously we would love for the models to invent new forms of writing and be extremely creative in their Generations but this is actually one of the hardest AI research problems today is um how do we make models to like write novels and have coherent stories over the long course of the period of time time so I think the era of 2020 to 2021 there was an year of scaling prating a lot both at anic and at OPI and actually the first at that time one of the first products was GitHub compilot and I thought it was completely interesting product the autocomplete because it's so entertaining the model has learned so much about the code and the next token prediction for The Code by the billions code tokens using from GitHub open source projects Etc and what has happened for the autocomplete T tab tab in the cursor or gith hilot is that the researchers constrained via rlf reinforcement learning from Human feedback and reinforcement Lear from AI feedback to make it extremely a little bit more useful to use and this is where the era of post training has gone off so in post training we teach the model how to complete function bodies understanding dog strings how to complete generating multi-line completions predicting the next diffs apply the next diffs and I think we are still in that era where there is so much more to be explored in the posttraining stage of rhf RF to push the capabilities of models to reason through complex code code bases the next Paradigm in AI research which has happened last year and and it was published by open a with a new model 01 is scaling reinforcement learning on Chain of Thought and this is why we call them it's highly complex reasoning and you can imagine you you spend a lot more test time computed on training in to scale reinforcement learning and the reason why it works is because the model learns how to think during the training and learn from the feedback by having really good signals in ARL so on the left you can see the output of normal gbd4 OG gbd4 and on the right you can see the entire Chain of Thought that has been that the model has thought about to solve the complex problems and as we think about harder and harder tasks if you want the model to go from you know translation towards solving medical problems you actually need to spend you actually need the model to like spend a lot of time just thinking through the problem and completely creating more complex environments with tools and other other tools and more complex environments to think through and verify its outputs during the Chain of Thought So as you can see the chain itself is very interesting and the model is has certain words that it does um but um there's a lot of science to be done in terms of you know faithfulness in the Chain of Thought how do we measure the faithfulness what happens if the mod model goes into like wrong direction can it backtrack itself I think there was a lot of science around that and we are only the beginning of it one of the first projects that I've done at open a is actually how do we the interaction Paradigm is very different now so the interaction Paradigm is the model thinks a lot to solve the problem if the problem is hard so but how do we create this interaction new inter Paradigm with humans such that it will be much easier so that humans don't have to wait for 15 seconds or 30 minutes for a model to come back and one of the things that we did um as a simple approach is to have like a streaming models thoughts to a user and that way we had to communicate what exactly the summaries of the thoughts for the model and communicate very wisely to human but I think it's still one of the design challenges like as the model's capabilities and attraction paradigms change you have like new design challenges that you need to solve um for these types of models so and I guess like this year open is the year of agents and the way we think about it is highly complex reasoners such as models trained on AR and Chain of Thought using real world tools such as browsing search computer use uh over a long Horizon period of time over a long context but what's the next stage in my view the next stage level is co- innovators and the way I'm thinking about is it's agents that is built upon all the things that we've done with reasoning and Tool use and long context plus creativity and creativity is enabled only through human AI collaboration and I think this is where I'm really really excited about in the future is to create new affordances for humans to collaborate better with AI such that we both can co-create the future that we want and so those two scaling paradigms in AI research has unlocked us new kind of product research and you know you can imagine product research being oh we we have API from the model and now we have to integrate in the products but it's actually what's happening on the ground is we have like a very now we have like a very nice rapid iteration cycle of the product development and the reason why is because we can use those highly reasoning models to distill back to smaller models or the models that we canate very very fast and we can use those highly complex reasoning modelss to synthetically generate new data such that we can create new post training new data sets new reinforcement learning environments so okay um so one of the things that we can do is creating new completely new class of tasks and um you know if the task is um how do we create a multiplayer collaboration uh with a human AI you might want to simulate different users H and and how do you do that you might want to like synthetically generate data sets of different users conditioned on the different users and push chain on that so it actually highly depends on like what kind of product experiences that you want to create and extrapolate that to a new class of tasks that you want to PST in the models um I think we are moving towards more complex reinforcement learning environments uh which means we can allow models to use search or browsing or much more collaborative tools like canvas during RL such that they can learn how to be how to become better at collaborating um we can leverage things like in context learning I think models are extremely extremely good so you can essentially create something a new tool and then the model will learn just by a few shot examples and this is extremely rapid eduation cycle for any developer as I mentioned before synthetic data wide distillation is another thing I think we can also invent New Model Behavior and interactions to utilize user feedback so now we're going to go through some of the vignettes that um that has happened um from Anar to um I think the first concept that I've learned um is how do we bring unfamiliar capability into familiar from factor and the reason why 100K context uh was successful is because we found you know file uploads is extremely familiar from Factor everybody is working on documents but you can imagine we could have deployed 100K context via infinite chats such that it's like one huge long chat that you can interact with but I think finding the simplest form practice sometimes for unfamiliar capability uh is one of the design challenges uh in this new ER the next project that I worked one the second project that I was I worked on at open is called CH tasks um and actually I did not realize about this until it was shipped um you know reminders and tasks schedul tasks is actually it's very familiar thing that people do almost every day but what's amazing about this product is that you can scale this with new kind of capabilities models so CHP task is not just scheduled reminders and to-do lists it's actually you can create you can ask the model to continue the story every day for you H or you can ask the model to search everything that you are interested in every day or every other every other day so in a way you can also like help yourself to like learn new language by having like extremely multimodal and Interactive visualization that JB create and so I I think this is concept of your product feature should enable modular compositions that will scale very nicely in the future as the models will develop much higher capabilities um is one of the is is something that I've learned by doing chpt tasks um I think another design challenge that we have here is how do we Bridge together real-time interaction with models to a synchronous task completion and where we can ask the model to go off for like 10 hours to research or write code and then come back with the solution and the bottleneck here is trust and I believe that giving trust can be solved by giving humans new collaborative ordinances to verify edit model outputs and having them to give models realtime feedback so that the model can self- improve and you know one of the first products uh from antartic was actually CLA and slack and it was the first attempt to have a virtual teammate in the organization and it was an amazing concept because slack had all the affordances with tools and image uploads and multiplayer collaboration that you can create and there is something in there that is still I think there is still a lot of that we can do here and take the lessons from CL and slack to the next generational products um again the task was also like very much inspired by Claud and slack prototypes when Claud could just summarize channels slack channels across the organization every Friday and have a summaries for everybody my first project at was canvas and I thought that human collabor affordances could scale and create new creative capabilities and what I really loved about the canvas and the way we operate in canvas as a team is that it was extremely flexible interface that we could come up with and here here are some like the vignettes that we had so the canas itself can become like a co-creator and co-editor and you can have like a very very fine grained edior uh interaction the model can also do search in order to generate the report and then you can also ask a question back hi verify this output and you can imagine this interface scales to multiplayer when other people can join your document or even multi-agents if uh I can create a model critic or editor they can use you have like a multi- agentic and multiplayer collaboration at the same time and so this is like a new design challenge that we need to navigate how do we do that I'm also excited for personalized tutors I think the models care are becoming extremely multimodal extremely flexible that you can learn new things in a new way the way you like if I'm a visual learner and you are more auditory learner the model can adapt to your personalization sorry um one thing that uh I did yesterday is that um I was on the plane and actually I used canvas to create me a game and so I really like this genor entertainment on the Fly anyone can create their own tools and VB apps now and I'm not sure what the future will be look like but I think it will be extremely amazing if a non a person who never had to touch code ever before in their life for the first time can create the tool that they really wanted and deploy that L um for themselves or to start a business from scratch and I think there's something around pair programming and code creators that we can use in order to create the future that we want and so canvas has also becomes more prer programmer so the reason why canvas is so flexible is because it was adop it was trained both to become collaborative for writing and coding because it has tools such as search and can search for AP documentation it can become a data scientist too and especially if you upload the entire CSV docs um it can generate a real time um analysis and finally what I'm really really excited what everybody is excited in AI is how do we actually help models to become better at research um and creating new knowledge and here you can see that the model that the model in the human can co-create a document or co-create a new artifact that has never been happened before so here's a demo of the paper that I've co-published and then um I'm asking the model to kind of reproduce it and and you can imagine this is like one of the maybe one of the most common tasks in research is to reproduce or you can imagine to reproduce like open source GitHub repo and you you have like this very nice interactive Paradigm where because the model can also leverage its own internal knowledge you and a you and a AI can work together to come up with new research hypothesis and verify certain like research um directions together and you can also handle delegate the tasks to an AI assistant to do that finally what I'm really excited about the future is that there will be this layer of invisible software creation for all um and especially what I'm really excited is like from the mobile itself people can just create their own software tools I think the way you interact with AI fundamentally changes in a way that the way you access the internet will also change my prediction is that you will click less and less on the internet links and the way you will access the internet will be why models lens which will be much cleaner and in a much more personalized way and uh you can imagine have like very personalized multimodel outputs let's say if I say I want to learn more about the solar system and instead of it giving me a text output it should give you a 3Gs interactive visualization of the solar system and you can have like highly richly interactive features to learn more and I think there is there'll be this like kind of cool future of like generator entertainment on the Fly for the people to learn and share new games with other people I think the way I'm thinking about it is the kind of interface to AGI is is blank canvas that kind of self morves into your intent so for example you come to the work today and your intention is to just write code then the canvas becomes more of an IDE like a cursor or like a coding IDE although the future programming might change or if you're writer and you decided to write a Noel together the model can start creating tools on the Fly for you such that it will be much easier for you to brainstorm or edit the writing or create character plots and visualize the structure of the plot itself and finally I think the co-innovation is actually going to happen with co-direction creative co-direction with the models itself and it's through collaboration with highly reasoning agend systems uh that will be extremely capable of superhuman tasks to create new novels films games uh and essentially new science new knowledge creation cool um thank you so much I think that's the end of my talk [Applause] [Music] um the next gener generation of AI Engineers will be learning more and more from language models and agents like me here to provide a glimpse into the future of educating the next generation of AI Engineers is Stefania druga research scientist from Google thank [Applause] you thank you so much hello New York uh how are you doing okay so so I know you've been hearing a lot of amazing talks and before I get started I wanted to see a quick show of hands how many of you are here for the first time at a engineering conference wow that's amazing how many of you are international outside of us incredible thank you for coming um so today I'm going to talk about how do we open up this conference and how do we open up this knowledge to the next generation of AI Engineers which are young and much start much earlier and why does that matter is because 70% of AI users are actually from gen z um and we've seen the potential of multi multimodal AI to transform education we know that students are using generative AI tools for their homework and I think they can use it in much more interesting way we also know that in many cases they prefer it um to uh human tutors and I want kids to actually be part of the engineers and the designers that create the tools for them to use um and I turn to scratch I don't know how many of you know scratch wow uh there are over 100 million children using scratch world wild scratch is a a platform it's fre it's open source for coding for kids is visual programming it was developed at MIT and I was part of that lab and um during my masters and that's in 2015 when I started working on Cates cognates expand scratch to actually allow children to learn about AI by building games training their own uh AI models and programming Hardware as well so it has these kind of blocks they look like Lego blocks and they can put them together in order to create their programs does anyone want to guess what this program do any guesses so Tiana here who is nine at the time she she wrote this program for the first time and then play with the robot um using the program for half an hour which was super F fun so it's a hide and seek game right so she would like run around the room and the robot would turn around and say every time you would detect a person the number of people was like higher than zero I would say I see you there is a pro a problem with the program though can you spot it how many times are we going to play if we have it like this yeah yeah so what do we need Nick we need a loop exactly awesome um so the time to fun and the time to play um is very short in scratch which is why I used it when I buil CNE mates it has this library of blocks the coding area and a stage and I'm going to show you a quick video of how kids use it in order to learn more about AI so we were programming robots you could play Rock Paper Scissors [Music] you did Rock Paper Scissors into the camera and on shoot you did one of the motions and the camera did one of the motions and it's like rock paper scissors shoe the computer gets like better as you play the game cuz like us we might not know everything at first but if we keep trying we get better everyone has heard about like machine based learning or artificial intellig and there was a certain of no questions asked for a lot of the more techsavvy parents was like go for it technology is going to be a huge part of their lives much more so than my life if it's scary for some people this AI technology I totally get it but as a parent and as a teacher I thought it was really important because these are skills that 21st century kids need to have when my dad was young he bought a car and took it apart to see how it worked so you teach people that young how these things that grown-ups mostly program how it works so we did this in 2015 2016 uh these AI engineer started very early and at the time they were training these custom models with basic classification models for images and text um but they also had access to the entire library of extensions that we built for them so they could use off-the-shelf sentiment analysis image classification they could program their voice assistants because they realized voice assistants were really limited so they could program The Voice assistants to remember things about them and their preferences um program microbits robots and this is what the training page looks like um they can drag and drop examples of images so for example one kid wanted to um make a game about unicorns and narwhals so this was like what his training data looked like and then this is what the program looked like so he can choose his custom model unicor unicorns versus narwhals and then show different drawings into the camera and see what the model predicts and moreover it ALS he also or she can see they can see the level of confidence of the prediction which really helps them understand oh yes this in this case it guessed that my drawing was narwal but the confidence level is very very low what do I need to do I'm going to go back to the my training and add more examples of images that are handrawn because most of my images are cartoons so it really creates this AI literacy data literacy and demystifies everything that kids learn about AI or hear about AI that it's evil Terminator or all sorts of things right um and they built projects that went across different domains so it was a project where they could look at things in their food by transforming a webcam into a microscope they could build games like the rock paper scissors that you've seen in the video or a literature program uh where you speak and it analyzes what you say and uh it's seeing if it's in the style of a famous writer or uh different other types of styles and what's important like I tested this with kids in public private uh schools and community centers and what I found is that kids are actually like little scientists so if we give them the right tools they engage in the scientific process they formulate hypothesis about how the model works or how the robot works then they test those hypothesis and they refine their understanding in the process and it's the same with the model training so we need to create tools that enable them to engage in the scientific process as fast as possible and create tools that are fun and sticky that they want to use and why does this matter um before allowing them to program with cognates and train their own models I actually asked them questions about voice assistants and uh smart toys and smart uh robots at the time and I asked them do you think it's smart do you trust it um do you like it is it friendly and I asked the same questions about those Technologies we didn't have ch GPT at the time or Gemini um at the end of the study and what I found was that there was a significant difference in the intelligence attribution after they engage in this process of learning how to train a model learning how to program it understanding why the data matters um so it does make a difference in demystifying the intelligence that we talk about and this platform is used around the world it's actually translated in 50 languages and I realized after I worked on this that it's not just the kits like the pandemic came and a lot of young people were stuck at home um so I had to like really think about how do I create and how do I work with families so started to do uh a lot of other experiments and I fig like what are the type of Tutors or uh games or um platforms that families could use when they're at home and maybe they want to learn how to code with their kids I'll show you an early prototype hi there I would like to know your name so let's do a program that allow me to learn it let's start with the green flag block there you go you did it no I need you to help me ask a question for that we'll need the ask blck see if you can find it awesome so in this case they're learning how to program a robot it was a jba robot and the robot itself it's participating in the process so you're having like this reflective uh conversation with the thing that you're programming which is pretty cool uh and we could do that much more than that now um but because not everyone can afford to buy a robot or like it's not you know a required thing I wanted to build something similar like a pair companion for programming that works in the browser so during the pandemic I start doing this design studies with families in 10 different states uh in US very different backgrounds very different ethnicities and this is very important and I wanted to highlight that here before building the system I mocked the system so we didn't have a functional cop pilot or a functional assistant I was the AI um and it was a Wizard of Oz study like I we were the kids did not know initially we told them afterwards that they're interacting with a person that but they would interact via chat and I really wanted to understand what is it that they want what kind of supports do they need from a pair programmer when they code on scratch with their parents and what I found was that they really want to generate coding ideas they don't want the co pilot in scratch or in cognates to do everything for them they want to kind of brainstorm like oh H what if I want to do a game about bears or what if I I'm into soccer like give me some ideas um that was a really big one and here are some quotes like uh one of our participants who 12 said most people would like coding with AI friends because one of the hardest parts of our project is when you start and you run into a wall you are out of ideas so it really helps with the ideation process it also helped them kind of Express and elaborate their ideas like if they would play a a pong game it would ask like okay so how do we make the ball move or like how do you make it move faster um so it was very helpful in that regard as well and it supported their creative coding identity right so another quote I like it because sometimes when you code it gets frustrating you finally get it to work it's so good that it let you feel good it's good when you have someone that's says to you good job and what's interesting is that the kids participating in the study spent double the amount of time programming that they would spend normally and I I got that from parents and didn't always work sometimes like uh the parents were needed in the loop like if it was too distracting or if it wasn't able to moderate thir taking between siblings uh or if the yeah you know agents have limitations a has limitations or it wasn't always able to explain the most complex Concepts like the in scratch there is this thing called clone when you create multiple instances of the same object or broadcast that was harder to explain so after doing this first design study and identifying like what are the core features that kids and parents want um I created a evaluation Benchmark so I actually had over 100 cases of scratch programs and I would run this against like different state-of-the-art models and see how good these models are at explaining the scratch code at explaining it with learning exercises at debugging it at generating ideas and the results were very promising so uh the next thing was to build it and this is uh the first time I'm showing these results I just finish running the study so it's very off the press uh very fresh uh from the from the oven and um uh I I just tested this cognates copilot with young people um 18 young people from 11 different countries in different languages and it was It was kind of cool like it's very simple it's the code editor and it has this AI chat um it's sending a message to a web server and then gets a resport I'm I'm using and evaluating different models including a fine tune model on scratch projects and he can also generate assets art like images for their games and here's an example from a session with a kid from Mexico so all the session like was done in I speak many languages so I could do this session in different languages and so does the co-pilot um and what I found in analyzing these sessions uh from 18 kids is that the copilot provided all sorts of supports and was conceptual support design support positive encouragement platform na navigation there were instances where it failed um and I also saw lots of instances where the kids would refuse the suggested help or the suggested ideas and here's an example of code support this was a student from Jamaica and he had never programmed in scratch before so he was able to go from zero to actually having a fully functional program with a support of the copilot and the the copilot was actually very helpful not only in giving him ideas but also helping him understand how to navigate the platform for a very for the very first time like this is where you find the loop uh block and this is where how you create variables um so it was very helpful for people who were new for people who were Advanced it was also very helpful because it would generate assets that they really liked um it would give them ideas like for how to refactor it didn't call it refactor but like improve the code um or how to add new features or new levels and it was interesting because kids would use it in a lot of ways I didn't predict like besides like generating the the background in this game Al like for they wanted to get names ideas for names for characters and ideas for plots um and this is a great example like where a student is actually pasting uh an image of the code and it's asking for an idea and he actually doesn't like the answer so he said like I don't want to do that type of movement and then the copilot says no worries it's your game have fun right so we want to have tools that are actually prioritizing and encouraging young people's agency in this process some lessons learned uh U to prioritize users agency to balance the support and the challenge that we give them by default the copilot does not give the answer initially it asks asks questions and if the students really stuck after asking the same question three times it would give a hint um to see these agents as motivators and starting point right like the effect of the blank page when you start to write the effect in in scratch is called cold start we see a lot of students that go to the platform and they really don't know where to start so it's very helpful for that too and we learned that it was important to allow for flexibility and customization I had kids that really wanted to use voice and other kids that did not want to use voice they just wanted to type right I had kids that said I always wanted to give me three ideas and other kids I was like no just one idea is good um so everyone wanted different things and design agents that support um creativity in sichu so a lot of the participants told me they want the agent to be able to go and move the blocks with them and they want to have like C simulations of Agents programming like how would an agent build a Pac-Man game I want to see like how five agents would collaborate to build this asteroid game um or I want when it generates the assets instead of giving it to me in the chat like to have it directly on the stage so so the next the next phase and pro of the of the Prototype is to actually hack the OS entirely and integrate the the agent at all different stages uh of the UI and then support multimodal AI capabilities like including like sound generation and maybe reaction to camera stream um and then the the part that I was very important is when it doesn't work or it cannot do what they ask of it to do um you should tell them why right like should be like okay I cannot generate this type of images because my training set does not include it or I can only give you answers about scratch because that's what my prompt is or um so it should be really transparent uh and explain its limitation in order to set the right expectations so I'm going to show you a quick demo let's see if it works and this is just a prototype and I learned that I'm giving this talk to hours ago so please be kind um so oh oops I get it there thank you okay so when see I'm GNA reload it looks like this so if you come here and you don't know what it can do or not we could just say hi uh it's going to be like what do you want to work on scratch today what should I say a racing game I love that racing game give me some ideas TBO Bo bought on extra speed that sounds great um but the cool thing is that it actually integrates with scratch so I can go and get like any uh project from scratch like people build really crazy stuff on scratch so let's say like they they built some OCR programs uh this is a bit slow but people are actually building OCR implementation in scratch um and if I find a project that I like uh if this works then I could actually load it in in my scratch I have one that is downloaded so I'm just going to load that and then see so first let's see how it works like I can draw any number here and then ask it to recognize it okay thinks it's two and eight um but this is a pretty complex program right so if I don't know how something works like I can actually do a screenshot oh I I did not expect to have two screens um I can do a screenshot and then like attach it to the chat and it would explain to me what the what the code in the see if I can get it fast enough yeah I can't get the screenshot from y chat but you got the idea um so that's kind of what the program is and let me go back to this and um it's free it's open source like I mentioned there are a lot of new features coming I hope you can contribute and give us feedback or share it with your young friends um and the reason why this matters is because AI literacy is now actually part of the law I don't know if you know but this was passed earlier this month it's part of the EU AI act and it basically says that all providers and deployers of AI systems should take measur me to ensure to the best of their extent sufficient level of AI literacy uh of their staff but also of the users of their products right so we need to in order to ensure this AI literacy we need to start early and um that's that's what I'm hoping and trying to do with my work um if you want to learn more uh there are lots of papers and studies in about a literacy about AI education uh work done on other domains math misconceptions and science it's all on my website side um thank you so much and uh yeah I don't know if we have time for questions but uh yeah [Music] thanks what does it take to build a personal local private AI agent that augments you deeply we are pleased to Welcome to the stage the co-founder of metap pytorch sumith Chinta [Music] hello hello how's everyone doing last Talk of the day last Talk of the conference hopefully uh I'm not the most boring so uh why do I well first of all who am I uh do you know this thing called pytorch um a lot of people in AI used to know it but now a lot of people just use high level apis and don't know what's powering things underneath uh but like pytorch is uh the software probably powering your AI apis um so I work on it I co-founded the project and it's uh it's a big project that is uh majority funded by meta where I work at um and so I'm talking about uh so I'm not talking about llama at all I work on llama a little bit but uh unfortunately uh I am not in charge of llama to try to sneak in some secrets for you guys um I'm not going to tell you when the next llama is going to come or anything like that so uh why am I thinking about personal local agents well the as AI kind of started becoming more and more useful one of the things that saved my time the most every single day was swix is AI news and it's basically like I have to keep update with all of what's going on in AI That's my job and now instead ofay basically spending three four five hours a day um looking at a bunch of sources like they was aggregating a bunch of news for you and I thought that was like one of the like one of the first applications I thought was like mind-blowingly uh personally like effective for my own productivity and that's when I started going into like hey I'm going to like augment uh AI within my day-to-day in a deeper way um that's not an agent though AI news is more like a an aggregator uh but that's how it kind of started the other thing is I also work on robotics um and robots are essentially agents they act in the world so um I I my goal is to build home robots so that I don't need to do any errands um and so um as part of that Journey as well I've been like kind of getting into like okay how do I get into understanding AI agents more deeply um the key takeaway I'm going to really really like drill down uh to you today is Agents like especially personal agents have so much POS agency in taking actions on your behalf and stuff and they have so much of your life context to actually be useful to you that you're better off keeping them local and private and I'm going to try to like sketch out a plan on how to do it but I don't think I have a complete solution either um so first like agent what is an agent like and why did I say swix is AI new is not an agent well an agent is something that can act in the world like an agent is something that has agency it can actually like take an action in the world uh anything that can only get context and do things but then eventually can't act in the world is not an agent that's how I think about it and what I think is like a highly intelligent agent without the right context is as good as a bag of rocks it's like really useless I'll give you a couple of examples very quickly uh let's just say I build a personal agent it has uh access to my Gmail my WhatsApp my calendar I was like did did I get my prescription renewed and it's like no not yet and like it totally lying uh except it didn't know because like I got the text from CVS on my iMessage and it didn't have access to uh that source and it was doing the best it can with the information it has but if it didn't have the context like it's not going to like know how to do better uh similarly I mean you can make up like a hundred examples like this where like you have access to One bank account but like it your money came into a venmo and you're like uh the agent lied to you what happens is like a personal agent that doesn't have the right context it's largely going to be irritating to use it's like you don't know when it is useful and when it is not useful so it's essentially not useful um like even when it gives you some answer you're like H is this actually right I'm going to have to go dig in right so so unless it hits a certain level of like reliability and predictability that you know it is right uh it's not going to be actually useful to you um so now like why am I talking about personal agents specifically um and how do you like how do you get all this context to the agent so let's just say you have like your opening ey API or some other API or some local llm what is all the context in the world that is personal to you and how do you give it to the agent well like the number one thing that you possibly want to do is like just have variables right you're just like you can uh the your AI should see everything you see and listen to everything you hear um and that is like obviously the best case of providing context to your AI agent except like there's no battery life for any of these variable things so that's not really practical maybe one day when you have like crazy batteries but that's not really going to work the other thing could be like okay like most of my life is on my phone uh in the ways that I care about from like an agent perspective uh what about just like uh running an agent on my phone it's running the background and it's just like always like watching my screen or something well you know that's where like apple kicks you because you know they don't let you run a bunch of stuff and like on your phone asynchronously if you even if you do they have a lot of restrictions so like the ecosystems kind of like kill you and not allowing you to do that and unfortunately I use like apple um so that's that's out so the next one is like okay actually like the thing that I found like relatively useful is like if you use like apple in your daily life uh you can actually get a Mac Mini and like just put it somewhere in your home connect it to the internet and you can run your agents asynchronously there's no battery life issues you can just log into all your services on your like Mac Mini and um it also can access all the Android ecosystems because Android is actually open um so I work at neither of these companies I can say what I want um so I I think that's like what I think is a feasible um um device to use to like run your AI agent right now um the next thing I want to talk about is like okay why are you talking about local and private why can't you just like run this in the cloud like just subscribe to one of the large tech companies uh agent services and run your life out of it well I want to give you like a few points here uh first is um I want to talk about how this is different from you using other Digital Services and I think it is different meaningfully and I think it's also easy to understand so let's just think about like a lot of you in this room probably use like a cloud email service that is free for all of your life all your taxes are going in there like you know everything personal is going in there why do you trust it the reason I think you trust it at least the reason I trust it is because it has a very simple mental model on how it will act on your behalf or how it will act in general email in reply out it's not um go it's basically not trying to do something sneaky under you that is unpredictable it's a very simple mental model your trust of that service is correlated with whether you understand how it behaves on your behalf so imagine tomorrow if some unknown email service that you've been using forever says oh you know for some of your emails that I have confidence in I can Auto reply on your behalf and you're like okay well first of all that might be true but what is the worst case action you can take maybe you'll like reply to my boss like something nasty and like I don't want that to happen and like that's like once the action space becomes powerful enough and unpredictable enough you get uncomfortable with using a service that you're not fully in control um and it can get like worse right like uh companies have to monetize in a million ways and so what if like you're using like like an online service and they suddenly are like oh you know every time you ask for a shopping query we're going to like start making the agent only buy from like stuff that gives us Kickbacks or something so like I think like your personal agent is so personal to you and so intimate that I feel like ultimately you want to be in control uh on many aspects that you might not have control on eventually when you have to like trust uh an online service so that's like one of the biggest reasons like why I feel I want to build a personal agent that's local uh to myself the second is decentralization like I mean you already see all these ecosystems that are wall Gardens and like fighting with each other and don't allow each other to interoperate in various ways and if you build one of like your personal life uh your personal agent around one ecosystem like is that something like it it works fine for compartmentalized things like maps and email and various things but like is that something that you want to really subscribe into for like an agent that can take so many different kinds of actions on your behalf uh in your day-to-day life that's like the other reason I I I feel like you should try to we as a world should try to get to like local personalized agents as the norm um and the third one is um for various reasons okay this is what I called uh this is what I call um are you going to be punished for your thought crimes right like okay you have a thought and it is not a good thought and like you know should you be punished for it and usually like the answer is no now if you have a personal agent that is effectively augmenting you in such an intimate personal way you might be asking it stuff that you generally wouldn't say out loud ever um and in those cases like do you really want to take the risk of like like putting it out into like some provider because like you know you you can ask perplexity like Enterprise grade Cloud API contracts that like are like um Enterprise not consumer grade where they like get sloppy even they have to like do a bunch of like legally mandated logging and then safety checks and stuff so there is a possibility that like you might or might not want to take a risk on but for me I'm like I don't want to ever get into a scenario where like I will be like prosecuted or persecuted for my thought crimes and like that I think is like another really powerful argument for myself at least to focus on like local agents for my most personal um uh augmentation so now coming to um well I hope you're convinced that yes yes we actually like if you're going to build a personal AI agent it has to be local and private uh well okay what's the problem well let's go to the technical challenges first first like okay you got to run this stuff right um there are like great open source projects to run a bunch of local models that are one of the key components of these agents VM and SG Lang are pretty great uh they're both built on top of P torch so effectively uh this is one time like we wrote a bug in pyge and um a bunch of us uh had a Tesla car and Tesla uses pytorch and we were like man like this is so scary because like are we writing bugs on ourselves that's an aide uh it was totally fine the bug was not that bad um so yeah vill sang are great um but local model inference is still as of today slow and limited it's not as fast as like you know if you just use like a cloud service even if you spend like a be like enough money on a beefy machine I think that's also rapidly changing like for example locally if you're using like a 20 billion or distilled model of some sort it actually runs pretty fast uh but if you want to use like the latest R1 like full unquantized then it runs like super duper damn slow um I think this like is in a state of like it will fix itself so you you probably won't get to run the latest and greatest um and I think like the challenges are not so much the technical and infrastructural challenges like they will kind of get to a place where they're fine I think there's some challenges around like both the research and product that um people need to think a bit more about I think there's a bit of a gap and this is just an open challenge for this room for all of you AI Engineers um one is like the open multimodal models are good but not great I mean they're not great in a couple areas one is like just computer use even the closed models like the latest and greatest apis that you can just pay money for they're not that great for computer use they break all the time so that needs to definitely get like into a better State the other thing like I noticed is like if I ask a model to do shopping for me from clothes to shoes to Furniture to whatever it'll basically give me the most boring right like it's like the basic stuff and if I ask if I'm like look I'll tell you my tastes and my taste can get very like specific and find like the more specific I get the more like it gives me like it's like it's like the same oh you asked for like a red velvet sofa with Oak wooden legs uh here's a green sofa that has velvet um and it doesn't have like Oak wooden legs you know like they're not very good at identifying actually visually what you're asking for they mostly rely on like a bunch of text matching um the other thing uh you will notice and this is a big one is we don't have good catastrophic action classifiers what I what do I mean by catastrophic actions is there's many actions an agent can take a lot of them are reversible or harmless like even if it takes the action and that's not the action you wanted it to take it's like whatever oh it had to go to like that particular Wikipedia link but it went to this other one okay big deal whatever it'll just backtrack and go but there's some actions that are actually catastrophic for example you ask it to go purchase like uh a renal of your Tide Pods and then it goes and like purchases a Tesla car you know this is not the best thing for you to do uh and some of these are called catastrophic actions and I don't think there's a lot like there's some open research around like how to really get agents to get good at identifying catastrophic actions before taking them and then maybe like notifying the users instead uh but there's not enough and so if you want to really trust your agents personal or in Cloud I think we got to get a bit better at these things um so that's like a big one and I think open source voice mode is barely there uh I feel like when I need a personal local agent I definitely want voice mode because sometimes I want to talk to it uh and not actually type out everything I want to say um but still why am I bullish about this whole thing I am uh because one I see open models are actually like compounding an intelligence like faster than closed models like based on how many resources are being put on them like what do I mean by that like open AI is only improving their own model anthropic is only improving their own model with all the billions they have or whatever but open models are improving themselves like in coordination across board uh um and you know people didn't really believe it until llama came out and they didn't they didn't really believe it until mol came out and then they didn't really believe it until guac came out uh and then they didn't really believe it until deep SE came out like basically like people are like oh you know like open models you know will not really win but I think they are like basically in open source like I've worked in open source like all my life um there's a starting coordination problem like initially you don't have enough of a critical mass to coordinate with each other but once you have a critical coordinated Mass open source kind of starts winning in like an unprecedented way and you see that with Linux you see that with a bunch of projects uh so I am pretty bullish that open models would actually start getting like better than closed models um um like per dollar of in investment into open model um and that's what I said well okay I have some plugs uh this is like uh gr. in from my friend Ross Taylor who worked on this model called Galactica which got a lot of like um criticism when it was released uh out of meta it was this open science model before Chad GPT released now like doing science with like llms is pretty common but like they got a lot of when they released uh and he like quit uh he like unreleased Galactica and he quit doing like a bunch of stuff publicly but then like he's working on like plugging the the reasoning gap between open models and closed models and they released a bunch of open reasoning data uh that will help so just a nice quick plug the other quick plug is I work on pych pych is working on enabling local agents especially the technical challenges that I talked about uh and we're hiring so if you are more than an AI engineer if you're an AI engineer who's also like a systems engineer then like P George is hiring um well that's what we got the other thing obviously is I welcome you all to come to llama con which is happening on April 29th and save the date it's going to be very exciting lots of llama stuff will happen there that's it uh I think it's in California I I actually didn't look it up so thank [Applause] you ladies and Gentlemen please welcome back to the stage nlw and kinia saw [Music] thank you everyone for being such an amazing audience we know you each to get out of your seats but please please hang out with us for two more minutes um first uh on the website ai. engineer you can find the information about After parties that are happening this evening please check it out and then tomorrow there are a full there's a full day of workshops um they are not here you can find the address on the website uh but it's Jets which is 109 West 39th Street 2 floor and AWS AWS JFK 27 on 12th West 39th Street so both of which are right next to each other um before we go we'd like to invite the organizers up for a quick message of thanks so please join me in welcoming to the stage the co-founders of this AI engineer Summit Benjamin dumpy and swix hey everyone how are we doing did we have a good time long Marathon for us I'm sure for you as well we won't keep you too long but we just wanted to uh to chat with you for a little bit because this has been such a blast for us I know it has been for swix as well um behind the scenes he's really putting together all the content and he's doing a lot more speaker wrangling than I we like him to do um so this guy is doing so much work so I just would like everyone to give him a rousing Round of Applause all of the incredible people here and also you in the audience I mean you're not coming for me I'm putting together the show you're coming for him so yeah and everyone else here uh but like you know I I think a lot of the show is also just how uh everything smoothly runs like people don't see behind the scenes how much chaos there's back there uh and and it's all due to Ben and team and Leah's back there as well and we have a whole team that's uh helping us as well so um yeah we ought to we ought to thank that and I'm very grateful to work with you on this stuff I mean this like our third conference and um I still feel like we're I feel like it's getting better but I still feel like there's always like chaos and uh we kind of make it hard on ourselves because every conference is slightly different for example this one is the first one in New York yeah we also didn't decide on this venu until like December 15th so this is like less than two months of planning um so maybe we can do a little bit more with World's Fair for June yeah yeah I mean this like we're we're doing it one year ahead yeah so yeah so uh wsare coming soon I don't know if we have like a URL or something no we're just chatting okay cool yeah yeah what else but I mean a lot we get as this a lot so I just wanted to address this to the whole audience like why did you come to New York like a bunch of San Francisco people so we get ask this a lot and it's really just a few simple reasons I mean first like we just want to get out of our San Francisco bubble we had two successful AI events in San Francisco and then second secondly as swix said to me about the AI scene in New York what did you say uh yeah show me what you got you know like I think New York uh people talk a lot about like you know the the great engineering that's happening in New York we saw some data yesterday about uh you know the the hiring that's going on in New York and um basically I felt like New York was kind of underserved uh the kind of frequency of AI events that we have in San Francisco we take for granted um apparently you know I was checking my biases I was like maybe I'm just ignorant but uh I talked to some of you here and you're like yeah you know this quality of event Doesn't Really Happen uh that fre that often in New York so every we was just wanted to bring a little bit of the sort of eii engineer magic to New York yeah and it's been a blast of far that I think the last reason is like if I get an excuse to go to New York I'm going to take it yeah even if it's in February as Californians are are are uh we're not that accustomed to this weather but yeah yeah I I so I I really felt like we needed to do something to early in the year so that we have context we have relationships and whatever for the rest of the year um so I think like the timing is important um I I'll just say like you know if you come here even though it's cold then like you're really here you're very serious anyway so God smiled upon us this week those blizzards did not show up so yeah it was supposed to be 2 feet of snow uh on Thursday and then then some cleared up so I don't know um what we did but we did it right global warming for the win so I was talking to some someone about the uh what it takes to put an event on like this and like everything's really tight everyone's got 19 minutes yesterday was 20 minutes everything's back to back so it's like to go from the plug-and-play conferences where you're like coming in and doing this like Meetup style which is what I do it's literally like a THX more work and a lot more stress on like us and then we kind of pass that on to the speakers a little bit because it's a little more strict so um we I I just want to thank all the speakers for putting up with me and putting up with the crew for saying you know urgent we need your slides now or everything's going to break and you know uh swix doesn't think it's that big a deal but I do is it's it's a big deal but I think we have to tell people why uh you know when we when we have such high expectations uh that they're not used to like you have to give a rationale it's like you know it's like you have to show your reasoning to to arrive at some kind of conclusion like it actually helps you know true that true that yeah what else there's just so many more people that go into making this production we have an entire crew back there we have Argus HD back there time Center helping to run this um but not just that we have a whole organizational crew we have Leah we have France we have Peter we have Scott joining us we have we're actually starting to really build a big team so we can actually cross that Chasm because right now we're like you know you see me running around or you don't see me at all because I'm you know putting slides together last minute or I'm doing something last minute that should all be done way ahead of time but uh yeah with a bigger team we're going to be able to do that so I think as we prove this model we're going to continue to grow and grow the team I mean don't forget our MC's MC's yes thank you wonderful job um most mostly I wanted to I I think I have a theory that professional yappers are like very good like imp improvisers like they like are involved in Industry um you want to meet lots of people like I you know that that's my theory like I want to kind of bring people into the fold so uh you know I'm uh I hope that you guys are familiar with with each other and um that you can get get something out of it as well but thank you so much for all the work that you did today and then of course in that sponsor Expo down there did you guys see that that was Motif events they put that together really really good Expo down there really really professional looking and then our sponsors themselves who uh man those booths and showed you those exhibits who are our favorites what do you think shout them out what daily the the PCAT Cloud thing yeah Cloud everyone else too scared all right don't want to show favorites s all right Galileo all the company reps are now shouting their names yeah um yeah I would say like so uh we added the Expo stage at at kind of the last minute uh mostly because we we were like we have we have some ability to set up a uh but it was noisy in there yeah I mean it was it was like hard to hear um I think in future we would want to separate it all a little bit more but um yeah we didn't we didn't really we couldn't really tell the Acoustics before running the thing true and that was kind of a last minute Edition so we didn't really get a chance to test that and when we added that they were like well we can get speakers in there but it's going to be overhead it'll be it'll sound good though but we did get some complaints from sponsors saying you know it's too loud in here we can't have conversations um probably getting some same feedback from attendees as well so I think that's those are like little things that we need to optimize yeah fix yeah yeah it that gets smooth out yeah any else know yeah so we want to do one more thing and we want to invite you all and apologies to everyone I didn't get to thank because too many people go into this um thing but one person uh Randall gee is our photographer there he is and he is going to take a photo of ideally all of us and Max video productions who does our Boll and did the interviews out there um hopefully they'll be getting some b roll of us as we do that so I've never done this I don't know if you've ever done it but I'd love to get everyone as much as everyone wants to come on stage just one caveat before we get up these things will break if you step on them and we'll have to pay for them um so just don't step on those they're like greats they're hard to see but just don't go past this line and don't touch that um but yeah who wants to come up and do a group photo it's going to be a good memory yeah this is like the Salve conference come on up let's do it Miranda you want to grabb a word h Sam you can shout just reminder not to step on the grates behind you they will break thank you right all right we go look at this group we get the lights on all of you now all right all right ah we got to turn it back off S you got to be in the red carpet can we get maybe MC's over here red carpet if you want to if you're comfortable we're going to get to know each other really quickly come on in everyone and rand's GNA tell us if he can see us how are we looking all right let's do [Laughter] it swix can actually sing do you guys know that go for it started spread everyone fall in fall in Fall do we have some volunteers [Music] [Applause] thank you everyone thank you uh one more thing before everyone goes a little bit of a surprise Ben didn't mention this but uh it's actually his birthday today um and we got him we got in a little thing uh so thank you Ben for uh taking a part like your personal life putting life on hold and doing this us to you happy birthday to you happy birthday dear Ben happy birthday to you and I don't think yeah we we shut your mic off just a reminder um one last protocol item for me uh we we are actively like breaking down the venue you're welcome to stay for a little bit we got to like be out of here like 5:30 5:45 so there are plenty of um side events happening but of course you know this is New York you can go anywhere you want we'll see you at the workshops tomorrow um those are at those are nearby those like in within walking distance of this venue and also the hotel uh that's at AWS Hank and J Suite so that's all on the website check that out you can get the address there you're all welcome to come to that we'll see you there [Music] [Music]