AI Engineer Summit 2025 - AI Leadership (Day 1)
Channel: aiDotEngineer
Published at: 2025-02-20
YouTube video id: L89GzWEILkM
Source: https://www.youtube.com/watch?v=L89GzWEILkM
[Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Applause] [Music] [Applause] [Applause] [Music] [Music] [Applause] [Music] [Music] [Music] I [Music] [Music] [Music] [Music] he [Applause] [Music] [Applause] [Music] [Music] [Music] [Applause] for [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] in [Music] n for [Music] a [Music] welcome welcome everyone and thank you for coming welcome appreciate you being here and welcome to the AI engineering Summit 2025 okay so I just want to start by saying even though I've often personally been told I work like a machine I just want to take this moment to reassure everyone your event staff your event curators your event people like myself we're all real humans none of us are Bots that are being launched at this conference well this year anyway I mean next year who knows right but uh this year is very special to us uh this year is the first year we're in New York and we could not be more pleased to be here I can already feel and have already felt the energy that's available for us today here in this room and at this conference we're building on the soldout success of our AI engineering Summit in 2023 and our soldout AI engineer World's Fair in 2024 in San Francisco and that's what's allowing us to bring to you here today this curated exclusive in Industry Insider event here in New York City uh and where we can all get together and talk about and learn from each other's real world pragmatic experiences uh and at a gorgeous Stellar venue uh okay so why are we all here well like the Industrial Revolution before AI is a new future and it's going to change everything that comes after it I mean frankly just a few years ago even the idea that the entire internet and assembled Collective knowledge of the human race could be put at your fingertips is just amazing breathtaking and we're here of course also because Aon is by no means done evolving right um things that used to take five years and a research team in 2013 now take API docs at a spare afternoon in 2025 it's just incredible okay so let's set expectations about the event for just a moment um we have a really exciting conference lined up uh today is the AI engineering track uh for executives and VPS and then we're going to be followed tomorrow by the agent engineering track and then we'll conclude the conference with the Hands-On workshops on Saturday so today's leadership track uh will equip you with the Strategic insights you need that are at the intersection of AI and business leadership uh exec leaders senior folks from Lux Capital signal fire uh anthropic open AI LinkedIn data dog many many others are going to be here today to share their real world experiences uh and hard one uh scars and lessons with you um we will touch on everything from Trends to hiring to security to tools to infrastructure and technology and a lot more uh and of course we are blessed and very fortunate to have sponsors like salana uh if you're not familiar with the blockchain world or you're not familiar with salana salana is the permissionless infrastructure that lets your agents create wealth they've got a large Booth downstairs with three demo stations so please stop by to learn how they can help you uh the Expo area is just uh downstairs right uh on the in the hallway in the lower level um it opens after the morning Keynotes and it's open all day uh so please do take the time to go visit our sponsors because they're a integral part of how Gatherings like this happen right uh and more importantly they will help you on your journey okay this event however isn't just made possible by salana it's made possible by all of our sponsors um who are innovating at the edge of AI engineering and they represent a really fascinating mix of companies uh they've sent their top you know sea level Executives their heads of product their senior technical staff to this event so please make sure to visit them at the breaks and have a chat all the breaks are in the schedule and lined up and who knows you might find your next service provider your next partner maybe even your next customer um so in a moment you're going to hear from these Founders these executives these AI leaders who have all prepared tailored talks just for you okay and then at the break following uh each block of talks um there's going to be the speakers are going to be available to answer your questions there's one of three Q&A and discussion areas uh throughout the the conference venue um there's one right here on this level uh and then there's two downstairs one at the uh base at the landing of the stairs and another tucked underneath okay um we're going to use these areas during the break time to facilitate the hallway track so you can gather kind of birds of a feather F birds of a feather style um to talk about the topics as well from all the sessions before in that block before the break okay as well as meet the speakers um these breaks again several of them throughout the day they're all listed in the schedule so you won't have to to guess or worry um and then of course all food and drinks are going to be served downstairs in the Expo area and please be sure don't miss the Afterparty tonight in the Expo area so we can all have some drinks have some fun listen to music and get to know each other because that's a lot of I think why we're all here okay with that thank you for your time and it is my honor and privilege to call and invite our first Speaker up to the stage please put your hands together and join me in welcoming partner from Lux Capital Grace isord [Applause] [Music] thank you so much Peter to swix to all of the AI engineer Summit for having me I am so thrilled to be here I'm Grace again a partner at Lux capital and it's a pleasure to really kick off this conference and tackle a pretty tough but exciting task the state of the AI Frontier and how we navigate that in 2025 so a little bit about Lux as we get started here right Lux likes to say we believe before others understand we invest in Frontier Tech ideas that seemed crazy right uh and we really like to bring sci-fi to scif fact in fact we've been lucky to partner at the earliest stages with some top AI companies hugging face I'm sure several folks know it the GitHub for machine learning together AI the open source AI Cloud physical intelligence that's like a robotics software brain and saon AI That's a research lab actually in Tokyo Japan doing really cool evolutionary nature inspired algorithms they launched a pretty cool AI Cuda scientist last night so go check it out moving a little bit forward as we think about New York City if I get my clicker working here there we go Lux is really excited to double down in New York City and AI Lux was founded in New York City our first AI investment was here in 2013 and a majority of the Lux AI portfolio is actually headquartered here or as a major Hub as you can kind of see in the graph behind me it's also home to many of you right state-of-the-art research and Engineering leaders and many four to 500 companies several of whom you're going to hear from over the next few days we are really bullish on the New York City opportunity and we're really excited you guys all came to share this opportunity with us so when I was creating this presentation I went back and looked at the last few years of AI right and all the way back really to stable diffusion August 2022 and wow I mean look at this hockey St right the last 2 and a half years have been crazy the last 18 months have been even more exponential the progress is getting more aggressive it's getting more impressive and really it's getting more spread right it's not just open Ai and anthropic publishing these models it's xai we just saw the grock launch this past week it's mistal it's deep seek it's many many more and the models are getting more performant they're also getting more compute efficient and as we zoom in to the current state of the world in 2025 it's off to an even Wilder start right if you thought the last few years were crazy 2025 is even Wilder we saw the 500 billion Stargate project announced between the US government open AI soft bank and Oracle we saw 03 open eyes 03 right before the start of the Year where they actually exceeded Human Performance and the arc AGI challenge we saw the Deep seek Mania right with deep seeks R1 model launching earlier this year sending you know Nvidia shares tumbling down we also saw deep SE go to number one uh in the app store and then of course just last week we saw the France AI Summit where macron actually launched a whole new AI initiative with France and Europe back in the game so you may be saying and I think a lot of us are thinking right this is the AI agent moment in 2025 and I'd go as far as say this is the perfect storm for for AI agents and Frank frankly it's easy to see why right uh several reasoning models starting with open eyes 01 then 03 deeps R1 grock's latest reasoning model this past week are outperforming human ability and in some cases even having more capabilities that we've never even seen before we' seen the rise of test time compute right that's more compute applied at inference instead of at training that's increasing this model performance as well we've seen further engineering and harder optimizations right whatever you think it cost to actually train that deep seek model you cannot deny it was a feat of engineering and Hardware efficiency inference is getting cheaper Hardware is getting cheaper the open source close Source Gap is getting closed between deep seek and llama models getting more and more performant and of course billions of infrastructure powering all this data center and compute we just talked about the US Stargate we talked about macron and Europe and also Japan with soft and Nvidia has been doubling down on their set of efforts so all this is setting this exciting groundwork for the aonomus name of our conference agents at work and it really does feel like an exciting moment but in reality these AI agents aren't really working just yet right people are saying it's a perfect storm and I've seen a lot of Thunder I've seen a lot of great momentum but we haven't seen that lightning strike uh and everyone I know has different definition of Agents so for the purposes of this presentation I'm going to Define my definition as an AI agent that is a fully autonomous system where llms direct their own actions so let's give a little example of what I mean when I say an AI agent isn't working just yet here's a seemingly simple query on open AI operator I'm sure everyone here knows what it is I asked it to book a flight for me to New York to San Francisco on Monday I'm sure it's also a route and something that many people in this room are familiar with and in reality it's actually kind of a complex problem right I need to leave after three on Monday but I want to avoid rush hour traffic I want to fly United JetBlue or American to maximize my chance of an upgrade from economy I want to keep it under $500 to keep under my work expense policy I also want an aisle seat that's not too close to the bathroom um and I want to get there you know before midnight so I put this in to open AI operator and the first thing it did with all this information is go to kayak which if anyone has booked a flight before that's a pretty frustrating experience and unfortunately it did not find a flight uh it wasn't able um it couldn't find one it didn't even seem to look for United or American second try try it again uh this is Skys scanner this time which is slightly better um and it did actually find a flight but it found one that had a lot of traffic uh 5:30 JFK for those who live in New York that is a tough traffic time um and ultimately I also couldn't even pick my seat so didn't really work out based on my prompts uh earlier so what does this all mean right why these AI agents not work I think we so often talk about hallucinations and fabrications and AI models kind of going sideways we don't talk enough about these tiny cumulative errors that add up right uh there's a lot of little errors that we see with this old model and I'm going to go through a few it's not an exhaustive list but it's a sense of some of the things you might run into as you're building these AI agents first decision error it chooses the wrong fact right I may book a flight with AI but it may book it to s Isco Peru instead of San Francisco California the model could overthink or exaggerate and do a few other things as well second implementation error the wrong access or integration on the prior slide with my Skys scanner I actually had to enter capture and that messed up a little bit of the flow you also could get locked out of a critical access to an important database and ultimately that AI agent isn't going to work anymore third heuristic error the wrong criteria unfortunately the model didn't acknowledge best practice of allowing enough time for JFK in fact I didn't even ask where I was coming from Manhattan Brooklyn or Beyond and that could really affect the traffic you're going to get and ultimately if I even make that flight at 5:30 p.m. and fourth taste error the wrong personal preferences for those who know me well I'm actually a pretty spooked flyer and I do not like flying Boeing 737 Maxes if AI booked that you know I did not put it in the prompt earlier but if it did book that I will be very unhappy and I would not get on that plane and then there's kind of a fifth more nebulous error right it's a little bit of this Perfection Paradox right we are doing things so magical with AI right now yet we're getting frustrated when A1 thinks too long or when operator moves at the speed of a human even if the agent gets it right on the first try often they're inconsistent and unreliable leading to really underwhelming our human expectations about the whole EXP experience here's another visual of kind of these complex systems where each of these cumulative errors really compound right two simple agents one that's 99% accuracy one that's 95% accuracy to start pretty impressive agents at the beginning but over 50 consecutive steps you actually realize a pretty big disparity here there's actually a 50% difference after 50 tasks between the 95 and the 99 and that 99% agents actually down to 60% the point here is that something simple like booking a flight is actually really complex in nature when all these tiny cumulative errors add up and they get even more Amplified in a complex multi-agent system with multi-step tasks so how do you as all these amazing VPS of AI these leaders of AI in the room optimize a complex agent taking into account all of these really difficult queries to consistently and reliably make the right decision the truth is it's hard but that hasn't stopped us before and there is hope so I thought I would run through some of the best practices that we're seeing building AI agents today and five strategies we can all think about to help mitigate a lot of these cumulative errors let's Dive In First Data curation how do we make sure an AI agent has the information it needs data is messy it's unstructured it's in silos it's everywhere it's not just web and text data now too it's design data it's image data it's video data it's audio data it's a data in your sensors and your Warehouse if you're in the manufacturing world it's even the agent data that your data your agent is producing in real time think about curating proprietary data the data the AI agent generates and ultimately even the data you're using in your model workflow for Quality Control Data is your best asset and curation is key to making it more effective data also isn't static anymore how do you design an agent data flywheel from day one so every time a user uses the product it automatically improves in real time and at scale a simple example back to our flight example is getting a curated data set of all of Grace's travel preferences including the 737 Max and all my Airline preferences or even say we run that agent over time and book many flights how do we recycle back that content and adapt to my own preferences in real time second the importance of evals how do we collect and measure a model's response how do we choose the correct answer this is long been important in machine learning and Ai and really understanding what's right versus wrong you know it's pretty simple in verifiable domains where there's a clear yes or no answer like math like science here's actually the grock three benchmarks just up here where you saw they did all verifiable benchmarks in MA math and sciences but how do we set up evaluations for non-verifiable systems where there aren't clear yes or no answers like well Grace Like This Plane seat based on her preferences and how do we collect those signals too we also saw other examples of an eval debate over the weekend with deep research right we have an openi deep research product one from perplexity one from Gemini as well and there are multiple versions of the same product the evals here really depend on the eye of the beholder right which one is better for everyday research versus VC market research versus scientific or academic research we have to keep an eye on collecting those signals we need to know and collect human preferences and we need to build evals in a way that is truly personal Sometimes the best eval is just trying out the agent yourself and Vibes based on your needs with no number or leaderboard telling you what to do third Scaffolding Systems how do we ensure when one error occurs it doesn't have a cascading effect throughout the organization ramp a Le portfolio company has done a great job with this and I know Rahul is speaking tomorrow so go check him out when ramp launches a new applied AI feature and it fails there's infrastructure logic to ensure that doesn't have a cascading effect across the agentic system and also across all of ramp production infrastructure we can mitigate scaffolding by building a complex compound system of how all these things work together and sometimes even bringing a human back in the loop for reasoning models get this gets even more interesting and important how do we adapt the scaffold to Stronger agents that self-heal and grow an agent that realizes they're wrong and actually tries to correct their own path or an agent that's not sure and then need to break execution to get it back on track back to our travel example again could we add a checkpoint for this AI agent to verify the Trine for traffic or maybe steer it back in the right direction fourth user experience or ux is the mo that matters and that's how our AI agents become better co-pilots AI apps today are all using the same models Foundation models are the fastest depreciating asset class on the market right now gbt rappers are cool ux really does make a difference for those who reimagine product experiences and really deeply understand the user workflow and really promote that beautiful elegant human machine collaboration right here's a few concrete examples back to the Deep research right asking clarifying questions to make sure it fully got the picture of what I'm trying to accomplish like Wier from codium understanding the ux or the psyche of that developer really on a more fundamental level to predict their next step like Harvey in the legal World integrating seamlessly with the Legacy systems to really create real Roi for a practicing lawyer if you think about all the major AI apps today and categories like coding like customer support like sales these all are using the same models again right and it's truly the ux and the product quality that makes any one company stand out at Lux we're really excited about the new AI Frontier companies who have proprietary data sources and who know the workflow of their user really well like robotics like Hardware like defens and Manufacturing like the life sciences you know how do we take a company where they take their proprietary data source they know the workflow of a biologist or a defense contractor or a chemist and truly create a magical experience for that end user Fifth and finally how do we build multimodally you know we're not just multimodal anymore we're multimodal there's new modalities where we can truly reimagine and create a 10x user personalized experience I am so sick and tired of the chatbot as an interface and I know there's so many more exciting things we can do with our AI agents to really make them more human right how do we make AI more human how do we add eyes and ears nose a voice we've seen really incredible improvements in voice over the last year it's getting pretty scary good Lux actually has an investment in the smell space called osmo that's digitizing the sense of smell and what about touch right how do we instill a more human feeling and sense of embodiment with robotics I'll go as far to even talk about memories right how do we make AI truly personal and know you on a much deeper level than it does today doing all of this reframes what Perfection is to a human and even if that agent is inconsistent it's unreliable the Visionary nature of the product exceeds all expectations it's something new and on the slide behind me you'll see TLD draw that's an amazing Lux portfolio company and I think they've done a great job really reimagining the visual canvas right implementing AI through brush Strokes they have a cool thing called T draw computer or you can actually combine a bunch of these cool AI models in Tandem and not even know you're working with a large language model in the background so really strive to build multimodally so in summary we tackled a lot today but we're at the perfect storm today for AI agents but unfortunately that lightning hasn't struck yet and AI agents are not going to happen overnight cumulative errors add up we see wrong answers wrong preferences wrong criteria all these wrong human expectations that abound when you're building these systems data curation evals and Scaffolding are all tools you can use to help mitigate a lot of these challenges and really please think bigger ux multimodality Innovative product experience that truly set the workflow and the vision apart and I'm so excited to see what all of you build and really excited to continue this conversation over the next few days thank you guys so much and look forward to talking to you throughout the conference thank you [Music] [Applause] [Music] our next presenters will teach you how to build an AI strategy that fails please join me in welcoming haml Hussein founder of parlament labs and Greg sarelli co-founder of spec [Music] story all right everyone welcome Hamma and I are absolutely thrilled to be here with you to basically teach you how to build the definitive guide to completely utterly and spectacularly messing up your AI strategy I actually couldn't have really wished for a better foil Grace's leaden than this because we're not just talking about Minor setbacks here we're going to take you through a way to create full-blown company crippling career ending failure Grace talked about best practices but we're here to embrace worst practices in fact we're going to make sure you know how to completely torpedo your AI projects and ensure you alienate everyone that you work with how does that sound it sounds great to me um so before we begin it might as well start with some introductions um we Have No Agenda here it's just a sequence of steps but I'm Greg I'm an executive leader and I've spent years in the SE Suite crafting AI strategies I'm now a co-founder of an AI startup but previously I was the chief product officer at plural site and executive leader at other companies um and I've had a front row seat to how executive teams can transform clear strategic opportunities into labyrinthine disasters and I'm haml I'm a machine learning engineer and independent consultant who has worked with many companies on AI I've witnessed every conceivable way AI strategies can fail and I found it really fascinating how creative people can get with their failures that's right so you could say that together we're kind of like the dream team of disaster we've advised or maybe we've you know interacted with representatives from numerous companies we hav even have this fancy website um but for today's presentation we've decided to live and breathe the great words of the late Charlie Munger who said invert always invert so let's get started all right the first step to failure is to make sure that you begin to divide and conquer your own company this is key if you're destined to fail you got to embrace the disconnect between willingness to pay price and cost the keys to creating value by contemplating unreasonable goals and you should especially everyone here in the audience know by now you you have to make sure you go and attend every AI industry conference right but never go back and talk about what you learned with your team the point just like Moses here parting the Red Sea is to create impenetrable silos and incentivize secrecy between your teams so let's get into it so I talked about the value stick and willingness to pay in the the prior slide but here it's really important for us to adhere to the anti value stick you got to embrace it because it's the opposite of everything good and useful when it comes to Value creation and being strategic and today that's our guiding principle you might be thinking that wtp means willingness to pay but here it's wishful thinking promises you got to tell your customers that AI is going to do absolutely everything for them your new systems are going to write their emails block their dog solve climate change and achieve world peace but don't really worry about the details just promise the moon you know about price right well for us that's another acronym particularly ridiculous in infrastructure costs everywhere I mean that was a mouthful sorry uh buy the most expensive gpus don't bother with any cost benefit analysis just max out the company credit card think of this as an investment in something and cost well it's that Cascade of spectacular technical debt you're about to run headlong into um you need to think about Building Systems so convoluted so intertwined that even you as an executive can barely understand it you know about job security right this is a key to guaranteeing it think about it this when it inevitably breaks no one's there except for you and finally if you know about value you know about WTS or willingness to sell for us it's why this system well the answer and I mean always is because AI there's never any further explanation needed no board is ever going to question you it's like magic but much more expensive and less reliable so step two here is when you start to Define your strategy right here's the first key fake any diagnosis you might be thinking of grab last year's annual report or operating plan and just start highlighting random paragraphs preferably the ones you understand the least and declare he I must fix this don't bother talking to anyone who actually does the work in your guiding policy should be both incredibly ambiguous and vague something like become the global AI leader in everything except don't Define what everything means that's someone else's problem totally and your action plan simple you need an AI powered SEO tool that guarantees top Google search results even if you sell garden gnomes right and a generative art plug-in that creates nfts of your CEO's cat and of course an AI drone lunch delivery service because Synergy uh announc solve this at your next company All Hands meeting and you get bonus points if you wear a shiny suit and use the word disruptive at least a dozen times and the last point on this slide is about timelines but timelines are for companies that intend to finish projects what we recommend is you Embrace Perpetual beta just create a massive backlog in GitHub and stick all those highlighted Financial reports in that Greg was mentioning earlier great strategy but you know what strategy really works just create a 4,000 page document that you post in all your slack channels and just erode people's willpower to engage with the material and with these tidal wave of documents um you know in words Greg isn't there a strategy that you have about jargon there certainly is the point is to communicate in such a way that nobody understands drown everyone in a tsunami of jargon say things like our multimodal agentic Transformer based system leverages few shot learning and Chain of Thought reasoning to optimize the synergistic potential of our Dynamic hyper hyperparameter space if you say it with confidence you probably will have absolutely no idea what you just said Remember the goal is to look incredibly smart even if nobody understands a word you're saying the key is alisation yeah you might be tempted to do something like defining a very cogent clear business on a page approach like in the in in the advantage but never be too tempted one of the most effective ways to cause dysfunction in your organization is to use jargon everywhere and use jargon strategically to hide the jobs to be done for example I had a mental health client instead of saying we need to write a prompt we would just say hey we're building agents and what that did is it made sure that the mental health experts were not in the room and didn't know how to participate and that's exactly the result you want that's right just like Hamill reduced those mental health experts mental health I like to do that as well so instead of saying hey let's make sure the AI has the right context I just talk about Rags and don't say make sure users can trick the AI into doing something bad just say prompt injections yeah and the key here is to encourage Engineers not the people who might best understand your customers to write prompts because what could possibly go wrong look we know that translating everyday English language into jargon can be really difficult so we made this guide for you and this guide will help you divide your organizations just like Greg was talking about earlier just like Moses the link is right here but remember making everything even writing prompts seems super technical and Out Of Reach for everyone is what you want to go for all right just a brief recap we talked about how to seed your your division how to start to Define your strategy how to communicate it now we're on to mobilization right because you got to do something with that giant backlog well some of you might know about Jeffrey Moore but I've never heard of him today we're pioneering a very new revolutionary framework which is about zoning to lose it's designed specifically for failure just randomly assign AI tasks to people with absolutely no relevant experience for example Outsource your data review to Offshore Q&A teams who have very little context about your business yeah and most importantly you might be tempted to use the incubation Zone to bootstrap new AI ideas but the goal is to launch from here completely untested bug written AI chat Bots directly to your customers as Hamill mentioned never worry about beta testing dis disre regard quality assurance just ship it straight to production because what's the Worst That Could Happen outside of a potentially career- ending PR disaster so if you do it sort of right should feel something like South Park um you're going to yank all your best Engineers from potentially supporting your revenu producing products wait a while and then profit no actually it's going to feel more like total collapse and because you're so disorganized we can now transition to look at this point your organization is in complete disarray but it's time to do the deed and burn it all to the ground so the most effective way to start doing this is to focus on tools not processes and those problems that you created earlier and other ones that may exist don't analyze them don't try to understand them just throw tools at them so if your rag system isn't retrieving the right documents just a new more expensive Vector database yeah and if you need to measure progress just use every off-the-shelf evaluation metric you can possibly find never bother customizing them to your business needs just blindly trust the numbers even if they make no sense oh and like you know we're talking a lot about agents today if they're not working just pick a new framework and vendor find tune without any measurement or evaluation just assume it's going to be better because it's kind of like Al me with a lot more electricity exactly you don't need to look at our design metrics evals that's a vendor problem just plug in a tool and it will solve all your problems Greg I really love how you demonstrated exactly what we're going for here with wack-a-mole every time you see a problem hammer it with a tool if another problem comes up Hammer that with a tool the same problem comes again hammer it with a different tool you get the point yeah him I really appreciate being meme fodder to help you get your point across look I want to emphasize you should adopt a mindset that evals are a vendor problem just realize that there should be a oniz fits all solution let the vendors figure it out you're too busy being an executive and if you really want to do this properly you need to create a dashboard that looks like this with every off-the-shelf metric that you can gather the more metrics the better it doesn't matter if the metrics track with outcomes or real failure modes make sure the numbers are unintelligible so you don't know the difference between a 3.5 and a 4.5 keep hoarding random metrics until you find one that's going up and to the right then you can claim success and look maybe you might have a hard time figuring out where you can come up with these generic metrics but we got you just adopt the ones from eval Frameworks in fact adopt all of them let your eval metrics guide you blindly and never ask whether they actually measure success again the more numbers you have the better yeah I personally like to optimize for cosine similarity BL and Rouge ignoring actual user experience and I said it once and I'm going to say it again never CR check with domain experts or your users because if an LM says it's accurate who are we to argue we are their humble servants after all amen now it's time to unveil the most potent Technique we have in our toolbox and it's avoid looking at data seriously just avoid it keep a blindfold next to you at all times you have to bump into Data by accident put that blindfold on yeah data that sounds really messy let's let a tool handle it because you can absolutely 100% trust the ai's output without ever looking at it yourself looking at data that's an engineering problem you're a leader you have more important strategic things to do like having meetings about meetings besides developers they always have more domain expertise than your business teams yeah and we know that ultimately by this point your customers are really your best best Q&A and hopefully you have lots of them and they'll complain if something is wrong maybe eventually but more importantly you got to trust your gut it got you this far in life right feelings are always reliable substitute for data especially when you're making million-dollar decisions if you have trouble trusting your gut just put the blindfold on it'll get you right back in touch with those feelings so now we know by we know by now that engineers are all coding Wizards and they're going to handle everything it doesn't really matter if they haven't spoken to a customer in years because you can quickly forget about the fact that there might be simpler options like using spreadsheets to annotate and look at data say after me remember this is beyond me great advice and just it's not enough for you not to look at the data you have to make sure no one else is looking at data and the best way to do that is to put your data in complex systems that only Engineers can access and it's not available to domain experts right so like instead of using a simple spreadsheet or perhaps an air table like up on the screen as an executive you should insist on buying a custom data analysis platform that requires perhaps a team of phds to operate and understand remember those bonus points you get more of them if it takes six months to load this thing and errors incessantly so so there you have it the ultimate foolproof guide in under 20 minutes to achieving total AI failure if you follow this advice that we've given you here meticulously it's guaranteed that you're going to waste time resources and alienate all the people you work with and as far as I'm concerned that's the ultimate success that you can have sure is so for more advice it's actually real do visit ai- execs U we also have an O'Reilly book the same material coming out February 27th uh and so while this talk was inverted you know our lived experience really isn't and we're always very eager to help you on your journey so find us after this presentation the Q&A speaker boo thanks so [Applause] much our next presenter is the co-founder and CTO of privacera and paage please join me in welcoming to the stage Don Bosco [Music] Dury hi everyone I'm Don B I'm the co-founder and CTO for private sah uh very recently we open sourced our solution for Safety and Security for J and AI agent um I'm also the Creator and PMC member of the open open source project Apache Ranger uh it does uh data governance for Big Data is also used by most of the uh CW providers like AWS gcp as well as um Azure uh so today I'll be mostly talking about how you can build a safe and reliable AI agent so before I get started let's get some of the terminologies standardized um from my perspective AI agents autonomous systems uh they can do their own reasoning they can come with their own workflow and they can uh call task for doing some actions that they can use tools to get um make API calls so task are more specific actions uh they may be able to use llms or they may also call Rags or uh tools while tools are functions which can be used to get data from the internet uh if you have databases it can going get data from the database if you have uh service apis it can call those things also and memories are context which are shared within the uh agents the task and the tools to give a visual representation um there could be multiple agents and agent may have access to multiple tasks that could be multiple tools and as these tools can talk with apis and DBS so one thing that you need to know out here is um most of the the uh agent framework today they are run as a single process what that really means is the agent the task the tools they are in the same process that means if the tool needs access to database that mean needs need to have the credentials or if they want to make API calls it needs share tokens so those credentials are generally a service user credentials that means they are have super admin privileges and since they all in the same process uh one tool can technically access some other credentials which is in the same process similarly if you have task or agents which has uh prompts all the things that's running within the process any third party Library they can also access it so those sort of makes this entire environment a little bit unsecure right so there's a little bit of a zero trust uh issue out here uh the agents the task uh they talk to LM that if you don't have a secure llm then that is another area where these things can get exploited um if you take agent on his own by definition is autonomous that means it will call their own make make up their own workflow depending upon the task so so that actually brings in another set of challenges which we call in security is unknown unknown so you really don't know like what the agent is going to do so it's very non-deterministic so because of this the attack vectors in a typical agent is pretty high considering from some of the traditional software so what are the challenges because of this so there are multiple challenges so if you look from the security perspective if the agent is not designed or implemented properly that can lead to unauthorized acces also data leakages of your sensitive information confidential information right safety trust is also the biggest challenge uh if you are using models which are not reliable or if your environment is not safe enough if someone goes and change the prompts that can also give you wrong results compliance and governance is an interesting thing most of us are so much busy even just getting the agents working we are not even worried about lot of the other things that are necessary for making your agent Enterprise ready so interestingly I was just talking to one of our c customer this Tuesday they one of the top three um credit buau so they built a lot of Agents but the biggest challenge right now is to take it to production for them they consider a AI agent as similar as to a human user and when they onboard a human us they go through a training and they have lot of regulations they need to adhere to right they have data from California residents so they there to make sure anyone who is accessing uh California resident data they should not be using for if the user is not given consent they should not be used for marketing purpose they have international so if they are Europe data so who can access those data there's a regulations around it and also there are a lot of regional regulations so when they consider even a AI agent similar to a human so they have onboarding process they have a training process and they want to make sure the agents are also following the regulations right so without that they can't go into production and we as air Engineers still in the early stage so this one of the things which is of our radar right now so now how do we really address this thing right so those who are in security associated with security compliance there's no Silver Bullet uh the best way to do have multiple layers of solutions so these are some of the things that I have in my mind like so you can split it into three different layers the first layer is what is the criteria to even put your agent into production like what are you need to do right uh we talk about evals but most of them we only talking about evals for how good your models how good your responses is alternating but you also need to have evales which are more security and safety focused so we'll go through some of those things but the the goal of this eval out here is to come with a risk score and depending upon the risk score you can decide whether you can even promote this agent to the production and the agent may not be necessarily you writing it it could be a third party agent so it has to go to the same criteria the second is enforcement um eval tells you how good is your agent built and enforcement is the one who actually doing the enforcement or implementation so you have to make sure you have a pretty strong implementation if your implementation is not good your ual is going to fail essentially you can't go to production and third is observability uh particularly in the world of agent is a lot more important because there's so many variables involved out here like you cannot really catch all of them during development or initial testing so you have to keep track of how how it is used in real world and how you can react on it so I will go through some of those things in a little bit more detail so uh let's start with the evals itself right um if you look into traditional software development there is already a process there is there are gating factors that tells you how you can promote your application into production right so if you start with basic things like uh when you're writing your code you have to make sure you have the right test coverage right when if you're building uh Docker containers you have to do the vulnerability scanning if you're using third party software you need to make sure you're scanning for CVS if you find higher medium risk or critical risk you try to remediate that before you can get into production right uh you do pen testing so make sure there's no cross-side uh scripting and other um vulnerabilities the same thing applies for AI agents also right you need to come with the right use cases you need to make sure you have the the right ground through so that when you are doing any changes you're changing the prompt or you are uh bringing a new library or new framework or new llm you want to make sure your base line doesn't change right uh if you're using third party llms make sure they are not poison they they have been also scanned for vulnerability uh if you're using thirdparty libraries which almost everyone is using it make sure they also meet your minimum criteria for vulnerability right and similarly to pen testing uh you should also do testing for your prompt injection make sure you are um your your application has the right controls so it can block them and most of the llms already doing it but not necessarily all the LMS are doing it the other evaluat about on data leakage uh this also is pretty important particular in the Enterprise world because when you building Enterprises You're Building agents which does generally what a human would do right if you're building a agent for HR have certain functionality if I am an employee I can request for um uh to get my salary benefits but I can't do the same I can't get for someone else but if I HR admin there's a possibility I may be able to access someone else's salary benefit right how do you make sure your agent is not leaking data there's no malicious user who can exploit some of the loopholes you have so you would have to do this eval up front before even you can put your agents in the production uh similar to data leakages unauthorized actions um most of the agents even though uh a read only there also now agents coming which are trying to change things they'll do some actions how do you make sure those are also done by the right person with the right person and Runway agencies um those are work on agents already know like the agents can go in a tight Loop and for various different reasons it could be a bad use on prompt or just the prompt for the task or the agents are not cannot address those things so you have to make sure you test for such scenarios before you you put your agent into production so the goal of this is to come with the risk score at the end of the day so that it gives a confidence that can you put this into production and the next one is going to be around enforcement as I said your risk Cod is going to be depending on how good is your enforcement and particularly in agents um you're working almost like a zero trust environment because you are libraries which can access anything right uh if you are accessing certain of your backend systems which have sensitive data how do you make sure the wrong user is not accessing it so uh from the security control there are a lot of other things which I'm not going to talk today like uh detecting projections and moderation but focusing on Enterprise level thing uh you have to make sure you have the right authentication authorization uh this is pretty important because when you look at the environment when a user makes a request to agent it goes to task and eventually goes to tools and makes the API call to a service or a database if you don't have a right authenication someone can impersonate someone else and may be able to steal confidential sensory information and the second is the authorization if you have the authentication done properly then you have to make sure the access control is applied properly and this is also important because agents have their own roles and as a they can do certain things so you have to make sure they are not going beyond what they're supposed to and at the same time if you have agents which are trying to do something behalf of another user then you have to make sure that the user the that person's role is enforced so if you're accessing a database you shouldn't access anything which the user does not have permission to or making APA calls so so that's why authentication authorization are super important but that um obviously there going to be a lot of other the issues approvals is interesting because um in the traditional world we already have workflows uh if I request for a leave my manager will approve it it's already built into the system but in the case of Agents you don't need to have a human all the time your agents can do most of the things automatically right so there's a if you do it design it properly you could have another agent which all it does is just looks for approvals and making sure that results are right and you can also put thresholds how much this agents can approve automatically and you can put the proper guard ra to make sure if it goes above a certain uh limit it can automatically get a human in the loop so uh just to reiterate this one because it's pretty important is when it comes to authentication authorization is not just about doing the authentication at the point of entry at the where you're making a request it you have to make sure the user identity is propagated across everywhere if you're making a calling a task the task is calling a tool you have to make sure the user identity is passed on to the the the last point where it's actually making a data access or making APA calls and that point you have to make sure you're able to enforce the right policies and access control and the third is um observability so observability is is pretty important in the agent world because as I mentioned the traditional software once you build it it gener just works you just had to make sure it is uh there's no new vulnerabilities coming in because of some Library update or something like that but in the world of Agents um there are many different variables involved one is the models change very rapidly um you if you're using a agent framework that is also keep evolving right you're using third party library that start behaving differently um the another important thing in an agent is is very subjective to what the user is entering like you may have tested with a certain assumption mostly sunny day scenario I want to apply for a my leave but the end user may use entire different um uh um text to ask the same question so how do your model is going to behave with so you have to keep monitoring to see if the user inputs are anything that changes how the responses are coming in and also to make sure how much Pi data and other confidential data is been sent out because if you see some anomaly you to be able to really to act upon it um the other thing is obviously you can't monitor each and every request right as a number of request increases it's just not possible so you have to start uh putting uh defining thresholds and metrics so what that really means is and uh can start calculating counting how many failure rates are out there once you know you have a certain failure rat which is within your tolerance is fine but if it goes above that you can automatically create alert and look into it and the failure rates could be because of uh Mis Bing agent it could be a malicious users trying to um um compromise the system then anomal detection and is another interesting thing I don't think we are anywhere close to it yet uh but this very common in uh the regular traditional software in the security side there always something like user Behavior analytics where they look at the user and see whether they have within the um uh um standard operating thing with agents coming in so there'll be more and more of anomal detection whether the agent is behaving within the uh accepted boundaries and all those things will end up with a security pro cure so that will give you near real time saying how good your agent is actually performing in life so that gives you to a bit of a confidence so to recap uh as I said there are three things one is preemptive have a vulnerability eval to make sure that you get a right risk score which gives you the confidence whether you can promote the uh um agent to production if you're using third party agent whether you can use it in your environment um second is proactive enforcement make sure you have the right guard rails you have the right enforcement you have the right sandbox so that you are able to run the agent in a secure way um make sure you have the right observability so that you know at real time or near real time how good your agent is performing and if there are some anomalies you can go and quickly find tune it so um just I said we open sourced our um Safety and Security Solutions um it's called page. a uh security and compliance is a pretty vast field I don't think any single company can do it uh so we are looking for design partners and contributors who can help us in our journey so if you're interested please reach out to uh me at boscat page. or connect me in LinkedIn thank you [Music] our next speaker will teach you how to build AI coding agents that build themselves please join me in welcoming to the stage founding researcher of augment code Colin [Music] [Applause] flarey hi everyone thanks for coming today so I want to talk to you about something that sounds like science fiction but very much is reality an AI coding agent that helped build itself my name is Colin I'm an AI researcher at augment code a company building AI power Dev tools for software engineering Orcs And I want to share with you a little bit about my our journey working on AI coding agents so zooming out AI Dev tools is a fast changing space everyone remembers in 2023 we're all talking about autocomplete models GitHub co-pilot being the one that probably really comes to mind in 2024 chat models really started to penetrate software engineering Orcs in 2025 though we think AI agents are going to dominate the conversation about how software engineering is changing so naturally a few months ago we started building our own agent at augment I want to show you a sneak peek of what we built and share some hard-learned lessons about how this Tech works and I just want to you know reiterate R I've been really amazed to see the extent to which this agent has helped build itself uh I'll not a kind of fun statistic so we have about 20,000 lines of code in our uh agent code base and over 90% of that was written by our agent with with human supervision so what does it mean for the agent to write itself implementing core features so one of the first things we had we had to add was third party integration so our agent you know if it's going to work like a software engineer it needs to interact with slack linear jira notion search Google um muck around in your code base and so we wanted to have the agent help us build these features uh we had it we found after we added the first few ourselves when we add asked to you know gave it an instruction like add a Google search integration it was able to go look in our code base for the right file to add it in uh figure out the right interface to use and go add it uh one kind of fun an anecdote is when we were adding the linear integration uh it didn't know the linear API docs the foundation model we're using uh didn't have those memorized and so it used the Google search integration which it had written previously to go look up the linear API docs and then was able to add that uh we used it to write tests so we found if we asked it something like add unit tests for the Google search integration it was able to go add those uh in order to do this we just had to give it some basic Process Management tools things like running a subprocess interacting with it uh not hanging if there's an infinite Loop in some test it wrote and reading output um I think this is super interesting so everyone's seen the Twitter demos of these agents writing features and writing tests but I haven't yet seen a compelling example of them performing some kind of optimization well over the course of our project we noticed the agent was pretty slow and we weren't sure why so we asked it to profile itself and what it ended up doing using all these tools we'd given it was add some print statements to its own code base run essentially sub copies of itself look through these print statements and it figured out there was a part of our code base where we were loading up all the files and the users repository uh synchronously and hashing them synchronously and then it added a process pool for these to speed it up and uh stress test to confirm it was all working and by the end of this we reached about 20,000 lines of code and again over 90% of that was written by the agent with with our help in supervision so let's walk through a quick examples a couple quick examples to see how the agent works I focus on simple examples where it's reliable so you can uh follow along easily so here I asked the agent are you able to search Google and then it notes that it found a tool called Google search for those who aren't familiar with the notion of tools I'm sure most of you are but I'll just kind of quickly reiterate the idea is we have this kind of Master Level agent that's doing all the planning and it has access to certain tools that it can use to interact with it its environment whether that's the third party Integrations I talked about like Google or it's editing a file in the user's repository and then it wants to confirm this that this Google Search tool is working so it sends a query to it of tests and the agent uh uh responds to us yes I can search Google and I see the first 10 results let's try something a little bit more complicated I ask it instrument agent's Google Search tool with logs and then generate an example then it uses our retrieval tool which is you know allows to search uh the local uh codebase and it's looking for a file related to Google search Integrations it finds this file deep in our directory hierarchy at Services integration thirdparty gooogle search to.py and then it calls its file editing tool to quickly and performant edit that file to add those print statements uh this is a continuation of the last example so it added those print statements and now it wants to run a a sub copy of itself so it can look at the outut this print statements uh because we asked it for example logs uh but in doing so it finds that we don't have Google uh credentials authorized so it uses its clarify tool to ask for clarification from the user it asks I don't see Google credentials would you like me to one add stub for Google API or to guide you through setting up credentials I note that the credentials are actually stored in augment gooogle api. Json it had just missed this and then here's a a really cool extra feature we have which is we want the agent to continuously learn as it interacts with humans and so here it thought well it's probably a good idea to remember where the Google credentials are stored so it called this memory tool to create a memory of the where where the Google credentials are stored to save that for later this is another example if you have that really good context engine uh it's really critical to getting the agent to to work well and so now we get our output so it prints out these logs that it searched with an example string Python programming language and it gives some uh uh example URLs that were returned by Google python.org and Wikipedia.org so we have the agent add logs to itself run itself learn from user feedback and it use all kinds of tools Google search codebase retrieval file editing clarification from the user and and memorizing uh useful learnings so let's fast forward and talk through some of our lessons building this uh I just want to knowe you know we've been working on AI coding tools for a couple years now and we didn't set out to build agents we've worked on things like completion models and Chad and so forth but our Focus the whole time was around building a super powerful scalable Enterprise ready context engine because we knew no matter what no matter how good these llms get you're going to need that context and we also thought a lot about how do you build great UI ux so AI can seamlessly um interoperate with humans it turns out this context ENT and all these thoughts around design provided a great foundation for us to quickly build this agent in just a couple months the three most important things were that access to context that context Engine with all those different types of context sources whether it's slack or the codebase the reasoning capabilities from a best-in-class uh Foundation model and that code execution environment so you can safely run uh commands in a uh customers environment so let's talk through a couple assumptions that we frequently fall into we we've have frequently fallen into and remedied and and some of you might encounter as well uh so the first one is that you know L5 agents are here is the senior soft agents are at senior software engineering level if you look at the Twitter demos it oftentimes can seem like this you have an agent write an entire website all on its own in reality professional software engineering is rarely zero to one and the environments that we're coding in are are a lot messier than what those demos uh show you as a result these tools you know aren't quite there yet but they're still super useful um the way PE one framework I've seen people think through when they're trying to figure out you know how to use these agents and how to build them is they think agents will take over entire categories of tasks so first you build an agent that will uh solve backend programming and then you build an agent focus on front end and maybe one focus on testing in reality this technology is very general purpose and so instead of thinking about categories of tasks we found it more helpful to Think Through levels of complexity so our agents you know kind of good decently good at tasks across front end backend security and so forth and we're we're improving the capability level along all those friends at once because again it's a very general purpose technology um we've we' also seen people anthropomorphize agents so they think they're just like human software engineers and they map the characteristics of a weak software engineer to what they think a weak agent would look like and vice versa for strengths as well in reality agents have different strengths and weaknesses in humans and so you may have an agent that can't do math but it can Implement a whole front end feature way faster than any human could and it's important that we keep this in mind uh let's talk through a couple Reflections and lessons uh so here I asked aaman can you create a stack of two PRS for the new reasoning module using graphite unfortunately so graphite is a Version Control tool for working with Git you can like stack PRS it makes it a lot easier to review unfortunately Foundation models have not memorized how graphite works so our general agent responds I don't know what graphite is so I'll use git and then it calls our terminal tool to run a command run and get checkout but what do we do here we wanted it to use graphite we can't necessarily go tell open AI or anthropic to retrain their model understand graphite overnight so what we came up with was this notion of a knowledge base which is essentially a set of a sort a set of information that we want the agent to understand that it currently doesn't we can kind of patch holes um one thing we wanted to add to it was this graphite knowledge so we created this markdown file describing graphite how to you know run common commands things like how to create a PR use GT create some things not to to do um we created other files in our knowledge base for things like details on our tool stack how to run tests the style guide and then we added this into the context for the agent so it can dynamically go search in this um knowledge base when it doesn't understand something and uh once we added this then you know we go ask it can you create a stack of two PRS for the new reasoning module using graphite and it calls that Knowledge Graph reads about graphi and then can run the GT create command so what learning here well onboarding the agent to your organization is crucial the analogy I like to think about is if you just hired a new a new hire software engineer you wouldn't go tell them to just stare at the code base for three days to figure out how your Tech stack Works you'd let them ask you questions maybe they're some things they didn't understand and you add some additional documents to your notion uh we should think similarly about agents uh recall I was talking about how we had all these uh third party Integrations we added whether it's linear tools or slack tools and so forth well when we were working on these we weren't really sure of which ones to prioritize and start with on our product road map in a normal World we'd make some educated guesses we'd Implement a couple of them and go from there but with the agents we were able to iterate them um uh build them all at once and so this starts to change the calculus around how product management works uh if you can build everything at once well then maybe um maybe uh engineering hours aren't the bottleneck on what we build and it starts to uh we start to be bottleneck a little bit more on good product insights and good design so when code is cheap you know you can explore more ideas uh also recall earlier we were talking through this example of you know instrumenting the agent's Google Search tool with logs uh and it was able to go find the file to edit notice here how we didn't have to give a very precise instruction to the model we just told it in natural language like how we' talk to another engineer to instrument the agent's Google Search tool and I was able to go figure out the file to edit this only worked because we had that really good uh codebase awareness um we can also use the agent for tasks outside of writing code but still within the software development life cycle so here we asked it to look at the latest PRS uh in our codebase and generated an an announcement on them and then we posted it to slack and so uh was titled new tools for this uh CLI agent and we talked about some things around slack notifications and linear uh linear Integrations this only works because we had that slack integration and and understood our code base well this uh figure may look familiar from the beginning of the talk um we actually had the agent make this as well so we asked it make me a plot of the interactive agents line of code as a function of the date um and so good context is critical in all three of these tasks we needed to pull in some different context from some different sources and it's not just the codebase context comes in many forms and also note that it's multiplicative so having access to the code base and having access to slack is forx is useful as just having access to one of those finally I want to uh switch over and talk about uh testing so uh here's a really a hard to test Edge case in our code the agent actually wrote this um and we only caught it because of some unexpected runtime Behavior so we have these caches that the agents store relevant information for their runs in uh we can run multiple agents in parallel and they all write to the same cache location and the agent wrote this save function to save to that location and I had this lock around the Json dump so there were no raise conditions that would explicitly fail if you had multiple agents all right into this cache at the same time but notice here how there's no read before writing to the cache and as a result you could hit a race condition where multiple agents are running in parallel they're all overwriting each other's caches and when the agent wrote this save function why did it Miss this issue well these agents make mistakes and this is a hardto test situation there's some parallel programming there's a cache involved and so we didn't have a test and because we didn't have a test the agent messed up my learning here is we need to be very careful about having sufficient tests um we have this pretty incredible statistic so we have a internal bug fix bug fixing Benchmark uh we found when we upgraded our foundation model by about six months our scores in this Benchmark improved by 4% but when we added H the uh ability to run tests so the agent could suggest a fix for the bugs run tests look at the feedback suggest another fix run test and do that four times that led to a 20% gain on this Benchmark so what's the lesson well better tests enable more autonom you can trust these agents more and it just makes them smarter what does software engineering look like in a world of Agents well agents didn't work last year but now are pretty good if you'd asked me two years ago if we'd be working on this Tech I frankly wouldn't have guessed it there's a compounding effect where these agents are staring starting to help build themselves and that's only going to accelerate the pace at which they improve code isn't going away way because it's a spec of our systems but our relationships to it is changing good test harnesses are becoming more important than ever and we need to be especially careful about those parts of our code bases that tend to be less well tested and the calculus of product development is changing if code becomes super cheap to write then the focus our focus is more on good product work Gathering customer feedback quickly building insights we're really excited for how this Tech's going to positively transform our industry and we'll be releasing our agents soon so I'm really excited to share that with you uh find me after the talk if you want to discuss any more thanks [Applause] [Music] ladies and Gentlemen please welcome back to the stage your MC for the leadership track session day Peter [Music] Humphrey all right folks thank you uh Colin thanks for calling for I mean pretty amazing to see AI software building itself I think he was right just uh sounds like science fiction doesn't it um all right well everyone we are truly off and running here uh We've kicked off the day with AI market trends setting your AI strategy we've heard about AI security and AI safety even just now ai creating itself okay so now we're going to take a 30 minute break um if you want to discuss anything talk to the speakers have some question and answer U go to the one of the three QA lounges the speakers have spread across those three uh there's not too many to to choose from so you'll be able to find them pretty easily and then again also for some birds of a feather style uh discussion if you want to talk about any of the topics you've heard uh this morning just be friendly go introduce yourself say hi to folks uh and uh and and have some time interacting um also please do make some time to uh the stop by the sponsor Expo which is open now uh coffee and snacks are being served there uh our sponsors again are a huge part of making Gatherings like these happen and they have amazing products and technology and services to help you with your journey okay we'll see you back here for the resumption of leadership day at 11 o'clock sharp thank you very much [Applause] [Music] [Applause] [Music] [Applause] [Music] he [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Applause] [Applause] [Music] [Applause] [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] a [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] n [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Applause] [Music] [Applause] [Music] n [Music] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] n [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Applause] [Music] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Applause] [Music] to New York ladies and Gentlemen please join me in welcoming to the stage your MC for the leadership track session day Peter [Music] [Applause] Humphrey all right welcome back everyone uh hope you got a round of coffee and uh strap in grab a helmet our next Sprint of sessions is pretty action-packed okay we're going to talk about retrieval augmented generation and data pipelines we're going to talk about exactly uh always a popular one we're going to talk about AI in the software development life cycle this is a this is one I'm pretty excited about how does it impact what's going on with the traditional software development life cycle that's something I definitely am here to learn myself uh and then a little bit about AI productivity internal agents and then of course we're going to have some speakers from no no one less than open AI uh so with that please join me put your hands together in welcoming our next speakers Steph chin who's the VP of developer relations for Neo forj and Jonathan low senior director of operations and insights from fizer thank [Applause] [Music] you hey so it's so great to be back in New York City actually I grew up nearby here and pleased to be co- speaking with Jonathan thank you Stephen good to be here and you know we're here to kind of talk about leadership talk about how you can actually do a bunch of the things you've been hearing in practice we're going to talk about strategy we're going to talk about technology but let's start with analysts who who here trusts Gartner when Gartner says something they're predicting the AI wave they're predict okay nobody does nobody no hands sweat up in the room for the record but when they are predicting failure and catastrophes I I I try to trust that right so last year they predicted 30% of generative AI projects are going to be abandoned by the end of 2025 now anybody in the room this is a really real honest check has anyone been on a failing gen project okay now Brave SS amazing give give those guys around that took a lot of courage now to make them feel a little better who who hasn't yet got to production on their gen app okay so the rest of the hands went up right so this this is the challenge so we all want to be successful gen we all want to do amazing things we're getting asked to do amazing things but we we need to have the right way of approaching this in our organizations with leadership to sell this internally to to build it on Technologies which they can understand and the the vision it's it's hard to get a vision that's technically achievable when the guy at the head of the table is is this is this guy he's he's the executive who's heard about J now his kids are using it for their School courses and he's like oh yeah yeah just it solves all the problems insert here success I wanted in production in two months now um I think the great thing about having having Jonathan as my co-presenter is that he's actually done this and a big Life Sciences company and he's had to navigate all of this um leadership challenges organizational challenges silos to build a system which actually is something we can take to production so tell us a little bit more about that Jonathan thanks Stephen now as I've been introduced Jonathan low you may know me as Jonathan when we're out in the hallway or when I give you a bit more information about my experience launching Jen AI based capabilities in business you may think of me as Debbie Downer AI is so exciting until the singularity so that's how actually approached the problem I'm about to explain to you but it it actually worked so the business case was technology transfer which means in biofarma scaling up from Lab Ben think beers and human scale drug development to Industrial scale making a million doses a day and to get from that lab bench level to multiple factories around the world making lots and lots of product very quickly takes years because the industrial people that build the factories and build the equipment need to sift through hundreds of thousands of documents and notes and test outcomes that were created at the science level another challenge with doing that is I'll go to a statistic in 2019 a study said that the average tenure of manufacturing workers tenure being how many years they had spent in their companies was about 20 20 years of average tenure what do you think the average tenure is in manufacturing companies today the study said three years so we've gone from 20 down to three and all that expertise has has or will soon be retiring because the boomers are growing old so we really need generative AI we need a machine to take a lot of the intelligence that's captured in documents or even in tacet people's heads and get it to the new people showing up to do this technology transfer so we take all these millions of documents and we've loaded them into a graph now we haven't necessarily loaded the document into the graph we've loaded the chunks into the graph and one of the things that we really liked using the graph to accomplish was we structured the chunks the document the block the paragraph the line the because we wanted to understand when we searched for those chunks with similarity search which ones really returned the results that people wanted the most we wanted to really refine how we stored and managed the chunks so at this at this point it was it was a totally new space and because we were able to structure in the graph that level of chunking we were able to eventually learn and get better and better at how we chunked the documents in the first place yeah so I think what's really really amazing for me about this is um we were talking about business challenges and like like projects failing and in the study that Gartner did that the biggest failure mode was not having a business use case which would actually solve real problems and then be monetizable and um this is not only like a great business use case but it's also something which potentially is Saving Lives because you're you're getting life-saving drugs to folks faster you're able to accomplish this quicker but the problem is always the humans in the middle right so the teams you work with probably have a little bit of gen not invented here syndrome where you come along with this great solution like I'm going to use graph rag am I going to load all these documents in the my my you know my my big um store and they're like no no no I we've SE This research paper we watched this talk there's some other platform we want to use there's a mother framework um or maybe it's maybe it's too expensive I mean compared to Classic Computing and cloud computing gen architectures have the potential be much more expensive if they're not well architected and in general are going to re increase the cost of the organization so how do you convince people to go from a system which is is working but not working well enough to a much more expensive system which is R&D investment Redevelopment to go towards a gen architecture so what are what are some of the challenges you hit internally and how did you address that viser great so for this one it's more of an entrepreneurial use case within a big organization I wonder how many of you have worked in organizations with 50,000 or more people a lot of hands going up uh my current organization has over a 100,000 people I've also worked at IBM deoe big organizations and if you are like me in these organizations you'll be that little red guy going like this with the with the light bulb over his head saying I have an idea that might help the company and I have a team of X number of data scientists and developers and sres and we can bring that value that capability to the company if you're like me if you're that red guy who's the first group of people on this slide that you're most interested in connecting with well you go for the top you're better than I am I I joined this whole profession because I love building applications that Delight the people that use them so my instinct has always been to go to the bottom first and say to those users hey do you really want this tool and what are those users going to tell you the like your tool if that's right what makes it good it takes away boring stuff that they don't want to do but it can't just take away boring stuff it also has to give them accurate results it also has to work in a performant way they can't push the button go get coffee and come back and I feel like that's the easy part right more and more these days you can build accurate fast applications quickly so where's the real challenge so somebody said you go to the top first what's the likelihood in a company of 50,000 to 100,000 people that you're going to meet the CEO if you're the guy with the idea at the level four of the hierarchy the likelihood is pretty small did anyone here ever see the movie Dirty Dancing Dirty Dancing maybe do you remember the part in Dirty Dancing when baby the lead leading woman in the in the movie meets Johnny the amazing dancer for the first time and she she's uh she's unable to speak she's so flustered and finally she blurts out I carried a watermelon and then off he goes and she goes I carried a watermelon two weeks ago I stepped into the elevator on the seventh floor of of the headquarters of of my company and there was my CEO in the elevator and I felt like baby in Dirty Dancing I couldn't think of what to say I locked up for he's a good guy he broke the ice I'm just back from vacation rolling up my sleeves can't wait to get to work what are you up to and then thank God ding we got to his floor the doors opened and out he went and as he went out the door I blurted out not I carried a watermelon but I'm working with llms off he went so when you're trying to promote your work within a big company like this it would help to know what that executive is trying to accomplish and the way that he gets to that point is he talks to Consultants who say let us tell you how to be a leader in your industry and not fall behind the competition so an example of something that an executive at that level might do is create a purpose blueprint or something like that name and the number one message has to be a few words and convey something that the whole company can follow so an example of that might be change a billion lives a year in life sciences a big aspiration now why do you have to care about that in the elevator maybe you'll reference it I'm changing A Billion Lives a year with the most amazing AI search engine Bing and off he goes but that that message that he gives trickles down to the next level the chief digital officer the chief scientific officer the chief Supply officer what do you think they're going to say they're going to try to take his message and turn it into their specific flavor so the digital officer will say I want to lead the industry in Ai and the scientific officer will say I want to take on the world's biggest diseases and the supply officer will say I want to accelerate Supply still very high level and you probably won't meet these people either who will you meet though you'll meet their level twos and their level threes and what are they going to say at this point they don't really say tagline instead they say I want cost savings I want cost avoidance I want earlier realized revenue or I want more balanced headcount so when you're talking to these people your slides have to have numbers and times and your promises about how your tool or your capability or report or whatever is going to me meet those times and those numbers now you may not get to meet them either if your big company has a a form of um a role called the client partner where your digital people talk to the client partner and the client partner talks to the business then that's the other person you have to convince and the problem with this is that the client Partners tend to stay within their particular departments there might be a client partner who works exclusively in R&D or one who works exclusively in Supply what would they say sometimes they don't say the same thing one of them might say R&D already has five or six or 10 search engines why build another or they might say search engine is a great idea why don't you incorporate that capability into every tool in the supply organization so either your scope goes to nothing or it goes to everything and you need to be able to negotiate and navigate that are you done if you can satisfy all those people and cross through all those gauntlets well no because as you're starting to build the vendor comes to you and says why build in house when you can buy our tools and they've been talking to the chief digital officer about build versus buy and which one is more economically realistic and appropriate well maybe you get through that and then you're done right who else would possibly stand in the way of your incredible AI Search tool Friendly Fire is the answer your own colleagues either a level above or at the same level may say dude I was here first AI search is my terve or they might just say hey that client partner over in Supply is right can you please integrate with the stuff I've built so I guess my message is we've heard a lot of talks about failure and Challenge and Garten are not liking this it's an incredible time to be in this in this amazing industry in this amazing change in both for me life sciences and more generally for the information technology industry and I love that we're hearing all this concern about failure because it just means we're at the beginning of a really exciting time but as representatives of that my advice to you is know your audience personalize for all of them and get your human wetwear chatbot speaking the right language at the right level now that's that's amazing um we've chatted out a bunch of these challenges right so we've chatted about getting a good business use case that can actually provide values the organization how to navigate like like peoples and and different failure modes um within the organization where the organization has a huge quantity of people who can be your allies or can work against you depending upon how you work with them but it's also a technology problem you have to have the right technology to solve your use case now one of the the biggest challenges I think a lot of us who have been building Rag and Enterprise applications has been the LMS themselves fighting us with with hallucinations this is getting better with newer models um it's getting easier to feed the right sort of information in with Vector databases but I think that you've chosen a rather unique approach using graph databases why did why did you choose to use a graph database for your implementation at fizer well um there are a lot of things that graphs aren't good at things like genealogic sequences of recipes or social networks or hierarchies or time series and all of those applications were prevalent opportunities within fizer so that was the that was the original impetus for using a graph but I also discovered that the more data we Consolidated in the graph the faster my data scientists and engineers and developers and sres were able to understand the data landscape what used to take three months to consolidate understand clean up took three weeks or less for for a new project so I know the reason a lot of people take on graph is because traversal becomes so much easier in terms of data search and uh and and performance gets better but I've found that team performance also took a really big Boost from using that Tech cool and for folks who aren't familiar with with knowledge graphs and LMS or or graph rag um this isn't a new idea although I I would put you guys on the early adopter where you're actually in production now with something that uses this but U Microsoft kind of wrote the seminal paper on graph Rag and used it basically taking existing documents llms to chunk it into a graph and then showed Superior results coming out of it it on the spectrum of Technologies using LMS directly you can get good results but it lacks that context it lacks that Enterprise knowledge using a vector database or or Baseline rag you you can get better results where now it's actually pulling in organizational knowledge but the answers tend to be a little bit generic there's a lot of hallucinations um graph rag kind of pulls us to the end of the spectrum where now you're you're getting answers from that that Knowledge Graph you built you can evolve over time and much more precise answers which actually get to the heart of of real problems in in life sciences and Manufacturing and business critical Industries where you can't afford to be wrong and also where in in industries that are complicated if there are a lot of connections that might not appear in a relational database because no one bothered to make the joins permanent whereas in a graph those joins are there to begin with so if you search for one thing suddenly the neighborhood of related stuff becomes available to you to share with an llm for better contextual knowledge yeah and you know I think just if folks are implementing this or folks are thinking about how to how to think about architectures for graph rag um this is a really simple way of thinking about it so basically what you're doing is you're taking your gen application and you're doing both a vector and the knowledge graph representation of the data so you're both ask asking the vector for the answer you're getting relationally close nodes from the graph database where you're getting additional context and passing that into the LM and then this gives you more contextually relevant results coming out of your your expert system so I think this is a great way to use a Knowledge Graph either that you built up over time or that you have the LM construct to kind of get those Superior results where you could do better governance you can put controls and property on the graph nodes to control who has access to the information you can get better explainability now because when you're getting an answer from the LM you're no longer looking at statistical probabilities in the vector space you're actually looking at graphs and nodes and edges which we can reason about and we can start to understand the relationship between understand like what what things are related to manufacturing which things are unrelated to that they're just you know general terms and for the right application maybe we're saving lives getting drugs to people more quickly and using gen for a good cause so thanks so much for joining us for our presentation at AI engineering Summit and um appreciate everybody thank you [Music] our next speakers integrated AI coding agents into the largest travel site in the world here to tell us how is Bruno [Music] pasos how's everyone doing all right it's been it's been a fun morning so far it's great to see like a a huge uh range of Faces in in the audience everyone building software from Big to small uh so my name is biang uh I'm the CTO and co-founder of a company called Source graph we build Dev tools for uh big messy Co bases yeah and I'm Bruno Bruno pasos and I lead uh the product site of developer experience at booking.com and um yeah over the past year I've been overseeing the um the geni Innovation side of booking as well cool and today we're here to talk about uh how we're partnering to build software development agents uh that made a bunch of toil inside booking that are actually having real Roi and impact so how many people have heard this before you know you're the you're working inside a large company the CEO comes in and says like hey we need to adopt AI uh and then folks are like okay uh what does that mean you know how do we measure it you know maybe like fomo purchase co-pilot or something like that uh and then 6 months later uh someone else maybe the CFO is asking you hey so what's the r I of of that AI tool we just adopted or you know what's the measurable impact of the agents that that we're building um this is a question that I think a lot of people aren't quite sure how to answer right now but Bruno and booking have been sort of on the Leading Edge of answering this question uh and very ProActive at uh acquiring and building the best tools and also following through to demonstrate uh how they're actually impacting their org it's very kind of you to say we are we are leading this I think we are we are right at the beginning and I feel couldn't be couldn't feel further from from actually the Forefront of it uh but let me let me start by talking a little bit about booking um I am sure uh the most of you would have heard about this company uh our goal is to make easier for everyone to experience the world and my team's goal is to make sure that our developers have their path cleared so that they can do their best work now are we close to that in some parts of the company yes other parts we couldn't be uh farther away from it to get to set a little bit of context uh we are one of the largest uh online travel agencies in the planet um and we serve about 1.5 million room uh nights uh um with more than 3,000 developers uh can you raise your hands uh who work in a company that has more than a thousand developers quick show of hands good number of people okay uh on the more on the on the dev side or on the technical side um we serve over 250 50 merge requests uh at a given year uh with 2.5 million CI jobs running at a given uh year as well and we are extremely data driven um our company has gotten to where it got to over experimentation and being obsessed about data and the reason I'm going into this is because as we experiment and in the form of primarily AB tests we start adding those experiments and and feature Flags into the code base and as we push forward to bring new features to our users uh most likely those experiment Flags or dead code will stay in the code base and now fast forward decades our code base became extremely bloated uh fun fact I was uh my kids were looking at me uh editing this slide and they said what are feature flags and I said well um you know they stayed the code base and they sto polluting the code base and they were like like code farts and I'm like now you're going into code smells it's a different topic but let's uh let's move forward um and so as the the code Bas starts to blow tap and become bigger and bigger cycle times also become uh larger and longer and they the time that developers spend to debug and to work on that code base just becomes over 90% toil right who here is familiar with this that's even more HS than than than than a thousand developers and so we survey our developers at least a quarter uh on how they're feeling how they're how they're they're feeling about working on that particular code base and it's it just becomes harder and harder for them to do anything and so we had to do something about it so I've seen the best developer Minds My Generation destroyed by decade long dead feature flag migrations it's crazy L Channon actually say that but I mean seriously though like they're probably like Geniuses out there like I was talking to someone from PWC the other night and described the system that they're building to like update uh all all the kind of like Legacy code in their system and it was amazing like the guy was really smart really brilliant uh really like interesting Tech but wouldn't it be great if you know those sorts of Minds were unlocked to actually work on like you know new features and thinking about like user problems rather than all this kind of like Legacy craft and so in a nutshell that's why Source graph exists uh as a company so our mission is to Mak building software at scale tractable uh and so you might be familiar with a couple of the the products and tools we built uh along the years uh code search it's kind of like a Google for your code allows any human developer to find in and and build a working understanding what's going on we have a tool for large scale refactoring and code migrations uh you might have heard of our AI coding assistant Cody it's a context aware uh code generator that's tuned to work well in large messy code bases uh and the topic of this talk is really about the agents that we're building to automate toil out of the software development life cycle so a bunch of different products that we built over the years the unifying theme really is to uh accelerate things in the developer in Loop augment human creativity there and then to automate as much of the BS out of the outer loop as possible all right so um SB young talked about uh Source Graph Search just over two years ago we started using their product and there was a big success within our uh our community because they were able to search that bloated code base much much easier and find small pieces of context lying here and there I I totally encourage you to have a look at this uh particular product is awesome um and so about a year ago uh January last year we started experimenting with Cody and why because Cody also has has Source Graph Search as context and so it became extremely useful for us to use a uh tool that had that context to be able to experiment with the the Gen topic and now we are hoping to reach the path of uh uh building agents with Cody and and Source Graph Search uh uh built in all right so um if I summarize very quickly and hopefully this illustrates how fast things are moving uh uh forward in January we started um with Cody we gave everyone one the ability to start using the tool in the company so all our 3,000 developers uh um had the the opportunity to use it some started using it some uh uh uh used it didn't see any value with it and then stopped using it and that started intriguing us and so back then right in the beginning of the year we had the choice of one llm to use across the entire company and some token limit uh uh um uh limiting what we could do with it and so the first thing that that we started pairing with Source graph and we appreciate the partnership on that was to remove all the the guard rails that we had in order to be able to really give it a go and so Source graph was very quickly to be able to give us multiple llms per developers we could choose that and why that was important it's because we found the llms had expertise right and so if we were going to excavate our code base our bloated code base a particular llm would do better than someone that was working on a completely new uh uh piece of service and and developing features there and so fast forward to July um we started training developers and that became incredibly important because the people that started using and didn't see the value when they started getting trained they started using it and falling in love and becoming what we call that now daily users and I'll explain how why that's important um and then we started looking into more metrics back in January the main metric was I was saved and um I mentioned that we are a data driven company and I was saved wasn't the most statistic relevant metric that we could use it was based on Research only over a couple of developers a few developers and uh that wasn't cut um raise your hand here if you uh heard folks out there in the beginning of the hype talking about thousands or or 80 100,000 hours they saved with Gen has anybody ever heard that and then you go back to your company and say why are we not doing this I call that semi BS uh uh uh and so we had to start going into other metrics something that were more statistically relevant and so we started brainstorming with that come October October we Define new kpis which I'll go deeper into it and metrics to measure uh to measure gen and fast forward to November end of last year we then started finding traces that developers were 30% plus faster if they were using Codi on a daily basis and that's 12 plus day in a month to take away weekends and the times that they they were not coding and most importantly we were able to partner with Source graph to be able to create an API layer in front of cod so we could be creative in using with some of the tooling that we use like slack jira and and being able to extract some of that away from the ID all right so as we as we we finish around October we started looking into those kpis and what was important for me is that we defined something that we could measure within a year why because things are moving so fast and if we it was really helpful to ground as to what can we measure within the next year and so we defined four kpis the lead time for change quality code basing sites that would then go into how we could modernize some of our bloated code base and so some of the metrics uh uh when I say short mid and long term these were metrics that we could see results in the short term in the midterm and in the long term and that longterm is precisely a year and so we started seeing results with time to review am Mars developers that were using Cod on a daily basis would ship 30% more M than the ones that that didn't and one very interesting piece is that their theirr were lighter they had less code in it which I still don't know what to make out of it but we are we are working on it and then on the quality side of things we are hoping to go into the vulnerability can we show some of the vulnerabilities we've had in the past give the context the code bases context and try to see where we can predict whether new vulnerabilities will uh will appear or if they're still lingering our code base and then we started the obvious one is test coverage can we increase test coverage can we create test coverage on the Legacy so the new stuff when we rep platform passes that that particular set of tests and then we went into coding sites which is more related to like can we track what parts of our codebase are not being used some feature flags that are still lingering but shouldn't be there and the code that is not performance enough and all of this is to feeding to our ultimate goal which is can we bring the time to rep platform our codebase from from yeah years to months right okay so while all this is going on one of the things we noticed is that the same Engineers that were using the the like coding assistant to generate code were also playing around with the underlying apis and so what we realized is that like asking people to customize prompts leads to them wanting to build and compose those calls into longer chain automations that we now call agents um there are a lot of pitfalls that we encountered uh uh you know in in the early stages of this like helping people understand what the expectations were with respect to what the the LM can do and what it can't do but the long story short is at some point we basically said F this it's not really working let's just like put our brains together you know fly out to Amsterdam we'll do like a weeklong joint hackathon and build some agents together and so the first thing to come out of that hackathon was this thing that uh generates graphql so booking has a a huge graphql API play the video it uh it's seriously like more than a million tokens long and so it does not fit into the context window of any of the existing uh llms even if you could shove it inside context it's not going to do a good job of of integrating that context into something that's coherent a ton of hallucinations and so what we did is we built this system that basically searches this very very long graph P schema finds the relevant like nodes where wherever they are in this like schema tree uh and then uh agentically figures out which ones are relevant and then walks up that tree to pull in the relevant parent nodes and so on the right hand side you can kind of see it's like inner dialogue this is like it's thought process for uh reasoning about which nodes of the schema to pull in and then uh after it's done that reasoning it generates a response and so if you do this naively you know the UI looks very similar but you just end up getting garbage which is what we were seeing you know before we ran this hackathon after we sat down and and and actually worked through like the specific prompts and stuff uh to make this work well we saw far better results all right so um a pretty interesting one that uh uh that we started uh uh working through in terms of Agents were the automated code migration could we go into that Legacy piece functions with over 10,000 lines to give you context and uh uh and speed up that rep platforming efforts and so uh code search structure it structured met meta prompts uh and then the the concept of dividing that particular code base to conquer the small bits were were really uh really interesting um one of the things that I totally recommend if you started to embark on on on a journey like this is pairing with some experts to bring that expertise into into your offices was incredibly valuable to us and we started seeing uh back to when I mentioned that the developers were using Cod and stopping and feeding back doesn't doesn't add any value was pure lack of knowledge folks didn't know how to work llms out they didn't know how to pass the right prompt in the right context and this was uh a pretty important piece for us to be able to uh to work on this particular uh agent and so when we go into this we had developers working for months at this point to try to figure out the size of the problem that we had to then be able to divide and conquer and then we came within two days within a hackathon we were able to really Define uh and understand where the the call sites were coming from and then being able to Define how big the problem is was important for us to be able to have a start point and then collect the low hunging fruits that were available for us so U all of this is still in experimentation uh mode uh but we' seen a lot of values and a lot of uh uh sort of like fire in that smoke in going from mon uh of of understanding the code Base today cool and so the the last agent that really came out of this joint effort uh was targeted at code review so this is something that we found is is pretty Universal across many different Enterprises like everyone who does not do code review here one hand okay uh I'll talk to you later sir um so like everyone does code review and what we found like originally we didn't think this was a very interesting space because there's like two dozen startups now that are popping up that do AI code review but when we talked to booking we talked to other Enterprises what we found is that like code review is kind of like very specific to your organization there's a long tale of like rules and guidelines and other things that uh you want to bake into your review process and a lot of the tools that are off the shelf there aren't super customizable and so what we built is this interface uh where we're going through and productizing the the process of building a review agent that's tailored to your team and your organization so the the basic idea is that you define a set of rules that you want to hold in the code and then those are defined in kind of like a simple flat file format and then the agent will go and consume those rules apply the relevant ones to the specific files that are modified in any given uh PR and then uh very selectively post up comments uh that are tuned to those rules so it's very not noisy uh we're trying to optimize for you know Precision uh over recall here in in the feedback that we're we're giving uh the the the developer all right so knowing what we know in the beginning of this year we've been working on this for a year together uh uh with Source graph then a few ideas started uh popping into our minds of how we could go forward here and uh one of the things that I'd love to leave you with is the concept of declaring what are the rules of your service right so think think of your CI pipelines today uh when they give you errors could we anticipate and shift this left to the IDE so those errors appear there and they appear in the form of here is an error and here is a fix so hopefully the service gets to a point where it's self-healing right and we started seeing that we could do that that there are there are there are areas that we can start using giving the all the context all the prompt that the developers started creating via the prompt uh library that we created and uh asking those questions Auto automate those questions to the server to see what comes out in terms of knowledge uh um uh out of that code base and so we think this is ultimately what we we are trying to achieve within I would say a short as short as the end of this year in terms of uh agents um but lots um Lots here to to go can I sorry can I just say one more thing about that last uh slide um I think that we have the potential here to solve one of the problems that has plagued software development since it Inception so you know who here has read the mythical man month before so yeah basically everyone so like it's this problem of like any software that becomes successful eventually becomes a victim of its own success because if you have Revenue if you have users that's going to generate feature requests bug reports any business that's prioritizing that is going to take on Tech debt to in order to compete quite frankly and over time as you add contributors to the code base you lose this cohesion of vision you lose the set of standards that you want to maintain and hold uh with declarative coding now you can have like the senior engineers The Architects the the the people in charge of the organization Define constraints and rules that must hold through the codebase and enforce those rules both at review time as well as inside the editor for you know the code that's written by human or AI yeah for bigger organization all your compliance rules all the things that that the developers need to work on but it's not necessarily feeding new features to your to your end users I think those could are perfect examples of um um yeah being declared into your service but anyway the main important thing so far in this past year that we've been uh um you know pairing to be able to figure this out has been education the more we educated the developer and hand holding entire business units to be able to show them the value but then have them experiment within two days of like workshops and hackathons have them experiment with the tool they were coming out the other side incredibly passionate about what it can do but also becoming that daily users that we are trying to transform them so hopefully to defend that 30% plus increase on speed so educate your folks if you take one thing from this is education and if you want to dive deeply into any of this we got a booth downstairs feel free to stop by we'll talk shop or uh also tomorrow I'm given an expo talk that covers some of the more nitty-gritty details of how some of those agents were implemented so thank you thank you all [Applause] [Music] our next presentation is about building trust in Enterprise AI please welcome to the stage the co-founder and CTO of writer Wasim [Music] Alik hello everyone my name is Wasim I'm one of the co-founder on C a trer today I'm going to just tell you a quick story about actually why we building a trer what we doing but before we dive in I would love just to give you quick history of writer so writers we start the company in 202 we love to say the story of writer is the story of the Transformer we stting building those decoder encoder model in the early days and we start we kept building those model and build a lot of them today we have a family of models I believe around 16 we published we have another 20 coming in the way and we keep building those models and you're going to see from this list those model com in two categories General model like p x p 3 4 if you have a b 5 coming soon and we have a lot of what's called domain specific model Creative Financial Services B Medical now early 2024 basically last year almost we start seeing this trend with all the LM basically get very high accuracy in general with the Ed punish Mark we're see the accuracy moving and just the growing and I believe everyone noticing this accuracy today average accuracy for a good General models between 80 maybe close to 90 so that basically make a bring a question inside the company saying is it worth it for us to start building and keep building domain specific models if the accuracy today with General model achieving around 90% And we have domain specific model should we just keep building General models F tuneit maybe go Direction with what you call reasoning or thinking models and that would be more than enough actually and we don't need those financial or what call domain specific model now to answer questions we need data so whatever we going to present next actually could be applicable to financial services domain specific model sorry to Medical specific model customer support domain specific model and all different domain specific model today I'm going to talk specifically about the financial spec uh called the financial punchmark for domain specific model uh we have something similar for medical but we believe we are but we start seeing similar result now let me dive in just to remind you we're trying to answer these questions General model theic model should we keep build them where we actually going from here we start actually saying great we don't know the answer let's actually do the evaluation let's create the data and we created something called fail the idea behind it let's create real word scenario to evaluate those model and let's see actually of those new model can really give you the accuracy that we a promise or the accuracy that we see today from the punch marking on domain specific we created two type of categories in this evaluation something called query failure in query failure basically we introduce three type of subcategories something called misspelling queries you know when you go ask the llm questions but you do some spelling error segment error you do some com comment typo issues we introduce that to the eval set we introduce something in like what called incomplete queries you're missing some keyword some stuff not clear we introduce what's called out of domain queries if you're not expert in the field or you decide to copy paste some general answer try to answer about something very specific specific and also in the second category is what we call the context failure and the context failure basically and this get very interesting we introduce three subcategories what called basically messing context we basically ask the llm question about context not exist in the the quest itself in the BR we introduce what you called OCR error today when we do any kind of OCR or convert physical do document to text we introduce a lot of Errors like you know character issues distance between them the word between when you do the OCR could be merged together so we introduce that type of errors and also we did what call unrelevant context let's say you want to ask question about specific document and you end up basically uploading completely wrong document does the LM going to still answer is the LM just actually figure out you have a completely irrelevant context now when you put all this data together in domain in financial specific Financial Service specific you need some kind of diversity just a quick screenshot just tell you about amount of data how much token something worth mentioning the white paper the data the evaluation set the leaderboard all actually open source today available in GitHub and hugen face so anyone please check it out and we introduce very simple what call it evaluation key Matrix basically we need to look to two things and the model give the correct answer can the model actually give good follow to the grounding or context grounding or basically what you call it here the context this is quick or high level way of how we do the calculation so to evaluate we selected a group of models today we can see a lot of chat model and also thinking models this is basically the two list we have here I'm sure you familiar with this list and then we on the evaluation and we start seeing very interesting results I'm going to dive in directly to the result and basically we start getting something Fancy with all this color let me switch to the what basically see what's start getting very interesting we're saying really good behavior in all thinking models actually they don't refuse to answer this sound good most of the time but in reality when you give something those llms wrong context when you give them wrong data when you have a complete different grounding those model actually fail fail to follow this part and they still give you an answer and that basically get you way higher hallucination if you start focusing on the answer itself can the model give me answer or not you can see basically almost every model from the domain specific to General model they give you some kind of answer all them close to each other actually reasoning or thinking model they get to even higher score a little bit from there but when get to the grounding and con grounding this is when stuff get more interesting you can see specifically in task like text generation or question answering it's just not performing well now all does the chart look great what I prefer is the numbers this is the same data we use to generate the chart we can go through this really quick and if you look at this number here especially for example like the o1 or o03 or B fan you can start noticing the stuff those model doing amazingly and basically when you ask was it misspelled when you got stuff un complete out of domain the numbers look amazing the model can take a query with mess spelling wrong grammars or even out of domain and still can give you the answer but when you start going to grounding this is going to stop get very interesting I'm going to hold this slide for a second here if did you notice something different yep and also those bigger more thinking give you the worst result you're getting almost 70 50% to 60% uh worse in the grounding meaning the model is just not following you attaching context you ask the questions and the aners exist outside the context completely same thing coming stuff around it anal context so you can look at the data see smaller model actually performing better than all this model over thinking at that side and this is basically will get us about is this thinking or just a Chain of Thought you know this could be a lot of argument at least from the data we have in domain specific task those model not thinking at that stage meaning Hallucination is really high causing a lot of a lot of issues especially in this fin in this uh punchmark we run here in fin Financial use cases also we can see there's a huge gap between what you call robustness and the hallucination and getting the answer correct so definitely we still have a lot work to do to build those model and better performance but also that get me to you know to the main idea if you go back real quick here even with the best model between all the slide we're still not getting between robustness and cont surrounding more than 81% sounds a great number if you think in the reality you saying every hundred request 20 of them it's just completely wrong so that basically what we start seeing believe at least today with the technology we have with the current model we have until we have something completely different we're seeing you need full stack you need the arck system you need the grounding you need everything from guard rails and the build around the system itself to actually have something reable utiliz today in the same time I would love to go back and answer the first questions and our first question here do you still need to build the models at least today from the data we have from R those punchmark the answer simply yes we still need to build and continue to make a specific model at least with the today implementation even accuracy is keep growing but the grounding the context following all the context correctly it's still way way way behind from everything we see today in the market thank you so much guys [Music] our next presenters are here to share real world case studies from open AI please welcome to the stage member of technical staff of open Ai prant matal and Toki sherbakov head of solutions architecture of open AI hello uh thanks for having us here and today we're going to talk a bit about building and scaling use case with open Ai and what this means in terms of Enterprises working with open to bring use cases to production and a little sneak peek into agents and how we've seen some of our experience building these use cases in now agentic workflows uh in the field so uh on our side um just a quick introduction into open AI I'm sure folks have probably heard of open AI but just in terms of how we operate we have two core engineering teams we have our research team which is 1,200 researchers that are inventing these models right we they build and deploy these foundational models these kind of come down from the heavens our apply team our second engineering team take this and build it into product so this is where you see things like Chad GPT you see things like the API where GPT models are available and that's where we actually deploy this finally in the go to market sense where we take these products and put it in end user hands that's kind of where our team comes into play with go to market where we actually help get this in the hands of your Workforce in the hands of your product and really start to automate these internal operations and once we finally deploy these there's kind of this iterative Loop where we take feedback from the field to improve our product directly and then also improve our core models through this research flywheel so that's kind of the last step of getting it back to research so this is typically how open AI operates um in terms of the enterprise we see the kind of AI customer Journey happen typically in three phases it doesn't have to happen in sequence in this way but this is what we usually see is first and foremost building an AI enabled Workforce this is getting AI in the hands of your employees to become AI literate to use AI every day in their day-to-day work that's the first and foremost that first step typically that we see then from there you typically graduate to towards automating your AI operations this is actually more of internal use cases to build in automation or maybe some co-pilot type use cases into the workforce then the last step here is actually infusing AI into end product this is end user facing so when it comes to open ai's product specifically enabling your Workforce typically starts with something like chat GPT so this is our you know first party product to put in the hands of users to use day in and day out then when you talk about automating operations internally you can do this partially with chat GPT for the more complex use cases or more more customization is needed that's where something like the API comes in and then finally infusing this into your end user products is where it's primarily API use cases but just to give a flavor of how these products come into play when actually executing this across your AI customer Journey so in terms of how we see Enterprises actually craft this strategy in practice it kind of happens in a few different ways I'd say first and foremost you determine a little bit from the top down level of what should the strategy be and one core thing that we acknowledge here it's not actually what's your AI strategy it's actually what's your broader business strategy and what open AI does is help figure out where does a technology meet that broader business strategy first and foremost so that kind of top- down Str strategic guidance is really important to start with and then once you start with that top down guidance you then move to use cases like let's identify one or two mey use cases that are high impact to start with and scope those out to really just deliver on um kind of that scoped scale so once you have the strategy you execute upon those two use one to two use cases and then think about how to build divisional capability across your Enterprise this is where you start to enable the team and to fuse AI throughout the organization and this happens in many ways this comes through enablement this comes through building centers of excellence this comes with building maybe a centralized technological platform that other people in the Enterprise can build on and I feel like that's typically the journey we see is again set the strategy pick those one to two use cases and then build that capability across your organization through enablement so that's usually the the type of Journey we see and just to illustrate this a little bit with an example is this is how we've seen the use case Journey play out so um this is illustrative of a three-month type of example of a use case but when you've identified that one to two use cases that you want to tackle first and foremost you have to ideate upon that do some initial scoping do some architecture review to understand how does AI going to fit into your current stack and then really clearly Define what the success metrics in kpi are once you have that established the bulk of the time is really spent in development this is where you iterate this is where you are iterating prompting strategies incorporating rag whatever it may be to constantly improve the uh use case that you're tackling when it comes to engaging with open AI this is where our team like Brant myself really interact closely with your engineering team through things like workshops things like office hours paired programming sessions webinars whatever it kind of takes to accelerate the use case forward once we doe that development phase we kind of move to testing and evaluation which is with the evals we've typically defined up front we're able to actually now do some AB testing do some beta roll out to understand how this actually works in practice and then finally we go to production this is where you just do some launch roll out do some uh scale optimization testing to make sure it's going to work once you deploy it to many end users and then we have kind of constant maintenance that that's ongoing so that's like the typical phase you'll see and again the bulk of the time especially in partnership with open AI will be around development um in this we bring a dedicated team we ask you bring us also a dedicated team to make this work in practice and the things that we deploy also to enable you are things like early access to do models and features that's one of the key things of working closely with open AI is that we can see a little bit into the future not much like our road map I I don't see beyond much maybe like six months people ask what's your 18-month road map I cannot tell you I can tell you basically what's going to happen the next two quarters but that like purview into the future is really important to bring forward to these use cases and enable customers to build and innovate for what's coming next so that's a really critical part of our partnership um also we bring in you know internal experts from our research engineering team our product team to help kind of accelerate you on this path and then lastly just kind of do joint roadmap sessions to make sure that we're on track for what your future road map is as well so that's hopefully an illustration of how we partner together and then one example on this is something we did with Morgan Stanley so Morgan Stanley here based in New York uh was building a internal knowledge assistant so what this was was giving their wealth managers the ability to qu ask questions of their large corp Corpus of uh knowledge which was research reports like live views on stock ticker data whatever it may be and they wanted to get highly accurate information back to be able to respond to their and clients right and accuracy was pretty bad to start right it was 40 45% typically what they saw so interacting with us we introduced new methods throughout the use case development things like hide retrieval we did some fine tune embeddings different chunking strategies which improved performance and then once we kept introducing more and more methods we saw accuracy go up we introduced things like reranking and classification step that got it 85% and ultimately their goal was 90% we got to 98% accuracy through other things like prompt engineering query expansion so more of just an example of how we introduced methods throughout this use case journey to uh improve their core metric for more conly in this case um so this is hopefully one illustration of how open eyes partner with customers and one common use case we're seeing more and more of is now building in this agent space you maybe hear that 2025 is the year of Agents agentic workflows has been a buzzword for a long time I think we're seeing that actually come to reality this year and um I think with that we've seen uh some B we have some Battle Scars and some best practices of what we've seen in the field and I'll hand it off to pant to talk about what we've seen on the agent side thanks DOI so at openi we are lucky to work alongside customers who are building state-of-the-art agents and working alongside team members who are building our own agentic products like deep research and operator like Doki said we expect 2025 to be the year of Agents the year gen gen truly graduates from being an assistant to being a coworker and to help usher in this era we've been hard at work identifying the patterns and anti patterns prevalent in agent development I'm excited to share four of those with you today before we can go further I'd like to quickly Define uh what we mean by the term agent so we think of an agent as an AI application that consists of a model that has some instructions usually in the form of a prompt access to some tools for retrieving information and interacting with external systems all encapsulated in in an execution Loop whose termination is controlled by the model itself so one way of thinking about this is that in each execution cycle the agent can be thought of as an entity that's receiving instructions in natural language determining whether or not to issue any tool calls running those tools synthesizing a response with the tool return values and then providing an answer to the user Additionally the user may determine sorry the agent May determine that it's met its objective and therefore terminate the execution Loop so with that definition let's move on to some of the lessons that we've learned uh building these agents in the field so for the first Insight imagine you're designing an AI agent you need to orchestrate multiple models you need to retrieve data reason over it and generate an output you have two choices you can start with Primitives making raw API calls logging results yourself um and logging outputs and failures or you can start with a framework you can pick an abstraction you can wire it up and you can let it handle a lot of the details and I have to say starting with a framework is pretty enticing it's how I got started building agents it's really easy to get started have a proof of concept Concepts stood up in no time but the problem is that if you start with a framework you often don't actually know how your system behaves or what Primitives it uses you've deferred design design decisions before you've understood your constraints and if you don't know your constraints you can't optimize your solution so we believe a better approach is to First build with Primitives understand how your task decomposes where the failures happen and what actually needs Improvement then introduce abstraction when you find that you're Reinventing the wheel for example by re-implementing an embedding strategy or re-implementing model graders that may be a good time to bring in some abstractions many teams today are spending a lot of time picking the right framework um we actually believe that developing agents in a scalable way isn't so much about choosing the right abstraction it's really about understanding your data understanding your failure points and your constraints so in summary the first lesson is to start simple optimize where needed and Abstract only when it makes your system better which leads us straight to our second Insight starting simple so too often teams are jumping straight into designing multi-agent systems agents calling agents coordinating tasks dynamically reasoning over long trajectories it all sounds really powerful but when it's done too soon it creates a lot of unknowns and it doesn't give you all that much Insight we like a different approach we generally recommend starting with a single agent that's purpose built for a single task put that into production with a limited set of users and observe how it performs doing this allows you to identify the real bottlenecks hallucinations over conversation trajectories low adoption due to high latency or maybe inaccuracy due to poor retrieval performance then knowing how the system underperforms and knowing what's important to your users we can work to incrementally improve it in a nutshell we should think of complexity as something which increases as we discover more intense failure cases and constraints because the goal isn't really to build a complicated system it's just to build a system that works so starting simple sounds great uh but we all know that complexity is where True Value is realized so how should we handle more complex tasks this is where a network of agents and the concept of handoffs comes in so you can think of handoffs sorry let's start with the network of Agents so a network of Agents is a collaborative system where multiple agents work in concert to resolve complex requests or perform a series of interrelated tasks you can think of this as a series of specialized agents handling subflows within a large agentic workflow on the topic of handoffs you can think of these as the process by which one agent transfers control of a active conversation to another agent it's pretty similar to how you get transferred to someone else on a phone call except in this case you can preserve your entire conversation history and the new agent just magically knows everything you've talked about already so let's see an example of this in this sample architecture we are showing how a fully automated customer service flow may be implemented with a network of agents and handoffs this approach is allowing us to BU bring the right tools to the right job so for example on the left hand side we're using a GPD 40 mini call to perform triage on the incoming request we're then using GPD 40 on the dispute agent to actually manage the conversation with the user and finally we are using a o03 mini reasoning model to perform accuracy sensitive tasks like checking whether the customer is eligible for refund it turns out that handoffs work really well and keeping the entire conversation history and context while swapping out the model The Prompt the tool definitions provide sufficient flexibility to solve a wide range of scenarios so our final lesson pertains to guardrails and just a level set guardrails is a catchall term today for any mechanism that enforces safety security and reliability within your application and it's generally used to prevent misuse and ensure that your system maintains Integrity so keeping the model instructions simple and focused on the target task ensures maximum interoperability of your system and also ensures that we are able to H on on accuracy and performance most uh predictably guardrails should not necessarily be made part of your main prompts but should instead be run in parallel and the proliferation of faster and cheaper models like GPD 40 mini is making that making this more accessible than ever tool calls and user responses that are high stakes for example issuing a refund or showing a user what information uh some information from their personal account these can be deferred until all of the guard rails have returned in this example we see that we're running a single input guard rail to prevent prompt injection and then a couple of output guard rails uh on the use on the agent's response so to recap we have four lessons from our time building agents use abstractions minimally start with a single agent graduate to a network of Agents when you have more intents and finally keep your prompts simp simple and focused on the happy path and use guardrails to handle edge cases thank [Applause] [Music] you ladies and Gentlemen please welcome back to the stage your MC for the leadership track session day Peter Humphrey all right folks thanks uh and thank you to pant and Toki uh it was really wonderful I think to have them here today to hear about everything that's going on with open AI uh it's been a pretty exciting morning folks uh We've dived into topics like knowledge graphs how agents fit into the existing software development life cycle uh domain specific llms and of course just now open AI um so if you want to just a quick reminder if you want to discuss any of these topics meet the speakers have some question and answer time uh just go to one of the three Q&A lounges that I mentioned before uh there's one on this level there's one right at the bottom of the stairs on the Expo level and there's another underneath the stairs tucked under uh and the speakers will be in one of those three areas um also please take this time during lunch uh to go stop into the sponsor Expo um that's again where lunch is being served and uh our sponsors have um pretty amazing products Technologies and services to help you on your journey um so also if um during lunch a little special uh late ad which I'm we're we're kind of excited and pleased to tell you about uh we're going to bring a little Family Feud back does anyone remember Family Feud that game show TV game show can yeah all right a couple folks um we're gonna have a teams from leading AI Frontier Labs basically in a head-to-head Family Feud style battle of wits uh and this is going to be hosted by bar euron a partner at amplify uh Partners um and we're going to feature family feers like Mahir Patel uh shest the Malik John ma Tina Zoo Petra grutzik Colin flarity Steven roller Paige Bailey just to name a few um so if you that's going to be right here if you want at 1:15 if you decide you want to kind of exit launch early and come back and enjoy that just for some funsies um all otherwise we will see you back here for the continuation of leadership sessions at 1:45 p.m. in the theater please enjoy your lunch and thank you for being here [Music] [Music] [Music] [Applause] [Music] [Music] [Applause] [Music] oh [Music] [Applause] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] e [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] welcome back everyone as an AI language model I cannot taste food but I am certain you enjoyed your meal but enough about me please join me in welcoming back to the stage your MC for the AI engineer Summit leadership track session day Peter Humphrey welcome back thank you hope everyone had a bite um and uh just a quick reminder uh not going to be up here long um thanks for coming back from lunch a little early uh we are going to have some fun with a family feud Style game show just again just curious who remembers Family Feud anybody okay so a couple more hands than than last time uh it's a game show Q&A Q&A Style game show where the contestants are under a little pressure so this should be fun um I would like to call up Baron uh a partner at amplify uh partners and she will introduce the rest of our family feers have some fun enjoy thanks for coming back [Applause] [Music] that we're all [Applause] coming okay welcome to Frontier Feud so I'm your host uh bar yourone I'm a partner at amplify we're the First Investors in technical founders and I invest in data and AI companies I'm very excited to be here Frontier feuding in my favorite City New York City and today on stage we have some amazing folks competing for prizes and eternal glory and all that good stuff so we're going to introduce our teams uh do not let these smiling faces deceive you we are here to compete uh before we do intros for the teams just an audience feeler who here in the audience including those on stage have watched Family Feud before quick show of hands okay most people for those of you uh who haven't the premise is pretty simple the nuances you'll learn with the game so we surveyed 100 AI Engineers on a series of questions folks on stage are going to guess the answers to those questions and the most popular answers from the survey are going to get the most points so feel free to follow along and guess yourself in the audience as we go along uh if you don't like the answers you can look to your left and look to your right and blame your neighbors no I'm just kidding in all seriousness thank you so much to everyone who filled out that survey before and so we're going to do some quick intros um we have on my left M you can kick it off tell us who you are what you do and your Tech hot take we're going to do quick intros cool uh hey everyone my name is m here uh I work at anthropic um my tech hot take is I think uh let's say the five is people training big models today at least one of them will no longer be training AI models by the end of the year mic drop but really mic pass so John hi guys I'm John I work at anthropic and my tech hot take is I think AIS make really good therapists hi I'm Tina um I work at reflection AI it's a startup building coding agents and my tech hot take is that everybody working in our industry should spend like 20 minutes a day with just a piece of paper and a pencil um and think like non AI interrupted thoughts thank you hi I'm sha I'm the product lead for the Gemini developer API working with Paige uh I have two my first tech hard take is a followup from Tina's I think it'll be important for people to do what Tina said because eventually when you have ai models represent you they'll need some good content to be trained on I think I'll save my next one for later I love it excellent paig you want to start your side heck yeah so just be aware that uh we did not pregame it no fraud no fraud um uh my name is Paige I work at Google deep mine leading uh engineering for our AI deell team um uh and I guess my tech hot take is that uh speed forward ahead a year and a half I think that the majority of deployed models will be on device um and we'll be seeing a trend more towards smaller models um that are uh kind of orchestrated together maybe uh hypers specialized for a specific task um as opposed to relying just on larger models um and sending data elsewhere to do interesting things uh my name is Colin uh you might remember me from my talk earlier today I'm a researcher at augment code building AI coding tools before that I was at Fair Facebook AI research here in New York on working on AI for board games I I just had to get out both hot takes since I had two two ones so one is that in the future dating apps will be our AI dating each other um two is I don't think Transformers of the final architecture because eventually we'll build these models using biological materials oh snap hi my name is Petra I'm the product manager of factuality for the AI answer at the top of the Google search page my hot take is that um chatting with uh one bot one-on-one is boring and in the future most conversations will have at least two other bots in the in the room with you I love it hello uh I'm Stephen uh I'm joining thinking machines soon um and my H take is that if you think we're hitting a token wall that sounds like a skill issue to [Laughter] me I love it amazing so with that let's get started I'm gonna have Paige and Mahir join me not that you're so far from me excellent great so the objective is like hit this as fast as possible right yes okay so so um so whoever buzzes first has the first chance to answer this question so I'm going to ask a question and we'll take it from here so we asked 100 AI Engineers name the most influential AI researcher God dang it I looked at the screen okay what's your what's your guess uh Ilia satar okay we have Ilia n shazir Noom shazir very influential but not on the board not in our not in our top eight answers so your team decides if to play or pass all right we're going to play You're going to play You're going to play okay okay awesome so John it's it's your turn to guess someone I'll guess another person but you can stay there we're working out the Kinks cheating frud Onre car answer good answer all right Tina how about you Jeff Hinton oh an excellent one but just oh sorry did you say Jeff Hinton yeah oh that's on me um I thought you said something else great I the the the lead auor on the Transformer paper the attention all you need paper we need we need a specific name sorry um sorry I'm having a literal like I know the name but I uh I'm having a literal like uh we're gonna have yeah I think but we you already had a strike that doesn't count so we're gonna we keep it we're netting at one go here I'm GNA say yan laon yan laon we do have Yan laon okay why do we keep going we just keep going because you're winning oh until we lose yeah until you lose oh I didn't I if you want to lose here's your opportunity uh no so not on the board um what was the question again good question so name the most influential AI researcher um I'm gonna say sam Alman Sam Alman the good one it's not on the board and so we are at three strikes so your team needs to deliberate and if you pick an answer you actually steal the points we have four options on the board and for those who are joining the question is we asked 100 AI Engineers name the most influential AI researcher all right I'm giving you a five four you're good all right I'll do Jeff Dean um Jeff Dean is not on the board which is what I had registered earlier um okay so that goes to uh Roo's basilisk and we're going to go yeah I'll show you yeah I'll show you the rest of them so these were the ones that were guest we also have Andrew in we haveu we have F Fe Lee and we have yosua Benjo there is a longtail very influential researchers too but they they're only eight spots so all great guesses okay next all right uh I think John and Colin what is it yeah you want to know the question yeah patience so name the top considerations when choosing a model oh I think your's brok I first more passion uh come aggressively this team but we'll yes name the top considerations when choosing a model intelligence I'm going to be very literal very smart answer very smart answer you have to to guess uh safety safety uh is actually not on here and so why don't we get uh Tina and Petra to guess am I next yeah oh um price so cost is the number one answer which actually surprised me I don't know if like maybe raise of hands if you would have guessed price now no so hindsight's 2020 that wouldn't have been obvious to me but this means that you guys will continue with this answer and we'll we'll we'll continue from there latency latency uh yes uh eval benchmark scores some of these things are ambiguous but I'm gonna put it under don't don't kill me accuracy performance [Applause] okay that's like [Laughter] everything okay um uh the CEO the CEO good answer good answer good answer good moving on Oh I thought because we miss oh no we your team got it um like where the the model is being served like if it's on Prem or you know uh no we're not GNA count that one but it's a good answer okay this team has an opportunity to steal these points I hear some folks in the audience think they're ready to steal so I love that about you I love the confidence oh my God we can make a guess that was no it wasn't uh open source versus closed Source okay so with that you guys get 63 points cuz you stole the stole the points but [Music] so it's a pretty close game going into the third and final question yeah all right sorry I know everyone everyone wants the answers who has like very strong upper body strength so so here here are the [Music] answers yeah these are the answers great thanks for keeping me honest everyone in the audience and everyone here so all right next so who do we have coming up Petra and Tina incredible you can do it so both in the front Okay this requires uh Power is what we've learned so um are you ready ready yeah okay we asked 100 AI Engineers name a buzzword sorry it's okay I'm not offended but name a buzz word everyone in AI is tired of hearing agents agent what do you all think yes okay but you still have an opportunity to guess because you could get the number one answer oh but you have to guess now I'm giving you three two two one sorry um okay so it's going to go to this team and um uh sha it goes to you we asked 100 AI Engineers name a buzzword everyone in AI is tired of hearing I'm going to start my five four multimodality multimodality uh we do not have multimodality but I don't know where this strike came from so it's first strike uh go with co-pilot co-pilot that's a good guess what do we think yes or no to co-pilot that's a very good guess I think it's a good okay good answer good answer uh but no AGI AGI okay what what does the audience think AGI yes all right it seems like we think get the good answer and no just kidding yes the number one answer AI all right Tina oh um deep seek too I no no deep so this team has an opportunity to steal but you're right maybe maybe for like the week yeah you're back you're so back safety safety so unfortunately safety is not one of the [Applause] answers so the these were the additional answers people here are sick of rag and prompt engineer um but what this means is that we have a winner which is Roo's basil but amazing amazing work by paig attention and the mixture of experts and so we're going to move to the fast money round do you guys have two Representatives all right you two are going and that means John leave the stage I'll call you when you're ready all right we're moving into fast money and uh so some of you have seen Family Feud but we're playing fast money the goal is for the two of them together to get to 200 points uh you have Mah 20 seconds to answer five questions uh if you can't think of anything just say pass and we can come back to it at the end or never um and if you hear a buzzer sound let me check that it works beautiful your answer wasn't one of the surveyed questions you can keep asking or or moving move on all right are you ready y okay uh name an AI tool that Engineers love curser uh name the sorry I didn't start the clock name the job most at risk of AI disruption software Engineers name the most influential AI paper in history attention is all you need name the biggest nightmare for AI engineer at 2 a.m. uh Hardware failure okay so we got through four but they were good ones so name an AI tool that Engineers love you said cursor cursor is the number one answer we asked name the job most at risk of AI disruption you said software engineering that is the number three answer we asked name the most influential AI paper in history what do you think demolished attention is all you need is by far the number one answer uh name the biggest nightmare for AI at 2 am. Hardware failure Hardware failure you know what we'll give it we'll give it as infar problem okay okay yeah um and so you're going into the round 140 points you want to tap tap John in John ran away he got scared no there we go amazing incredible so John big things you have 25 seconds and you got to stand back there you have 25 seconds to answer five questions um and if you can't think of anything just say pass and we'll come back to it if we have time at the end uh if you hear a buzzer sound it means that either your answer wasn't isn't on the board or he has already answered it you ready y um okay and I'm going to start your timer name an AI tool Engineers love sir oh um ch um name the job most at risk of disruption um artists uh name the most influential AI paper in history uh Transformers is all attention is all you need okay so we have model apis which is uh the third answer on the board we have name the job mode you said you said artist so I'm going to give you content creationwriting it's a little ambiguous and attention is all you need was already uh selected by Mahir but we'll I'll let you pick the next question which is name the biggest nightmare for AI engineer at 2 a.m. someone wrote In Cold email from a VC so I will talk to you after uh uh Cuda SS Cuda SS okay well you know what your teammates because you think alike and you think you you think of great answers so let me just show you the number one answers here which are a cursor data entry for the job most at risk of AI disruption attention is all you need the the top paper so that was a good one just already taken the biggest nightmare an outage of the model or otherwise and an industry that would benefit the most from AI Healthcare uh so with that we have well you lost the 200 points but I think that you're still winners and you're also winners and we have a few prizes outside that I think someone is bringing so we have a massive llama for this team we have rainbow lamba Bean babies for everyone we have engineering books and gift cards to your favorite restaurants in New York and maybe we can bring them out and leave uh thank you so much for joining us for Frontier Feud and we're going to see you next year we're running a massive survey on the state of AI engineering going to be presenting it at the June AI engineer World Fair so if you liked some of the teaser questions this will get much more in depth into the tools folks are using uh the workflows that AI Engineers have and it's a way for the industry to be more more transparent uh so thank you uh you can find the QR code or the link here if you want to participate in that survey on the state of AI engineering amazing babies are but they'll come for for for for [Music] ladies and Gentlemen please welcome back to the stage your MC for the leadership track session day Peter [Music] Humphrey that was fun thank you um yeah that the uh I'll try to do my best uh Steve Harvey impression which is not going to work I'm just going to say it in advance so the feud that was really fun um well I hope you enjoyed that and lunch uh and hopefully talking to our sponsors for a minute in the Expo um all right so settle in hope you had some coffee buckle your seat Bel grab a helmet our next Sprint of sessions is going to be pretty awesome we are going to talk we're going to have an AI case study we're going to talk about AI evals Hot Topic of course AI observability AI infrastructure and of course a talk from none other than anthropic so uh with that please join me and welcoming our next speaker to Thea stage Shira chadri from Thompson [Applause] [Music] Reuters hello good afternoon I have before me the ominous task of making this presentation really interesting with a topic which is going to sound like a crib what are those missing pieces for workflow automation to happen with AI and I'm going to tell you really an Enterprise story is it dry is it just going to be about now I'm going to find out who took my lunch sandwich we'll see um and you know as I Was preparing for this talk and I realized in the schedule that this is going to be just after lunch I thought I should start off with a joke and since for all our daily needs we go to AI tools I try to go to a AI tool for a joke and they really suck I couldn't find one decent joke if you can tell me a good joke about you know using AI for your real world Enterprise needs I'd be happy to squeeze it in right now uh like it doesn't work can someone help me oh okay yep so the graph looks a little different from all the graphs that I've been seeing this morning we took this journey in our worlds in our Enterprise worlds um as we explored but before I dive into this I think I should do an introduction I'm shisha I've come all the way from Bangalore to tell you the story that I see unfolding around me not just in Thompson Reuters where I work trying to bring AI to my um to my um you know teams and um different business processes but also the same story that I hear at meetups and you know um different community events where where I meet AI practitioners um everyone started off trying to democratize the use of generative AI back in 2023 we have something called as open arena in uh Thompson Reuters very similar to your you know um playground where you can try different large language models this is where it truly came home to um almost everybody in an Enterprise to start using generative AI for their workflows further along we got on to the rag um and you know prompt engineering World pretty quickly we looked at um automating various knowledge driven tasks with the use of rag and very soon we were answering questions um at the Enterprise level on what is the ROI further along in 2024 we started to play with tools and Frameworks and we heralded the rise of the agents we are now here where we're looking at automating entire workflows with the use of AI SL agents and not just one task at a time right we are looking at a future where we want to reimagine business processes because just automating a task seems redundant okay so what what do we mean by workflow Automation and what I've got here is a very typical workflow for almost any company that's putting software out there customer calls you calls your service desk report a billing issue an invoice issue or a product feature are not working as expected your customer support um is going to take the call they are probably using rag to sort of answer that question already or they may be looking at you know internal tickets or connecting with um you know uh their uh hierarchy to see if the answer can if the answer can be given and if no answers are found they then you know report a ticket to the it Ops teams right the internal it Ops teams do the level two support and they're looking at you know launching investigations to support this further along if this doesn't work you've got your engineering teams doing the L3 L4 support and a fix is likely going to be identified with the use of various observability tools scripts are launched to you know create that new build tests regression tests integration tests are launched and finally you've got either the bug fixed or the billing question answered you your tickets are updated and SLA is met needless to say all of you can spot so many of these tasks that can be automated with the use of Agents but you know is automating each task the way we want to do this is there something that can be done differently in reimagining this workflow we are there we're trying to reimagine this workflow um here's a slightly different take let's look at um a workflow where content is getting created right it starts off with authors or content specialist perhaps identifying an alert or a trigger that's going to launch that content workflow you then have maybe approvals to say yes go ahead do your research find out what we want to write about this and then you've got you know content getting created with research being done and subsequently your editors and your um associate editors and reviewers reviewing that content if it's you know very critical content you you will probably have several rounds of this reviews and eventually the finalized content goes to the publisher which then you know the the publishing teams then launch their own formatting styling related workflows and eventually the content is published here too you will realize that so many of these St tasks can be done by AI can be done by agents and of course humans being in the approval flow but here again something seems a Miss should we stick to the same design of the workflow or should we be doing this a little differently okay so that's that's where we are we want to be able to remagine these workflows because it's a new world because we have new capabilities with these Technologies um and not just plug in capabilities into an existing business process right but we stuck we're stuck we are um missing certain parts of that reimagination so what are we missing the first thing that we missing is connectors um I I spoke to few of the um you know the Stalls uh yesterday and a common theme was how around providing a good AI agentic solution you always needed that layer which connected to your current it systems right and and connectors are a very very much a missing part of reimagining these business processes um I also want to say that you know I come from a world where um you know technology is not altogether new we've been you know we've been doing AI we've been doing NLP for several decades as Thompson Reuters and even even in the different companies that um uh that developers come from in the different meetups and communities that I attend they are also supporting it systems of some of the you know different technology companies of our world Believe It or Not 71% of Fortune 500 companies still use Mainframe 68% of the world's it production workloads still run on Mainframe right and some of your major credit card um transactions still happen on the main frame which means we are that distant right like that technology Spectrum if you were to measure from the main frame to to an agentic workflow how do we connect these worlds right and so that's one of one of our major stumbling blocks I believe where you know how do you connect the worlds of the technology um stable technology Stacks that exist with um with with with the power of AI agentic workflows the second thing is something that um I struggle with as I take new ideas to to different stakeholders and I see you know several startups whom I meet on a regular basis them struggling as well it comes back to the question of Roi it comes to the question of you know um reliability right how will I be sure that my agent will be able to perform and often with stakeholders and you know from from a business impact standpoint it's a zero or one call right am I going to continue to need to have to pay manual hours or can I consider that not needed anymore if I'm going to pay for the agent AI agent right and so reliability becomes a big factor and a stumbling block for us the third thing that we're finding missing as practitioners is to have Visionaries who are able to re-imagine this world with us it's um you know as a practitioner as somebody who's deeply entrenched in AI you can only go that far in reimagining this world you need the subject matter experts you need the Specialists from that specific domain to sort of do this together with them to be able to you know reimagine your business processes the the fourth thing that you know we need and I'm sure many of you will agree based on conversations that I've had is we need a certain level of standardization right we need to be able to say this is how agents will be built this is how they are packaged this is how they deployed it's too nent yet um you know in a in a established Tech ecosystem to say uh this we are going to replace these these bits with agents data and systems for an agent to truly you know um get its full power we need to give it access to context the context is today distributed across different it systems Business Systems it is probably um partially located in logs it is probably there in um chat messages or you know it tickets and you know different um you know different styo systems right sometimes which are spread across different parts of the organization and so bringing them together and even identifying which of these systems will have what and how do you correlate a single transaction across these systems becomes often a stumbling part of you know getting the AI in the sixth thing is one that you know I've I personally feel very strongly about is creating a collaborative ux agents are going to be assistant so what is the role of the human defining that and creating systems in which humans can support the work of the agents and vice versa is I think a very important part of um creating creating those workflows and so what makes sense from a collaborative ux is something that you know I'm waiting to hear for from any of you fresh ideas on right AI governance we we saw um in one of the talks about how different parts of different aspects of security testing go down into parts of your agentic workflows and and so you know aspects of your AI governance which we established all this while around ethics and responsibility how do you translate that into different levels of your agent architecture right the next thing is control we still want to give the human control we want to have certain steps which are deterministic and certain steps which the agent can control on its own or you know how how do you balance that um need for control uh between between the agent and the human and give the human the right um you know right time to act and finally what is the life cycle for the agent All of You Have You Know spoken about that exponential growth of um evolution in our space how do we bring the capability the latest capability into what we've already got deployed and that one that's ever changing so that that's what I had to share um we are just at the start a lot of good work from all of you I'm waiting to bridge from the world that I'm seeing around me to the world that I come from and so happy to have your questions and ideas suggestions feedback thank you [Applause] [Music] our next presenter will share strategies for turning AI agents into reliable production ready tools that deliver tangible business results please join me in welcoming to the stage the founder and CPO of arise apara dinkin [Music] [Applause] [Music] hey hey y'all how's it going all right well can you all hear me cool well thank you so much for being here I'm gonna start off by saying apologize my voice a little bit it's a little horse today but you guys are going to hang in there with me today we're going to talk about a really important topic which is um slideshow mode awesome we're going to talk about a really important topic which is about evaluating AI agents and assistance this will load just to set a little context before we we jump in a lot of you have probably heard today about different agents that are being built how to build it what are the Cool Tools out there to go build agents and today we're going to actually talk about when you put those agents into production it's important to actually know how they're doing and evaluate them it's super important to making sure that they actually work in the real world World we're probably going to get a little technical in this talk maybe a little bit more than some of the other talks but hang in there I think this is important even at the leadership level to understand how to make sure what you're putting out actually works in the real world um so a little bit about me my name is aparta one of the founders of arise uh fun update on us actually today we announced our series C Ray um so um have a lot of fun folks who are using us to evaluate agents so with that let's jump in okay well everyone here's probably talk to you about text based agents so you have this chat bot whatever it's making an action and it's it's figuring out all these things to do the cool next Frontier is actually voice AI is already taking over call centers there are over 1 billion calls made in call centers all around the world with voice assistant with with voice apis and the the realtime voice API if any of you guys have played around with it we're actually already seeing these types of um agents start to take over and revolutionized call centers this is actually a real production application of a travel agent this is the price line pennybot you can go in and actually handsfree no text book an entire vacation using Priceline Penny today so we're not just talking about tech based agents anymore we're talking about multimodal agents and it's important to to address these because the way that you evaluate these types of Agents it's not just evaluate an agent but also if it's on voice there's specific types of evaluations you're going to need to do if it's multimodal there's additional types of evaluations you need to consider so we're going to break all that down and hang in there with me for a fun one today so before I jump in and talk about how to evaluate an agent let's talk about what are the components of an agent you probably have heard different versions of this today but I'll tell you the language we're going to use one um there's something typically called a router uh which is essentially what's deciding what the next step an agent will take there's skills which is the actual logical chains that do the work and then there's something that stores the memory this is important because there might be different architectures of how you're seeing people build these agents out there doesn't matter if you're using L graph or qai or llama index workflows there's all sorts of agent Frameworks they all have slightly different ways of building an agent you might not even use a framework but what you're going to see is these common patterns of okay that's a router that's a skill and that's a memory and these different components are going to have different ways of how you actually evaluate it so let's first talk about the first one what the heck's a router so you can think about a router almost like the boss it's kind of deciding hey well it's very common to have e-commerce agents in you probably are all talking to e-commerce agents today to purchase things Amazon has one all these e-commerce companies have one when you type in a question like I want to make a return give me an idea of what to go buy are there any discounts on this that user query funnels into something called a router and that router's goal is to is really to determine do I call this skill about hitting up a customer service agent do I call this skill um to suggest all the discounts we have or suggest products the router is really kind of the boss deciding who do I tap on to go actually execute the the ask that the user made and the router might not always get it right but you want it to get it right because then it goes down the pathway of a specific skill within an agent so in this case it will call a skill um so if I asked hey uh tell me the best um I don't know leggings to go by so it'll go in it'll do a product search and then this is actually the entire skill flow of execution that the agent needs to go through to execute you know whatever the user asked for some of these might be llm calls some of these might just be API calls it just really depends on how people actually Implement them and then lastly this is an important piece is there's always something storing the memory because these are usually not just single turn conversations they're multi- turn conversations multi- turn interactions and so you don't want to be talking to an agent that forgets what you previously said so there's really memory which is storing what it previously asked for and keeping all of this in some sort of um in in some some sort of semblance of state so with that we're going to get a little fun here I'm going to show you um an actual example of what this could all look like a router skills and memory so this is an open source project um that actually looks at the inner workings of an agent these are called traces for folks who may not be familiar if you're you know in leadership or this is really what your engineers are looking at when they're actually building and troubleshooting your agent they're actually understanding what the heck went on under the scenes so this is actually an example of a code-based agent somebody asked a question like what trends do you see in my Trace latency AKA what's making my application slow this is the router call that we were talking about earlier where it actually decides well how do I then go ask how do I then go tackle that question so first you can see here there's multiple router calls there's not just one router call this is pretty common as your application grows you can have multiple times where it comes back and has to decide what do I need to go do so the first time it calls the router what it does is it actually so the router then makes a tool call um which is essentially the skill that you need the first time it actually makes a tool call to then go run a SQL query go collect all of my traces of my application and go go run a SQL query then it goes back up to the router and then it calls the second skill which is actually the data analyzer skill which takes all of the traces and the application data and then it passes it to something that actually analyzes that data so in this case you can actually see there was a router there was tool calls we actually have memory that's actually storing everything that's happening under the scenes and so really just shows all three of the different components that I actually just walked through so now that we have an example of a of an agent with a router and skills in memory let's talk about how to actually evaluate these agents every single step that I just walked through here actually is an area where the agent can go wrong for routers typically what teams end up caring about is did it call the right skill because if it didn't call the right skill you know user asks for I asked for leggings but then it sent me over to customer service or it sent me over to um you know uh something about discounts and Deals so you actually want to make sure that the router within an agent is correctly doing the right skill and calling the right skill so that's the first piece that you'll want to make sure that your teams are evaluating so if your teams are building agents when ask well hey what's the ultimate control flow what's the control flow and are do we have something like a router and are we evaluating it to make sure that it's correctly calling the right skill between ABC and is it calling the right skill with the right parameters so not just um a calling product search but actually making sure that whatever way you've designed that skill you're actually passing in the correct things like um you know I want this type of material I want this type of whatever cost range you're actually passing in all the right parameters into what the user actually is is asking for can I get a raise of hands have any of you guys heard of do any of you guys evaluate your agents today actually is that something you know your teams are doing okay awesome are any of you guys evaluating this router level internally okay awesome wow this is a great group okay this is impressive um okay let's next go to the next one which is actually evaluating a skill this is actually the part where it gets really interesting and tricky because there's many different components in a skill there might be in this case I have a rag type of skill so I want to look at things like evaluating the actual relevance of the chunks that were pulled I want to look at the actual correctness of the answer that was generated but this skill itself can have many different llm as a judge evals or can also have code based evals that you might want to run to actually evaluate the skills the skills of the agent and then lastly this is kind of a really important one that we're seeing teams probably have the most trouble evaluating which is actually the path that the agent took because well ideally you want it to converge you call the same skill hundreds of times and it always takes about five steps or six steps to actually query what the user asked for put in the right parameters call XYZ components of the skill and then ultimately um take the right you know generate the right answer but sometimes this can be a little longer we've seen sometimes where the same skill and I don't know if you all have done this experiment but you can put the same skill and build it with open the eye and you can also build it with anthropic and sometimes they have wildly different number of steps that the path actually takes and so so the goal here is how do you be succinct and how do you also make sure there's reliability in the number of steps that your agent takes to actually consistently complete a task so we call this convergence um but probably one of the hardest to actually evaluate is anyone evaluating convergence today or at least counting the number of steps awesome okay you're awesome dude cool well with that I'm going to go maybe two more minutes then I'll hop into one more demo here so if any of you guys have watched the movie her this is from her um uh this is where you know the the main character asks like who else are you talking to and you know the Samantha says something like 8,000 other people are in a conversation with me right now and so the future of voice applications is that these are probably some of the most complex type of applications that have ever been deployed ever been built it's going to require one more additional pieces to actually evaluate voice applications and the interesting part about these is that it's not just the text that needs to be evaluated or the transcript but it's also the audio chunk that needs to be evaluated um in a lot of these Voice Assistant apis you have the generated transcript that happens actually after the audio chunk is really sent and so that's a whole another dimension around is the user how what's the user sent to is the speech to text transcription actually okay is the tone consistent throughout the entire conversation and so you actually need to evaluate not just the audio piece and the flow of the conversation and everything else you're doing for all your other you know other parts of your agent but also make sure that the audio chunks are getting their own evals defined on um you know intent or speech quality or speech detects accuracy um so this is important for voice so with that um I'm going to actually show you guys how we evaluate our own agent so that you can get a little bit of a example of of what some agent in the wild actually does um this is our own agent so let me actually show you what it looks like um you can actually go in our product today and there's a little co-pilot and our co-pilot does something similar to what other co-pilots do where as people are spending time in our product we actually help them do things like hey help me debug this help me summarize this help me look at this um can I search with natural language there's kind of this co-pilot integrated throughout our entire product but we're an eval company so what do we do we actually dog food our own tool and we decide to what you're looking at here is actually the traces of our entire co-pilot actually in the wild and every single step of this co-pilot we actually run evaluations of so in this case we have an eval at the very top actually evaluating something around was the overall response that was generated this was actually a search question is the overall search question actually correct or incorrect and then we also have one around once it actually called the search router did it pick the right router and then did it pass in the correct arguments into the router and then finally ultimately did it complete the task or the skill correctly in the execution of this this entire skill and so evals aren't just at one layer of your entire Trace if you take anything away from this conversation the goal here is really how do you make sure that you have evals throughout your application so that when something goes wrong I can debug if it actually happened at the router level if it happened at the skill level or if it happened somewhere else along the flow um and I think that's it from me any [Applause] questions yeah so what teams actually end up doing is they have um call like the input into a skill so the input into I'm not sure you know if you're a building agent but like if there's some input into your skill that's typically what the input would be and then it would repeatedly call that same skill with the same input like one time two times three times all the way to 100 times and then you're tracking for every single run of that skill how many steps did it actually take so ideally you want to MIM mimic it with the same input and the same skill we do have some teams though where as part of testing they'll slightly modify the input so sometimes they'll ask the question a little bit differently a little bit more wordy see if that takes more steps um so you can do that as well but it's you know flavors of the same input is typically how we recommend testing for convergence um there's a lot I didn't cover here today which is around like for example guardrails is a big component of you know I call it slightly more proactive type of EV valves because you're you're running EV vales and then you're actually making an action or blocking what the output of the the agent could because of it um if I'm honest with you though I think that it's really mostly useful for external facing applications and highrisk environments because of course it's going to add some sort of latency to your actual call um so you want to be sure you need it and sorry I think there's a question there yeah oh yeah yeah okay good point so she uh first she asked both great questions first she asked well how do you think about um evals in the context of multimodal applications so if you have voice if you have video if you have images how do you actually think about that piece in the evaluations let me tackle that part first so we we do see that a lot I think voice is probably more common than video is is what we're seeing and to be honest image and I think there's just a lot of applications like the call center one I was telling you about I think Enterprises have just been like the ability to just talk to something is a lot easier than video and image um so we've been seeing a lot more voice assistance and then with voice the things that they end up caring about is actually um a fun story uh one of our customers the tone started out really like nice and sweet in the beginning and then it got rough near the end and it just and so that wasn't really something that they expected they wanted the tone to be consistent throughout the entire conversation so stuff like that is things that the speech is not going to detect so if you're just evaluating the transcript you're not going to detect those things that the underlying tone has changed um with image and video uh what we've typically seen people do is summarize what's in the picture or summarize what's happening in the video and then try to evaluate as a common one and then of course all the common like quality of image quality of video ends up being common things that people end up evaluating uh and then I think the second question you asked was like how how do people actually use you know I won't talk specifically about ours but just a general observability platforms so people do um plug in and connect their actual data because at the end of the day today these are probably some of the most complex software that has ever been built and they're also non-deterministic so they do want some level of visibility into what's happening the engineers use it for troubleshooting we're seeing aipm really make a rise into really understanding The Experience um so they they are um if you're concerned about data security there's a lot of ways not just our platform but other platforms out there too where you can always decide to deploy it in your own VPC um so that's kind of how they address the security yeah go ahead thanks y yeah yeah um I'm so excited for the call center transformation that's about to happen right now um because I think we're going to see a lot more of what you're saying so one thing we have been seeing teams do is I I mean it just comes down to what can you fit in the context window and the context Windows aren't big to handle a lot of back and forth especially of the voice conversations with both the audio file as well as the transcription what we see some teams do is that uh they I think this is B pretty common of like summarizing the previous historical State the memory this is where it that starts to get really important they decide to then uh summarize the audio chunk itself so that you're not obviously passing that back and forth over the wire um and so typically how do you keep trying to cond condense the historical State um and then of course you're kind of hoping that there's some recency so things that were asked more recently are more important than things that were like earlier in the conversation and so that's how we've seen a lot of people manage State overall um and and just growing size of these conversations but great question yeah thanks everyone I think that's time [Applause] [Music] bye would you like more time to go to more epic conferences like this one our next presenter is the director of engineering AI at data dog where they've built AI agents that are always ready to handle issues on your behalf please join me in welcoming to the stage Diamond [Music] [Applause] Bishop hey I'm Diamond I hope everyone's you know feeling the AGI today um I'll be sharing our AI agents at data dog and what we've learned building the devops engineer who never sleeps I came all the way from the New York Times building uh right across here uh to see all of you so I hope it's worth it if this works uh okay here we go hey little about me um I've been working for my entire career about 15 years or so in AI trying to build more AI friends and co-workers um I wouldn't read too much into that I have human ones too um I promise um throughout the AI Winters and lws of the last 15 years or so I've managed to keep doing just that at Microsoft Cortana um building out Alexa at Amazon working on pytorch at meta and building my own AI startup that was working on a devop assistant now at data dog we're building out bits AI which is the AI assistant who's there to help all of you with your devops problems so today I little I'll talk a little about that talk a little about the history of AI at data dog a little bit about how we think about AI agents today and where we think things are going for the future day dog is the observability and security platform for cloud applications there's a lot that we do um but it kind of all boils down to being able to observe what's happening in your system and take action on that make it easier to understand make it easier for us to uh simply understand and build out things to have a safer and more devops friendly system we've been shipping AI for quite a while actually um it's not always inyour face it's not always out there saying here's a big AI product but things like proactive alerting really understanding things like root cause analysis impact analysis and change tracking and much more has been happening since 2015 or so but things are changing this is a clear era shift I think of this kind of similar terms to the microprocessor or the shift to SAS um bigger smarter models reasoning and multimodal coming uh Foundation model Wars happening this General shift where intelligence becomes too Shi too cheap to meter and what this means is products like cursor are growing you know terribly fast um and really people are expecting more and more from AI every day um with these advancements at data dog we're really trying to rise to meet the shift as well the future is uncertain this kind of ambiguity creates opportunity but there's a lot of potential for us that's kind of the dawning of this intelligence age we're working to move up the stack to leverage these advancements and give even more to our customers by making it so that you don't use data dog as just the devops platform but also as AI agents that use that platform for you this requires work in a few key areas that I'll talk about developing the actual agents doing eval you just heard a lot about eval we' think about that every day for better or worse um and building new types of observability there's a few agents that we're working on right now in private beta the first is the AI software engineer this kind of looks at problems for you looks at errors tries to recommend code uh that we can generate to help you improve your system the second is the AI on call engineer this wakes up for you in the middle of the night does your work hopefully makes it so you have to get page less frequently and then we have a lot more on the way so I'm going to talk a little bit about the AI on call engineer first this is the one that you know everyone wants to save them from that 2 am. alert you don't want to have to wake up in the middle of the night go and look through your runbook go and figure out what's going on if you can help it our on call engineer is there to really make it so you can keep sleeping this agent proactively kicks off when an alert occurs and works to First situationally Orient read things like your run books grab context of the alert and then goes and you know figures out the kind of common stuff that each of you would do on data dog already look through logs look through metrics look through traces and kind of act in this Loop to figure out what's going on the en call agent's great for both automatically running investigations for me but also you know being able to look through and find summaries and find information for me before I even get to my computer so if I want to get insights into why an alert just occurred or figure out why a trace might uh be showing an error this agent can jump ahead pull information for me and show it to me we also have added a new page that makes it easy so that you can have human AI collaboration this is still something I'm thinking about a lot is like what what kind of collaboration do we expect we want our agents to act as humans but we also need to be able to verify what they did and be able to kind of look over what they're doing and really learn from it it also helps you to kind of earn trust along the way I can see the reason why uh this hypothesis for example was generated I can see what the agent found and I can make decisions about whether or not I agree along the way it also tells you things like what steps did it actually take out of your runbook and kind of like a junior engineer who does this work I can go ask follow-up questions find out why I did a certain thing a little more insight into how we're making this happen much like a human s or devops engineer our agent Works to put together hypothesis on what might be happening and reason over them coming up with ways to test them use tools in the tool forer sense to try out ideas run queries against logs metrics Etc and work to validate or invalidate each hypothesis in the case that it does find a solid root cause our agent cause uh can suggest remediations along the way again just like a human might might say hey we should page in that other team that's involved here or it might offer to scale up or down your infrastructure over time we plan to add more built-in actions and eventually discover new types of workflows based on what your team has done but if you already have certain workflows that youve set up in data dog um we can tie directly into them and make it so that our agent can understand those workflows and how they might map to helping you remediate a problem and if it's a real incident the enal engineer is not usually done once an issue is remediated you usually go and write a postmortem you go try to learn from it you share it with your team our agent can do the same write out your postmortem for you look at what occurred during the entire time what it did what humans did and put that together so that you have something ready in the morning so that was the on call engineer um that's the one that is you know trying to help you in the middle of the night trying to help you every time alerts come on um we also have this AI software engineer I think of this as the proactive developer the devops or software engineering agent who observes and acts on things like errors coming through this is kind of the error tracking assistant it automatically analyzes these errors identifies causes and proposes Solutions those Solutions can include generating a code fix and working to reduce the number of on call incidents you have in the first first place so they can work in coner to make a better system over time in this case the assistant has caught a recursion issue proposes a fix and even creates a recursion test so that we can catch it if it happens again in the future we have the option to create a PR in GitHub or open the diff in vs code for editing this workflow significantly reduces the time spent by an engineer manually writing and testing code and greatly reduces human time spent overall so what have we learned building out these agents and some of the new ones that we're working on today well we've learned quite a lot um there's a lot of things that we started with that we kind of went back and and redid um but a few areas I'll touch on that I hope help you as you develop your own first is scoping tasks for evaluation it's very easy you know to build out demos quickly much harder sometimes to scope and eval what's occurring second is building the right team who's ready to move fast and deal with the ambiguity that comes with these kind of problems third is that you know the ux of is changing um that's something that everyone needs to be comfortable with and fourth is observability matters you know uh I'm surprising for data do to say that I'm sure but observability is terribly important even in this new era so scoping the problems scoping the work to be done I like to think about this as defining jobs to be done and really kind of trying to clearly understand step by step what you'd like to do think about it from the human angle first and think about how another human might go and evaluate it um this is why we build out vertical task specific agents rather than building out generalized agents we also want where possible this to be measurable and verifiable and at each step this has honestly been one of the biggest pain points for us and I think this is true for many people working in agents where you can quickly build out a demo you can quickly build something that looks like it works but then it's very hard to actually verify that over time and improve it um use your domain experts but use them more like design Partners or task verifiers don't use them as the people who will go and kind of write the code rules for it because there is a big difference in how these kind of stochastic models work versus how experts work you know everyone kind of knows gnome and his uh anti- NLP um rants but that kind of stuff happens pretty frequently at domain experts eval eval eval I can't stress this enough um start by thinking deeply about your eval the number of mistakes we made by not thinking about eval first is uh frustrating and something that I think everyone should think about it's very easy to build these demos as I said um but everything in this fuzzy stochastic World requires good ev even something small to start this means offline online and kind of living evl have endtoend uh uh uh tasks have endtoend measurements uh make it so you also instrument appropriately the way to know if humans are using your product right and giving you feedback and then make this a living breathing test set building the team um you don't have to have a bunch of ml experts there aren't that many to go around right now um what you really need is you want to seat it with one or two and then have a bunch of optimistic generalists who are very good at writing code and very willing to try things out fast um I'll also note that ux and front end matters more than I'd like as a backend engineer myself um but it's terribly important as you collaborate with these with these agents and the assistance um and then you want teammates and people who are excited to be AI augmented themselves this is day-to-day AI use this is Explorer types who want to learn this is field that's changing fast um and if you don't have people like that you're going to kind of get stuck you want folks to kind of you know yeah you're in for the vast and endless AI capabilities right um it's a big world out there and there's a lot going on Ye Old ux um this is one of those things that I still you know we think about we go back and forth every day um it's an area that I didn't realize was quite so important initially when I started working in this field um despite my engineering sensibilities and lack of ux it's terribly important um this is such an early space of work this is kind of one of the more important things here as you collaborate and work together but the old ux patterns are changing be comfortable with that um and so far I'm partial to agents that work more and more like human teammates instead of building out a bunch of new pages or buttons so who watches the Watchman right um You have these agents running around um observability is actually really important and don't make it an afterthought um these are complex workflows you really need situational awareness to debug problems and this has saved us time a lot as we start to work with um a new view that we're calling LM observability in the data dog product um data dog in general has a full observability stack as many of you know we can look at gpus um we can look at LM monitoring we can look at really your system end to end but tying in the llm observability has been very helpful because you have a wide variety of interactions and calls out to models you're hosting models you're running maybe models you're using through an API and we can make them all uh kind of group together in the same paint of glass so you can look at them and debug what's occurring I will note though that this can get messy fast with agents our agent for example has very complex multi-step calls you're not going to look at this and figure out what's going on right away Um this can be hundreds of calls this can be uh you know uh tons of different places where it's making decisions about tools looping time and time again and if you just look through a full list of these things you'll never really figure out what's going on so here's a Qui you know sneak peek into a more agent view of what's occurring inside of our observability tools this is our agent graph um really what this means is that I can kind of look at it just like our agent did and looking at workflows that are occurring you can see in this even though it's a big graph uh there's a bright red node here if we zoom into that we can actually see where errors were occurring this is very human readable something that makes it super easy to figure out what's going on when your complex workflow is running as an aside though I do also want to note what I think of as kind of like the agent or application layer bitter lesson uh General methods that can leverage new off-the-shelf models are ultimately the most effective um by a large margin um I hate to say it but like you sit there you fine-tune you do all this work on like this specific uh you know project a specific task and then all of a sudden you know open AI or someone comes out with a new model and it handles all this you know kind of quickly a lot of the reasoning is solved for you um we're not quite there where it handles all of it very quickly but you should be at a point where you can EAS try out any of these models um and don't feel stuck to a particular model that you're you've been working on for a while you know Rising tide lifts all boats here um I also think a lot about not just building agents but what it might mean for other agents to be users of data dog and other SAS products um there's a good chance that agents surpass humans as users in the next five years um I'm probably somewhere in the middle on my estimate there you know there people who will tell you that'll happen in the next year there'll people who will tell you you it'll happen in 10 years I think we're somewhere around the fiveyear Mark um but this means that you shouldn't just be building for humans or building your own agents you should really think about agents that might use your product as well an example of this is like third party agents like Claude might use you know data dog directly I set this up with mCP relatively quickly um but any type of agent that might be coming in and using your platform you should think of the context you want to provide them the information you want to provide about your apis that agents would use more than humans so looking ahead um the future is going to be weird it'll be fun uh and AI accelerating is accelerating each and every day I strongly believe that we'll be able to offer a team of Dev secop agents for hire to each of you soon you don't have to go and use our platform directly and integrate directly ideally our agents will do that for you and our agents will handle your on call and everything like that for you um I also do think that AI agents will be customers many of you building out SRE agents and other types of Agents coding agents should use our platform should use our tools um just like a human would and uh we can't wait to see that and generally I think that small companies out there are going to be building built by someone who can use automated developers like cursor or Devon to get their ideas out into the real world and then agents like ours to handle operations and Security in a way that lets you know an order of magnitude more ideas make it out into the real world thank you so much um please reach out if you're building any agents that want to use us um or if you'd like to check out our agents as well um there's a lot to build here and if you want to work in the space we are hiring more AI engineers and people who are just excited about it but thank you very [Applause] [Music] much our next presentation is about building self-managed AI networks please join me in welcoming to the stage technical lead for Arista networks Paul [Music] Gilbert ah oops how do I go back there you go yep so my name is p Gil I'm a tech lead for Arista networks I have an accident but I'm actually based here in New York City and I build or design or help build and design uh Enterprise networks uh but what we do is a Plumb in uh so I'm not going to talk about agents but more kind of how you train uh models what the infrastructure looks like and how you do inflence in on on the infrastructure uh I I I normally teach people uh the very basic stuff so I I you guys probably know this already but these are new terms for us when when we built computer networks people will come to us and say job completion time uh barrier I'm I'm pretty sure you guys know that the the inference and the question I get all the time is you know we we can build a network to train a model there's a an algorithm maybe you can use to to look at what you what you need uh but then you know what's inference and you know it's changed a lot now because of chain of fa and reasoning models the inference is a lot different it used to be X but now it's y uh I'm pretty sure you guys have seen this slide but I use these just to talk to Enterprises around kind of what they might be thinking in GPU size uh this uh wed Sosa came up Dr wed Sosa came up with this on the left there is the training and on the right there is the inference and it's kind of you know on one you have times 18 times the other times two again I think that changes now with Chain of Thought and reasoning not too sure kind of which way it's going to go and at the bottom there was a really interesting one again which I showed customers cuz I most of the Enterprises I talk to kind of don't understand models and how they work and training I know little but not a lot but you know the the the the the model they trained down here was 248 gpus for one to two months and then when you go to inference after fine Jun in alignment it's four h100s for inference so we talk to people about building different types of networks which I'll speak about but uh you know kind of I always start at the beginning and you know this is I got this slide and I think it's really interesting you know llms were kind of just tiny little bit of inference but now with the the next generation models it's a lot so this is what we build or I I build uh and these are new terminologies for us from the networking world uh backend Network so this is where you connect gpus to when we build these networks uh they're completely isolated because gpus are really really expensive they take a lot of power uh and they're really hard to get old of so when people build AI networks in the Enterprise uh we don't connect nothing else to these networks the the bottom part of that the back end Network there's eight gpus per pool on those servers and they can be Nvidia they can be super micro they can be be whatever they will go into a high-speed switch at the bottom there uh you have a leaf switch and a spine switch and nothing else attaches to that Network and then on the front end network is where you get sto storage from the Train the model obviously you know it the gpus synchronize they do something they calculate they produce an algorithm and they call for more dat and that's kind of the cycle the front end network is not as intense as the back end the backend Network depending on the model that you train uh they the gpus will actually work at 400 gabt and for for us in the Enterprise you know and I've built some big big data centers but I've never seen anything like that so this is in the networking world this is a completely new world to us and we we make the networks as simple as possible because again these are really expensive and people want to get their money's worth they want these running 24 by7 uh so we do IB ibgp or ebgp just a really simple protocols uh I'm sure most of you have seen this but again I I kind of teach this I uh the the this was an infrastructure presentation but you know that's kind of the back of a h100 probably the most popular it is actually the most popular uh uh AI server out there right now you can see in the middle there there's four ports but those four ports are broken out into two so there's eight ports those are the GPU ports there and then kind of over to the left there there's the ethernet ports so that's what we connect to we've never seen anything like this before you know when you first speak to people about this you know I've seen servers with 400 gig you know and I do a lot of the big Financial networks but never before have we seen servers that can put this type of traffic onto a network uh you we always they always ask me about this and you know I got this from an Nvidia slide it's stand there but you know there's this thing called scale up and scale out I'm not really sure scale up you know when when you buy when my customers buy these servers that always have hgp using it you can't add anything to an Nvidia server you get the djx the if you go with the Outsource model it's a hgx so it's a third party you don't really add things to it but so I don't see scale up but scale out you know obviously you can build we we build a network so that you can add more gpus we can start very small and we can go up to you know hundreds of thousands of gpus not in the Enterprise but the cloud scale guys do so so what's different uh you know for us again it's there's it's hardware and software the hardware are those gpus we're not used to them uh the first time I tried to configure one it took me hours and hours but I'd never seen them before whereas other stuff I've seen pretty quickly you know you have software so Cuda and nickel are probably you know two of the biggest protocols and you you guys know more about that than me but we had to kind of understand not Cuda but nickel because it has a collective so we had to understand kind of how the collective works because that will put uh traffic onto the network in a certain way uh the hardware was completely different again you know we had the eight uh 400 gig PS and the four 400 gig PS facing the the the front end Network totally new to us the other thing was Data Center applications kind of web app database they're really easy uh they go from one to the other and in different parts of the network and if one fails you you have some kind of low balance in or fa and it fails over uh this is AI networks not like that the gpus all speak they'll talk to each other they'll get stuff they'll send stuff and if one fails the the job might fail it might recover but it's a different concept to us so it's it's hard to imagine uh and traffic is bursty because all of these gpus if you have th000 gpus of 400 gig they will all burst at the same time and if you if they can they will burst up 400 gig so there a lot of traffic on a network and I've never seen anything like that so when we build these networks we don't build them over subscribe we build them one to one uh might in the data center world we used to do 1 to 10 it went down to probably 1 to three but never one to to one because it's just really expensive to to to build that kind of bandwidth but with AI networks we need to so we have no over subscription in the network and from from our point of view if you look at what one of these servers can put on the network you know just a h100 is 8 400 gig gpus and 4 400 gig is 4.8 terabyt which is and that's just one server the storage size the the the front end probably nowhere near that but the back end is always wire rate and then 800 gig is probably you know the bees are just around the corner I think in March they'll be released I think there's some people to have them and those are 800 gig we support 800 gig today on the network but each one of those servers in there's a possible 9.6 terabytes per server and you most people in my world in the in the Enterprise world come from servers at maybe 1 two three or four 100 Gig ethernet but nothing like uh 9.6 terab whes per server so the other problem we have is the traffic patterns when we low balance from kind of leaf to spine we use a thing called entropy which is the five tle IP address port and Mac address and we do pretty good low balancing but with gpus it's just one IP address and it can sometimes match to a single Uplink and over subscribe it which would be really bad because you'll start dropping an awful lot of packets so we have to take a lot of Care on how we low balance within the AI Network or how we build the back end and the front end so we have some pretty cool tools where we don't now look at the five topples we actually low balance on the percent of bandwidth that's being used on the Uplink and we can get up to about 93% utilization on all the uplinks to the down links which is pretty good uh you know and again one thing that's really new to us is a single GPU can you or a set of gpus if they fail uh sometimes the model will stop I know checkpoints but uh a single fit GPU fail is a problem for us and if one of the big problems that's we've always add is Optics and transceivers and Doms which is the rates and the loss between them and the cables Etc and when you start building these networks with thousands of gpus you will have a lot of cable problems and you will have a lot of GPU problems so it's it's really hard for us because we again this world is is new to us the last year or so uh Power I you know power is you could you know you've read the newspapers you know everyone's trying to buy buy nuclear power stations to power these things the the average rack in the data center today is about 7kw to 15kw and you can put like you know 10 one one IU rexs into those and you'll be fine and when was come to me say yeah we finally got gpus whatever and I say to them you know what kind of racks have you got and they said well we're going to put them in and then you could only put one of these servers in one of those racks because they they actually draw with 8 gpus 10.2 KW so you need new racks uh most Enterprises now waking up to this and they're building racks between 100 and 200 KW and they're water called there's no way you could air call them in a data center so that's a whole new concept to people as well is water called rxs uh traffic is is both ways which again is new to us so north south you know in a regular data center you have users coming in database app web whatever and it comes in it goes out but in the AI world when the gpus speak that traffic is east west because it's speaking amongst each other uh and then when they ask for more data from the storage Network it's north south so you have both traffic patterns the East West is really bad that's kind of where they run wi rate the front end to the storage is much more more calmer because most storage vendors can't put that kind of traffic on on the on the network right now I'm pretty sure they will one day but they're more around 100 200 gig uh and you know in a network there's a certain amount of buffering on these switches and buffering is bad because it means it can't send traffic somewhere because something else is is not receiving the traffic trffic so you need a congestion control and feedback uh and right now we use something called Rocky V2 which is two parts of Rocky V2 there's a PFC and an ecn uh if you were building an AI Network your engineers your network Engineers will definitely know about this ecn is an endtoend flow control where if traffic if traffic is if there's congestion somewhere in network packets are marks they go to the receiver the receiver sends back to the sender you need to slow down because there's congestion and it goes for an algorithm it pauses for a while it slows down and if it doesn't get any more ECM packets it speeds up again uh and PFC is basically stop uh my buffers are full I can't take anymore so it kind of is a dead stop so you have kind of a a slow uh feedback mechanism with ecn and the kind of an emergency stop with PFC the networks we build are really simple we don't uh have things like in regular data centers we have DMZ with firewalls low balances Etc we have connections to the internet we have L4 through 7 service whole bunch of stuff when we build these networks they're totally isolated uh the GPU the the back end is completely isolated the front end possibly could have connections to something but even then it's so expensive to build you you don't want to take the chance uh on demand the applic ations that we're used to you know if it fouls or something fouls something will recover and you know you you may get a a little skip or a jump but if you've done the right thing it's not going to be that bad in this world if something fails the model May foul and the call that you get into your operation Center is different cool than you get that if you're you're at kind of restarted and everything's good again uh the other thing is collectives you know obviously nickel will go out there and work out kind of where to GPU gpus are and what to do with it but there's kind of different design so I tell my customers that speak to your data scientist and your your uh your programmers developers and find out kind of what they're doing and what kind of models they're building because it can can affect the network on kind of how you build it and how you design it so networks totally isolated things are moving fast we're at 800 gig right now uh which is you know we have been for probably a year we will see 1.6 terabytes on the network probably end of this year uh early 20127 and it will just keep grind and grind and grind and these models will get bigger and bigger and consume more and more and more I'm pretty sure uh visibility and Telemetry I you know all my customers the core that they get when a model fails because the network is the problem is a different core than they used to so we put different uh to lry and visibility in there to make sure that if things are going wrong on the network that you know they know about it hopefully before they get that call so yeah I work for Arista our operating system is called EOS and we have a whole bunch of features there so if you were building an AI Network I'm not sure that you guys speak to the engineers but this is the type of things we talk about uh lossless ethernet everyone thinks that you you know when you train them model you can't drop packets I I've seen it and you can I think drop packets are okay consistent latency is okay but if you drop so many packets obviously it's a problem so flow control losses e foret is really key ecn and PFC are part of that as I said before they're flow control mechanisms one is a slow down please and the other one is a stop and as you know because G gpus are synchronized if something slows down you slow down one port one GPU everything slows down so you really got to be on top of kind of the over subscription and if you are getting Qin where is it uh we we have really good buffer uh we can adjust buffers we have different kinds of switches for different places in the network but we found that models send and receive a particular size packet and what we do is we adjust those buffers to accept those types of packets buffering is a really expensive commodity and switches in networking and if you can find a way to allocate the buffers exactly tuned to the packet sizes it's a win-win uh and we we've worked out how to do that which is good uh yeah monitoring is really key for us uh I tell my customers there's probably five things you want to do one of them is RDMA you know uh these networks train using RDMA you know uh which is memory to memory rights rather than going CPU to memory and RDMA is a a complex protocol and it has 10 or 10 or 12 maybe more kind of error codes so if the network starts seeing problems uh and starts dropping packets rather than just drop the packet on the floor we can actually copy that packet to a buffer or send it somewhere or just just the headers and and why we drop that packet and if you think about it's really cool like most networks will in congestion your buffer fills up is you're going to drop the packets we'll drop the packet but we'll actually take snapshot of the packet and the headers any rme information in it and we'll tell you why we dropped it uh another thing we have which is really good we have an AI agent uh you know from the networking point of view we can look at what's going on but we don't really have any visibility into the GPU so now we have an agent which is an API uh and some code that we load on the the gpus in Nvidia and they will speak to the switch so that agent will say to the switch how are you configured so PFC and ecn those slow control mechanisms have to be configured correctly because if they're not it will be a disaster so the the GP will speak to the switch and say this is how I'm configured the switch will say yeah you're good we understand each other and the second thing it does it gives you a whole bunch of statistics about packets received packets sent RDMA errors RDMA issues in there so you can can correlate now if the problem is a GPU or if it's the network which is a huge step forward for for us uh another really cool feature we have is uh smart system upgrade you know if you used the routers and switches you know you have to upgrade the software sometimes uh sometimes to get new features sometimes to fix pts which are security vulnerabilities on that switch uh we've worked out a way now that we can do that you can upgrade code without actually taking the switch offline so you know if you have 1,24 gpus with 64 switches in your network you actually can upgrade those and the gpus can keep working so it's a real big step forward for us so you for us again I I don't know but no over subscription on the back end you can't because the gpus use everything you give them address wise is really important for us it's a it's a Point to-point connection so it's sl30 31s you could use IPv6 if you have ipv4 problems address space problems all my customers I told bgp because it's the best protocol out there it's really simple and it's really quick uh evpn VXL if you have M tency if you have a lot of different business units lines of business uh using the network you need things like Advanced low balancing we have a couple of different we we actually look at the collective that you're running low balance on that Collective Now which we call cluster low balancing you could deploy It Rocky I tell all my customers do it because if you don't your network going to melt down you're not going to know why these things will give you an early warning system that you need to do something with your network so they're really key to have and visibility in Telemetry is is really good at all times because in the network knock or the Operation Center you always want to be aware of the problem before you get the call from the developers and the people that have paid a lot of money for that Network I I'm running out of time here but this is kind of 1,400 gig cluster what it would look like spine and leaf uh again no over subscription 800 gig links between the leaf and spine 400 gig down to the gpus this is a 4,000 cluster these ones these are the bigger boxes these are 16 slot one of these boxes can take 576 800 gig gpus so 1152 400 gig gpus so if you're building clusters with thousands of gpus then this would be the box for you the 7800 series and and putting it together this is kind of what we would build there's three networks here there's a backend Network where your gpus live there's a front end Network where the storage live and then there's the the the inference that you take the model you put it somewhere else I I'm out of time I do not come in so I'm guess I'm not so so so the other thing is you know there's Ultra eanet Consortium you I don't know if it's interest you ethernet hasn't changed the way it's built probably 30 years uh there's some things it could do better around congestion control around packet spraying around the Nicks talking to each other so there's this thing called Ultra eite Consortium version 10 will be ratified probably q1 2025 and it's a kind of different way of building networks and you probably won't see them until Q3 Q4 but most of the cloud scale guys were kind of really keen on this because it it puts a lot more into the ncks and takes a lot more out out of the network so we just get we can do what we're good at which is forward in packets so summary you know for us we have the front end which is storage the back end which is the really important part for us uh that part is really bursty uh the gpus are all synced so they send and receive at the same time and if you have a slow GPU that's a barrier because it stops everyone else job completion time is what matters to us if you know we get the call that you know my job completion time was 1 hour yesterday it's 4 days today you know it's probably our problem uh you know models can checkpoint but they're really expensive you guys know that and I'm done and they're still not coming yeah uh anyone got any question I take a question if you want if they're not [Music] [Applause] [Music] coming our final presentation for this block is anthropic for VPS of AI please join me in welcoming to the stage Alexander bricken member of technical staff at anthropic and Joe Bailey GTM Enterprise at [Music] anthropic here we are brighter than I expected good to see you all today I'm Alexander bricken I'm on the applied AI team at anthropic so I work very closely with customers to do technical implementation work and I also bring that advice back to product research and model research um I'm going to pass it over to Joe hey everyone it's great to be here my name is Joe Bailey I work on the go to market team in anthropic I joined anthropic over a year ago now so I've seen our models evolve from a 2.1 to today's capabilities and I think day-to-day what's really exciting is we're working with AI leaders who are solving real business problems um that just seemed impossible a year ago so really excited about how quickly everything is moving okay for today we will do um a quick overview uh you know who we are our mission and then we'll focus a lot on implementing Ai and best practices and common mistakes uh Alex and I actually didn't just take this from our own experience but we uh talked to a number of our colleagues so this is all based on hundreds and hundreds of customer interactions um so we hope there's some actionable insights to take out of this awesome so what is anthropic so we are an AI safety and research company building the world's best and safest uh large language models we were founded a few years ago by some of the leading experts in Ai and since our Inception we've not only released uh multiple iterations of our Frontier models we've done so while being at the bleeding edge of safety techniques of research and policy I'm going to pass it over to Alex to talk a little B about our Marquee model awesome and so some of you are probably familiar but the most recent model we launched was Sonet 3.5 new in late October of last year um you might be familiar with it because if you're a developer uh Sonet is actually one of the leading models in the code space so if you're familiar with evaluations like sbench which is an agentic coding eval uh Sona is still at the top of the leaderboard for that um I won't go too much into the details on the eval side so let's keep moving um so yeah in addition to what Joe mentioned we have a lot of different research directions that we're focused on um and these are really distributed but have overlap between you know model capabilities product research and AI safety the one that differentiates us I would say is the interpretability and this realistically is reverse engineering the models and trying to figure out actually how they're thinking and maybe why they're thinking and then an additional capability in terms of steering them in the right direction depending on a use case so let's dive into that a little bit more we're still very early in interpretability research it's worth mentioning as you can see there's kind of like a longer timeline and we're really only at the the first half of that maybe even the first 25% um but we're we're really approaching it in these stages that build upon each other so these include things like understanding so grasping AI decision-making detection so actually being able to understand specific behaviors and put labels on those steering so influencing the AI input in some some way shape or form and I'll get to an example of that in a second and then finally explainability and that's really where you unlock business value associated with interpretability methods and so while we see interpretability in the long term providing you know a lot of significant improvements in AI safety reliability and usability specifically our interpretability team uses methods to understand feature activations at the model level and then has published research on these uh in towards Mon semanticity and scaling mon semanticity which are two papers I highly recommend um and then as the technology improves into kind of detection landscapes for example you can imagine having a much better grasp at uh the actual thinking and behavior of the model or even discovering sleeper agents for safety reasons that might be very deep within uh model capabilities so a good example of that is imagine you ask the model what were the scores of the NBA matches today right and let's say it knows the answer and it says oh Steph Curry you know scored 30 points this would lead to a feature activation of for example feature number 304 famous NBA players realistically that's a group of neurons activating in recognizable pattern that we've identified across all mentions of famous basketball players when model is answering a question not just Steph Curry um and you also might have heard of Golden Gate Claude that was an example of us steering the model uh basically amping up the activation in the Golden Gate Direction and thus whenever you'd ask a question like what should I paint my bedroom Claude would respond oh you should paint it red like the Golden Gate Bridge and maybe it should have some like you know pillars in it or something I'm going to pass over to Joe to talk a little bit about some of the customers we work with yeah so I'm going to frame this in two ways one is uh sort of early on discussions and the other would be just examples of customers that are doing really cool things so in conversations there's obviously a lot of noise and Buzz and everything and that's fantastic but we often encourage our customers uh to sort of get back to the basics and how can they use AI to solve the core problem that your product is trying trying to solve we also get to work with a ton of uh of AI native or AI startups and this is how they're thinking about their product and I think you want to move Beyond uh chat Bots and summarization these can be great options but I'd be thinking more like where do you want to place bigger bets and to give an example um if you just click one more time fancy slide uh imagine you're an onboarding and upskilling platform the problem that you solve for customers is you help them get ramp really quickly and then you help them get to the next phase of their career by equipping them with skills so for instance you might be public speaking you want to get good at or you might want to become a manager and so it would be easy to say okay let's summarize course content or let's um uh let's have a Q&A chat bot that answers questions along the way and they could be helpful but I would actually think about it differently so what about if you could hyper personalize uh course content based on each indiv individual employees context or if someone is like breezing through all the course cont content could you adapt it dynamically to make it more challenging uh so they're actually getting more value out of it and the last one that I particularly like would be what if you could uh uh dynamically update uh course material based on people learning about the customer so if someone was a visual learner great let's make visual content for them and having the AI having the ml sorry the the large language model just do that automatically and you have to think does that solve the problem more than summarization or or a Q&A chat B um so really good food for thought and to sort of talk about some of the customers uh that we see achieving really industry leading results by combining uh their own domain expertise and our um our model so I won't read off each but just a couple of call outs one is uh AI impacting different Industries we have uh taxes we have legal we have uh project management they're using AI to uh drastically enhance their customer experience they make it more like easier to use they make it more trustworthy um and so it's really improving the experience versus just being like a nice to have and then they're achieving a real a real high quality um of output right you can't be giving you can't be hallucinating when you're doing your taxes uh it could just you know that could lead to all sorts of things so we're thrilled that they're seeing these these sort of like business critical workflows powered by AI driving really positive outcomes for them and also their customers awesome I can do this one yeah so um getting started I just quickly uh there's two two key points here so if you go to the next slide um what are our products we have our API we have Claude for work our API uh is for businesses that want to embed AI in their product and services and then clae for work empowers your entire organization to take advantage of AI and the day-to-day work we also have uh next one we also have a partnership with AWS and gcp and you can kind of get the Best of Both Worlds here you can access our Frontier models on Bedrock or in vertex you can deploy these applications in your existing environment um and you but you so you don't have to manage any new infrastructure so it really sort of breaks down like any barriers to entry so you're getting the best of both World here we talk a little bit about support throughout this talk it to us it doesn't matter if you're accessing us through a third party or our first party so I just wanted to call that out awesome so now that we've talked a little bit about some of the customers how do we actually set customers up for Success when working with them out anthropic so just a Preface on kind of what my team does as I mentioned it's at this intersection of product research uh customer facing interaction and then also just actual research within the org um and we support the technical aspects of the use cases so helping to design architectures evals tweak Cloud prompts to get the best out of our models Etc and then we also bring whatever we see back into anthropic and we try to build some the best products we can for our customers so some examples of projects we worked on or things that we published include the building effective uh research uh paper that uh my colleague Barry published he's going to be speaking tomorrow and then as well as that we launched model context protocol which is a open source protocol for language models to interact with data sources and uh Mah is going to be leading a workshop on that uh on Saturday I believe so anthropic as a whole we try to effectively support our custom customers but where we really start to embed at least my team in particular is um we work closely with customers that are using clot a lot and they're facing really Niche challenges in specific use case domains and they need support from our team to try to apply some of the newest kind of latest and greatest research or get the most out of the models from a prompting standpoint Etc and so this approach is pretty additive we often kick off a a Sprint once the customer is facing those tricky challenges that could be L llm Ops architectures or evals we help them to find certain metrics that they deem to be important when they're evaluating the model against the use case and then finally um we help them deploy that kind of iterative loop the result of that into an AB test environment um and then hopefully into production and so a part of that is the importance of evals and I'll get onto that in a second um but I'm going to pass it over first to Joe to talk about some stuff that we did for intercom yeah so sort of I think this is a good segue in what Alex was describing so for those of you don't who don't know intercom is an AI customer service platform they have an AI agent called Finn by many measures it's the best in the market and it's a pretty competitive market so they had their product uh out for I think about a year or so and when we spoke to them they shared where they wanted to go where they saw the futures of like customer support and agents and based on some of the capabilities of our model we felt that we could have a pretty good impact on these metrics and so what we started with was the the applied AI uh lead met with their data science team and we ran a quick twoe Sprint we took their hardest prompt for Finn and we compared it uh against a prompt that we helped them sort of figure out with uh with Claude and they saw really good results after the first two weeks so much so we went on this sort of Sprint of about two months where we were basically um fine-tuning and optim optimizing all of their uh prompts to get the best performance out of Claude at the end of this they're ble to look at all their benchmarks and see that anthropic was outperforming the current llm it's also worth noting that they do a resolution based pricing model so there's an incentive for everyone for the model to be really helpful and help customers solve problems and not be like a deflection machine where it's like you know we're probably all experienced them before and so at the end of these two months they decided to move forward with anthropic they launched it you can read about it it's called fin 2 and I think just some of the metrics are really like mind-blowing like can solve up to 86% of customer support volume 51% out the Box our support team thought about lots of different options and they actually adopted Finn as well and they saw very similar resolution rates but also making it more human so they I think with our model we can there's much more of a human element to it so they could do like uh adjustment of the tone uh answer length and then was also really good at doing policy awareness so like refund policy for instance so unlocking some new capabilities and we're thrilled to be partnering with them as they sort of I think March forward as a leader in in this space yeah on a kind of separate note one of the things I've seen recently is uh Claude on Twitter acting as some sort of therapist for a lot of people and I always find that an entertaining example of like its character being expressed yeah um cool so let's get on to some best practices and mistakes that we see in the field uh on the goto Market team so firstly testing and evaluation I'm sure this those two words have been mentioned a lot today and probably tomorrow too um there's some typical common common mistakes that we see uh customers strugg with so the first one is they build a really robust workflow they spent like a bunch of time building some architecture out and then they're like okay now we need to evaluate it like let's build some evals that's not really how it should work in practice because your evales are actually the thing that directs you towards a perfect outcome right you can't build a whole workflow without evals probably from the get-go or very shortly after and so you know sometimes customers as a result of struggling with data problems might not be able to design their evals you could use Claude to clean that up do data reconciliation um or they're just you know trusting The Vibes too much maybe they run a couple queries they're like hey it looks good right are they really testing that on a representative sample though like do you have enough samples to say that the thing that you're looking at is statistically significant and like or are you going to you know run a hundred things when it actually goes into prod and then there's going to be like loads of outliar uh because you didn't actually predict correctly what that the customer is going to ask of the model for example so I challenge you to think about your use cases as this sort of latent space right let's take this kind of chart here on the leftand side of the slide right as you explore the latent space with different functions that you can apply to the model let's say prompt engineering prompt caching stuff like that you're you're basically moving your kind of position in that Laten space around between attractor States you know and eventually you want to find an optimized point but you don't really know where that is right like if you're changing an instruction don't know how the attention mechanism of the Transformer is going to eventually result in some different outcome that might not be performant and so the only way you can truly know that is empirically and that's through evaluations and so I think that's why evaluations are so important and a lot of people just don't understand that soon enough in many ways I actually tell customers evals are your intellectual intellectual property like if you want to be competitive in a space you need to be able to out compete people by navigating that l l in space and finding the attractor state state faster than anyone else um and so part of you know how you do that is well firstly setting up some sort of telemetry uh to back test ideally that architecture set up in advance but you know you should invest in it um designing representative text cases so let's say you're working on that customer support agent eval you know you might have a kid come on your website let's you're building it for and might ask some crazy question like how do I you know kill a zombie in Minecraft like it totally unrelated to your product that's still probable and so you should probably include silly examples like that in your eval set to make sure that your models actually approaching the response in an appropriate way or rerouting the question Etc cool moving on to the next one um identifying metrics so a lot of the time you know there's this intelligence cost latency triangle of tradeoffs that people are trying to move in between and most or organizations can optimize for one or two of those things but it's very difficult to meet three at least right now but realistically that balance should be defined in advance and you should know that for your specific use case you're going to make a trade-off between those things so let's say a customer support use case again you care about your customer getting response within 10 seconds right more than 10 seconds I think there's been research done on this uh the customer is likely going to just log off the page and then they won't get the response and then they'll probably complain about your product to their friends right whereas if you're looking at a financial research analyst agent you probably don't care that it works for 10 minutes to come up with the actual response to your question because the decision being made after that is very important it's an allocation of capital for example and so the stakes and time sensity sensitivity of the decision should really drive your optimization choices and you know maybe more instruction sets lead to longer latency but higher performance Etc the other thing is ux could be important right so again on that customer support agent because we spoke about intercom you could have different ways of circumventing that 10c to 15 seconds specifically you could add like a little thinking box that bounces around you could send the customer to another web page in the meantime have them read something right like there's loads of ways to distract and kind of push on those boundaries but you still need to know what that important indicator is and you need to optimize accordingly finally fine tuning so a lot of people you know I go into these calls and they're like oh we want to do fine tuning I'm like oh here we go again um fine tuning is not a silver bullet so it comes at a cost and most people aren't aware of that cost um the cost is generally you're doing brain surgery on the model and thus there can be kind of limitations to its reasoning in other fields outside of the thing you're fine tuning tools um so my encouragement is try other approaches first right most people they don't even have their eval set when they're trying to do fine tuning right they need to have a clear success criteria in advance and it's like only if we can't get that in our specific intelligence domain do we then do fine tuning don't try to boil the ocean in advance the difference in fine-tuned capabilities and the wide variance of which you know failure versus success looks like in fine tuning land means that you should be able to justify the cost of fine tuning and the effort of doing it right getting a team fine tuning working with us for example uh you should be able to justify that difference and so in in terms of best practices you you know don't want let to let fine tuning slow you down right you don't want to say oh I'm only going to convert this language model use case if we can actually finally fine-tune our model it's like no no no pursue it and then realize that you need to do fine-tuning and then you can just sub in the fine-tuned model and then explore other methods first and there are loads of different methods that you know are anthropic as well as other companies are working on these days and I just wanted to like flash up a few of them as we wrap things up here and so I'm not going to go through all of these but alongside just Bas prompt engineering which granted is very important there are loads of different features or architectures that will change the success of your use case drastically so for example you might not need to sacrifice on intelligence of your model if in order to speed it up by like removing instructions if you can just leverage P caching and have a 90% reduction in cost and a 50% increase in speed right or contextual retrieval will drastically improve the performance of your retrieval retrieval mechanisms and thus you feed the information to the model more effectively and thus it has less of a Time processing all the instruction set that you've given it so there are quite a few things that you can apply here and some of them are even out of the box like citations and then there are also architectural decisions like agentic architectures you know Barry my colleague who's speaking tomorrow we'll have a lot to say on that um but that pretty much does it um thank you so much for uh for your time we'll be in the theater level Lounge after this chat for followup questions um anything else from you Joe no thank you so much [Applause] [Music] cool ladies and Gentlemen please welcome back to the stage your MC for the leadership track session day Peter Humphrey all right folks uh thank you again to Alexander and Joe for all their insights into applied AI with Claude uh anthropic is truly up to a lot of amazing things who's using Claude on a pretty frequent basis I'll speak for myself okay great we got a lot of them so it's been a pretty exciting afternoon uh we've had case studies we've talked about evaluation Frameworks we've talked about observability with agents and we talked about self-managed AI infrastructure that one I was particularly kind of interested in and of course anthropic thanks again to Alexander and Joe so we're going to take a 30 minute break before sessions continue at right here at 4M uh and again as before if you want to discuss meet the speakers or talk about uh birds of a feather you know like-minded topics from the last block of sessions just go find the speakers at the one of the three Q&A areas again one here on the theater level one at the very bottom of the stairs right at the at the landing and then another one tucked underneath uh hidden way behind the stairs um take some of this time during the break to stop in at the sponsor Expo there are more coffee and snacks again our sponsors have some pretty amazing products technology and services to help you on your journey see you back here at four o'clock thank you very much [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] he [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Music] [Music] [Music] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] w [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Applause] ladies and Gentlemen please join me in welcomeing to the stage your MC for the leadership track session day Peter Humphrey all right folks thanks welcome back again uh you only have to see me one more time you've made it to the home stretch uh it's getting real we are uh almost there uh that said our last Sprint of sessions is a little shorter uh just to to take it easy on you um but it's no less exciting we're going to have people speaking about Ai and hiring one of my favorite topics I'm really excited for this talk uh from Heath he's got some some yummy data that I think you're going to like uh and then of course uh we have other speakers speaking about uh building AI platform teams and org structure and then we're going to finish off with retrieval augmented generation and data pipelines from a very special distinguished speaker with that please join me put your hands together and welcome Heath black managing director at signal fire thank you [Applause] [Music] hi everyone I'm Heath black I'm the managing director of product at signal fire and let's actually take a quick step back and ask uh why am I here uh so before I got involved in Tech I actually went and got a master's in Irish literature of all things uh both of my sons are named after Irish writers and uh you're going to see how this actually weaves into the presentation a little bit later when you get a degree like Irish literature you have to get creative in how you actually use it in 2009 I got involved in uh some startups and I helped ship the first ever conversational chat bot at a company called chirpify uh I then went and worked at a company called MZ where we were trying to build a Reddit competitor but a lot nicer and then I actually went and joined Reddit where I worked on experimental business lines and trust and safety tools and I followed that experience up by going to meta where I shipped meta's first assistant it was called M it lived Within Me messenger and then I was the uh first product manager on the AI assistant for their rayb bang glasses but now I serve as the managing director of product at signal fire so what does signal fire do I like to say the signal fire is actually the first VC built like a tech company and what I mean by that is the same way that you all go out and interview customers to figure out whether you're building the right thing for them before you ship your product signal fire interviewed 500 Founders to understand the things that keep them up at night and make them bang their head against the wall all day we then built AI ml tools and our portfolio success teams entirely around those problems going to Market recruiting building your leadership skills and the ability to launch your product but today we're actually here to talk about uh some things that we've learned from our proprietary AIML platform beacon beacon tracks over 650 million employees 20 or 80 million companies and 200 million open- Source projects and with all of that information we build a variety of proprietary ranking systems and Market insights that we can then use to power our firm so that we can move at startup speed but then also to support the companies that we invest in today's Focus we're going to be using some of the data from Beacon to figure out how to filter the right people to find them in the right locations to nail the right timing and then finally to close them with the right narrative so let's first start with filters when I think about recruiting I think about it in terms of like what filters would I apply to the people that are on my team and that I want on my team Beacon gives people the tools to apply these filters as they search but the reality is if you don't know what filters you need to apply you're not going to find the right people so here's some interesting trends that we've seen that change how we filter over the past you know decade or so we've seen a Stark deeden isation in ai ai startups are hiring more Engineers without phds or prestigious School than ever before in 2015 27% of engineer hires were from top schools and 16% had phds in 2023 those numbers were 15% and 7% this is about a 50% decline for both of these numbers over that period of time and you're probably saying that doesn't sound right what about for things like research scientists they've got to have phds right and you know you're not entirely wrong about 40% of research scientists have advanced degrees now this is not phds is simply Advanced degrees and even still it makes up less than half of the people that serve in research scientist roles today from my standpoint this isn't too surprising because there's been a slight shift in the market since 2015 in 2015 we were really focused on the the kind of ml research side of things the foundational side of things whereas today a lot of the work is about applying that to the real world usage of that model it's ml Ops Pro like product like software experience is understanding how users interact with the thing that you're building and with the shift from credentials we've also seen this really interesting people Mobility over this period of time if you look on the left side here historically a lot of the AI Talent was centered on these companies the Googles the Ubers the the metas the apples and over this period of time they've shifted to nine companies that we call the aiv league the companies on the right here have seen a massive concentration of talent over this period of time it's really interesting because we generated this last year shortly after that inflection was acquired so one of the key things that I want us to take away from here is that the market is constantly moving so we have to constantly be assessing where that market is Shifting and the interesting thing as well is that all the companies on the left side of the screen are now fighting viciously to get people from the right side of the screen rather than the other way around but one of the key things is not just knowing where people are going it's about where they're coming from this graph shows you net employee movement between different aiv League companies so to speak as you can see at the top open AI has a positive flow of people from Deep Mind whereas coh here actually has an a negative Trend knowing where people come from and where they're going is essential in ensuring that you are filtering for the right people as you look for people like you know building out your teams so the takeaway here is that work experience has always been important but it now far surpasses education in terms of the main aspect that you should be looking at here don't just rely on the credentials that someone has you should instead look at the body of work that they've compiled for new workers you can still look at their body of work what are their open- Source contributions what have they built outside of class the reality as experience and what you're building matters more than where you get a degree secondly you should be asking yourself do I need a PhD researcher for the role that I'm hiring or will a really awesome engineer with experience suffice and then the third is that you should actually consider removing academic requirements from your job postings or maybe making them soft so to speak because this will ensure that your top of funnel is getting the people that have the experience you need more so than the education now let's talk about the next aspect which is location I'm sure many of you have seen the debates on Twitter that San Francisco is clearly dead and so we wanted to know is it interestingly the answer is no is not dead San Francisco makes up about 29% of all startup engineers years now this is slightly down from highs of 2013 when it was at 33% but it's up ticking again since 2021 New York and Seattle have also been pretty impressive as they've both doubled the market share of Engineers that they have over that period of time if we were to zoom out and look at big Tech 50% of big Tech Engineers still reside in the San Francisco Bay area but what about AI specific specifically well San Francisco is still leading the pack 20 or 35% of all engineers in AI reside within San Francisco Seattle makes up about 22% New York makes up about 10% so San Francisco makes up more than both of those cities combined but if you actually look at this slide and compare it with the data that I showed in the previous slide you'll see that these these markets are all punched in well above their weight in terms of AI hiring and AI Talent where they had a smaller number in the previous slide they have a much larger number in this one so the talent is concentrating in these threey key markets today now this isn't terribly surprising for me because San Francisco makes up nearly 38% of all early stage funding into AI startups and the interesting thing about this is that San Francisco only has 26% of all early stage funding in the United States total so not only is San Francisco punching above its weight in terms of AI Talent is punching above his weight in terms of the funding that is going to AI companies today so the takeaway here is that Twitter doesn't determine whether a market is dead data does location still matters even in a highly distributed world that we live in today San Francisco Seattle and New York York are the premier locations for AI Talent today and so your job is to watch the location and funding markets to see where talent and capital is Flowing as another way to filter and find the right people now let's talk about timing finding the right person has as much to do with time as it does Talent at the airport on the way here it was a mere matter of minutes that separated me catching my flight or sitting in the be sad like Charlie Brown a fraction of a second is the difference between a home run to right field or a foul ball in the bleachers for me timing means two different things first it means finding people when they are most likely to leave secondly it means finding people that really are going to have a pinion to join a company at your stage and uplevel your team for where you are today so let's talk about timing we analyze some of those aiv League companies to see their retention rates if you're on this slide I'm sorry if the if the information uh offends you at all but anthropic is leading the pack with about a 66% 4year retention rate while perplexity hovers around 44% or so 43% this is just a small slice of the world as is constantly changing but the reality is understanding retention helps you know when you're likely to be able to get someone to answer that message that you send to them it effectively creates a poach ability score so to speak of your ability to land that person but in addition to retention we actually study the behavior of different Generations they act differently in 2023 nearly 27% of all gen Z left their job if you compare this with Gen X that's actually more than two times as much as genx and if you were to look at the fact that within four years after graduating gen Z has about 2.2 jobs whereas Gen X has 1.1 and so some of this has to do with the fact that gen Z is actually getting promoted at a slower rate some of it might be causing those slower promotions some of it has to do with the layoff Market that took place over that period of time but if you ask me a lot of it has to come down to the pension for people to take risk and bet on themselves j z likes that risk but it's not just retention and it's not just the generation you were born into you need to know when people work at different companies being on the New York Knicks in 197 three is very different from being on the New York Knicks in 2018 one of those teams held up a championship trophy and the other team had the worst record in their franchises history so as signal fire we built this cool tool we call it historical composition and it actually shows all of the startups that we invest in a snapshot of the companies that they admire at different points in time what did their org structure look like at that point in time who are the sales leaders that took them from 1 million to 10 million who were the first three Engineers on their team when they shipped that key product that I'm trying to beat now these things are going to help you identify the risk profile people have are they going to join a company at your stage it's going to Mo like help you understand the motivations that they have but it's also going to help you understand whether they're a potential tenx higher which you need to make in order to take your company to the next stage so the away here is that you have to understand timing both from an Outreach and an impact standpoint you should know when your competitors or the companies that you admire most are likely to lose other people you should track the people that work at those companies profiles to see whether there are changes made to it over that period of time studying the patterns of different Generations or segments of the population will help you understand how they change jobs and finally you need to know when people join and leave companies because that will help you identify your 10xers and help you identify people that are likely to join a company at your stage now this is where I finally get to use that literature degree narrative one of my favorite writers Kurt vonet has this awesome visualization for the shape of stories if you look at the bottom left here on the xais you have beginning and end and on the Y AIS you have ill Fortune leading up to Good Fortune so the bottom here is fron kafka's metamorphosis Gregor Sam so wakes up a bug and everything goes downhill from there on the top right you have a man walking down the street and he falls into a pothole but then he works him uh works his way out but my favorite Cinderella she's on the bottom right over here things start pretty crummy for her sisters are evil she has to do a bunch of work then uh some magic happens and she gets invited to this ball meets a beautiful man they fall in love and then what happens the clock strikes 12 she falls off a cliff but then through a series of fortunate events things lead to Eternal bliss now I'm not telling you that you need to preach the depths of Despair that your company is in or has gone through but you do need to understand the triumphs that you've had why you are where you are today where your Arc is going the reason for that is because historically pay and Equity were the two components that we use for narrative but we can't rely on those solely anymore why from November 2022 to November 2024 we saw a 1.6% increase in the average tech salary and a precipitous decline in the amount of equity granted but I have some really bad news for the folks in this room it's even worse for ai ai Engineers are the hot ticket for this year they command a 5% salary premium and 10 to 20% Equity premium over other engineering roles so what was already expensive is getting even more expensive for us so if we rely our entire narrative on that we're relying on things that we might not be able to afford so salary as the sole selling point has got to go Equity was that other thing that we used to dangle to get people to buy into what we were doing as a company but we've seen a precipitous decline in the amount of people that are exercising investment shares in Q2 of 20124 33% of people exercised the shares that they had vested this is down from 55% a couple years earlier a lot of this is driven by concerns over uh valuations that might be a little bit too high concerns about the cost of liquid Capital to exercise these shares and concerns about the market shifting which it does every 3 weeks in AI Equity can't be the only other thing that we're relying on we have to get to a point where we're not just focusing on money and Equity we have to have things like a close-knit environment with working with the founders collaborative teams speed and the lack of friction to actually get stuff done a big mission the ability to grow your mind and your career opportunities markets that are exploding and solving complex problems you need to understand what all these things are for your company in order to not rely wholeheartedly on salary and Equity as a narrative so to summarize in a world where so many companies are fishing in the same engineering Pond recruiting data can give you an edge these are just a few examples but in the same way that you use data to build your product both your models and the kind of analysis of that product you should be assessing data to build your team your team is your most valuable product that you have what we've seen is that de credential isation is happening so you need to filter accordingly location still matters so watch where people move data can help you identify the right time to reach out to people and it can help you identify the right time that people have been at different companies and all of this is going to help you craft a better narrative so if you can filter if you can time if you can find the right location and you can have a good narrative you're going to do a much better job if you're on the other side of the coin and you're actually looking for work you should know where the people that you admire go not just the companies but the space you should watch how long they stay there this will help you know how treated whether they think the space is going to be fruitful and then finally you should know what you want in that Arc of your career I'll be out in the lobby a little bit later thanks for your [Applause] [Music] time our next presenter is engineering manager for generative AI at LinkedIn presenting insights from building their gen AI platform please join me in welcoming to the stage Xiao Fang [Music] Wang all right good afternoon everyone uh it's my pleasure here to share our journey on building out linkedin's ji platform uh my name is Al ad manager of J Foundation uh let me try it one more time cool uh in today's talk I'd like to First share our journey on building out this platform especially on why we're building it how we build it and what we're building it after that we will talk about uh some thought process on why this platform is critical for today's agent word um hopefully after that you agree with me this is critical component in your component uh uh in your company and you also want to build this team I want to share some tips on how to build such a team how to hire for such a team towards end we will share some K takeways and Lessons Learned before we dive into this application uh platform journey I think it's important to first talk about the ji product experience because that's essentially what our platform is supporting for back in 2023 LinkedIn launched the first formal gii feature called collaborative articles this is a kind straightforward uh GI feature if we are thinking in this standard because it's a very simple prompt in string out type of application uh we leverage chat GPT uh I mean we leverage GPT 4 model uh to create the long content uh articles on the platform and they invite our members to comment on it at this stage our team helped to build some key component behind the scene including the gateway to centralize the access to the model uh some Pyon notebook for the prom engineering uh but at this time we actually have two different taex Stacks uh to serve the uh experience in the online phase we use Java and in the back end we use Python uh we wouldn't call this as a platform at this time very soon we realize uh there are some limitation for this simple approach especially it lacks the capability to inject our reach data into the product experience then in the mid 2023 we started to develop the second generation of the ji product uh internally we called co-pilot or coach here we're showing one popular such experience on LinkedIn right now uh basically it looks at uh your profile and the job description and then uh use uh some rack process to give you personalized uh recommendation on if you're good fit to the job at this time we started to build uh some platform capability uh specifically in the center of our platform we build the uh python SDK on top of the popular uh lunching framework to orchestrate ourm calls and it also provides the key value to integrate with our large scale infrastructure uh in this SDK so our developers can easily assemble a application we started to unify the text stack at this stage because we realize it's really costly to transfer the python prompt into the Java world not to mention the arrow during this process we started to invest on the prompt management or prompt source of Truth this is a sub module at this stage uh to help developers to version their prompt and to provide some structure around their meta prompt uh the most important piece I'd like to call out here is conversational memory uh this is uh infrastructure to help to keep track of the llm interactions and retrieval content and then inject those content into the final product it will help us to build this kind of conversational uh bo now uh zooming to this year actually in the last year we launched our first ever uh real multi-agent system called uh LinkedIn hirer assistant uh this is uh multi-agent systems to help our recruiters to do their work uh efficiently especially it automate several Teeters task uh normally recruiter need to do manually like um a post the job uh um and evaluate hundreds of candidates then reach out to them our platform also start to involve into the agent platform uh from the framework side we extend the sport of the Python SDK into a more large scaled U distributed agent orchestration layer it will handle the distributed agent execution and also handle the more complicated scenarios like retry logic and traffic shift uh for folks who build agent uh I think you probably know the skills or apis are one key aspect of the agent because we expect uh this agent to perform some action one investment we did at this uh time is around the skill registry basically we have a set of tools uh to help our developers to publish their API into this centralized skill r this skill rry can handle the skill Discovery problem skill invocation problem so in your application it's actually very easy to call the API to perform some task another key component uh we invest at this stage is on the memory in addition to the conversational memory we extend uh it capability into the experential memory essentially it's a memory storage to uh extract and analyze and infer the contextual Knowledge from the interaction between the agent and our user we also organize this memory into different layers including the um uh working memory long-term memory Collective memories uh this can help our agent to be aware of the surrounding content uh lastly at this uh time we also realize the operability is super important because agent uh one key aspect to Define agent is autonomous right because agent can decide what API they can call what L LM they need to call so it's actually very hard to predict Its Behavior so we started to invest on the operability uh particularly we build our in-house Solution on top of the hotel to keep track very low level granularity of the uh telary data so we can use this data to replay the uh agent call and we also add uh actual layer of the analytics on top of it so we can use that to guide the future optimization of our agent systems let's put together all the components we build for this platform uh we can classify them into four layers basically including the artion prom engineering tools and skills in location content and the memory uh management uh of course that not everything in the LinkedIn J ecosystem uh in addition we have our sister teams to build out the modeling layer like fine tune the open source models responsible AI layers to make sure the agent is behave according to our policy and standard and also the uh AI platform or machine learning infrastructure team to host those models the key value propriation for this uh ji platform is actually to uh be the unified interface for this complex ecosystem so our developers don't need to necessarily understand all those individual box when they build uh their application instead they can leverage our platform to quickly access to this entire ecosystem uh for example uh in our SDK the developer can just the switch one parameter in the one of the code to switch from the open a model to our on model of course they still need to do the prompt engineering but that reduce a lot of the complexity on the infrastructure integration phase uh last but the most importantly is because of this is a centralized the platform uh it provide a place to enforce the best practice and governance so we can make sure our developers are building the applications efficiently but also responsibly as as you can see from our journey we actually start to build this uh platform piece by piece and then this platform start to emerge if we take one step back and think uh do we really need this platform at this time especially there are lots of uh uh the vendor product on this space shall we buy it build it and why do we need to buy it or build it uh here are some thoughts um the short answer is yes the reason behind it is uh we feel like ji is totally different new AI systems compared to the traditional AI systems so in the traditional AI systems there's a clear cut off between the uh AI model optimization phase and the model serving phase so AI engineers and product Engineers can operate in two different tax stack uh they usually don't uh need to uh work on the same code base but in the ji systems what we're seeing is this line between the optimization phase and the serving phase disappear basically everyone is a engineer who can optimize the overall system performance this actually create the new challenge of the tooling and the best practice in the company essentially we think these ji systems or agent systems is a compound AI system here we borrow the definition from Berkeley AI research lab uh compound AI system can be defined as a system which tackles AI tasks using multiple interacting components including multiple costs to model retrievers or external tools as you can see this is actually skill across AI engineer and product engineer and I believe this uh J app platform is trying to bridge this Gap to summarize uh we believe this platform is critical for your success m because it can Bridge the skill gaps between those two group of Engineers okay let's say if you want to build this uh platform in your company and how to hire it is a frequent question uh I heard um I basically look into uh my great engineering team and uh extract all the key qualifiers from those top engineers and uh I put all the qualifications here uh the ideal candidate in this team is a strong software engineer uh who can build infrastructure integration they have a good developer uh PM skills to design the interface uh ideally they have the AI and the data science background to understand the latest techniques they are the people who can learn from the latest techniques but at the same time they are Hands-On unfortunately it's really hard to guide those candidates if you get them uh it's probably worth more than unicor realistically we are making multiple tradeoff in the hiring uh here are some principle uh we follow and it's actually working pretty well on to share here in terms of the core skills uh we usually prioritize the stronger software engineer skills over the AI expertise this might be controversial but uh uh uh we can discuss if you're interested second is instead of hariring for experience or degrees we hire for the potential because this field is involving so fast most of the experience might be outdated in case you won't be able to find a single engineer with all the qualifications we're showing here uh the way we are solving this problem is to hire a diversify the team so so for example uh in our team we have some full stack soft Engineers we have data scientist we have ai engineers and data Engineers we also have a fresh grads uh from the top research University and also uh some people from the stub background and then we put them together uh into the project what we've seen is based on those collaboration those strong Engineers start to pick up new new skills in the project and very soon they started to grow into these ideal candidates uh lastly is uh want to emphasize is uh the critical thinking uh one constant topic uh in our team meeting is uh no matter what we're building right now it will be outdated within a year or even less than six month so we consistently evaluate the latest open source package talking with vendors and deprecate our solution more proactively cool let's talk about say some uh K takeways uh especially on the text stack choice if possible we strongly recommend python we started with Java and python uh there are some back and forth of the debate internally but finally we pick Python and I think that's a right choice especially most research and open source uh are in this space based on our experience it's also scalable in terms of the uh key components you want to build in this platform the first one is a prompt source of Truth prompt in some way is like a traditional model parameters you want to have a really robust systems to Version Control your prompt this is really really critical for the operational stability you don't want accidentally added your prompt in production and uh see some really side effect second key component is on the memory I think in today's uh meeting uh I mean today's talk someone already talked about it memory is a really key component to inject your Rich data into the agent experience lastly in the agent era uh one key new component we are building is on the uplifting our apis into skills which can be called from the agent easily so you can uh build some surrounding tooling and infrastructure to support this need all right let's talk about how to uh scale this solution and got gued adopted uh from our experience instead of trying to build this full-fledged the platform at the beginning try to solve immediate need for example we started with a simple python library to support orchestration then we start to grow into all the components we're seeing here second is uh focus on the infrastructure and the scalable solution and Linkedin we actually have a pretty good success story by leveraging our uh messaging infrastructure uh to be as a memory layer uh it's both cost efficient and scalable last day is uh focus on the developer experience by the end of the day this platform is trying to help developer to be as productive as possible their adoption is a key for the success if you can design this platform please focus on uh how to align your technology with their existing uh workflow so it will ease adoption and uh be more successful uh we actually have lots of lowle details on the technical side uh if you are interested please check out our engineering blog post on LinkedIn by Cake s and myself uh with that uh thank you for your attention and uh if if you are having more questions happy to answer that after the talk thank [Applause] [Music] you ladies and Gentlemen please welcome to the stage the CEO and co-founder of contextual AI da Kila [Music] [Applause] hi folks uh thanks for being here uh I'm the last talk for today my name is Da Kila I'm the CEO at contextual Ai and I'm here to talk to you about rag in production rag agents specifically um and I'll I'll share some of the lessons that I've learned so my background is in AI research uh but after that I became the CEO of a AI company focused on Enterprise uh so I thought I would share some of my learnings with you uh in the hope that that's useful so if you uh look at Enterprise AI uh if you work in this space you'll probably notice that there's a huge opportunity uh ahead of us right everybody wants to grab that opportunity there's there's these huge numbers flying around $4.4 trillion dollar is is the estimated added value to the global economy according to McKenzie so we have this giant opportunity but at the same time if you actually look at what's happening in Enterprises you see a lot of frustration it's probably even true for some of the people in the audience right here if you're a VP of AI then you're probably under some pressure right now it's like where's the ROI we're investing all this money in AI but where is it actually leading us to are we getting something out of this so uh Forbes has this interesting study where they showed that only one in four businesses actually get value from AI so why is that happening right it feels a bit like a paradox uh so to to explain it we can look uh at a paradox that might be familiar to you it's it's something called morx Paradox it's from Robotics and in robotics they were very surprised when they found out that it's actually much easier to beat humans at chess than to have a robot that can vacuum clean your house or have a self-driving car um so the the the Paradox here is really that things that seem hard are actually much easier for computers than you would expect and things that seem easy actually turn out to be much harder right so there's something very similar happening right now in in Enterprise AI specifically and this is around context so on the one hand we have these amazing language models right you you've that's why we're all here basically because we see this revolution happening right in front of our eyes so language models can generate code much better than most humans they can solve mathematical problems much better than than most of us here can do uh and we're pretty smart um so it's really amazing what they can do but one of the things that they really still struggle with and that's one of the things that as humans we are very good at sort of without effort is putting things in the right context right so as humans we build on our expertise we build on our intuition that we've developed over the years especially if we're a specialist this is something that is very easy for us to do is to put something in the right context and uh and in the right situation so that you can make sense of the information or the problem that you're solving so I would argue that this is really the key observation this this context Paradox um for unlocking Roi with AI and the reason for that is that where we are right now here is is in the bottom left right so we're we're mostly focused on convenience we have general purpose assistance they're very useful mostly if you're lazy they help you sort of solve your problems faster but where you really want to get to is differentiated value if you're an Enterprise it's nice that you can make things more convenient venient you probably can make people more efficient and more productive that's great but where you want to get to is this business transformation ideal right that's what all the CEOs are probably telling you as a VP of AI like I want to change my entire business how am I going to do that so getting to that differentiated value that's where you want to get to but the problem is that the higher you go on that axis the further you go on the context AIS so that the better you need to be at handling the context uh uh that exists Within your Enterprise um so what should we do about that um so that observation is really why I started the company that I'm currently the the CEO of contextual AI um and we started this two years ago to try to help bridge this Gap um and we've learned some lessons along the way that I thought I would share with you uh in the hope that they're also useful for you so the first observation is really that language models are awesome but often there are only 20 % of a much bigger system um so if you have an Enterprise AI deployment usually that means it's a rag system uh so I I think everybody here probably has heard of rag uh rag is something that I uh originally pioneered with my team at Facebook a research when I was there uh so rag is is really kind of the standard way that you get gen to work on your data so what happens very often these days is a new language model comes out everybody goes whoa new language model it's great everybody starts to think just about the language model but very few people actually think about the system around the language model and that system needs to solve the problem right so you can have a relatively mediocre language model but an amazing rag pipeline around it and that's going to be much better than an amazing language model with a terrible rag pipeline around it so the basic observation here or the lesson is that you should be thinking about systems not about models the model is only a small part of the system and the system is the thing that solves the problem the next observation is that if you're in an Enterprise expertise is really your fuel right so uh one of the the things that you want to be able to do as an Enterprise is unlock all of that expertise so you have all of this institutional knowledge in your company how do you get it out um so one way to try to do that is is using these generalist kind of general purpose assistance but it's very hard to get them to to uh to match the expertise of people in your company so ideally what you want to do is to specialize so that you can capture that expertise much better so uh at my company we call this specialization over AGI AGI is great there are lots of use cases for it if you really want to solve a very difficult problem that is very domain specific where you understand the use case you want to specialize for it and you'll get much further so that's I guess uh pretty counterintuitive if you look at the sort of broader uh interests right most people are much more excited about AGI but solving real problems is much easier with specialization the next lesson is uh at an Enterprise scale is your remot so if you think about what a company really is is a company maybe its people probably a little bit right but over time what the company really is or what makes a company a company is its data because even people are transient right so the data that a company owns that is the company in the long term so now as an Enterprise you need to think how you can unlock all of that potential right and so uh one of the the big issues that we see a lot is that enterprises think that uh you need to scrub the data and clean it and invest a lot of time in in uh making your data accessible with AI but what you really want to do is make sure that AI can work on your noisy data at scale and doing that is incredibly difficult but if you succeed in doing that that's how you get to differentiated value right that's how you get that mode because the data makes makes your company your company and so data is really your remot uh one observation uh and this is really a hard truth that that we've learned and I I think that many of you might have learned already or that you're about to find out uh if you're earlier in in your journey is that Pilots are very easy uh building a demo not very difficult these days right if you want to build a rag system you take one of the Frameworks you put in some documents you have a working solution it's great you give it to your 10 users they all tell you it's fantastic and and then you show it to the CEO and he saysay we're going to fire half the customer support team and we're going to replace them with AI and we're going to do that in three months and now you're on the hook for productionizing something that is actually much much harder right so getting this to work at tens of thousands or hundreds of thousands or millions of documents you can't do that with any existing tools uh that are out there on the open source market it's very very difficult to do that making this skill to thousands of users is very hard um making it work for lots of different different use cases if you're an Enterprise maybe you have 20,000 different use cases that you want to cover so how do you scale if that's the problem that you're solving and then there's of course Enterprise requirements around security and and compliance so bridging that Gap is much harder than you think and and the the right way to deal with that is to really focus on production from day one so don't design for the pilot design for production uh and that can save you a lot of time and that brings me to the next observation is that speed is really much more important than infection what we see um in terms of production rollouts of uh rag agents it's all about speed um and and what that means is uh you need to give it to your users relatively early real users not not sort of uh testers who are are kind of friendly you want to give it to real users to get their feedback you want to do that early it doesn't have to be perfect it just needs to be barely functional and if you do that then you can heal climb to actually get to this level where it's good enough if you don't do that and you wait too long and then you try to design something that is perfect it's going to be very hard to to bridge that Gap from Pilot to production so iteration is really the key to a lot of uh successful uh production AI deployments in in Enterprises next observation is is related to this too which is that uh if you want your engineers to be fast and if you want to follow that that speed Maxim I just talked about then you don't want them work working on boring stuff uh sounds kind of obvious but it turns out that Engineers are working on a lot of very boring stuff um and so one of one of the the things that they have to worry about for example is what is the optimal chunking strategy for my rag system and it's different for every use case and is different for every framework and then they have to think about what the right prompt is or really basic things that ideally they don't have to think about too much because you really want your engineers to think about how are am I going to deliver business value right how how do I make sure I have this differentiated value and that I'm actually better than my competitors um so make sure that your engineers spend time on the things that matter and not on the chunking strategy or or things that that can be abstracted away uh very well these days by by state of theork platforms for for rag agents next one is is about making AI easy to consume so what what I mean by that is we actually see uh this happen quite often where companies have gen AI running in production and then the next question I often ask them is okay how many people are actually using it and and surprisingly often the answer is zero almost nobody's actually using it they did all this work but they had to make sure it came through uh sort of model risk and and teams like that so it was really like kneecapped almost and now it's barely useful uh so that's one scenario or or very often people just don't actually know how to use the technology so it it really a journey that you are on and the easier you can make your solutions to consume the better it is and what that what that means for most Enterprises is not just thinking about your Enterprise data and how you make AI work on it but also how you integrate it into their workflows so the closer you can integrate it into a workflow that already exists in your Enterprise the more successful you're are going to be with real production usage uh next one is is related uh to to previous one as well uh where it's really about getting usage it's about sort of being sticky and and so this sounds maybe a little obvious but the the quicker you can wow users or get this sort of spark where they they suddenly get it like this for for me as a CEO of a AI company that's really the special moment when people suddenly go like wow I didn't know that it could do this um so you can try to design your experiences for onboarding users around this observation too right so so where where they get to the wow as quickly as possible so for us we had this really nice example with someone at Qualcomm so we're we're running in production globally with Qualcomm with thousands of customer engineers and one of them became so happy when they found this document it was seven years old it was hidden away somewhere they didn't know it existed they had all these questions and they just never knew what the answers were and suddenly because they asked our system they got these answers and like their their world was never the same again after that um so these are the small the small winds sort of uh that that really matter for for uh evangelizing uh production in AI uh so that brings me to the the penultimate learning which is that it's not even really about accuracy anymore so accuracy is almost table Stakes right uh so I I think as AI practitioners we probably know that getting 100% accuracy is very hard if not impossible getting 95% accuracy maybe you can get there or 90% but what Enterprise are are thinking about much more these days is what about the missing 5% or what about the missing 10% how do I deal with the things that might go wrong right um so there's a minimum requirement for accuracy but beyond that is really about inaccuracy and the way to deal with that is through observability so you want to be very careful with how you evaluate these systems you want to be very careful with making sure you have proper audit Trails especially if you work in a regulated industry this is incredibly important right making sure that you have an audit dril that says this is why I generated this answer it's because I found it here in this document basic things like that so attribution essentially in a rag system actually becomes very very important for dealing with the inaccuracies and similarly what you can do is you can check the claims that your system generates so do a lot of post-processing to ensure that you have proper attributions uh that that you can really back up uh as evidence struggling with the clicker a bit so uh fi final one uh that I want to end on and this sounds maybe a little bit cheesy but it it really is is true is be ambitious we we actually see a lot of projects fail not because people are aiming too high but because people are aiming too low where where folks are going like I have gen running in production and then what does it do it answers basic questions about who your 401k provider is or how many days of vacation I get like that's not really where the ROI of AI is right so you want to aim for really ambitious things where if you solve them you actually have have Roi and you don't just have a gimmick that people don't really uh use anyway so try to be be ambitious because we really live in special times we have the the astronaut here on the slide um so so I I think it was a pretty special time to be alive during the the moon landing and when all of that happened right we're in a in a similar moment right now where AI is is really going to change everything is going to change our entire Society in the next couple of years um and so you have an opportunity being in the role that you're in to to Really uh uh affect that change in society yourself uh so so be ambitious when you do that uh and don't aim for the the lwh hanging easy fruit uh aim for for the sky so uh that's really uh what what my lessons were for you here uh this context Paradox is not going away um but by understanding these lessons that I I shared with you hope you can turn some of these challenges uh that we see everywhere in Enterprise AI into opportunities for yourself um so so it's really build better systems think about systems not models focus on your expertise and specialize for it don't settle for General Solutions specialize for for the expertise that you have in your company and be ambitious and then you'll be very successful thank you [Applause] [Music] ladies and Gentlemen please welcome back to the stage your MC for the leadership track session day Peter [Music] Humphrey we did it thank you everyone and thank you dawe for your insights into retrieval augmented generation I mean it's such an important part of this ongoing Quest that we're all on for AI accuracy and relevance um I think in particular I don't know if everyone saw the Google announcements in the last couple days about like one billion input tokens um so I a really really great session I was I was uh pretty riveted on that one um it's been a pretty Whirlwind afternoon folks uh we did topics like AI hiring I thought uh Heath session was great so much good data there um team building case studies at LinkedIn and of course just now retrieval augmented generation with uh if you want to meet up with the speakers again same thing as before um there are those three Q&A lounges one's here on the theater level two downstairs one at the bottom of the stairs the other one tucked under um so that's a good place to go to chat with birds of a feather people that want to talk if you want to talk about uh topics from this afternoon um and uh uh that'll be kind of going on for the first the speakers will be there for like the first 30 minutes uh of the reception and then you know uh enjoy some some social time some drinks and have some fun uh quick reminder uh regarding brunch tomorrow um for those of you that don't have a bundle pass uh so for security reasons we are not reprinting badges sorry uh so if you have a bundle pass and you are joining us tomorrow um please remember put your badge in your bag don't toss it keep it with you tonight hanging on your front door knob you know whatever it is that's going to help you remember it as you walk out the hotel room uh tomorrow um so that's it you did it thank you so much for sticking with us today that is a wrap it's onto the Afterparty at the Expo like I said we'll have some drinks and of course products technology and services from our sponsors so please stop by and chat with them have a wonderful evening thank you for being here and we'll see you tomorrow for the engineering track [Applause] [Music] hold [Music] [Music]