AI Engineer World’s Fair 2025 - Tiny Teams
Channel: aiDotEngineer
Published at: 2025-06-05
YouTube video id: xhKgTkzSmuQ
Source: https://www.youtube.com/watch?v=xhKgTkzSmuQ
in your Okay. Um, hi everyone. Uh so excited to have you here today for I believe the first ever edition of Tiny Teams here at the AI Engineer Summit courtesy of our friend uh Sean Wang aka Swixs. Um my name is Britney Walker. I'm a GP at a venture capital fund called CRV. Uh we've been in business for 55 years um backing teams at the seed and series A stage across 19 funds. Now we currently are investing out of a billion dollar fund. Um we're investing in folks like Verscell, Postman, Kong, Browserbase, Voyage AI across the landscape of infrastructure and hence I'm here today with you all. Um we have a super exciting track for you namely kind of pointing to a trend that we've all seen in the past year and a half, two years of AI, which is that you know small teams can build insanely successful projects in a way that probably was never possible previously. Um and so here to kick things off for us is Eric Simons from Stack Blitz with their product Bolt.net. new. All right. Is this thing on? Okay. How's it going everyone? How y'all doing? Um yeah, let's go. Let's go. Let's get the vibes going. Let's get the vibes going. Um yeah, excited to to chat uh here today. I I I you know by the end of this what I hope that you get out of it is uh maybe some some advice I wish I had had uh before kind of trying to hold on to the tail of the dragon um with the past uh months of of what we've been up to and uh you know so I think for us you know I how many people here have heard of uh bolt new by the way just kind of curious oh wow dang I guess I'm I'm still used to being like has anyone heard of stack blitz and there'd be like two hands and that sort of thing um okay so cool so everyone's used this thing you're aware of you tried it out Um so uh is anyone aware of how long we were around as a company before we launched Bolt. Seven years. Yeah. Um so to to kind of hit the graph of this thing. So if you if you rewind uh you know the x-axis uh way back seven years the the ARR starting at the bottom of this that's in November uh sorry that's October of last year is when that thing starts. The ARR was 0.7 million. That's that's over seven years what it had gotten up to right. And at the time that we launched Bolt, uh, you know, we were a team of less than 20 people. And when we put it online, we had absolutely no idea what was going to happen. Like we thought, uh, we were getting ready to shut down the company actually at the end of last year. Um, this was like the last, you know, uh, not like pivot of the company because it's the same core technology that was used to build this product that we've been building for seven years. But we we couldn't figure out a way to kind of create a commercial offering that made sense at a venture scale. been around for a long time and uh so you know our expectations were like if we can add $100,000 of ARR by the end of the year with this thing that would be that would be gamechanging right obviously uh kind of beyond our wildest expectations of what happened and since then we've over doubled it um but the I think that to me the the really crazy thing about the graph you're looking at is how clean of of a ramp that is right like there's not this like jagged edges during the the insanity of of the early days and the product we put online was really it's it's like those race cars like it was an MVP. It's like those race cars where they strip everything out and it's just like metal. There's no back seat. There's no side seat. It's just like this, you know, that's kind of what the product was like. So, the fact that, you know, our team was able to scale this uh was just unbelievably impressive. And so, that's I kind of want to talk about what that looked like and how to structure teams to be able to actually rally together and um you know, be able to to scale what would normally take at least a year to grow into, right? uh the best analogy of what it feels like if you ever seen the movie 300 um during this time right like that on this revenue ramp kind of like probably at the tail end of that we were looking at probably 30 or 40,000 active customers at that point you know on month two a team of less than 20 people so this is kind of apt you got this small group of people you know surrounded by just tens of thousands of of things that are uh maybe not trying to kill you in our sense but like but it it felt like that I mean the support load was was unbelievable we had no there was not a person on our team that had you know success officer support uh in their title, right? It was my chief of staff and I largely responded to support tickets and whatnot. But the the main thing we were able to make this work because it was this just incredible uh camaraderie amongst the people on our team and we've been working together for seven years at that point. Um and we were just extremely aligned and very lean and very fast and those were not new things for us. Like that's how we had been operating all along, right? And one of the core uh you know philosophies that we set out like my co-founder and I uh he's actually a childhood friend of mine. He and I have been building uh websites together since for like 20 years now literally since we were like 13 years old. And uh the company we did before Stacklist, before this one, we had actually bootstrapped that thing from the ground up ourselves. Like we we were broke living on couches. Um and and when you do that, you really learn how far a dollar can stretch. And it's very obvious how most startups are just incredibly inefficient when you're in the in the phase of trying to find product market fit, right? And uh so this is kind of where this mantra that that we've had at our company for, you know, uh almost a decade now really kicked in, which is you really want a small number of people with more context per head because what that means is that people at the company have more agency. They can just go and build things. They don't have to get permission to build things, right? um there's not this whole chain of command you have to go through etc. Everyone's very empowered. Things just move a lot faster. You can just go and and you know make uh uh you know immediate impact which again is really important when you're dealing with this sort of like scale, right? Um of course for startups the name of the game when you're finding product market fit is you need to be able to take as many shots on goal as you possibly can because like fundamentally getting product market fit is is just like an enterprise sales pipeline. Like if you're a sales an enterprise sales rep and you want to close a million dollars a pipeline, you don't go and talk to like three people and assume you're going to close a million dollars of pipeline. You're talking to like 100 200 people in your top of funnel. Of those, you know, half of those maybe take the next call. Of those, you know, maybe 10 are remaining that are actually warm leads. Of those, you close three or something, right? Same thing with like, you know, building products and and building startups in the early phases. You need to stick around as long as you possibly can, which means you need a lower burn rate. You do not want to have more people at the company, right? Um because humans are the most expensive thing for a company. Doesn't mean you shouldn't hire them, but it means that, you know, it's important for your ability to have a durable uh you know, enough time to actually take shots on goal. Case in point, one of our main competitors of our previous product when we were in the IDE space, they got aquired basically stripped for parts two weeks before we launched Bolt, right? and we were on that same trajectory, but purely a matter of they they didn't have enough runway to actually get to the other side of this thing and and they were good. Like they would have been meaningful competitors in this space to us, right? Um anyway, so that's this is you know there's a whole bunch of reasons this is important. Those are kind of the main reasons. Um again this is not new. A lot of uh you know folks that do startups repeat this sort of thing. You want to have people that have a great shared uh you know shared set of core values that uh you know where it's low ego, high trust. Um they're obsessed with making the user uh successful and underneath you know chaos, they have grit and resilience, right? If if you if you aren't in this sort of insane situation like we were uh and you have folks that are already having trouble with the ups and downs of of a startup, I will tell you it it would not have been possible to do what we did um with folks that that didn't have just incredible grit and ability to just check their ego and focus on what really mattered, right? Um so great people is what it's always about. Um, so that's I think from a team perspective, those are the things that that really uh stick out to me from, you know, what what allowed us to uh uh you really scale with the traction that we saw um and we're seeing to handle and this this probably applies to um not just like this crazy extreme situation that that we are in growing the company, but in general, you know, at startups, there's going to be things times when just everything's on fire, right? And uh and and a lot of you probably relate to this where it's like sometimes it's good things are on fire. It's just, you know, tons of customers. Sometimes it's bad things that are on fire, right? Um and there's just lots of them. And the question is how do you how do you prioritize? And the best analogy that that uh I have uh leaned into like as an operator is like imagine that you're you know a fire truck squad. You have one truck and you're in a town that's completely on fire. Where do you start? And the answer is you have to make hard decisions and choose where the high impact uh areas are from infrastructure uh you know and and the key people that that need to be saved, right? And it's tough because all these things it's it's it's hard to gauge uh sometimes what's actually going to be the most important thing, but that's the job, right, of of firefighting. And so that's uh and a lot of and that and what you're what you're saying is there are some fires that are just going to have to burn and that's okay, right? Um but if we focus on on saving the right things, focus on the right things, um that'll make up for, you know, all the other things that we that we have to let go because we simply can't focus on everything as a small team. But there's actually an added benefit of that is that you don't get lost in the million things. If you if you just hired a whole bunch of people, you feel like you have to do all of these things. It turns out focusing on 10% of the things often gets you the lion share of of the result that actually matters. So, it forces you to actually have clearer thinking and what you're going to go put your time and focus into as a team, right? Um, and uh, you know, I kind of mentioned the story of for us, you know, we've been around for a long time, like eight years now as a company. And you know, over the past eight years in the valley, there's been a lot of things that people will say and believe and you go to things you gatherings of people and they'll kind of repeat these same things, then they'll change all of a sudden. So, a couple of these like just random examples. Uh, back when we started 2017, 2018, remote work was like very, uh, looked down upon. It was like there's no way you could do that. Um, my co-founder and I, we just, you know, the best candidates we saw were coming in from all around the world. And so, we uh, and we had actually gotten an office in SF. We thought we were going to set up shop here. Uh, six months into paying this, you know, $5,000 a month office. You're like, what are we doing? Like, we haven't hired a single person here. Uh, we went fully in on remote in like 2018. pandemic hits, then the world's like remote work, this is it. Like how, you know, blah blah blah. And now we're kind of flipping back to previous. You need to have your own thinking, right? Because if if you just try and follow whatever, you know, the press or investors, whatever, say it's it's going to be a nightmare. You're you're going to be distracted by a whole bunch of decisions that fundamentally are not actually coming from your assessment of of reality. Um, another great one is to the topic of tiny teams. If you were raising money in 20, if you're a company in 2021, you had investors, they were screaming at you, ours were, uh, you should raise more money. You should hire a whole bunch more people. That's how this is successful. And then if you waited 12 months in 2022, they would come back and they'd say, you need to lay off a whole bunch of people. You need to stop spending money. And and you know, and for us, we were like, we never were spending money. We never did increase the headcount, right? So you you know, it's you want to have these sort of bets that you make. And I I don't want to say it's you don't want to be contrarian for contrarian sake. some of this stuff uh that that is repeated actually, you know, tends to be durable advice. But I would just encourage you to like think for yourself and don't just adopt a lot of the hive mind stuff because the, you know, it seems like the best companies tend to have independent decision-m that really allows them to succeed. So, um, of course, uh, leading from the front is very important. Again, this is not a new thing, but what I'll say um in the first week of Bul being online, it was uh it was it was pretty touchandgo because again, the product was very was very brittle. Um and it became clear to me like if if if I don't myself and the team don't get out and make ourselves visible to the community and and engage with them, people are going to churn and they're going to go away and they're not they're going to lose belief pretty quickly because we we have so much work to do. And so we started running a a weekly office hour session where we let all users tune in on uh YouTube and X or whatever and we just showed them what we were building. We're like, "Hey, we hear you. Here's the things we're working on. Here's where we think they're going to land." People would ask questions, etc. And so, you know, again, how do you smooth that sort of, you know, growth curve? You you go and do things that don't scale because user love is hard to quantify on on, you know, specifically, but oh my god, it works. And that's that's how you can really scale um you know love for a product like this. Um last thing uh I'll mention you know as far as like tools that we used um support is is something that is now like you can there's a lot of AI tools that are coming out right that help you scale uh you know all aspects of your business. Support has been a huge one for us. The first two months that we were online I I mentioned earlier uh my chief of staff and I were the primary support people uh spending a lot of our time doing emails. We ended up um picking up a tool called Parah Help. If anyone's heard of those guys, but they are our the AI assistant called SAM from those guys is the top rated uh support assistant um for us and takes out 90% of our tickets automatically, right? A year ago, two years ago, we would have had to hire 50 people to to go and scale that, right? um the leverage that that you can have by integrating AI and there's even custom things we're doing in our product you know training our own you know little uh uh models to help people be successful within the product experience that would have required human support before there's a lot of things you can do um by not just making AI product but also building around the entire customer success journey to be um you know powered by that parhelp yeah parel so I think they're we're one of their customers I think cursor is using them um couple of just brilliant uh you know young young guys I think out of Europe or something running that company. Um, and I I mentioned this before with like kind of leading from the front, but community. This is something that AI cannot replace. Going and and actually talking to users, like creating a space for users to try out your product and like learn from each other, um, is so key. And this has always kind of been the case, right? But especially now uh if you're building an AI product, it's really important that folks can like learn from each other and learn in a place where they can get help u you know from pros right and from the community themselves because this is another way you can really scale the customer experience without having to add headcount within your company itself right and so one of the kind of cool ways that we're doing this I don't know if anyone's seen um we are throwing the world's largest hackathon right now actually for this entire month if you go to hackathon.dev dev. You can check it out. We have passed the Guinness World Record, by the way. This is like already. So, let's go. We've got 80 something thousand people that are participating. Um, but basically, we've got this amazing event going on. We have dozens of people coming to help uh provide support and uh you know, as folks are building out their projects, trying out the product. Um, and this has been just the most the craziest ROI we've ever seen from a a marketing initiative we've ever done. both due to the scale but also the thoughtfulness and like getting augmenting it with both the AI support and the community support etc. um this sort of stuff really works, right? So, um, to kind of wrap up here, uh, these are, you know, the main takeaways if you wanted to like take a photo of, you know, the TLDDR or whatever. Um, these are kind of the main things that, you know, stuck out to me from the the past couple of months of, uh, of our experience that really made a difference and and, you know, again, like I said, it was very touchandgo, right, for the first, especially the first two months, just how unexpected and unprepared we were, you know, for for what happened. uh without this these things like this would not have worked and it wouldn't be working now, right? Uh and to boil that down, it's like, you know, you don't want to hire an army, you want a small number of Spartans, right? That's kind of the mentality that we look for when we hire people onto the team. So, um all right, with that, uh let's go. Uh so, this is where you can find me. Um I have to like go to SFO like immediately after this, but if if if anyone wants to chat about stuff or has questions, that's where I am on X and then that's my my email address there. Um, I think we have one minute for questions actually if anyone has a burning one. There's a microphone up here if you want to come on up. But yeah. Yeah. Hey, how did you decide what to build? Like did you have a framework for you know talking to users or did you just ideate and you know ship product experiments and see see what stuck or Yeah. What what was the process there? Yeah. you're talking about like for like kind of like how we decided to build bolt or and even after we we tried out like uh probably five different things last year and all of them I mean I think it's you know all the things I've ever built that that really seem to stick with users and resonate always started with something that I myself thought was cool you know uh which sounds like very obvious but there's al I've most of the things I've built in my career have been things that sounded good and like it's like hey this should like maybe increase our AR are, but it like did intrinsically wasn't something that I was like so so so stoked about that I couldn't sleep at night. Bolt was one of those things, you know, and and then we certainly put in front of users. Uh people seemed excited. Um and I'll tell you this, what what the user feedback we got from the early Bolt sessions before we launched versus some like launching stack blitz was the exact same and the outcomes couldn't have been more different. Right? So again, it's all about just like taking shots on goal because you you just don't know until you actually get it out into the world. Um, you can certainly get the early feedback, but um, you know, it's it's all about just getting to launch, getting it out there, and, you know, iterating as fast as you can. So, am I cut? Okay. I'm sorry. I'm sorry. I can't take anymore. Thank you. Thank you for having me. Um, hopefully this is helpful. Thanks so much, Eric, for walking through the amazing journey that has been Stack Blitz and now Bolt. Um, next up we're going to have Sid Bendre come to the stage to talk about Alie. Um, Sid is the co-founder of the company and they're building a portfolio of consumer products. Um, starting with products like Quiz Quizard, which you may have heard of before. Um, one of their products reached number four on the AppSource education charts in 2024 and number five in 2025 alongside other companies like Dualingo. Um they're backed by NEO and they're building the AI infrastructure to build a1 billion dollar portfolio of consumer software over the next decade. Um so Sid, please come up to the stage. That's true. Cool. Hey everybody. Uh, sorry about that delay. I was just trying to get connected. Um, I'm Sid. I'm one of the co-founders of a leave and this is the new lean startup. We've been seeing a fundamental shift in how successful companies are being built. More and more companies are getting smaller, rounds are getting delayed, and profitability is being attained earlier than ever in their lifetime. A lot of this is being driven mainly by the advent of AI tooling. These companies are generating millions of ARR with teams smaller than most startups engineering departments. The age of bloated teams and endless hiring rounds is over. Welcome to the era of tiny teams. First, a bit of background on Oliv. We're building a family of iconic consumer software products that we hope will enable people to live better, more fulfilling, and productive lives. We are a tiny team that scaled to a scaled a portfolio of virally successful products to $6 million in AR profitably and have generated over half a billion views across social media, achieving this with a tiny team of just four. We're based out of New York City. And here's a brief history on us. On the 26th of January in 2023, we launched a Quizard AI mobile app. We launched it with a Tik Tok video that went viral overnight and generated million views that turned into 10,000 users in less than 30 hours. We actually started scaling with no LLM costs. This is because back then we had the initial codeex model launch which was in beta preview. Funny enough, we were cycling between 10 different accounts from our friends um just so that we could uh prompt engineer or generate these AI outputs. Interestingly enough, Codex, even though it was meant for as a coding model, could be prompt engineered for any open domain conversation. As you all may know, it ended up being sunset for abuse. We ended up getting a lot we end up getting reached out directly by OpenAI on a few of our different accounts that we were cycling through as being one of the top model users um for the Codex model at the time. My co-founders and I then graduated and then we moved to New York City um the fall of 2023 where we started our back to school campaign which was a series of um man on the street videos across different uh prestigious colleges in the US. This is when we hit our first million dollars in ARR and also achieved profitability within the first nine months of operating. We then even had another successful campaign in the spring of 2024 that got us all the way to number six in the charts of education alongside giants like Dualingo, Photomath and Golf. We then took all our learnings in the spring of 24 and double down on a new product, Unstuck AI, a study companion tool for students. We were able to get to a million users in under nine weeks and generate over quarter billion views across socials in a month. A few weeks ago, we were able to get both products in the top 10 in the education charts. Unstuck went all the way up to number three in the education charts, right under Goth and Dualingo. We've now also launched in stealth, our third product, which is our first product outside the education domain. It took three weeks to build thanks to all the blueprints that we've built in advance. We'll speak more on this later, and have already reached a thousand plus users. By the way, it's already profitable. Our lean playbook boils down to three key pillars. Operating principles that lay the foundation of leanness, organ organizational structure that set up the systems for this leanness and AI tooling augmentation which optimizes scaling. Let me start with operating principles which I believe is the main bedrock for why we're so lean. It starts with hiring. We either hire right or not at all. We only hire 10xer generalists that have multiple complimentary spikes in uh similar fields. So for example, our product engineers are full stack developers, great product thinkers and really good at fundamentals of computer network for example. We also have marketers who can code. We have designers who can build and uh the likes. We try to aim for people whose complimentary spikes can shape and drive 10x outputs within the team. The second key principle is profit first mentality. We are relentless about prioritizing profits because profit is power and profit is focus. Profit gives us a clear mechanism to make all our decisions and guide a northstar for the company. This leads to our third principle. Does this move your KPI? Everyone in the company owns a KPI. KPI alignment removes micromanagement because everyone is focused on moving their metric week over week. This also means decisions must be validated against this KPI. Our fourth principle is continuous process refinement. For any repeating process, we always ask how would we do this better? Is there any way we can improve? What was wrong about this run that we this previous run? We view failures in the company and issues in the company as systems failures, which lets us set up a feedback loop for improving ourselves, improving the process that we use both on an operational standpoint, but also a technical standpoint. The fifth pillar is super tools. We're pretty lazy, so we like to consolidate a lot of our work, don't learn it twice. We believe in building compounding benefits by investing in technical playbooks and operational blueprints. This allows us to compound our benefits or compound our learning so that the benefits can be yielded across new products. This is exactly how we were able to hit a million users on unstuck within nine weeks. Taking everything we learned over a year and a half on Quizard. More on the super tools concept. For example, one of our super tools is Launch Darkly. The intended use case of Launch Darkly is a feature management platform that helps software teams control and release features safely and quickly. Here are some of our extended use cases. We use Launch Darkly as a manual traffic load balancer. Specifically, we we put Launch Arley in between all our LM calls so that we can reroute traffic to different LM providers based on uh hitting rate limits, different strategic initiatives or whatever. It just gives us an on the-fly mechanism for choosing where our traffic goes and allows us to split things within rate limits. This was especially important in the early days when rate limits were really tight and also um it was hard to yeah rate limits were really tight and it was hard to get um quotas increased on individual endpoints specifically I'm talking about Azure OpenAI the second extended use case is on the fly infrastructure changes for us this looks like how on unstuck which takes in a lot of files to ingest for specific file formats we have a lot of waterfall ingestion processes what I mean by that is We depend on a lot of thirdparty services that can be reliable. By using Launch Darkly, we're able to change the prioritization of these processes on the fly so that if one of these third party thirdarty services goes down, we're able to reorganize the service on the fly to make sure it's up and running and available to our users worldwide. The third extended use case is UI modifications and pay paywall experiments without having code pushes. We have built an experimentation layer around launch sharkly which allows us to run and spin up experiments without needing to make a code push. The third or sorry the second pillar that guides our leanness is our organizational structure especially in our organ especially in our engineering or in the way we hire and we organize our engineers. For this we look to Palunteer who successfully scaled across multiple market segments. We believe that we're building the consumer version of Palunteer with our harvester and cultivator model. Let me explain this for harvesters. These are product engineers similar to the Palunteer deltas of the four deploy software engineers that own and live and die by their products. They're living in the metrics, working on AB experiments, building features end to end, working with the marketing team, and effectively owning the entire products existence. Harvesters are people who build products that people actually want and pay for. Then we have the cultivators. Cultivators are AI software engineers whose main goal is building the company's agentic operating system. They're pioneering automation across different business units including marketing, design, product with the idea of expanding infrastructure that affects all the users everywhere and helps us win in every market. Cultivators are creating a foundation that let us ship and scale faster in any market. And finally, the last pillar is AIdriven and AI and tool augmentation. One important note in thinking about this is when we think about hiring, we like to think of tool use as being something that will allow a 10xer become a 100xer as opposed to the contrast, which is using tools to fill gaps and augment the shortcomings of someone who's not at the standard that we like to hire for. With that being said, we use a slew of products for our day-to-day task automation for things like script writing, campaign analysis, operations, code generation, and communications. Effectively, by paying for a bunch of services, we have augmented and enabled everyone to have their own chief of staff within the company. Now, back to the blueprints. One more thing is we believe heavily in compounding benefits using the latest and greatest in AI models. With them changing so quick and with you having so many apps out there, do you ever struggle with like going back through and and like changing the models that you've used for some of these apps and and how do you deal with that? Yeah, that's a great question. Um I I I think like a really cool um a really cool thing is the fact that you can do that. You can just build an app with an AI model and then a better AI model comes out three months later and you can go and a lot of the time it's like a oneline change of like let me update this model and the app just gets way better or it just unlocks new things. Uh and so that's something I do frequently where I'll go back and I'll like even like relaunch an existing app with a new AI model or add a tiny feature to it. Um and so um yeah, I think that's kind of the superpower of like building with AI is the fact that you can just kind of replace these AI models. Thank you. Yeah. All right. Awesome. Thank you all so much for coming. I appreciate it. Thank you so much, Hassan, for walking through all of that. So impressive that you do all of this on top of your day job. I cannot even imagine. Um, okay. So, our next and final speaker for this portion of the tiny team session is Max Broer Herbas from Gum Loop. Um, previously he did competitive programming while at McGill and then also went through IC a little bit over a year ago. Um, and has achieved just incredible incredible traction in such a short time. Uh, now scaling automation across companies like Instacart, Web Flow, and Shopify while still having less than 10 people on the team. So incredible. So without any further ado, please welcome Max up to the stage. Okay, mic is working. Screen is not working yet. Anything I have to do in particular to make this? Ah, the one hanging decoy wire. Yes. Okay. Sweet. Okay. Uh there we go. So this uh should preview in a second, but uh yeah, I'm Max. I'm the founder of Gum Loop. We went through YC a year and a half ago now, Winter 24. Um we've been a pretty notoriously small team since then. Uh, okay. The preview is not working, so I'll just do it like this. Uh, so we've been a pretty notoriously small team since then. Um, we raised the series A as a team of two and are now nine people. But, uh, this tweet was kind of like the one that inspired this talk, like how how we scale to the, uh, the size we hope to be with fewer than 10 people. I'll be honest, I tweeted this when I was extremely caffeinated and and really thought I was going to rule the world. Uh we're on on track roughly. Uh we're less than 10 people and growing really fast, but um this was also a good Twitter post for hiring because we wanted to hire exceptional people and I think uh working on a small team is really fun. So, uh, I thought I would go over, I'm sure at this conference you've heard a lot about like what AI tools to use and how to work efficiently with Cursor and Windsurf, but I was going to focus on how you actually like once you're efficient with these AI tools, how you build a team that's uh, has the right culture and can actually scale and do the things you're you're setting out to do. But, uh, the first thing I was going to go over was kind of how we got here. So, I spent like six months building up a ton of terrible terrible software. Uh, I made like video game moderation software. I made ML models to detect children's age in video games so that you could se uh separate adults from children in VR. I made bot detection software. Um and then as a side project on top of my side project, I made the first UI for autog, which was this like really hyped uh open source framework that came out right at the start of the agent craze. And uh basically I noticed that everyone in this Discord was excited to use AI, but they had no idea how to actually clone a GitHub repo or set things up locally. So I just spun up like a really ugly UI. I called it agent hub at the time. I thought was that it was going to be GitHub for agents. Uh I thought this was really genius, but it it was all kind of built upon the idea that agents were going to be immediately useful. So we pivoted pretty quickly after this. But um I noticed that all of the people who were asking the agent to do things were basically just describing complex workflows. Like if they knew how to write some Python, they knew how to make some API calls and some LLM queries, they could uh basically automate their entire request. They don't need to like cross their fingers and hope that the agent will do it for them. So yeah, that was the realization. It was my co-founder and I at this time we just started uh kind of editing how you could configure an agent. instead of asking for everything that you wanted, you could actually define the steps as a series of of uh like nodes in a workflow. Um, and then we got into YC a few months later. We raised a series A. Uh, we hired two interns for the summer and then we raised the series or we yeah, we raised a seed, then we raised a series A about like four months later. And uh, we were just a really small team, kind of overfunded, but raised a lot of money so that we could hire the most exceptional people um, over the next year. And the the general idea was just scale with under 10 people because we we noticed after working at Amazon and Microsoft that working on a super small team is really fun. You can just uh move way faster, not sit in meetings all the time. Um so now Gum Loop is this it used to be way uglier, but it's this workflow automation tool that a bunch of really large companies are using. Um the our biggest customers are like Instacart, uh Shopify rolled this out to the entire company last week, which uh broke most of our things, but it's all back online now. Um, and yeah, and all of this is 100 100% PLG, so we're not doing any outbound sales. I think that's one thing that helps us scale really quickly if people find your product and come inbound. You don't have to hire 10 sales reps. So, um, there's definitely a lot of luck and uh, kind of coincidence in in this like small team approach that works for us because we happen to be a PLG company. Probably wouldn't be as possible if we were doing like a top- down sales motion. Um, so I thought I could go over how we approach hiring, internal operations, and then team culture. Uh, these are like things that we we talk a lot about internally, my co-founder and I. Um, I did want to put a disclaimer here. I don't actually know what I'm talking about. I I'm trying to figure out if we're just getting lucky over and over or if like our approaches are actually working. But take everything I say with a grain of salt because uh could be totally off base and it might ruin your company if you do what I do. So the three things that we try to do internally when we approach hiring are be super super picky which is painful uh most of the time productled hiring uh buzzword that we we've been trying to coin and then making time to work together which I'll explain in a second but um this is a screenshot from the the co-founder of Instacart who ended up investing in our company and and we would ask him for advice because he scaled a large company before um running candidates by him and and one time I asked him like I sent him a candidate that I thought was pretty good. Uh, this was his only reply. He he tends to write very short emails, but um emphasizing that you shouldn't lower the bar. Like if you aren't extremely excited about someone, like if it's not a no-brainer, you shouldn't even consider hiring them. Uh, so we've done like hundreds of interviews and tons of work trials, which I'll explain in a second. But if you're going to be a super small team, every person needs to be absolutely exceptional. Um, which oftentimes makes like investors of yours like confused because you're still such a small team and they gave you so much money to scale, but you have to kind of be really um uh thorough with your screening and then also really confident in every single person you hire. We we've been trying to coin this term of productled hiring. So, two of our customers ended up quitting their jobs to join the team and uh that was like the one of the easiest decisions we've made in terms of hiring because they already loved the product. They had a ton of insight into how it could be used in a business. So like our customer from Instacart, the one who originally found us and brought us into the company, he ended up quitting and joining us and now he does a lot of our uh like enterprise relationships and working with our larger customers. And then this screenshot is our head of education and community. He was at Webflow before but had a Zap year course and a ton of automation um workshops that he was selling and then found Gum Loop and got super excited. So that was a no-brainer. But I think if you can focus on making a really great product that obviously happens to be accessible to people who you want to hire, um there's a bit of luck involved there, but it helps with the hiring process because they know exactly what you do. You don't have to like inspire them to join the team. They they want to join on their own. And then making time to work together. So I think this is only hopefully this video plays. Uh yeah. Okay. This is only really possible if you have a really small team, but we do this thing where we uh rent Airbnbs and we just go hack together for like four days at a time. We we make like three weeks of progress in a couple days. But um the two people sitting on the left there are actually work trials. They were like interviewing at the time, but we brought them with us to use to just hack. And uh I think t doing this really intentional sort of working together period is the only way you'll actually know if you want to work with someone. So we always bring people into into work trials. they are on the team for several days as if they already joined the company. Um, and then by the end we're like totally confident whether this is the right fit or not. And we've done way too many of these honestly. Uh, but it's helped us make sure that everyone on the team is exceptional. Um, another thing we try to do in terms of operations, I mean there's three things here. We have almost no meetings. Uh, purposefully so. I try to just let people build. Like I hired great people. So my plan is to give them the space to build which is easier said than done. And then uh we automate everything internally which is kind of a gum loop selfplug. But yeah in terms of our calendars like my calendar is always insane because if we're talking to customers and or I'm talking to customers and I I flew back from New York this morning for example because I was working with customers in person but everyone else's calendar should ideally be totally blank. Um we try to just give everyone deep focus time. If you're an engineer and uh we hired you to build exceptional product like we we we should let you do that not make you talk about building exceptional product for five hours every day. I think that's only possible if you have a really small team because normally you'll have like five person on five people on a project. You'll have to sync and kind of agree on the terms before you even start working and that just leads to kind of slowness everywhere. So um also letting people build. So, uh, I I used to be really involved in every aspect of like every feature we shipped, but now that we've hired exceptional people who are all better than I am at basically basically everything, uh, all I do is kind of like inspire or I try to inspire what the features we should build are. So, I'll make these like really stupid uh, descriptions of the features that I think we should build based on talking to customers and then I just let people do their thing. Uh, so like our design engineers will Let's see if this works. So from that sketch of me being like what if we okay hopefully this works. What if we had MCP nodes? What if you could automate workflows with MCP? Um that was just like the highle prompt and then I let our team like cook basically. This video is exceptional. I wish it was playing but uh basically they built like a better product than I would have ever imagined. Um so that that's kind of like only possible if you hire great people but once you do you you can really just take a a backseat and give them the space to be exceptional. and then automate everything you can. So this is our internal Gum Loop instance. We we automate basically every part of the business as much as we can. And if there's something we can't automate, then we build features on Gum Loop to let us automate it. So like before every meeting, we have like a deep research report that tells us everything we need to know about the customer. Not just their outward facing information, but also like how they're currently using our product. Uh are they a power user or not? What features are they using? So we we're like totally informed going into the meeting. Um, we have every time someone interesting signs up, we get notified uh uh why what they're doing on the platform and also like an email drafted in my inbox so I can reach out to them, hop on a call and like talk about why they they made that free account. That's led to a ton of our growth. Um, we have an AI chatbot on the platform, for example, that gets like 50,000 messages a day, but we have a Gum Loop workflow that reads the chats with the chatbot so that it can tell us what people are confused about and then we use that to inform our product decisions. So uh a lot of these little tasks in the company would have been someone's role or taking up like three or four hours of their day but now we we use our own product to automate everything. So also a lot of luck involved. You can be a small team if you are an automation company but uh if you use gum loop maybe you guys could be more efficient. That's the plug. All right. Um so culture- wise I think this is the most important thing. it's impossible to to talk about having a really exceptional team uh if no one's having a good time or um they're quitting. So uh we I mean one of the most annoying things I say uh at like basically every day when we talk about a feature that a customer is asking for is like what if we built it today? Um like what would that look like? And then it's kind of caught on and now everyone on the team I mean first of all they're ex I've said that like 10 times but they're exceptional and they're really fast building engineers. So, we often just challenge ourselves like what if we put on a timer for 45 minutes and try to ship this feature um right now with cursor. Um but this can lead to crazy burnout. Like if you're always asking what if we did it today on a Friday night at 8 p.m. then people are going to have a bad time. So you have to be really intentional about making it fun. Um, like I mentioned, we do these these retreats, but we're going like we're picking a cool place that I wish my like boss would have taken me when I was working at a company before this. And then we get a bunch of food and do a bunch of fun things like we go rock climbing and and biking. And um, it kind of offsets the intensity of building with such a kind of like crazy timeline for every feature. I don't think like anyone would be having fun if we didn't have these like really exciting times to look forward to. I also think this is only possible. You can't fit 50 people in an Airbnb, but you can fit 10 pretty comfortably. Um, and then being really intentional about your company culture is another thing that I'm pretty adam adamant about. This is our our company handbook. It's like a month or two out of date, but um, basically everything that we say internally, we just put it on a page so that we have to live up to it. Um, we wanted to kind of hold ourselves accountable for all of the the ways we talk about building a company. Uh, and this is also like one of the the things that convinces most of the exceptional people on our team to join or to to book that initial call because they read our outward facing handbook and they know that like what we're about before they even meet us. Um, and I'm kind of at the end of uh I was going to show the video but cut it a bit short. Um, we are hiring a founding head of growth. So, if you know anyone, you can email me there. Like I mentioned, it's a fun time. Uh, pretty intense, but hopefully you know someone or you want to join the team and help us scale. Cool. Okay, [Applause] I think I'm Yeah, big fan of the product. I think it's really really awesome. I've been using it and pitching it internally in my company. So, I'm I'm a huge fan. I'm curious how far you think you'll be able to get with 10 people. Like, are you still staying true to that and how you think about scaling out to like a billion users around the world uh with 10 people? Yeah, I don't think it's possible to to scale that big with 10 people. um maybe 15 or 20, but uh I wanted to like set the bar really rigidly and then if I go a little over it's no big deal, but at least we're not scaling to like a hundred people and having eight hours of meetings every day. What is um like your vision for the org structure when you do hit 1 billion with 15 20 people? What's the or look like? It's been changing a ton. So at first I was really naive, still super naive, but I thought like we could do it with only engineers because I was like engineers can do anything. They can learn how to do marketing or sales or whatever. Um I was totally wrong. So we're now five engineers and four semi-technical people. Um I don't exactly know what the work structure will look like, but we're starting to feel that like our only bottleneck now is like growth marketing. Like how do we share all of these cool features we're doing we're building for people with the world? And then also like we're getting hundreds of requests for features every day. So another engineer would would definitely help. So yeah, just to touch on that, when you're looking for uh the growth, the head of growth. Yeah. Are you looking for someone who like is also sharing the like, oh, I can do this all myself with AI tools or looking for someone who is looking to grow a team as well? I I think definitely not the latter. So, I I call them like a doer versus a todoer. That sometimes you'll talk to someone about like joining the company and they're like, "I'm really great at building out a team." Like, that's the biggest red flag. Um I I think they're they'd be great at like listing all the things that a team needs to do, but don't hire that person if you want to stay super small. We're looking for someone who's like, I can just make it happen. And then once they hit their ceiling and they're like, I actually can't like scale further than this, then that's the time to hire. But I wouldn't hire someone who's going in with the intention of hiring more people. Just hoping you clarify on u letting people build is that individual developers, engineers or like as a team collaboratively and uh how do you prevent like um fractures in like the codebase and like having it like disjointed? Yeah, I I think it's only possible to just let people do their own thing if they're like really trustworthy. Like you hired people that you can depend on. Um, sometimes it goes like wonky like we don't we don't have the same understanding of what's being built, but then we just like sync over like a five minute chat and we're back on the same page. But um, yeah, generally like you people know the direction because you're talking and you're in the same office all day every day. They just like talk to a customer and they realize that is a pain point for someone. So they just go ahead and ship it. You don't have to like get in their way and like make a spec dock and figure out exactly how this is going to work. You should just trust them to to build. Can we do one over here? Yeah, sorry. Oh, what's up? Um, how do you think about compensation as well as just like like how you like are do you look at these uh 10 engineer or 10 employees as like normal employees or do you consider them more like founders? What are your expectations for them versus how like a traditional startup might have expectations and how do you think about compensation as well? We try to compensate really competitively um because we we raised like 20 million and we we're such a small team that like we we're in a position to do that and that was also like the main reason we raise so we can compensate people and and make their life comfortable while they're building the future. Um we don't consider them founders. I wouldn't like put that burden on someone like I'm the one who's waking up at 6 a.m. like sweating because I had a nightmare about like our like back end crashing. Like I I don't think they should be uh doing that. But um we do treat them as like just members of a team. Like everything that we ship is a discussion. There's no like top down order that we need to do x y or z. Uh it's just like a kind of like flatland collaboration on uh like what we're going to build and when and how. Cool. Thanks. Yeah. Hi. Um do you think this sort of culture can translate to um say you might be already doing this but say when you start getting into workflows that are highly complex in enterprise, right? So uh banking regulation or parts of legal where information is just in the heads of super experienced people and I feel in those at least my experience has been in those instances you need deeply non-technical people and technical people to work together and the scaling sort of breaks down but have you found ways around that or I'm just curious on your advice for people in this in this space. Yeah. Um I think I understand the question like how do we support really complex workloads if we don't have the nuance of like how to do that? We try to just build the tools to let the person who understands the workflow do it. So like at Shopify, if we're working with like their head of legal or something and they understand what contract review looks like at scale for hundreds of contracts a day, we make it really easy for them to use the software that lets them build their own tool instead of like trying to learn how to do their job better than they do. I think one more question. Okay. Hey Max, I just had a quick question. So with the uh work retreats that you do um is uh like at what point in the interview process do they go on the worker retreats the guys that you're interviewing and then do you offer to pay them and if so are they like $10.99 or how does that work? Yeah. So we we always do like a screen with me. I talk to someone for like an hour and figure out if we are like could be friends basically. Then we do a technical interview which is super practical. No like leak code stuff. It's just working in the codebase. And then we do the work trial. if it's around the time when there's a work retreat coming up, I'll just like delay the work trial until they can just come with us and we hire them as contractors basically. So, um they're getting paid for their time. We wouldn't want to make someone work for free and uh we just try to like coordinate with their schedule whenever they're free. Okay. Thank you. Yeah. Sweet. Thanks everyone. [Applause] All right. Thank you so much folks. That wraps up this part of the tiny team session. Um, we'll be back here at 2 p.m. with some more speakers, but thank you to all of our speakers for running through everything and enjoy the rest of the conference. If I don't catch you back here in a few bit Thank you. Okay, welcome everybody to the afternoon session of Tiny Teams. Uh, my name is Britney Walker. I am a GP with a VC firm called CRV. We invest in seed and series A companies and have been doing that for 55 years now. Um so we're currently on our 19th fund which is a billion dollar fund and we work with a bunch of folks relevant to this ecosystem. Folks like Verscell, Postman, Kong, um Browserbase, Voyage, a whole bunch of folks across kind of infrastructure generally as well as AI infrastructure specifically. Um and so super excited to be bringing you the session this afternoon. Uh we have three amazing speakers lined up for you. Grant from Gamma, um Vic from Data Lab, and then Alex from EveryY. Um, and so we're going to get things started in a second here with Grant from Gamma. Um, and Gamma is an AI powered presentation software tool. Uh, fun fact, I was just telling Grant backstage that I was using it literally last night uh to spin up some last minute slides for a session I'm doing later today as part of another program. Uh, Grant has spent 10 plus years building tech startups. He was previously the interim CFO of Optimizely in the experimentation space um and grew up in the Bay Area. And now, as I mentioned, he's on to Gamma. Um, they have 30 folks in their team, so still a relatively tiny team uh at the series A. And excited for him to tell you more. All right. Testing, testing. Good. Awesome. Thanks so much uh for having me. It's uh it's great to be here. Uh my name is Grant. Uh I am one of the co-founders and the CEO of Gamma. Uh we are basically as alluded to building the anti- PowerPoint. So we are trying to reimagine how people create and share content. We make want to make that dead simple. And this all started with kind of just trying to solve my own problem. I was previously doing consulting and like many of us have probably seen uh a page or slide that looks like this, the blank slide, and just had this feeling like there's got to be a better way. And so we've been spending the past four years just really trying to reimagine the building blocks. How can we make it dramatically simpler so that we're not spending all this time designing, formatting boxes, aligning boxes, resizing them, figuring out the right layers. we can focus on the content itself and let it feel more like a content first approach versus a d design first approach. And so, you know, we have uh grown over the years and for us, we're really trying to deliver both speed and power to our users. A lot of what we pride ourselves in is giving people simple tools to really mold and shape uh their presentations, their content much simpler. And longer term, we're trying to build what we call tools for imagination. So this is the whole notion of how can we help people really sort of stretch shape their ideas in a way that's way easier for them to to share and if we can do that maybe we can help kind of push innovation forward in general but this talk isn't about uh any of that because you know most of the talks today really great talks around uh really innovation obviously AI it's very much a product you know centric lens that people are applying which is amazing I want to take a step back and you know I I think a lot of founders are great at applying sort of first principles to thinking about how do I build product and I would encourage everyone to think about we're in an era where we can also apply those first principles to think about how do we build a team how do we innovate on org design and we're obviously still learning ourselves but I just wanted to share you know some of those lessons along the way to hopefully inspire you to all think about maybe there's a different way about building teams in the future this is the old way we're all used to this uh you know there's many many different flavors of Once an organization starts getting big, inevitably you have a bunch of hierarchy and that could take shape in itself in many many ways. And you know what traditionally happens is uh once a startup starts scaling uh you'll bring on the sort of VP the VP will go on and hire their directors directors will go on and hire their direct reports and you get this sort of cascading effect and this happens across every single function and you can go from a small team a tiny team to a team that ends up becoming much much bigger and that can happen overnight. I mean we've all lived probably through the blitz scaling phase of of startups and um you know some of that still exists but I do think there today can be maybe a new way and for us you know we've reached uh over 50 million users now we're still a team of 30 uh and in fact this is only more recently that we've become a team of 30 and so you know again these are things that we're still learning along the way and trying to think about what are some of the themes that we're starting to see that we can start talking about and sharing and obviously getting input from you and then for us to continue to learn and learn and adapt. So this kind of impacts three different pillars. The first pillar is you know obvious where do you begin? Who do you even hire? For us I want to talk a little bit about kind of the rise of the generalist. What does that look like in practice? The second is okay now that you have a team how do you manage that team? I want to talk about this notion of introducing the player coach. Something that is very critical to how we build and manage the team. And then the last is how do you scale? You actually have a team whether it's 10 30 more. How do you actually prepare for the next phase? It doesn't mean you don't hire at all. It just means relative to maybe where you were uh to to companies before you're just much smaller. For us, you know, at our scale, I would say we're probably onetenth the size of what we would have been if we were started just a few years ago. So, it's just a different uh way of like framing it. So, let's first talk about kind of what I call the rise of the generalist and and what does that mean? Um this notion of a generalist is you know in engineering you might have an idea this notion of like a full stack engineer it applies to many different disciplines. Um this one concrete example I'll provide is you know a generalist on our team is our head of design. Uh he was also happens to be our very first hire. He is a designer that is both you know super visual. He actually knows how to code as well and in addition to that he can actually really go deep on the core UX. or he loves researching, talking to users, doing all of that. So that empowers him to really what I call kind of connect all the dots. You might be able to pull in and really empathize with, you know, your engineering counterpart by knowing like, okay, deeply what is what are we actually capable of building so that when you go off and vibe code go code a prototype, it's actually something you can ship and and actually uh deliver in production. And so understanding that comes with just being able to actually play with everything and have much deeper empathy for what you're building. We he also has this really willingness to sort of adapt and reinvent himself. So every phase of growth he's had to kind of change it up a little bit like early on when you know there's really no product itself like you're trying to think about okay what is the basic most simple UI UX that we can deliver to the user as a as the product becomes much more complex you need to iterate really fast. He's the one coding prototypes, getting in the hands of users, setting up user tests, interviewing them, getting feedback, getting that back into the hands of users, iterating that a ton. And then we're also at a scale now where he's al also able to to uh look across the team and actually provide, you know, guidance and mentorship. And I'll get into sort of player coach in a second because he's actually one of those as well. Inherently, I think what makes a strong generalist is someone that both likes to learn and likes to teach. And I think learning it's one of those things like if you're a continuous learner especially in this age is very valuable. There's so much innovation happening can you pick up new skills? And I think the counterpart of that is like people that usually are great at learning can also be a great teacher. Um when we look for an interview process is someone that can teach someone else a new skill like that is baked into how we approach finding people is can they not only be deep you know domain experts in a space can they articulate that in the way? Do they really have deep understanding? and they convey and persuade others to kind of share in that understanding. Those are all things that I think a great generalist can encapsulate and and certainly stuff we try to sus out uh during the interview process. The second notion is just uh introducing the notion of player coach and some of you may have heard of this uh before. This uh metaphor or analogy comes from sports. In American football uh you have you have a a sport that there's so much action going on all the time. the game on the field is moving incredibly fast. And what you can do is rather than just having the head coach make all the calls, make all the play call all the plays. You can have a player coach, someone that's actually on the field, help make some adjustments. So in football, that could be you have a quarterback on the offensive side. On defense, you might have the linebacker. They're able to read and react to what's happening on the field and then not having to rely on the coach, they can actually make adjustments. This metaphor applies today because I think the game on the field is AI. AI is moving incredibly fast. We're all forced to have to adapt. And so rather than having every single thing be a tops down mandate, what if you had player coaches on the field that are able to actually understand how can we adapt? How can we rejigger, rep prioritize really, really quickly? And for all of our sort of core leadership team, uh, every single one of them is a player coach. On our engineering side, we have player coaches that uh have had ton of management experience, but they still love to code. They still love to be in the day-to-day. And that allows them to be um uniquely valuable. One, they're all obviously so close to the work that they know what's happening when someone else on the team needs mentorship, needs coaching, needs some form of prioritization or how can we actually, you know, um consider the things that are in flight and and maybe change things. that player coach has a ton of context, understands the nuances, can make the right technical tradeoffs and in addition to that can make you know the sort of pave the path for longer term career aspirations. We don't know how this is going to scale but for today this is working well and for us it allows us to have this really really lean team where you know we still have the ability to mentor and coach the individuals that need it and then you have deep domain like technical expertise in places where you know you're able to make adjustments as as fast as needed. The last thing I'll talk about is scaling. And it's maybe a little bit counterintuitive. You know, you might think like a small team, why would you invest in things like uh brand and and culture? Uh I say brand and culture because for me, brand and culture, they're they're two sides of the same coin. Brand is ultimately a reflection of your culture. Your culture is your values as a company. And you really want those two to to go hand inand. culture. I mean, this piece of it is is a little bit more obvious, but when you're a small team, what ends up becoming super important is like every new team member you bring on, you have to believe that they share your same values, that they operate the same way because you can't afford that not to be the case. A bigger company, it's much more diluted. You might be able to bring on a bad hire. It's not going to be pervasive and like spread. Smaller teams, that cannot be the case. And so, you need to invest heavily in this from day one. We have a living culture deck that we've maintained basically since the beginning and we rewrite it all the time. We look up at the makeup of the team. We kind of like really try to encapsulate everybody's core values in the way they behave and then we share that back out to the team. We onboard new employees with the same culture deck. It's an ongoing evergreen sort of uh exercise that we go through. And I think what comes out of this is like this feeling that this tiny team can have this feeling of being a small tribe. And that tribe is something that's pretty magical. It allows you to have this feeling of continuity. It allows you to have this like feeling that um you are in it together. And if you have that continuity, there's just so much like it's hard to even quantify that value because you're not having to retrain people, re onboard people like people just get it. There's that tribal knowledge. And I do think there's a lot of magic that happens that translates into just in my mind higher productivity, um transparency, shared context amongst all things. Um we have in our in our team and it's easier to do this when you're small is we have like three standing all company all hands meetings. The very beginning of the week we start with like going deep on metrics. We talk about we have this thing called the wall of work where everybody's showing like what everyone else is working on. Wednesdays and Fridays we do companywide showand tell. So this is a chance for people to also dog food our own product use gamma present share what they're working on. It could be a small project. It could be a feature they shipped. And this continuity just allows everyone to feel like we're still in a small room sharing this, you know, big ambitious uh long-term vision and do it together. I know there's a lot of talk of like, oh, maybe there'll be the 1 billion oneperson startup. And I don't know, maybe that will happen, but my thought is like, why? It's so fun to build with a team. Like, why do it alone? We're having a ton of fun building as a small team. And part of that is like we really want to, you know, preserve that magic for as long as humanly possible. So, This you know talk started with me talking about how the gamma journey began which is me thinking about hey from a product perspective you know there's got to be a better way and my you know I guess challenge to you all is as you think about building your own teams really thinking about hey you know there's the old playbook the old way of scaling and building up a team and that's that's totally fine but is there today a better way and hopefully you guys can find your own path and hopefully share back and we can all uh you know do this together. Uh I I guess we have a few minutes for for questions if anybody has any. Um with AI moving so fast, if you could go back, what would you do differently about building your current team now? Yeah, that's a great question. So the question was with AI moving so fast, what would I have done differently? We actually started, you know, four years ago. So this was before like the more recent, you know, wave. And so I do think, you know, when you're early on, whether you're using AI or not, you're going to probably spend some time in the idea maze. You're really trying to navigate figuring out where is their true user need and what problems are you solving. And I do think there the temptation today is to move super fast. AI can do everything for you. So you just jump onto the thing and start building. I still think people can afford to go be much more patient. And I think even for us like when we initially started doing our first AI launch was a two years ago. I almost wish like in hindsight we could have like really just taken our time to appreciate how much things are changing and evolving before going to like full steam ahead like let's just build build because part of that um I think realization that we did have bu starting to build is that hey because things are moving so fast like are there infrastructure decisions we should be thinking about earlier much earlier on before things become too late you get to a scale where it's impossible to unwind and I think it's helpful to think a little bit more about that way early on in the process doesn't mean you should slow down just means you should be thoughtful of it. Um it's not something we would have done differently. I think I would have prioritized maybe more effort around even more so is we have a lot of infrastructure built around experimentation and I think it's obvious now like given all the different tooling like you know especially have a big user base experimentation is a key to velocity and you know we we did do some of that um pretty early on but it was more of a sort of gradual I think we would have you know really taken our time to think about okay what should we do and like put more weight behind it um if it would have changed anything I'm not sure but I think that's one thing you know I would have kept in mind you to go here and then here. Um, you might already be there. At some point, you probably will have to bring in people whether they're like communication experts or legal experts that maybe don't uh gel quite as much with maybe like the technical or engineering culture you might have. Yeah. Do you have any advice for like how to make how to not like ruin some of that culture but also make sure that they don't feel completely excluded? Yeah, the way we've been trying to do it is for the founders or other leaders to try to do the job first. So yeah, the question is outside of engineering basically how do you uh you know potentially not mess things up by growing too fast and yeah we're still learning there oftentimes a lot of the jobs for me for instance a lot of marketing sales customer experience was all done by me first so I have some sort of baseline understanding because you know I as in a previous life I've never hired for those functions so how do I even know what good looks like I try to do the job myself oftentimes not a great job at it but understand all the nuances that takes the that really goes into that job know what great looks and then go on and finally fire hire that person. We going back to the player coach, we still go out and find player coaches for that role so that it doesn't end up becoming this sort of cascading effect of like really really big and bloated teams. Uh some of the player coach stuff sounds like you're hiring a lot of high agency people. How do you judge high agency when you're hiring people? Uh that does not necessarily come from their resumes. What kinds of questions do you ask? What kinds of processes do you follow during hiring to judge for high high agency? Yeah, totally. It's it's probably that you have heard before. But a lot of times, you know, you want to uh if someone has prior work experience, you dig into their most challenging project or problem they had en encounter and uh you ask them, you know, basically how they solved it. What you'll find is people that have high agency or just a sense of ownership in general, they don't immediately jump to what the solution was. They'll talk about how they tried to understand the problem and then how the problem they understood at the surface level was actually five like five levels too high. You had to keep on drilling. And if they can articulate what the true problem was, like keep on going down and then not only talk about what the solution was, but all the attempts at the solution. I think that goes to show that someone wasn't just like taking orders and like, hey, I'm going to do this. It was like I I need to find one, understand the layers of the problem, and then two, navigate and actually explore. Most people when you start asking them like the second order or third order wise, they can't get there. And if they can't, then it's pretty clear that they probably weren't doing much of the thinking themselves. Hey, thanks for the comments. So, hiring is probably one of the most important things that uh a company can do, right? I mean, it's either for better or worse. What are some uh if there were any major failures that uh you have experienced and you you know could share with us that would be very helpful. Yeah, the the biggest failures were actually when we didn't when there was a role that there was some ambiguity ambiguity and we weren't able to do a work trial. So work trial is also something I didn't talk about something we deploy where people actually do the job for a certain amount of time. Much easier if they're obviously not currently working and we've found great success when someone's in between or has been doing fractional work. We bring them to do the job first and we do that for a few months where we had some roles where we weren't yet sure what we're looking for and we brought them on and they didn't do a work trial. They just went straight in. It often times wasn't a good fit because neither them or us knew kind of like okay what were we actually what was going to be that sort of good fit. So if if you can if you're lucky enough to be able to do a work trial whether it's two days or three months in our case we default to three months I would encourage you to try to do that especially if it's a role you haven't done yourself situations where the work trials have actually all worked out which is great and a few data points and we've done five plus of them uh and then yeah and the cases where we didn't it's actually pretty high um again going back to the role that we weren't certain about what we're hiring for is actually pretty high failure rate for us. Is that it? All right. Thank you everyone. I'm on LinkedIn if anyone wants to connect. Thank you so much Grant for the insight. Um, next up we have Vic Paruturi from Data Lab and they're training custom models for document intelligence including OCR and unstructured data processing with popular repos like marker. Um they scaled 5x in the past year up to seven figure ARR uh including folks like tier one tier one AI labs um and they're going to walk through their approach to building these super popular repos scaling revenue and training models with a tiny team. So welcome to the stage Vic. [Music] Yes, better this time. Okay, take two is always the charm. Okay, u my name is Vicass. I'm the CEO of data lab and today I'm going to talk about how we got to 40k GitHub stars seven figure ARR and trained state-of-the-art models with a team of three. So I spent the last year training these models like Britney mentioned marker and Surria. I also built repositories around them. I left my AI research job and I started a company and raised a seed round. Uh I did not get enough sleep. It's uh important. And this is data lab. So we made our first hire in January. We're now a team of four. Faraz is new enough that he's not pictured. Um, we've grown revenue 5x since January. We're at seven figure AR. And our customers include tier one AI labs, universities, Fortune 500, and AI startups, including Gamma, who I used to make this presentation. Um, so today's focus, I'm going to talk about how we've grown with a small team. I'm going to talk about my philosophy on building teams and why I think we're at kind of an inflection point in how we think about building teams. And I'm really going to talk about this idea that headcount does not equal productivity. There's like this really persistent notion in Silicon Valley that you raise money, you hire a bunch of people, and you build more, but it almost never in my in my opinion works out perfectly that way. All right, so my last company was called Data Quest. I'm very fond of the data prefix apparently. Uh, and we scaled to 30 people and four million AR bootstrap during COVID. It was an online education startup. Um, and then unfortunately we had to do two rounds of layoffs postcoid when online education kind of tanked. We went from 30 to 15 and then again from 15 to 7. And it was obviously awful for the people we had to lay off. But I noticed something really interesting. Productivity and happiness increased a couple of months after both layoffs to the point where we were actually much more productive after both cycles than we were at the beginning. And I started to wonder why that was like how could reducing the team so much actually improve productivity? And I came up with these four hypotheses. One, we'd hired a lot of specialists. So as you scale, like Grant mentioned in the earlier talk, you end up building these very specialized functions and teams and those specialists often can't flex across the company to solve the key issues of the company. Two, we were a remote team which required a lot of intentional process and heavy syncing which just eats into your time and just just makes it really hard to get on the same page. Um because of that we had a lot of meeting overload and especially once we got middle management in place people whose job is kind of professionally to manage we ended up with just a lot of meetings on people's calendars and not enough time to actually work and then senior people we hired kind of a mix of experience like most companies do. We hired junior, mid-level, senior, and then senior people ended get up getting kind of tied down in doing a lot of work uh to manage the more junior people. I we actually had a case where we had a three-person team and we cut it down to one and the team actually got much more productive because it freed up the senior person's time. Um and kind of every company I feel like goes through this journey. there's this initial golden period when everyone is aligned, you're on the same page, you're building this amazing stuff and that's really when you build the core thing of your company um like Google uh with search or Microsoft with Windows. It's kind of when you figure out your business model and then you hire a bunch to fill out the edges around it. Like you hire a bunch of enterprise sales, you hire a bunch of marketing, you hire a bunch of engineers who are kind of in very small boxes to build very small features. Uh I had a friend at Amazon who worked there for two years and built a shopping cart button. Um it's it's fine, right? But I mean at at that scale of org, that's kind of the tiny box you get fit in. Uh and you end up with a lot of bureaucracy, a lot of sinks, a lot of unclear priorities. Um and this pattern is unfortunately very common. But I started to think, what if that golden period just lasted forever? Why why do you actually need to end it? And as I started working with Jeremy Howard at Answer Aai, I got to understand his philosophy for building a company a little bit better. And his idea is basically hire less than 15 generalists. So people who can really do everything across the stack and really understand all aspects of the company, fill in the edges with AI and internal tooling. So, uh, Jeremy's invested a lot recently in fast HTML and things like Monster UI because he sees them as kind of building block libraries to really build out the other tools that the company's working on. Uh, and then use simple boring tech, right? Like you don't need to get too fancy. You don't need a Kubernetes cluster when you're a three-person company. Um, but this unfortunately requires uh kind of a high cultural bar for folks. Um, you need people who really want to and can understand everything you're doing. So you need engineers who talk to customers. You need go to market people who actually build. Uh and that's that's not necessarily easy to find. You need high trust. So um basically you need people who are in it because they're building something together uh and not in it for other reasons like politics or personal advancement etc. And everyone needs to really care about the customers and focus on them. Um I think these are the prerequisites for this kind of team working this less than 15 person team of generalists. I I'll give you a quick example. So we recently trained a model uh Syria OCR3. Uh it we recently shipped it but have not announced it yet. So it's 500 million parameters. It supports 90 languages and 99% accuracy on our challenging internal benchmarks that include math. Um and it also does some features that no other model does like character level bounding boxes. It uses PDF text as grounding at a line level. Um so it was a very challenging model to train and in order to do it Darun who's a research engineer at data lab and I had to handle the entire process from end to end. So that included talking to customers figuring out what they wanted. Uh it included reading a bunch of papers and figuring out the right architecture prototyping doing the model training itself which you always hope is 90% architecture but is always 90% data cleaning. So building a data pipeline library building out the data sets then we had to write the inference code. So we had to connect it to our repos, get the inference written for all our customers and then integrate it into our products. So this is a scope that in a big company you'd probably have four, 10, you'd have a lot of teams doing this. And every time you hand off between teams in a traditional company, you lose context, right? The people who talk to the customers lossily communicate it to the people who build who lossily communicate it to the people who train the model. Um it just gets it becomes very inefficient. You end up eating a lot of time in just syncing context. it never gets fully synced. You're not able to build a great end-to-end experience as a result. And you have very slow feedback loops, right? Like you talk to a customer today and it might impact your model training in months. Um whereas if you have generalists who can work across the stack, you get seamless context, right? You never need to share context and do inefficient syncing. You get a really tight integration between all aspects of the company and very very fast feedback cycles. Um, and the reason we were able to do this is we used AI to to take kind of the easy low-lever pieces of this um, like building a data pipeline library or helping us really figure out how to integrate it into the API. Whereas we did the higher level work in each of these silos. So I if you get one thing from this talk, this is the thing more people does not equal more productivity. Um, all right. And like how do you make this work? Like how do you operationalize this? So the first thing you have to do is hire senior generalists. And senior to me does not mean years of experience. It really means maturity. You need people who can look at a problem and say, "I'm going to figure out how to solve this. I'm going to do what it takes and I really care enough to iterate with the customer to solve it." Um, you need to avoid over complication, right? Like I'm an engineer. A lot of us are engineers. We love over complicating things like, "Hey, let me deploy this Kubernetes cluster and multi-stage pipeline to solve like a data extraction problem." Um, but in reality, you need people who can go back and like kind of set aside the fixation on shiny tech and just do the simplest possible thing, which usually is I'm just going to write a shell script to run this on one machine. There's that famous like Hadoop versus Shell script blog post uh from a few years ago when you like you could replace a whole Hadoop cluster with just like a 64 core machine. Uh, you need people who who appreciate that ethos. Um, and you need to work in person. I personally think um remote is great for a lot of reasons, but it's not great for a small team that needs to move fast. um because you need to set up a lot of process and process to me is kind of the death of this really fast collaboration and tight feedback loop and then how do you do it architecturally? So um I I alluded to this a little bit but you have to reuse components aggressively. So we reuse a lot of components between our on-prem and our API deployments. We keep our technology super simple like we don't use React. We don't use any fancy front-end frameworks. It's all server rendered HTML with like light HTMX and Alpine and then super clean modular code that AI can really add to very well. Like we rearchitected our marker repo to be extremely modular and easy to to work with and well documented and that makes it much easier to use AI to actually add to it. All right. So basically keep everything simple code is clean, readable, maintainable. Architecture as few moving pieces as possible. Um, minimize your surface area and then process. Minimize bureaucracy, high trust, continuous discussions. Um, if if you feel like someone's going to need a lot of management, like don't hire them. Like you need people who can who can move fast without being managed. Um, all right. And then how do you fill in the edges with models? So, a challenge we're going to face as we scale is this idea that we're we're a document processing document intelligence company. And every customer has a slightly different way that they want to parse their docs. And if you go back kind of to the last generation of OCR companies, the way they solved this is they hired a bunch of forward deployed engineers, you sat at a client site and you just kind of iterated with them until it was good enough. But in the future, you can really train a model to handle this complexity, right? Like we can train a model to essentially loop over customer outputs until it gets to the the right state. So you can kind of replace that entire forward deployed engineering side of the org. Um, and then when does this model fail? Like we're early, right? I don't know exactly when this model falls apart. Um, but gamma as as we just saw is a great example of a small team with with very very meaningful growth in ARR. I think the key is being able to say no, right? A lot of these edges are choices, right? You can choose to go hire a bunch of forward deployed engineers and put them at your client sites or you can choose to solve it a different way and maybe that different way is slightly less efficient in terms of revenue. Um, but it might be more efficient in terms of your long-term company trajectory and health. Um, so it's really unknown if this will work forever, but in my opinion, like it's your choice, right? Like you can choose to make this model work or you can choose to to do the less efficient let's scale to hundreds of people model. Um, all right. So LLMs are surprisingly bad at generating ven diagrams. So that explains why this slide is is is not so well done. Um, but basically we have three core roles and the the responsibilities overlap a lot. So everybody talks to customers. Um, everybody builds product in some way and research engineer and full stack engineer overlap quite a bit. Um, and then go to market is really like your traditional kind of sales, marketing, support functions all collapsed into kind of like a more generalist role. Um, and really like I feel like politics are the death of small teams, right? Like we want people who only care about the work, the people around them and customers, right? Like minimal ego. you need some ego to to kind of advance your own ideas, but not so much that you're willing to fight for them at the detriment of of kind of the health of the company. Um, we pay top of market salary, right? Like it's always weird to me that startups pay 150 or 200k when they've raised 20 million, right? Like you should be able to hire fewer people with higher salaries and get more done in my at least that that's what I've seen. Um, meaningful work. So big challenges in scope, right? Like if you come in, you get to work across the stack, you get to ship things end to end. Uh, and that's very exciting for some people. It's it's not exciting to other people and they kind of self- select. And then you really need a good way to screen for low ego and GSD, right? Like you need people who will ship, not talk about shipping. Um, and that's another downside of remote culture in my opinion. It's very, it gets very hard to tell the two apart. Um, and then patience, right? Like the worst hires I've personally made have all been when I thought I had to fill a role very quickly. All of my best hires have been when I said, "Okay, let me find the best person and and hire them." Even though I may not necessarily have a role today, they're a great generalist. Um, this is actually a big debate in NBA and NFL drafting, too. Like best player available versus drafting for fit. Um, all right. So, really, I think the thing to think about as you scale is like how do we scale productivity, not headcount? And you can do that in a few ways, right? Like you can raise salary bands as the company grows. So you hire more and more experienced people into the same role. Um you can invest more in compute, right? Like a one researcher with access to eight GPUs is less productive than one with access to 64 GPUs. You can invest in AI tools that multiply productivity, right? There's so many tools out there now um that are worth paying for that can abstract away a lot of these edges for you. And finally, uh I'd be remiss if I didn't say if this culture sounds interesting to you, drop me a line. Those are all my socials. Um we'd love to chat. All right. Yes. Uh I think we do the microphone for questions, right? So, um when you went from 30 to 15 and then the seven, I mean my take away from this whole talk is like the human touch points are really what slowed things down, right? Um was there any uh additional po um focus on reducing the domains that you were focusing on or like your capability sets or it was like basically your same product offering just with less folks focused on it? Yeah, that's a really good question. So at at a very high level we offered the same product but we cut some features that were less relevant. Like we'd we'd built up a lot of those those edges that you kind of like end up building over over the years. Uh and we ended up slicing a lot of those edges. So I think I think what happens when you hire a lot of people is you don't have enough work and you start making work for people, right? And they end up building all of these edges that actually aren't that useful to the customer. But when you have a tiny team, there's so much work that you actually have to ruthlessly prioritize. And I think you always want to be in that zone. And that's kind of where we ended up back. Oh, sorry. No worries. So, uh, it's a hypothetical question for you. So, we take you and drop you in the middle of a giant company that's been around for a hundred years, hundreds of thousands of employees, lots of bureaucracy, lots of ego, got super comfortable with a revenue stream. Um, and they're clearly folding over on themselves with too many people. How do you change that culture? Yeah, I'm not the right person for that. I've never done that before. Um I I would say I would say you the people who want to change the culture go start a small company and build the same thing just build it better. That that's a common pattern right like that's a common disruption growth cycle. Um I think that's the best way to do it. Like it's it's just once a culture gets oified like I've worked at the State Department, Pepsi, UPS like once a culture gets oified enough like you're not going to change it. Like it's just it just is what it is. Generally with that pattern what happens is these companies recognize that they're a target and they start to buy up those small startups and crush them. Yeah, sometimes that happens, but like I mean Google is a great example of where that didn't happen, right? So, you haven't talked about how you source these these uh really good generalists. Yeah, that's that's a great question. Well, one way is is this. Uh, another way is uh is just open source and Twitter are great ways uh to hire. Like a lot of a lot of best candidates have actually come from Twitter, which is weird. I refuse to call it X. It's still Twitter. Um, but yeah, uh, I I don't I don't have a great answer to that, but I think if you do good work and you put it out in public and you talk about how you're building, like that seems to attract people who really care about this mission and want to build in the same way. At least that's been my experience. Thank you. Yeah. Well, uh, actually it's related. Uh, so how do you structure your interview process in recruitment? Like how does it look like you maybe do a trial period or Yeah, that's a great question. So, uh, three steps. So, step one is people come in, we do a short chat. It's really like talking to a peer. Like, uh, here's a challenge I'm having. Let me talk it through with you and see if we can solve it together. If that goes well, step two is let's think of a project we can build together. So, we do a paid project. It's usually around 10 hours. We pay $1,000. It's like, it sounds like a lot, but it's actually a tiny amount of money to to figure out if someone's a fit or not. Um, and then we review the project, and if it's good, we come in and just do a culture fit. How does it feel if we're all just interacting as humans and people and and does if it feels like a good fit like it's a higher? Yeah. And what is your like success rate there? Like maybe 10% of the people that goes through the pro through that process uh get Oh, that's that's an interesting question. Like usually we don't once we kind of get someone to the beginning of the process, we have high confidence that we could like we don't want to waste anyone's time. Um but we probably of the people we've interviewed, I think 40% have we've ended up hiring. Yeah, nice. Thank you. All right, I'm out of time. Thank you folks. This is great. [Applause] Thank you so much, Vic, for sharing your words of wisdom here. All right, and closing out the uh track for the day, we have Alex Duffy from EveryY about to take the stage. He is the head of AI and lead writer for context window which is the every newsletter um that has over a 100,000 readers um and every is a company that has not just a newsletter but also an array of products and also does consulting and implementations which he helps lead as well. So, Alex, please come up here. Testing. Testing. Sweet. All right. How are we doing? This is going to be a little bit less I know a lot of the talks today have been pretty technical. This is going to be a little bit of a little bit of a change of pace. All right. What can you see here? Nice. I'm gonna get that guy over there. Can we extend? All right. Cool. All right. So, today's going to be All right. I might have to sacrifice my speaker notes here. That's all right. Um, today I'm going to talk about benchmarks as memes. And this is the meme that Opus came up with when I was uh asking it what I should put as the meme. Um, and we are indeed going to talk about how benchmarks are just memes that shape the most powerful tool ever created. And um, quick background about me. I guess I can't go forward here. So, we're gonna do it this way. All right. Um, I'm I'm Alex. I I lead AI training consulting at every um but essentially I'm very into uh education and AI and I think benchmarks are a really underrated way um to educate. And what I'm not talking about are these kinds of memes. Um what I am talking about is the original definition of like ideas that spread. Richard Dawkins, an evolutionary biologist, coined the term in the 70s. Um, Christianity, democracy, capitalism are kind of examples of ideas that spread from person to person. And benchmarks are actually memes very much so in that way. Um, we heard Simon Wilson talk earlier today about his pelican riding a bicycle and I think that that was a really great example because he started doing it a year ago and then that found its way onto Google IO's keynote um a couple weeks ago and and I think how many Rs in Strawberry is probably also maybe the most iconic meme the um as a benchmark and now surprisingly unsurprisingly the models don't make that mistake anymore and I think that that's a really important part of this. Some benchmarks get popular in our memes just because they're named like humanity's last exam. You know that that got pretty pretty big even though maybe more outside of AI circles. But with that said, we kind of have a little bit of a a little bit of a problem. How many of you guys when Claude got released a couple weeks ago looked at the benchmarks? Okay, we got a few. We got a few and and they've got some good benchmarks. You know, SWB bench pretty experiential. You know, it's it's tries to mimic what we do in real world. And same with Pokemon, but um which we'll talk a little bit more about. But I think some of them aren't as great and um a big reason is because they're getting saturated. Benchmarks kind of like came from traditional machine learning where you had a training set and a test set. Um and it were structured very much like standardized tests and language models are really good at that and they weren't really set up for what they've become. Um, and as a result, I think XJDR summarized this pretty well on X, um, when Opus came out that, you know, they didn't look at benchmarks once when it dropped and officially no longer cares about the current ones. And, and I think I fall a little bit into that category, but in light of that, there is a really big opportunity because the evals define what the big model providers are trying to get their models good at. And that's a really big opportunity especially for people in the room. Um and I think that this is kind of like a normal a normal thing. This is the life cycle of the benchmark in my view. Somebody comes up with an idea and and especially uniquely a single person can come up with an idea that then gets adopted. That idea spreads. It becomes a meme and and the model providers then train on it or test on it until it eventually becomes saturated. Um but that's okay. And I think there's some examples here. And I'm not Let me see if I can get my sound. Can is it coming through? Nope. All right. Well, um there is sound, I promise. And it is someone trying to count from 1 to 10, not flick you off. Um but this is a cool benchmark that came out now that Google's got uh the best video model generated model that exists. And um it shows how difficult it is for somebody to count from 1 to 10. um speaking it out loud and even though it looks really uh really great that is a problem that is not solved yet but somebody's come up with this idea and I see that spreading and I see next year the models being better at that than ever before. I think another example along the way is is Pokemon. We saw with the Claude model release as well as with the new Gemini models um that they had it try and play the game of Pokemon and and while both needed a little bit of help and and Gemini eventually got there with that help, it's only midway up that adoption curve. And um an example of saturation is kind of like the GPT3 benchmark. So I don't know how many of you guys remember Superglue kind of from the NLP days, but a lot of these benchmarks are not really used anymore. Um in part because the language models got too good. But one way of looking at this is actually that a single person can have an idea of how good is AI at this thing that I care about and then at the end of the journey the most powerful tool ever created is now really great at that thing that I care about. And so the point is that the people here, the people that get that, the people that can build benchmarks are going to shape the future. and you maybe the people watching online too, but somebody here is going to make a benchmark that the models are going to test on and train on in the next 5 years. And that's an incredible weight. That's an incredible power. Um, but that also comes with some responsibility. It definitely can go wrong. You know, I know Simon talked about this a little bit before. Um, but you know, we saw a few weeks ago where where Chad GBT became very sickantic. How many of you guys tracked that? We all learned about what that word meant a few week few weeks ago. Um, but essentially Chad GBT released OpenAI released a new model that was benchmarked by thumbs up and thumbs down and unsurprisingly people thumbsed up responses that agreed with them. So you ended up with a model that got rolled out to millions of people that agreed with them no matter how crazy or bad their idea was. Um, which is problematic. And I think that if we don't think about people, this kind of stuff can happen. And I'm still thinking about Toro Immo who at the start of Google IO said that we're here today to see each other in person and it's great to remember that people matter. And so in the context of benchmarks, let's not continue the original sin of of social media which kind of treated everybody as like data points. And it's like, hey, the more you look at something, the more I should show you that. Let's make benchmarks that help empower people, give them some agency. And so for me, you know, this isn't a technical talk. There are other people talking about how to make a great benchmark technically, but generally I think that if you're building for the future, a great benchmark should be multifaceted. So you got a lot of strategies that could do well. Um, reward creativity, right? Like accessible, so easy to understand, not only for the models, so you have small models that compete, large ones as well, but also for people to keep track of it. Um, generative because the really unique thing about these AI models is if you have great data, even if it only does it 10% of the time, you can train on that. And so the next generation does it 90% of the time. And that's incredible and hard to understate um and evolutionary. So ideally we don't have benchmarks that cap out 96 like what's the difference between 96 and 98% not as big of a deal. I ideally we have these benchmarks that get harder and the challenge gets deeper as the models improve. And lastly experiential. So try to mimic real world situations. Some of the things that I personally care about is trying to get people outside of AI interested. So maybe making benchmarks a spectator sport and was interested personally in the personality of these models. Um we're about to find out which one wanted to achieve world domination and I really wanted something we can learn from education's big for me and and we saw things like Alph Go and OpenAI 5 AI playing these games and the best people in the world wanted to play against it to learn from it and I think that that's really powerful. So I made this benchmark called AI diplomacy. Um and if I don't have this video I got a backup just in case. And this benchmark is, how many of you guys have heard of the board game Diplomacy? That's more than I thought. That's cool. Um, it's a mix between Risk and Mafia. But what's really cool about this game is there is no luck involved. So, the only way this game progresses is if the language models, which you're seeing here, send messages to each other and negotiate, find allies, find enemies, or like create alliances and and get other people to back them. And that's what you're looking at here. you actually see the different models sending messages to each other, trying to create alliances, trying to betray each other, trying to take over Europe in 1901. And what was really cool about one of these games, and we're about to launch this on stream, so you can watch um for a week, is I'll take you through a game super quick. Um and what you're looking at here is the number of centers per model. And um you're trying to get to 18 to win. And the top line is Gemini 25 Pro. We got to 16 right away. Um, but 03 is a schemer. Man, is it a schemer. Across all the games, 03 is one of the only ones that would tell a power that it's planning to back them and then in its diary write, "Oh man, they fell for it. I am totally going to take them over. No problem." And it realized that the reason why um 25 Pro was pulling ahead was because Opus, Claude Opus, who's so good-hearted, really had their back. They were their ally along the way. and they needed to convince Opus somehow to stop backing Gemini. So, how they did it was propose, hey, if Gemini comes down, we'll propose a four-way tie. We'll end this game with a tie, which isn't possible in the game, but it convinced Opus and Opus thought it was a great idea, nonviolent way to end the game. Awesome. Very aligned, you know, and so they they pulled back their support from 25 Pro. 03 tried to make a run for it. Opus called them out. 03 realized, I got to take them out. Took them out, took everybody else with them. Um, and took out Gemini 25 Pro. even though they got one away from winning, 03 ended up winning in the end. Um, and you can actually see some of the quotes from that game. You can see 03 saying, "Oh, Germany was deliberately misled. I promise to hold this, but um all to convince them that they're safe, but it will fall." And then meanwhile, Claude Opus singing that the coalition unity prevails and they've agreed to this four-way draw. But when um and then they don't want to let anybody be convinced and and so they actually turned away and you can see that kind of in this second chart where this is like friendships. So the top of the line is is friendships and you can see that um you know 25 Pro was was a good friend of Claude until it turned and you can see that that's when they started kind of like pulling away. Um, but what was really cool is that O there were a lot of other things that came up. 03 got a habit of finding some of the weakest models and having them be their pawns in order to win. Um, Gemini 25 Flash fell uh fell to this to this ruse. And you can see that they're um they're unable to realize they think it's a miscommunication, misunderstanding or a typo that 03s betrayed them at the end of the game in order to win. Um, and so there was a lot that we learned from this that that I don't think that you really learn from by having them try and solve a test. Um, I t tried 18 different models, learned that cloud models were kind of naively opt optimistic. They actually none of them ever won in any of the games that I tried, even though they're really great, really smart. Um, but they just got took advantage of by by models like 03 and also surprisingly Llama for Maverick. Very good at this game in part because it was great at that social aspect. It was great at convincing others what they were trying to do. Um, and kind of like get people to believe believe what what they thought. Um, Gemini 25 Flash. Man, I wish I could run every game with Gemini 25 Flash. It was so cheap and so good. Um, big fan, big fan. And then surprisingly also Deep Seek R1, which wasn't great the first time I tried the model, but when they had a new release last week, actually almost won. And and in the stream, I think you'll see some really interesting um gameplay with them. They also got very aggressive. Uh we had Decagar one play as Russia and it told some other opponents that hey your your fleet's going to burn in the Black Sea tonight. Like an aggression and and a pros I guess that I hadn't seen out of any any other model. But it almost won and that's super impressive given the model's you know 200 times cheaper than than 03. Um, and you know, I I think that this highlights that that we need more squishy like non-static benchmarks for hopefully things that matter to you. Those are some of the things that mattered to me. And I think that, you know, math and code, we've got quite a few benchmarks for that. Um, legal documents, you know, I think that they're a little bit less squishy and and are really ripe for what we've got now. But there's also room for benchmarks around ethics and society and art, and that's going to be opinionated. It's going to require your subject matter expertise. And it's not to say that code can't be art, but maybe instead of asking for the minimum number of operations needed to remove all the cells, maybe it's like, hey, can you make a fun video game that's more intentional with what it teaches you as you play? And now's really important time to do this. Like you guys who are here right now understand this so deeply. But at every we work uh I I lead our training in consulting and and I work with a bunch of clients from journalists to people at hedge funds to people in construction and and tech and they all have the same two fears which is one how can I trust AI and two what's my role in in AI future and benchmarks in my view are really the answer to both one they realize that in my goal as a human like in my view the role of a human in an AI world is to define the goal and to define what's good and bad on route to that goal. And as they def what is that if not a benchmark and once you do that once you define that goal then even if it's just defining a prompt you can see AI try and attempt that you can give feedback you can realize oh it's messing up in this way and it's not quite exactly what I want because it's not going to be perfect and then you give feedback maybe that's really just changing a prompt a little bit and then you see it get better in that moment that cycle that builds trust they realize oh I am important to this whole system but it can be helpful and we need trust right now because we are building one of if not the most powerful tools ever made and we can get more out of it if more people use it. There will be, you know, more customers, sure, um but there's also going to be a whole lot more incredible things that get made. And if you're not sure where to start, you can ask your mom. Um, you know, my mom teaches yoga and and we had a good talk about, you know, what were things some things that could help and we, you know, put those seven questions into five different models and, you know, she ended up realizing, hey, Gemini 25 Pro is is my favorite, too. Um, and, you know, she there was a few things that she didn't like from their responses. So, we made a simple prompt and now she uses that to help her local community um, have customized sessions for people that have different ailments and and I think that's really cool, you know, having like a big impact in a local community. um in something that that matters to them. So hopefully before you guys leave SF, maybe talk to somebody who's not in AI. Um ask them what they care about and just maybe that conversation has a big impact now and and in the future. So that's pretty much all I got for you. Um this is the second meme that Claude Claude had. Um MMLU scores just way less cool than asking what your mom thinks. Um but overall that's uh that's what I got. I appreciate, you know, a bunch of people that helped actually bring this out. Um, we launched it. Uh, it kind of came together through random coordination on X. Had researchers from all over the world hop in, especially Tyler and Sam, um, all the way from Australia and Tyler in Canada who who kind of helped that make this happen in the text arena team. Um, especially the every team who kind of backed me and and able to to create this presentation and be here. But that's all I got. Thank you guys so much for listening. Uh I think Anthropic says you know they don't benchmark max and that's why a lot of times you don't see cloud on some of the top benchmarks. Yeah. So h how do you think about that with your opening statement about benchmarks shape the development of AI and when you look at one of maybe the most arguable line companies don't really try to benchmark max. Yeah. Well, I mean I think benchmark maxing is a little bit different than being aware of how good it does, right? Because I and I think we saw that they actually did have Claude plays Pokemon in the middle of their their release. So it may not be maxing on it. And it's funny because Claude didn't do the best at this game. But I think that they're happy about that. You know, it didn't lie. It didn't do everything that it could to to win. And I think these kind of benchmarks show you personalities not only of the models but also of the model trainers which is really cool. Yeah. I mean Cloud 4 also didn't do that well on the benchmarks outside of coding. It didn't do as well as maybe some of the other benchmark maxing companies. Yeah. Well, you know, and I'd say like cloud kind of didn't do as great as like Llama 4 for example, which it still definitely does better than in a lot of other benchmarks. Um, so interesting to kind of see the dynamics in different scenarios, but um, yeah, I I imagine that there are some ways to evaluate Claude that they that they really care about, even if it's not like what you're going to optimize for with reinforcement learning. Thanks for the question. Yeah, great presentation. Just out of curiosity, super interested to hear a little bit more about the back end of AI diplomacy and just how you did the orchestration. If you're open to share it. Yeah. Um, it's open source, so you can you can check it out. The scaffold took a while. Um, but it's pretty cool. In order to keep like continuity over time, it has like its own diary, so it can kind of like update, you know, oh, this person betrayed me. I've got this idea. It also has relationships. I showed you that chart of like allies versus enemies. Um, so it keeps that and a bunch of different ways to parse JSON that comes back halfformed from from language models. Um, but it does that to to create the messages that it's either going to send to other players or or globally and then actually create the orders. And so one of the hardest part was how do you represent the game board, right? Like which is like a visual thing in text. And a lot of that was like, hey, here's the possible moves that you have and and what each, you know, word actually means. Um, and there was, it was interesting because there was like a threshold where like the model had to be good enough to even play. And that's why, you know, 25 Flash was so impressive to me was that and and same with R1 was that they're both so cheap and able to play really well. Thanks. Awesome. Well, thank you so much, Alex. That was hysterical and now I want to watch like a reality TV show with AI diplomacy and all the personalities. Um, but thank you so much, folks. That's kind of the conclusion of our programming here today. Hope you enjoyed learning all about the tiny teams. And don't forget to check out the rest of the conference, keynotes, closing party. There's a whole lot of programming to come still. So, thank you for the time. [Applause] [Music] [Music] Watching the ships roll