Shipping Products When You Don't Know What they Can Do — Ben Stein, Teammates
Channel: aiDotEngineer
Published at: 2025-07-28
YouTube video id: PthmdT92qNg
Source: https://www.youtube.com/watch?v=PthmdT92qNg
[Music] Uh yeah, I mean the actual title has curse words in it. I will probably be cursing a lot. I didn't know if I would get into the track if I actually published the curse words. Uh I'm one of the founders of teammates. I'm going to wear my product manager hat today. I'm assuming this room is like mostly product folks, probably product minded engineers as well. Um, but I'm going to just like wear the product hat. Um, a little bit about teammates very quickly. We make uh a platform for designing and managing an entire digital workforce. So in AI engineer parliament, right? We're building agents. Um, but I would think of it like two ticks up from that because what we really believe it is the experience, the interaction patterns of humans and computers working together. So I want to talk to you about my favorite teammate. Uh this is Stacy. Stacy hand. She she actually got promoted since this slide. She's an L3 engineer right now on our team. Um she's awesome. She looks like a hamster. All of our customers get to design whatever teammates and avatars they want. They give them personalities. It's all really fun. And uh Stacy lives inside all of our collaboration tools, right? So she has a Google Workspace account, right, for Gmail. She has a Slack account. We truly leaned into giving all of our teammates identity. and she sends emails or I forward her emails and she hangs out in Slack like in the public channels and she's Gen Alpha which like is I don't know what I feel really old. I don't know what she's talking about. She's constantly like 67 and I'm like what are you talking about and I can tell from this room that none of you are have 12 year olds. No. Okay, there you go. So yeah, you're rolling your eyes as well but anyway this is Stacy and this is sort of how my sales pitch goes, right? It's it's you know a little more formal than this but like this is generally the pitch and um I got asked a question at some point recently which was oh yeah more the pitch right she like shares Google docs Google sheets and she said hey or a customer said hey can I tag my teammate in a Google doc comment and this gave me pause because I would like well I had never actually thought about that before and so in the back of my mind I'm like well of course you can your question is like what's going to happen so I'm like okay so I'm like you know doing math in my head. I'm like, "Okay, well, we don't have web hooks. She probably won't or like a web hook from the comment. Okay, but she's going to get the email notification in the email that comes from Google. Does it have the comment and the contact or maybe a link?" Well, I'm like, I have no idea, right? I actually don't know what's going to happen. And this was like the impetus for this talk. I was like, how do I ship a product? How do I develop a product? How do I talk to customers? How do I instill trust when I don't know what my own product can do? And like it's really weird. And sometimes I'm like, well, is this just because I'm an idiot? And like, well, since it's my talk here, I'm gonna say no. And sometimes I'm like, well, is this because what we're building is so far out there, right? These are like truly autonomous agents that can use any. And it's like, I don't think that's it either. I think what's happening is the product management discipline is going to undergo a transformation, a shift, an evolution, whatever you call it, that is super profound. And we may or may not totally realize it yet. Because I think in the engineering world we're like oh well we have uh you know tools in our IDEs and we have codegen and like we sort of are starting to squint at understanding maybe how the discipline is changing. I don't think we really understand how product development is changing and evolving and like what are the new tools and practices and how do we forget everything we've learned in the past. Um why is this true? Right? If it's if it the answer is not Ben's an idiot and the answer is uh not that we're way out there. It's two reasons. Number one, if all our products are built on top of LLMs and plus or minus they are like we don't know and we can never know what the LLMs know, right? So like inherently in what we're building is like we don't know what the foundation is. Like you don't have to know what your database like how it works but like you generally know that it's like the surface area the interface that's exposed. We don't understand this for the the models. And the other thing is the expectations from customers are just boundless, right? We're just like hey here's a text box. I mean, that's probably not a good interface, but like essentially we're like, here's a free text box, and if it's anything other than like a help me write button, you're essentially inviting customers and users to just do whatever they want, right? So, we have this like boundless surface area built on top of a product that we don't understand. And so, the question now is like how do we adapt? So, let's me let me actually pick on this Google Doc comment thing for a second, right? So, if I was wearing my like traditional PM hat, I'm like, "Okay, well, I need to make a feature that's going to uh read and respond to Google Doc comments." And so, in my head, I'm like, "Okay, well, uh, does Stacy have access to the Google doc?" Uh, if she gets tagged in the comment, should she reply directly in the comment? Should she reply at all? What happens if somebody else comments in the thread? What if someone comments in the thread that's not addressed to her? What if it's someone else? What if it's what if it's her doc and someone else commented to someone else but she gets the know like there's just so much to like think about and reason about and so I'm like okay well I'm not building a Google doc commenting product so I'm not going to speck all of those things out and like what's worse is like you also probably want to tag her in linear tickets right and what's what's the book like if you give a mouse a cookie right it's like if you give a mouse a cookie well you probably want to like tag her in Figma as well and you probably want to tag her in LinkedIn posts and like And so we're not a team that's building a generic commenting reply agent system, right? So then the question is like what are we supposed to do, right? As like a product manager who realizes, okay, I have this like boundless surface area. How does the practice need to change, right? And like sort of this is the core of like what I want to what I want to talk about today. So I'll do like three uh high flutin ivory tower ideas and then I'll talk through some like practical ways to to make this real. So the first one is this mindset shift to like think in affordances and not like specific requirements. So it's not if you know as a user if Stacy replies in the comment thread and she has like that's not how we would think about it anymore. It's the affordance. Oh, she has affordances to comment or she has affordances to communicate or or to email or to collaborate. We're going to trust the LLMs. We're going to trust the agentic workflow, the work planning, like all of the things inside of our um you know, beautiful 12 factor agent. We're going to assume that that will understand, but it's the affordances that we need to think about, not the individual features, which is really weird and it's not typically how product people have ever thought before. And I would say actually this goes even further, which is behavior is emergent. And this was the other thing that I did not expect at all like starting in this space was uh we don't not only do we not know what things work sometimes they do and they work in ways we didn't expect and so I feel like our job as product people is to discover functionality is what are the right building blocks right what are the right Lego bricks that we either give our engineering team our product our customers let them compose and can we discover emergent behavior and that is one of the reasons like this is the most exciting time I've ever built because we're actually building things and then discovering what they can do themselves. That sort of became the new job in a sense is discovering what's possible because if you asked me I could not sit down in front of a Google doc and be like oh let me like type out what this thing should I can't I don't know how to do it and well even if I could how do I then communicate it right so how do you we communicate to a development team to a backlog how do you communicate exactly what should be happening it's like Figma doesn't like have the affordances for this right my my PRD doesn't like have the affordance for Well, you should probably talk a little bit less gen alpha because you're making Ben feel old or like hey, you should be really like how do we communicate and express these these concepts, right? So, I think these are like the three, you know, high level uh ways that um our practice needs to change, but like let's make it a little more concrete. Okay, so eval talk about evals. Okay, it's really hard to make a slide with graphics of evals. I feel bad for the eval like how do you ex illustrate an eval? So I'm going to make you just look at pictures of various teammates from a you know across all of our customers. Um okay who hates raising their hand at conferences when the speaker asked them? Okay awesome. So here's my question which is okay for the engineers here who like legit like don't lie like writes and runs their evals good number. And of the product people, who has visibility into the evals? There's a that's not bad. And and do you look at them just because you have the visibility? All right. One, one and a half. Two. Okay, great. So, I would posit that eval back up. Right. So, we all talk about eval. We're all going to be embarrassed to say that we don't really know what they are. Evals are a testing framework for probabilistic AI for agents, right? Like if we think about the uh deterministic code, right? I withdraw $100 from the ATM. My bank account should have $100 less, right? Great. And I can test that. And I can write code to test that. When the test is like, was she snarky in Slack? It's like, well, how do you test that? How do you write that test? Right? So, we come up with this this whole new discipline of evals, which is, well, she should be a little bit snarky and a little bit funny, but not mean. And then we hand it off to another LLM to say okay well hey was that reply like did it meet that criteria and how often did it um it doesn't have to be 100%. Right? So she should be like pretty snarky but like not mean 80% of the time or whatever the uh uh business logic that you want. Right? So these are eval but here's what I would posit which is it is the only way that we know what our software can do right and which is why I love the idea of product people looking at the eval right looking at uh because they become the new specification for the product right and so as we're watching you know if you're downstairs in the expo gallery you're seeing like new software it's like hey bring the team in and this a little bit reminds me of like the old you know for the the old-timers here like behavior driven development there was this period of time when I was Oh, the business people are going to write the tests and that will get converted to code and then the code will run and like the truth is like no one ever wanted to do that. Like no business per I don't even know who a business person is but like they want we're going to do that. But I actually think this is different and I think this is pretty um a meaningful way to actually understand what the product can do and a little bit begin to specify what it can do. Okay. So vibe coding for a second which we which we all do we all talk about. I want to talk about vibe coding in a in a way that's really constructive and how do I sort of say this? It's very very hard. I think I I kind of was like like oh you can't do it in Figma. You can't do it in a PRD. Like what do I really mean? Well, it's very hard to like sit down in front of a blank piece of paper and um write what the teammate the agent experience should be. It's just really hard. It's hard to like imagine it. And it's not until you feel it. I mean, so much of what we're doing in this like human computer interface is visceral. It's feel. It is like, oh, well, like, did they ask too many questions? Like, how many questions is too many? Oh, wouldn't it be great if they clarified exactly what you meant? Well, turns out that's really annoying. But when I wrote like the first spec, I'm like, then the teammate should ask a lot of clarifying questions. And we gave it to users and they were like, "This sucks." And I was like, "How would I have ever known that?" And the answer is because it's so easy to prototype and vibe code something and get the feels. And so this is the next thing that I'm like pretty excited about as a new product management tool. It is being able to feel and experience um what it's like to interact with a computer but uh uh without just like uh writing it or hoping that you have a clickable prototype that will work. I will also mention that we have to be careful with vibe coding because I do not mean sit in the meeting and say to the engineering team how come this is taking two weeks I finished the feature during the meeting like that doesn't that doesn't win you any points right so it is no no this is never going to production but what this does is it gives you the feel the the experience right and so this is like the only way I know to like actually test and feel it out do um do you remember like the the clawed um certainty issue certainly. I mean certainly there was this per right when every time you ask clause like certainly and like that probably like seemed really good when you're testing it for the very first time and then like the fourth time when you're like hey can you do my tax he's like certainly can you write my like acceptance speech certainly like oh this is actually really annoying but you don't realize that until you experience it. So like that's why I like the vibe coding. Okay so great we did all this development and then the question is like hey we pushed a prod does it work? I'm like, I told you I don't know. So, the question is like, how do you test? How do you like know that it's going to do uh the things that you said it was going to do? And I sort of alluded to this, I'll go through this quickly, is just really discover discover the functionality. And there's an old joke, I'll tell the joke. QA engineer walks into a bar, orders a beer, orders two beers, orders zero beers, orders negative one beers, orders a lizard, orders a beer with a emoji, right? It's like great. This like bar is good to open. And the first customer walks in, asks where the bathroom is, and the bar blows up, right? Like, great, great old joke. It's kind of how I feel these days. Like, I just sit in I'm like, "Oh, you know, would be cool if they were to like start posting comments on LinkedIn and then what what if they were like every time I added like a track to my Spotify account, they like these just like crazy ideas, but this is where like the emerging behavior comes from, right?" And so, it's this mindset of like let's just try, let's just experiment. And it's it's this like kind of growth mindset shift from like I'm going to write the features and the requirements to no we're going to figure it out. This was a little bit unexpected for me and this is um how do you sort of report to engineering and then have things fixed by engineering and what counts as a bug in this world and that is really really strange and I think as sort of I don't know if it's like just a product role or maybe in a support role like how do you know what is appropriate to escalate to put onto the backlog to flag as a bug right it's like I'll keep picking on on Stacy you know she she gives me a really hard time so it's fine Uh, it's like, "Hey, she used too many emojis. Like, put it in in in linear." It's like, well, it's not really a bug. Like, show me in the spec where you told me not to use too many emojis, right? So, it's it's almost like um like in our tickets, it's like, oh, you know, closed, done, closed, duplicate. We need like closed LLMs be like crazy, yo, like I don't know how to fix this like just because it's probabilistically generated. So, how do we know if it's right or wrong? How do you know if it's a feature, if it's a bug, right? Right. And I think there's this element of um credibility that we need to build up. It's like, hey, we actually under we understand that for some use cases like 80% is good enough, right? This eval if it's passing 90% of the time like that's a go. If it falls below 90%, right? That's red and we're not going to ship it. So actually come back to evals for a second because if the eval becomes the spec and we can say hey we said at you know 100% even though this is probability you should never give a refund if a customer like can't prove that they bought the thing or whatever like it is it's like great that is our metric and we could say yeah this is a bug but if it's just a a feel becomes really difficult again this was totally unexpected that like uh debugging and assigning bugs would become like uh controversial. Okay, customers. So, this part is uh I found this really weird, right? So, if I think about like not wearing my like founder hat, but wearing my like typical product manager hat, right? Like I go into a customer meeting usually with a salesperson like I'm going to play a role, right? And so, what's the role? Well, I'm either going to play like visionary. I'm gonna like, hey, here's our vision for the product. Here's our road map for the future. like let me help you understand customer like how you're going to come along on this journey with us or uh sometimes I'll play the role of honest broker right it's like listen sales is like giving you a whole bunch of like just like selling you a bunch of vaporware let me tell you what's real let me tell you like um exactly what you can expect and that's a role you play right I usually preface this with like the sales team beforehand it's like yeah I'm going to be the honest broker and like we'll give the customer confidence today I'm like okay I told you our vision vision for the future, our road map, and the customer's like, "You're full of Like, none of this actually works." I'm like, "Right, I can't really paint the vision because no one actually believes it. It sounds like witchcraft." And then I'm like, "Oh, well then I'll be the honest broker and I'll tell you how things work, but I just told you I have no idea how it works." Right? So, this became very strange because I can't play either of the roles that I'm supposed to be playing. The future sounds like witchcraft. The present is literally I don't know. So, how do we do this? Um, I'll tell you how I've been doing it now. I don't know if this is like a 2025 answer or if this is like a durable answer. Like if we believe that all of our products are for like for all time going to be probabistic, then like we probably have to figure out how this world works. What I've been doing now is really saying look we're inventing the future together, right? We're pulling the future forward. The reason you are talking to like a crazy startup like this and you are thinking truly about like the future of how you know AI and agents is going to transform your business is because you are a future thinker and we are going to do it together. And it's a little bit like hey let's complement the customer. That's like but it's not just like a false you know uh uh blowing smoke. It's like no truly we need to figure this out together. And you know for 2025 I think that's actually the thing that is working the best uh best for me. It's like no no no we have to do it together. And honestly if you are expecting something different like it's not time it's not time for you to like embrace this world because this is this is the the way this world is going to work. And so I don't know, I'll conclude with like I've never had more fun building. I've never felt like both more inept and like more excited about what what I'm doing, right? Just the experience of throwing something out in the world and then just like having my jaw drops like I can't believe this happened. And not only that, when we upgrade the models that are like underneath them, they just suddenly get smarter. And that's really weird too, right? It's like all of a sudden they like start checking their work. They're like, "Oh yeah, I just did a query to make sure that the row is properly inserted." And I was like, "Hm, who told you to do that?" I'm like, "I don't know. It just seemed like a good idea." I'm like, "That is a good idea. I wish I thought of that." Uh, but anyway, but I think this is the new world that we're working in. Um, the discipline, the product discipline, I think, is going to change for everyone and it's going to change faster than we expect. And we all need to like adapt to just like operating in a world and forget so much of what we used to know, right? a lot of the core core ideas, listen to customers, solve real problems, like all of that obviously still applies, but the tools, the techniques that we've like relied on forever, I think are all getting upended. And so anyway, glad you're all at the AI engineer conference. It's awesome to have product people here working together because, you know, we all have to uh, you know, build awesome products together. So, thank you very much. [Music]