Shipping Products When You Don't Know What they Can Do — Ben Stein, Teammates

Channel: aiDotEngineer
Published at: 2025-07-28
YouTube video id: PthmdT92qNg
Source: https://www.youtube.com/watch?v=PthmdT92qNg
[Music]
Uh yeah, I mean the actual title has
curse words in it. I will probably be
cursing a lot. I didn't know if I would
get into the track if I actually
published the curse words. Uh I'm one of
the founders of teammates. I'm going to
wear my product manager hat today. I'm
assuming this room is like mostly
product folks, probably product minded
engineers as well. Um, but I'm going to
just like wear the product hat. Um, a
little bit about teammates very quickly.
We make uh a platform for designing and
managing an entire digital workforce. So
in AI engineer parliament, right? We're
building agents. Um, but I would think
of it like two ticks up from that
because what we really believe it is the
experience, the interaction patterns of
humans and computers working together.
So I want to talk to you about my
favorite teammate. Uh this is Stacy.
Stacy hand. She she actually got
promoted since this slide. She's an L3
engineer right now on our team. Um she's
awesome. She looks like a hamster. All
of our customers get to design whatever
teammates and avatars they want. They
give them personalities. It's all really
fun. And uh Stacy lives inside all of
our collaboration tools, right? So she
has a Google Workspace account, right,
for Gmail. She has a Slack account. We
truly leaned into giving all of our
teammates identity.
and she sends emails or I forward her
emails and she hangs out in Slack like
in the public channels and she's Gen
Alpha which like is I don't know what I
feel really old. I don't know what she's
talking about. She's constantly like 67
and I'm like what are you talking about
and I can tell from this room that none
of you are have 12 year olds.
No. Okay, there you go. So yeah, you're
rolling your eyes as well but anyway
this is Stacy and this is sort of how my
sales pitch goes, right? It's it's you
know a little more formal than this but
like this is generally the pitch and um
I got asked a question at some point
recently which was oh yeah more the
pitch right she like shares Google docs
Google sheets and she said hey or a
customer said hey can I tag my teammate
in a Google doc comment and this gave me
pause because I would like well I had
never actually thought about that before
and so in the back of my mind I'm like
well of course you can your question is
like what's going to happen so I'm like
okay so I'm like you know doing math in
my head. I'm like, "Okay, well, we don't
have web hooks. She probably won't or
like a web hook from the comment. Okay,
but she's going to get the email
notification in the email that comes
from Google. Does it have the comment
and the contact or maybe a link?" Well,
I'm like, I have no idea, right? I
actually don't know what's going to
happen. And this was like the impetus
for this talk. I was like, how do I ship
a product? How do I develop a product?
How do I talk to customers? How do I
instill trust when I don't know what my
own product can do? And like it's really
weird. And sometimes I'm like, well, is
this just because I'm an idiot? And
like, well, since it's my talk here, I'm
gonna say no. And sometimes I'm like,
well, is this because what we're
building is so far out there, right?
These are like truly autonomous agents
that can use any. And it's like, I don't
think that's it either. I think what's
happening is the product management
discipline is going to undergo a
transformation, a shift, an evolution,
whatever you call it, that is super
profound. And we may or may not totally
realize it yet. Because I think in the
engineering world we're like oh well we
have uh you know tools in our IDEs and
we have codegen and like we sort of are
starting to squint at understanding
maybe how the discipline is changing. I
don't think we really understand how
product development is changing and
evolving and like what are the new tools
and practices and how do we forget
everything we've learned in the past.
Um why is this true? Right? If it's if
it the answer is not Ben's an idiot and
the answer is uh not that we're way out
there. It's two reasons. Number one, if
all our products are built on top of
LLMs and plus or minus they are like we
don't know and we can never know what
the LLMs know, right? So like inherently
in what we're building is like we don't
know what the foundation is. Like you
don't have to know what your database
like how it works but like you generally
know that it's like the surface area the
interface that's exposed. We don't
understand this for the the models. And
the other thing is the expectations from
customers are just boundless, right?
We're just like hey here's a text box. I
mean, that's probably not a good
interface, but like essentially we're
like, here's a free text box, and if
it's anything other than like a help me
write button, you're essentially
inviting customers and users to just do
whatever they want, right? So, we have
this like boundless surface area built
on top of a product that we don't
understand. And so, the question now is
like how do we adapt? So, let's me let
me actually pick on this Google Doc
comment thing for a second, right? So,
if I was wearing my like traditional PM
hat, I'm like, "Okay, well, I need to
make a feature that's going to uh
read and respond to Google Doc
comments." And so, in my head, I'm like,
"Okay, well, uh, does Stacy have access
to the Google doc?" Uh, if she gets
tagged in the comment, should she reply
directly in the comment? Should she
reply at all? What happens if somebody
else comments in the thread? What if
someone comments in the thread that's
not addressed to her? What if it's
someone else? What if it's what if it's
her doc and someone else commented to
someone else but she gets the know like
there's just so much to like think about
and reason about and so I'm like okay
well I'm not building a Google doc
commenting product so I'm not going to
speck all of those things out and like
what's worse is like you also probably
want to tag her in linear tickets right
and what's what's the book like if you
give a mouse a cookie right it's like if
you give a mouse a cookie well you
probably want to like tag her in Figma
as well and you probably want to tag her
in LinkedIn posts and like And so we're
not a team that's building a generic
commenting reply agent system, right? So
then the question is like what are we
supposed to do, right? As like a product
manager who realizes, okay, I have this
like boundless surface area. How does
the practice need to change, right? And
like sort of this is the core of like
what I want to what I want to talk about
today.
So I'll do like three uh high flutin
ivory tower ideas and then I'll talk
through some like practical ways to to
make this real.
So the first one is this mindset shift
to like think in affordances and not
like specific requirements. So it's not
if you know as a user if Stacy replies
in the comment thread and she has like
that's not how we would think about it
anymore. It's the affordance. Oh, she
has affordances to comment or she has
affordances to communicate or or to
email or to collaborate. We're going to
trust the LLMs. We're going to trust the
agentic workflow, the work planning,
like all of the things inside of our um
you know, beautiful 12 factor agent.
We're going to assume that that will
understand, but it's the affordances
that we need to think about, not the
individual features, which is really
weird and it's not typically how product
people have ever thought before.
And I would say actually this goes even
further, which is behavior is emergent.
And this was the other thing that I did
not expect at all like starting in this
space was uh we don't not only do we not
know what things work sometimes they do
and they work in ways we didn't expect
and so I feel like our job as product
people is to discover functionality is
what are the right building blocks right
what are the right Lego bricks that we
either give our engineering team our
product our customers let them compose
and can we discover emergent behavior
and that is one of the reasons like this
is the most exciting time I've ever
built because we're actually building
things and then discovering what they
can do themselves. That sort of became
the new job in a sense is discovering
what's possible because if you asked me
I could not sit down in front of a
Google doc and be like oh let me like
type out what this thing should I can't
I don't know how to do it and well even
if I could how do I then communicate it
right so how do you we communicate to a
development team to a backlog how do you
communicate exactly what should be
happening it's like Figma doesn't like
have the affordances for this right my
my PRD doesn't like have the affordance
for Well, you should probably talk a
little bit less gen alpha because you're
making Ben feel old or like hey, you
should be really like how do we
communicate and express these these
concepts, right? So, I think these are
like the three, you know, high level uh
ways that um our practice needs to
change, but like let's make it a little
more concrete. Okay, so eval talk about
evals. Okay, it's really hard to make a
slide with graphics of evals. I feel bad
for the eval like how do you ex
illustrate an eval? So I'm going to make
you just look at pictures of various
teammates from a you know across all of
our customers. Um okay who hates raising
their hand at conferences when the
speaker asked them? Okay awesome. So
here's my question which is okay for the
engineers here who like legit like don't
lie like writes and runs their evals
good number. And of the product people,
who has visibility into the evals?
There's a that's not bad. And and do you
look at them just because you have the
visibility? All right. One, one and a
half. Two. Okay, great. So, I would
posit that eval back up. Right. So, we
all talk about eval. We're all going to
be embarrassed to say that we don't
really know what they are. Evals are a
testing framework for probabilistic AI
for agents, right? Like if we think
about the uh deterministic code, right?
I withdraw $100 from the ATM. My bank
account should have $100 less, right?
Great. And I can test that. And I can
write code to test that. When the test
is like, was she snarky in Slack? It's
like, well, how do you test that? How do
you write that test? Right? So, we come
up with this this whole new discipline
of evals, which is, well,
she should be a little bit snarky and a
little bit funny, but not mean. And then
we hand it off to another LLM to say
okay well hey was that reply like did it
meet that criteria and how often did it
um it doesn't have to be 100%. Right? So
she should be like pretty snarky but
like not mean 80% of the time or
whatever the uh uh business logic that
you want. Right? So these are eval
but here's what I would posit which is
it is the only way that we know what our
software can do
right and which is why I love the idea
of product people looking at the eval
right looking at uh because they become
the new specification for the product
right and so as we're watching you know
if you're downstairs in the expo gallery
you're seeing like new software it's
like hey bring the team in and this a
little bit reminds me of like the old
you know for the the old-timers here
like behavior driven development there
was this period of time when I was Oh,
the business people are going to write
the tests and that will get converted to
code and then the code will run and like
the truth is like no one ever wanted to
do that. Like no business per I don't
even know who a business person is but
like they want we're going to do that.
But I actually think this is different
and I think this is pretty um a
meaningful way to actually understand
what the product can do and a little bit
begin to specify what it can do.
Okay. So vibe coding for a second which
we which we all do we all talk about. I
want to talk about vibe coding in a in a
way that's really constructive
and
how do I sort of say this? It's very
very hard. I think I I kind of was like
like oh you can't do it in Figma. You
can't do it in a PRD. Like what do I
really mean? Well, it's very hard to
like sit down in front of a blank piece
of paper and um write what the teammate
the agent experience should be. It's
just really hard. It's hard to like
imagine it. And it's not until you feel
it. I mean, so much of what we're doing
in this like human computer interface is
visceral. It's feel. It is like, oh,
well, like, did they ask too many
questions? Like, how many questions is
too many? Oh, wouldn't it be great if
they clarified exactly what you meant?
Well, turns out that's really annoying.
But when I wrote like the first spec,
I'm like, then the teammate should ask a
lot of clarifying questions. And we gave
it to users and they were like, "This
sucks." And I was like, "How would I
have ever known that?" And the answer is
because it's so easy to prototype and
vibe code something and get the feels.
And so this is the next thing that I'm
like pretty excited about as a new
product management tool. It is being
able to feel and experience um what it's
like to interact with a computer but uh
uh without just like uh writing it or
hoping that you have a clickable
prototype that will work. I will also
mention that we have to be careful with
vibe coding because I do not mean sit in
the meeting and say to the engineering
team how come this is taking two weeks I
finished the feature during the meeting
like that doesn't
that doesn't win you any points right so
it is no no this is never going to
production but what this does is it
gives you the feel the the experience
right and so this is like the only way I
know to like actually test and feel it
out
do um do you remember like the the
clawed um certainty issue certainly. I
mean certainly there was this per right
when every time you ask clause like
certainly and like that probably like
seemed really good when you're testing
it for the very first time and then like
the fourth time when you're like hey can
you do my tax he's like certainly can
you write my like acceptance speech
certainly like oh this is actually
really annoying but you don't realize
that until you experience it. So like
that's why I like the vibe coding.
Okay so great we did all this
development and then the question is
like hey we pushed a prod does it work?
I'm like, I told you I don't know. So,
the question is like, how do you test?
How do you like know that it's going to
do uh the things that you said it was
going to do? And I sort of alluded to
this, I'll go through this quickly, is
just really discover discover the
functionality. And
there's an old joke, I'll tell the joke.
QA engineer walks into a bar, orders a
beer, orders two beers, orders zero
beers, orders negative one beers, orders
a lizard, orders a beer with a emoji,
right? It's like great. This like bar is
good to open. And the first customer
walks in, asks where the bathroom is,
and the bar blows up, right? Like,
great, great old joke. It's kind of how
I feel these days. Like, I just sit in
I'm like, "Oh, you know, would be cool
if they were to like start posting
comments on LinkedIn and then what what
if they were like every time I added
like a track to my Spotify account, they
like these just like crazy ideas, but
this is where like the emerging behavior
comes from, right?" And so, it's this
mindset of like let's just try, let's
just experiment. And it's it's this like
kind of growth mindset shift from like
I'm going to write the features and the
requirements to no we're going to figure
it out.
This was a little bit unexpected for me
and this is um how do you sort of report
to engineering and then have things
fixed by engineering and what counts as
a bug in this world and that is really
really strange and I think as sort of I
don't know if it's like just a product
role or maybe in a support role like how
do you know what is appropriate to
escalate to put onto the backlog to flag
as a bug right it's like I'll keep
picking on on Stacy you know she she
gives me a really hard time so it's fine
Uh, it's like, "Hey, she used too many
emojis. Like, put it in in in linear."
It's like, well, it's not really a bug.
Like, show me in the spec where you told
me not to use too many emojis, right?
So, it's it's almost like um like in our
tickets, it's like, oh, you know,
closed, done, closed, duplicate. We need
like closed LLMs be like crazy, yo, like
I don't know how to fix this like just
because it's probabilistically
generated. So, how do we know if it's
right or wrong? How do you know if it's
a feature, if it's a bug, right? Right.
And I think there's this element of um
credibility that we need to build up.
It's like, hey, we actually under we
understand that for some use cases like
80% is good enough, right? This eval
if it's passing 90% of the time like
that's a go. If it falls below 90%,
right? That's red and we're not going to
ship it. So actually come back to evals
for a second because if the eval becomes
the spec and we can say hey we said at
you know 100% even though this is
probability you should never give a
refund if a customer like can't prove
that they bought the thing or whatever
like it is it's like great that is our
metric and we could say yeah this is a
bug but if it's just a a feel becomes
really difficult again this was totally
unexpected that like uh debugging and
assigning bugs would become like uh
controversial.
Okay, customers. So, this part is
uh I found this really weird, right? So,
if I think about like not wearing my
like founder hat, but wearing my like
typical product manager hat, right? Like
I go into a customer meeting usually
with a salesperson like I'm going to
play a role, right? And so, what's the
role? Well, I'm either going to play
like visionary. I'm gonna like, hey,
here's our vision for the product.
Here's our road map for the future. like
let me help you understand customer like
how you're going to come along on this
journey with us
or uh sometimes I'll play the role of
honest broker right it's like listen
sales is like giving you a whole bunch
of like just like selling you a bunch of
vaporware let me tell you what's real
let me tell you like um exactly what you
can expect and that's a role you play
right I usually preface this with like
the sales team beforehand it's like yeah
I'm going to be the honest broker and
like we'll give the customer confidence
today I'm like okay I told you our
vision vision for the future, our road
map, and the customer's like, "You're
full of Like, none of this
actually works." I'm like, "Right, I
can't really paint the vision because no
one actually believes it. It sounds like
witchcraft." And then I'm like, "Oh,
well then I'll be the honest broker and
I'll tell you how things work, but I
just told you I have no idea how it
works." Right? So, this became very
strange because I can't play either of
the roles that I'm supposed to be
playing. The future sounds like
witchcraft. The present is literally I
don't know. So, how do we do this?
Um, I'll tell you how I've been doing it
now. I don't know if this is like a 2025
answer or if this is like a durable
answer. Like if we believe that all of
our products are for like for all time
going to be probabistic, then like we
probably have to figure out how this
world works. What I've been doing now is
really saying look we're inventing the
future together, right? We're pulling
the future forward. The reason you are
talking to like a crazy startup like
this and you are thinking truly about
like the future of how you know AI and
agents is going to transform your
business is because you are a future
thinker and we are going to do it
together. And it's a little bit like hey
let's complement the customer. That's
like but it's not just like a false you
know uh uh blowing smoke. It's like no
truly we need to figure this out
together. And you know for 2025 I think
that's actually the thing that is
working the best uh best for me. It's
like no no no we have to do it together.
And honestly if you are expecting
something different like it's not time
it's not time for you to like embrace
this world because this is this is the
the way this world is going to work.
And so I don't know, I'll conclude with
like I've never had more fun building.
I've never felt like both more inept and
like more excited about what what I'm
doing, right? Just the experience of
throwing something out in the world and
then just like having my jaw drops like
I can't believe this happened. And not
only that, when we upgrade the models
that are like underneath them, they just
suddenly get smarter. And that's really
weird too, right? It's like all of a
sudden they like start checking their
work. They're like, "Oh yeah, I just did
a query to make sure that the row is
properly inserted." And I was like, "Hm,
who told you to do that?" I'm like, "I
don't know. It just seemed like a good
idea." I'm like, "That is a good idea. I
wish I thought of that." Uh, but anyway,
but I think this is the new world that
we're working in. Um, the discipline,
the product discipline, I think, is
going to change for everyone and it's
going to change faster than we expect.
And we all need to like adapt to just
like operating in a world and forget so
much of what we used to know, right? a
lot of the core core ideas, listen to
customers, solve real problems, like all
of that obviously still applies, but the
tools, the techniques that we've like
relied on forever, I think are all
getting upended. And so anyway, glad
you're all at the AI engineer
conference. It's awesome to have product
people here working together because,
you know, we all have to uh, you know,
build awesome products together. So,
thank you very much.
[Music]