Why Getting Data Right Could Be The Key To Effective AI Projects — With Charles Sansbury

Channel: Alex Kantrowitz
Published at: 2025-11-05
YouTube video id: riVo7VI1BTY
Source: https://www.youtube.com/watch?v=riVo7VI1BTY
What does AI need to do to [music]
deliver real economic value? Let's talk
about it with Charles Sansbury, the CEO
of Cloudera, who is here with us in
studio for a video brought to you by
Cloudera. Charles, great to see you. How
are you?
>> Great to see you and thanks for having
me.
>> Thanks for being here. We've been
talking on the show so much about the
economic value of artificial
intelligence. Um whether or not there
there will be an ROI on this technology.
>> I'm so happy to be speaking with you
today because you have a fascinating
background. And you were the principal
technology principal for technology
investment banking at Morgan Stanley
>> 1996 to 2000. Yes. So you've seen the
euphoria around technology in the past
around the dot moment. How does it
compare to this moment?
>> Um well even better. I left in 2000 to
actually join a company that was one of
the um definite.com darlings. So I lived
it both as an adviser and as a as a
principal. What company?
>> Uh company called Vignette Corporation.
original maker of web content management
systems. So if you're building a complex
website, you went to Vignette and bought
the software. Fastest growing software
company in history up to that point uh
because of the demand went from 18
million of revenue in '98 to about 96 in
99 to almost 400 in 2000. And then the
trajectory changed.
>> So talk a little bit about what you saw
then and today. Is it the same thing?
>> It's people are saying it's another
bubble.
>> It's it's hardman for instance. It's
hard for me to say I I I you know when
we were in the moment there was euphoria
around the opportunity but what we
actually said was at that point
every idea could get funded and the
market wasn't differentiating between an
idea and a good idea for funding. And so
the financial investors were being
rewarded early on for betting on
everything. So, their solution, if I if
I'm doing the math out loud, would be
we're playing roulette and the payoff is
100 to1 and there are 33 spots. So,
let's just put money in every spot. And
we'd have losers, but the winners would
pay for that. As time went on, it became
more important to discern. And at that
point, you saw some spectacular failures
of businesses. Uh there's a company
called Web Van that was going to
automate the shopping experience. They
spent $6 or $700 million building out an
automated shopping experience. Turned
out the way to shop was to go and take
stuff off the rack and put in. Now 15
years later, we have the home delivery
that's kind of kind of gotten there with
humans actually doing the work, not
robots. But you look back and you say it
should kind of seem clear. So in the
moment there were certain ideas you
could say that doesn't make sense. But
generally most of the businesses had
business value attached to them. Um, and
so I think probably the same is is true
right now, except that the financial
dollars have gotten so much bigger. So
if you think about a venture capital
round to fund a software startup in
1998, $20 million was the first round,
maybe 40 or 50 to build out your
salesforce and those were astronomical
dollars.
What's different now is the dollars
required to be relevant in some of these
markets are beyond anything we've ever
seen. And so I read an article recently
that said that um the the the the AI
leaders and the funding they require the
article was somewhat provocative is a
systemic risk to our financial system.
But if you think about it, we are
spending so much money right now to uh
buy hardware electricity electricity
generation. Um and we just don't know.
What I will say though is my sense is
the end goals and outcomes are more
tangible than you know selling beach
balls on the internet beachballs.com or
socks or what and so I do think they're
tangible certainly there's a lot of
money that's going to be lost in these
investments um but I think what's going
to happen is actually different and that
I think there's going to be you know a
lot of companies that turn out not to be
successful but a very small number will
be massively successful and the gains
will outpay case the losses, but the
gains will be concentrated in a very
very small group of winners. That's kind
of what I see as the end outcome here.
>> And and as I'm hearing you talk, there's
like some patterns that are emerging,
right? So yeah, um and some some
comparisons and contrast. So for
instance, in the dotcom boom, the uh the
money was spread over many companies and
if you had an idea, then you would get
money. Now it's much more concentrated.
But maybe the difference is is that the
big the company's getting these $40
billion investments like OpenAI or they
just announced hundred billion uh with
Nvidia. People are actually using their
products like it's is almost as if
you're in the com era but everybody's
ordering from web
>> from one from one person. Yes. You have
a a fascinating study that just came
out. Cloudera does the state of
enterprise AI and data architecture and
you asked IT leaders are you using
generative AI or are you using AI? I'm
[clears throat] talking about AI first
then generative AI. 96% said yes.
>> Yes.
>> What's the significance of that?
>> Well, there's two pieces of it. One, I
think there's a a rush to show that
you're using AI within corporate IT
because there's been so much focus on
very high-profile successes of we
improved our code quality, we improved
our ability to respond to customer
calls.
Those gains are real. But the other
thing that's happening is if 96% of
companies are trying AI, only about 30%
of IT organizations have approve have
approved to what they're doing. And so
the business is running ahead of IT
governance, IT rules, IT structures,
which means we have this kind of wild
west right now going on where the the
business user, the nonIT person is
pushing an AI initiative because you
know he or she is being pushed by their
their boss to show me some value for
this technology. and and the early use
cases that were successful around code
completion or content generation or
customer supportd driven applications
they're very real and easy to get. I
think the question is not not that but
what happens next and what are the
killer apps that evolve for AI and you
know we talk about this breathlessly and
I get excited about it but it's only
been here for about 2 and a half years
and so if you think about the pace that
it's running at um I still think we're
in the very early stages of which
applications will be the game changers
for companies and what we've said to our
team uh internally is we want you to
think about using AI in every function
across the business. So, it's from a
legal perspective, it's pretty
straightforward. From marketing, it's
pretty straightforward. Um, finance, uh,
operations, and then obviously software
development. Um, but the software
developers were actually already ahead
of us. they'd already been using a
handful of code completion tools in our
environments before the IT folks had
approved um a specific set of tools to
use which points to the fact that the
business is is moving faster than than
it massive adoption real business value
but the technology is moving faster than
the people who are trying to control the
adoption of the technology that's the
tension that we're hearing at at our
event in New York here today. So let me
ask you, do do all those divisions need
to run through IT? So for instance,
marketing, does marketing, which is
using AI for content generation, need to
run through it? Because I imagine you're
going to be more successful if it is
involved in some circumstances, but
where's the balance? So
the concern especially as you move from
generative to agentic AI where you have
autonomous agents moving through your
systems doing stuff without checking
back in with a human. Uh I think it
creates risks that we haven't got our
hands around yet. So um I was I was
having a conversation today with one of
the folks who was presenting at our
conference and that person runs IT
governance for a large global financial
services organization and that person
said that right now the business is
pushing back on ITbased governance and
regulation but that actually once it's
in place governance will be an
accelerant not a detractor because if
you have a set of approved tools and
processes you can then allow the
business should go to go deploy that.
But I just I I am very uncomfortable
personally and from a governance model
perspective from not having some
oversight that exists
for technology that are going to be
deployed on inside an enterprise
infrastructure. So I'm that is my
concern and I know that there are there
are people in the organization who think
I'm being overly cautious um but but
that's how I'm thinking about it.
>> But it's also an effectiveness question.
I mean, I think that I'm going to run
this by you. So many organizations are
getting some value when people are using
chat GPT in their roles. Yes. But when
it comes to making change in the way
they work,
>> that's when you sort of need that
governance, you need that buyin. And
your your study is fascinating. Uh and
something really struck me from it. Uh
this is from right from the study. Just
9% of respondents said uh that all of
their data was available, right? and
only 38% said uh that most of the
organization's data was available. So
you have effectively this technology
that runs on data, but you only have 9%
of people that are able to access it.
What is what's going on there?
>> Well, I would say putting in a cloud
commercial, that was the value
proposition that we've talked about that
our new products are designed to
address. But but taking a big step back,
what AI needs to run is it needs um
accelerated compute and it needs high
fidelity data. the accelerated compute
and the models are being deployed, but
but large corporations, they're they're
data estates are like a dusty old closet
with things shoved in drawers and and in
in forums where people get together and
talk about it, people are kind of
embarrassed, but it turns out everybody
has the same issue. You know, we don't
have a clean and pristine set of data
across our various enterprise
applications. Um but the AI
the AI initiatives are rolling out and
so it is running very quickly to try to
maintain or improve the quality of the
data. What our perspective has been the
answer can't be you take all that data
and move it to the cloud so it can run
very neatly on these cloud-based models
because then you lose kind of control
over that enterprise context that you've
built over years. the transactions with
your customers, the unique insights that
you have. Um, but you also can't wait a
year or two for it to get the data in
shape and put it into uh one place so
that you can bring the basically the
models to that data. So, so what we're
trying to do is what we are doing
actually through a combination of
research and development and our new our
new basically uh our new iteration of
our product is we're going to give
customers the ability to create an
orchestration layer that overlays both
their on premise
whether it's in Cloudera or other
applications data stores um and also
overlays cloud-based applications and
hyperscalerbased data. So you can
basically create effectively a a an open
data lake based on those component parts
without having to move everything to one
place. And we also are able to do what
we call the data wrangling, right? The
data cleansing. So you can have a pool
of data in one place without having to
basically rebuild from the ground. And
what that means is you can get up and
running on a highfidelity set of data
much more quickly. And it addresses the
issue we've talked about, which is if AI
is built on good data and your data is
not good, you're not going to have
quality answers from the models you
deploy.
>> So, can you give me an example of what
happens when all this goes right?
>> Um, I actually have a really interesting
example that one of our customers gave
gave us a couple of days ago. Um, large
global financial services institution,
non US-based. Um and they have to and
they have transactions that happen
around the world their customers that
are flagged as being suspicious as a
regulated bank and with global know your
customer and anti-fraud
anti-moneyaundering rules they have to
evaluate each one of those. So
originally and and they've got and and
it's a bank that's come together through
acquisition. They have a business they
bought in geography A and they're on
different systems with different
repositories and then the systems they
have could be securities trading
systems, cash machine systems and and
all these are not tightly integrated
because they're all kind of have been
separate over time. So you basically
have this issue that's flagged and you
have to have a human go investigate it.
You know what were their credit card
transactions? Did they happen to buy a
plane ticket and let's go to this place?
And it would take a thousand people a
full-time job to basically on a daily
basis go through these types of issues.
And now they've created an agent where
they where basically when an incident
comes up, they score it based on this
agent going looking, oh my gosh, there's
been I'm I'm making this up, but there
was a cash deposit in Malta. That's very
odd. Have they been in Malta? Well, yes,
actually. They bought a plane ticket to
Malta, and they were actually at
Starbucks in Malta and bought a coffee.
So okay that makes sense. Whereas if it
doesn't make sense it gets a a different
score. So then they create this
basically scoring system for these
incidents and they found that they can
draw a waterline and and let's say if
the score is 40 out of 100 everything at
40 and below that's no issue. And then
above 40 they know that the top 10% are
highly likely fraudulent. So they put
their human investigators already with
the data file that's been created by the
agent so they can more quickly resolve
these cases. and it's allowed them to
basically take a team of a thousand and
repurpose the majority of those people
to other functions within the bank. So
that's a savings of tens of millions of
dollars and a more efficient process and
better for the customer who gets his or
her credit card cut off in Malta because
they happen to buy a Starbucks and then
make a cash deposit. And so it's one of
those things where when I when you when
they walked me through I'm like that
makes total sense. But it wasn't really
possible without the advent not just of
generative but a gentic technology. Uh
and so I think that's that's an early
use case, but we see lots of instances
of people rethinking business processes
and overlaying the technology on those
business processes which is going to
give you outcomes that are kind of
orders of magnitude better. I think
that's a pretty exciting use case.
>> So just a couple questions about that.
They can do that today.
>> They're they've been doing it in
production for for for a couple of
months now
>> and they trust the bot.
>> Uh they trust the bot. Um obviously you
had a bunch of iterations of testing but
the different question is
the magic is setting the waterline right
at what point do you get the human
involved and I think that's one of the
things that a lot of the AI strategists
that we have talked to have said that
there are different situations where you
get a human involved at a different
point if there are medical related
things uh breeding of an X what humans
get involved pretty soon um think about
the automated trading that happens today
on most of our exchanges humans don't
get involved in that. So somewhere in
that hierarchy from automated trading to
healthcare, um there's a different level
of involvement for the human. And I
think the art in what I've just said is
they decided where to draw the water
line by deciding the point at which to
get the human involved. And I think
that's really one of the one of the one
of the complex issues that we're going
to face is at some point for many of
these processes, we're going to want to
have a human involved and somehow a
human responsible for the outcome. So
someone worries about it as opposed to
releasing these agents into the wild to
see what happens.
>> Yeah. And it's it's kind of interesting
because it is it is using if I'm getting
it right, it's using the large language
model. Yes. To take all these inputs and
make sense of them, but because it's
operating on good data,
>> right?
>> It's actually able to make logical
conclusions,
>> right? Well, and the other thing that's
important there is we talk about the
concept of private AI. And private AI is
basically
you you buy a pre-trained model, you do
some fine-tuning um but you do
fine-tuning on your unique data. The
concept of private is using basically
your unique data and your enterprise
context in a way that informs your
models to optimum effect without
allowing that data to escape outside of
your security perimeter. So you don't
want your most proprietary data outside.
And so the other thing this this the
point this makes is they've done this
internally on their systems because they
wouldn't necessarily want to have um all
of that information about specific
transactions and customers in some place
in some repository in some vehicle where
they didn't have full command and
control. And that's also I mean that's
an operational concern but this is also
a place where they have huge data
sovereignty concerns and those are
increasingly becoming impactful in terms
of how customers think about data
security, data privacy and ultimately
data sovereignty. Do you think that line
of where it it makes sense for a human
to get involved will just keep going up
over time or is this kind of as good as
it's going to get?
>> Uh I will tell you if if the improvement
in the large language models that I use
as an indication the line's going to
keep going up. The technology is
iterating very fast and getting better
much more quickly than I would have
thought.
>> I'm sure you've seen the studies the MIT
study for instance that's cited
everywhere that we've talked about a lot
on the show that
>> the 95% study.
>> That's right. 95% of companies no ROI on
AI. This seems like a pretty clearcut
case of a company getting ROI on AI.
>> Yes.
>> Where do you believe those numbers? How
should we read the study?
>> Um maybe it's also so I think that a lot
of things are tried and and fail
quickly. Um and and I think that it's
also very hard for
the business user right now um to to
identify the use case that matters and
and and and maybe an example of that is
it requires not just facility with the
technology but also real deep
understanding of business and
historically um a lot of our IT folks
have not been experts in the business
and our business folks have not been
experts in the IT. So I think it takes a
pretty unique individual right now to
put those two things together and a lot
of the use cases I believe are being
driven by you know business users who
don't have as much technology experience
or IT users who don't have as much
business experience driven by the
urgency of oh my gosh we got to do
something and I think that's what we're
seeing right now the we've got to do
something so we're going to run a
prototype and even if the report up to
management is we ran three prototypes
and they failed that's still better than
we didn't do anything.
>> Yeah. You can't go to your CEO and say,
"I don't have an AI strategy." You would
>> do nothing is not an option.
>> Exactly.
>> Yes.
>> All right. Speaking of CEOs, I'm going
to ask you a CEO question.
>> Well, if we had one in the room, we
could ask him or her.
>> Well, I was just speaking with the
people running uh search at Google and
some of the products there.
>> And I I said, look, uh AI is a product
with a lot of potential, but it's really
hard to know where to invest in because
some days it's working well, some days
it's not working well, some use cases it
makes sense, some use cases it doesn't
make sense. So from where you sit, I'd
love to hear your perspective on how
does Cloudera uh decide whether to go
where to go forward and where not to
because you have a pretty good business
that's not not generative AI related. So
you're in this like always day one
situation. It's like do you want to
reinvent or do you stay with the
flagship? How do you think that through
as a CEO? Um it it's it's a really it's
a really tough question. Um using that
day one analogy because a pure day one
perspective would say ignore the history
from this point going forward. What's
the right decision? This is the first
day we're in business. And that's that's
theoretically true. But we have more
than a billion dollars of revenue and
7 or 800 of the world's largest
customers who depend on us to manage
their enterprise data and make it safe
and secure and accessible for all their
analytics initiatives. So I can't
ignore that. So the answer right now is
we kind of have to do both. So we are
investing both in that core data
platform, the Cloudera data platform
that is the data foundation for eight or
nine of the top 10 companies in
basically every global industry. But
those same customers are saying and also
because you are the repository within my
organization has the most data gravity.
what are you doing in terms of AI tools
and capabilities so I can use that data
in in in an in a data lakehouse fashion
so that I can basically use it to train
and build um models that my my business
users want to deploy and and so I think
from our perspective we can kind of
straddle both worlds um it would not be
um it wouldn't be prudent for me to
shift all of our resources to our our
new initiatives because we have large
organizations say look I need to make
sure that you know that I need this this
product that runs in my data center to
run for the next 20 years. On the other
hand, I have innovation requests being
driven by every one of those same
customers. And so, right now, we're
doing both. Um, we're both continue to
invest and and evolve our platform, but
we're also spending both in terms of
internal R&D, but also we've done three
acquisitions recently around adding
functionality to our data management
capabilities. um and also adding
containerization capabilities which
allows us to basically deliver a
cloud-like user experience regardless of
computing platform and that's more
forward innovation driven. Uh and so
right now I don't have I I say I don't
have the luxury of shifting everything
to the innovation but I've also got a
foundation of knowledge of expertise
that's built up over the last 15 years
of being probably the unquestioned
leader in the early stages of big data
that we can then leverage to help make
better decisions as we think about how
to solve people's AI needs. So that's
that's a that's a that's a that's a CEO
answer where I settled and kind of
filibustered.
>> That was not a filibuster. That was a
legitimate answer. You mean you talked
about the We had to do both. We had to
do both.
>> So, I want to hear about your journey
actually because well, I I I can't even
imagine what it's been like for you to
like see Chip PT come out and then
hearing what you've been talking about
today. I think corporations were like,
we need to put this into place in our
companies. Uh probably using
off-the-shelf models and then realizing
maybe that, you know, they're not going
to be as valuable if they're just using
public data that they'll be valuable
when they are able to connect it with
their internal data. And then your phone
probably started ringing. So talk a
little bit about what it's been like for
you. So, so in the early stages look the
the the the AI was born in the cloud and
all the models were trained on all the
publicly available data and as they got
better being they got better and better
but at that point all the models have
been trained on the same data that's not
differentiated and then we had customers
coming and saying look I've got all this
you know customer interaction data or
transaction data and and how do I feed
that in and it became clear to us about
two or two and a half years ago that we
had to also incorporate that into
fine-tuning models and fine-tuning
wasn't even a term back then, right?
>> To basically get better outcomes. And
the and the example I use is for my
legal department, we have thousands of
contracts that we've signed over time.
And there are certain terms that we've
agreed to and certain terms that we
negotiate and certain terms that we
don't negotiate. And we have we can, you
know, feed that into a model to
understand here's a contract that comes
in. Let's apply our rules. here's our
markup based on what we've agreed to in
the past and that's fantastic but it's
only valuable if it's trained on our own
internal context a and b that model does
not have to have read war in peace to
deliver to us that it doesn't have to be
generally trained on everything that's
available it has to be trained on our
content and so the the point that I make
there is people are getting better about
large and small language models and
where they're appropriate and being more
targeted in training models on the right
data that gives you the better outcomes.
Whereas the early days were boil the
ocean, right?
>> And so I think
>> the progress we've seen in the large
language models, the progress we've seen
in the sophistication of customers and
understanding what to use in terms of
different models for different use cases
is dramatically different over the last
12 or 18 months and and we're still in
the very early innings of this adoption.
And so I think we have a we have I say
we we have a ringside seat. Arguably
we're in the ring for a lot of this and
it it's it's pretty exciting. Um but I
think the answer is don't get too
attached to what you're doing today
because something better is coming
around the corner.
>> I think in the early days of all this
when companies were rushing to roll it
out, we and the public saw some
hilarious examples of them getting it
wrong. For instance, car dealerships
offering discounts because they had
said, "I read the policy and they just
hallucinated the discount." And there's
a question of like whether they should
be held to it or not. And this is I
think an important thing when we think
about the ROI question and where this is
going that until that gets right um
those questions will follow uh
generative AI and it's like we also talk
a lot about are the models getting
better. Well, the model can get good up
until a certain point. Uh, but once you
start bringing in clean data and not
subjecting the models to these types of
hallucinations, that's when you start
being able to put it into into
production as a product that works.
>> Yeah, I think um models are getting
better, but but maybe they're
approaching this kind of astoodic
barrier where are 13 billion parameters
really that much better than 12 billion
parameters, right? Maybe nine is good,
maybe five. we someone mathematically
has done the work, but my guess is that
we're kind of approaching a point where
the models themselves aren't getting
better, which means the quality of data
you're trained on has become
increasingly important. And so we're
hearing customers talk a lot more about
data fidelity, data quality, and then,
you know, data lineage, understanding
where the data comes from that trains
your models. And and so I believe that
what's happening is that enterprise data
is being revalued as a corporate asset
and people are now willing to spend more
money to get it right. you know, whereas
historically if I'm uh you know an an IT
leader and we had projects that came up
and one was cyber security we're funding
that right and one was analytics well
we're funding that and one is um data
governance and and kind of data fidelity
like I don't even know what that means
that goes to the bottom of the list and
so I think a lot of companies did
neglect their data stores uh but I think
as as recently as 18 months ago that was
the case but now it's a very high
priority for customers because again
components are accelerated compute and
high fidelity data and if the models are
nearly optimized data is your next
chance to improve your quality of
outcome.
>> All right Charles if people want to
learn more where can they go?
>> Uh cloudera.com
we have a tremendous uh group of of both
written and and video materials online.
We've just introduced a bunch of new
products that help us get to this
vision. So I'd encourage people to learn
more about the company and obviously
we're excited about what's going on.
>> Awesome. Charles, thank you for the
conversation. Thank you. Appreciate it.
>> All right, everybody. Thank you for
watching. We'll be back on the feed
soon.