Fuzzing in the GenAI Era — Leonard Tang, Haize Labs

Channel: aiDotEngineer
Published at: 2025-08-22
YouTube video id: OMGPvW8TBHc
Source: https://www.youtube.com/watch?v=OMGPvW8TBHc
[Music]
Thanks Ally for the great intro. Uh
indeed we're working on what I believe
to be the exant problem in AI which is
to say how do you validate verify audit
steer something that is as subjective
and unstructured as literal LLM slop. So
today we're going to be talking a lot
about this. Um I should point out that
ostensively we're part of the AI
security track although I would really
consider us more of a QA company and
eval company in some sense although
there's a lot of shared similarities in
how we approach the problem technically
right we are essentially a property
based testing company or fuss testing
company or as I like to call it a hazing
company.
Cool. So just to set the context a
little bit uh why do we start haze? What
does haze mean? haze to us is
ultimately, all right, we know that AI
systems are extremely unreliable.
They're hard to trust in practice, and
you sort of need to pressure test them
before you put them out into the wild.
Our solution to doing this is basically,
let's just run large scale optimization
and simulation and search before
deployment and try and figure out
through a battery of tests whether or
not your system will behave as expected
before it actually goes into production.
And I'm sure any of you guys who have
tried to build LLM apps in the past have
understood extremely viscerally uh what
I mean when I say the last mile problem
in AI, right? It's at this point in 2025
extremely easy to get something that is
demo ready uh or PC ready. Like you can
whip together a cool product over the
weekend and impress your PM and whatnot,
but uh it's really hard to get that same
product into production at a point where
it's truly robust and enterprisegrade
and reliable. And you know this has been
the case for the past two plus years at
this point right like we've been
promised uh the allure of autonomy and
agency and full gen AI and enterprise
transformation for two plus years since
chat GPT launched and we're still not
quite there right and I think it
ultimately it's because we haven't
solved this last mile problem around
trust and reliability and risk
so I think part of the big reasons we
haven't solved this is because people
still think about eval measuring your AI
system in a very straightforward and
naive sense which is easiest to explain
uh as follows right I'm sure everybody
has seen this idea of going out uh being
a human subject matter expert collecting
a finite static golden data set of
inputs and then expected outputs ground
truth outputs uh from the uh human and
then basically running the inputs
through a application getting the the
actual output and then comparing it
somehow with the the ground truth golden
answers right this is how eval has been
done forever uh since the birth of deep
learning uh and prior but it doesn't
quite hold up in the genai era
specifically because of this property of
geni systems which is what I like to
call brittleleness or more technically
lip shits discontinuity um and what I
mean by this is you know people say AI
is sensitive AI is brittle AI is
non-deterministic which is true this is
all true but that's really not the main
problem that makes AI so hard to deal
with right nondeterminism is really fine
if you set the temperature to zero yes
there's like caching and weird systems
uh quirks and all the LM providers that
make it somewhat non-deterministic even
at scale. But for the most part,
nondeterminism really doesn't bite you
too much when you're building a apps,
right? You for the most part are
constrain your outputs to temperature
zero. You're running things through a
workflow. It's fairly deterministic.
What does bite you a lot when you're
building AI apps though is when you send
two ostensibly similar inputs to your AI
application with maybe slight variance
uh in the syntax or the semantics or the
appearance of the text but all of a
sudden you get wildly different outputs
on the other side. Right? This is what I
mean when I say gen apps are incredibly
brittle. And I think this is the actual
core property that makes building with
AI uh with geni so difficult.
And of course, we see this brittleless
manifest itself in all sorts of fun
ways. I'm sure we don't have to blabber
this point too much, but you've got
everything from uh Air Canada customer
supports hallucinating to, you know, uh
character AI telling teenagers to commit
suicide to um buying a pickup truck for
$1 on the Chevy uh patient or customer
portal, right? I I don't think we need
to go through more examples of this.
This happens more or less every single
week. There's more and more examples
popping out. Um and again this all comes
back to geni being extremely sensitive
and brittle to prohibition in the input
space.
Cool. So standard evals of course
doesn't cover uh this brittleleness
property and I would say it's
insufficient in two senses two primary
senses. One is coverage right? Uh with a
static data set you only know how good
your AI system will be with respect to
that data set. Right? It might look like
your AI system is 100% on all your unit
tests on all your golden data set
points. But if you just push around the
corner and look around the corner for
more inputs that cover your space more
densely, it is entirely possible that
you get prohibations that tell a very
very different story about how your AI
application actually does in the wild.
So point number one, standard eval don't
have sufficient coverage.
Second point too is it's actually really
difficult to come up with a good measure
of quality uh or even similarity uh
between the outputs of your application
and your ground truth outputs. Really
what we would want almost is a human
subject matter expert who is constantly
overseeing your AI application and a
subject matter expert who has all the
right taste and sensitivity but is able
to translate that sensitivity into some
quantitative metric. This by no means is
a trivial task, right? I think this is
the core challenge that we've been
trying to face in the field of AI around
reward modeling for the past five, six,
seven plus years, right? Uh and the key
challenge is how do you get that
sensitivity from the subject matter
expert from a nontechnical domain to be
able to translate their criteria into
quantitative measures. This is not even
close to something that's being solved
with standard evals today. People are
using things like exact match, um
classifiers, LM as a judge, semantic
solinity. All these things have their
own sets of uh quirks and undesira and
we'll see how this this pans out in a
second.
Long story short of how uh we think
about tackling this eval problem is
essentially through hazing right fuss
testing in the AI era. Essentially what
hazing comprises is very simple in the
abstract. We just simulate uh large
scale uh stimuli to send to your AI
application. We get the responses as a
result of the stimuli. We judge and
analyze and score the outputs of your AI
application. And we use that as a signal
to help guide the next round of search,
right? And we essentially just do this
iteratively until we discover some bugs
and corner cases that break your AI
application. Uh if we don't discover
anything and we exhaust our search
budget, that that means you're
essentially ready for production, right?
So this is hazing in a nutshell.
But easy to easy to describe, actually
really difficult to execute in practice.
um both sides of the equation in terms
of scoring the output and also
generating the input stimuli are quite
difficult technically. Um I'll first
talk about how do we think about scoring
the output again translating from
subjective criteria into quantitative
metrics. Uh we call this judging more
broadly.
Probably you guys are familiar with uh
something like using LM as a judge to
essentially have an LM look at the
output of your AI application and decide
you know based on some prompt or rubric
that you give to your judge you know is
this a good response or is this a bad
response tell me on a scale from 1 to
five or 1 to 10 or what have you right
very simple to do but it has its uh
whole large array of different failure
modes in particular LM as a judge itself
is prone to hallucinations it's it is
obviously an LLM so it's prone to
hallucination conditions it is uh
unstable. you could have actually really
good articulation of the criteria but it
doesn't actually operationalize well
into a model right so it's um uh
unccalibrated in the output right like a
what is what is a one to an LLM that's
very different to what is a one to a
human right what is a five to a human is
very different to what a what is a five
to an LM so it's unccalibrated uh it has
all sorts of biases right if you change
the inputs uh in any weird position
right let's say you present uh one
response first and the second response
if you flip the order that changes the
results often time uh if provide context
or you change some part some some part
of your rubric that changes the result
of the LM as a judge too. So extremely
biased extremely fickle and TLDDR LM as
a judge itself uh as an off the call
call offtheshelf call to an LM is
oftentimes not going to solve your uh
reliability issues.
So the key question in my mind is how do
you actually QA the judge itself, right?
How do you get to a point where you can
judge the judge and say that this is the
best gold standard metric that I can use
to then actually iterate uh my
underlying AI application against. So
how do you judge this judge?
The broad philosophy that uh we've been
taking over the past few months is
essentially pushing the idea of
inference time scaling or more broadly
compute time scaling to the judging
stage. So we call this scaling judge
time compute. And there's two ends of
the spectrum of this philosophy. One end
of the spectrum is basically just rip
from scratch. No inductive biases. Train
reasoning models that get really really
good at this evaluation task. Um and
then the other end of the spectrum is be
very structured. Uh you know don't train
any models. Just use the offtheshelf LMS
have really strong inductive prior but
basically build agents as judges. Right?
So this is one approach. Basically,
we'll build agent frameworks, pipelines,
workflows to do the judging task. And we
have this nice little library called uh
verdict uh that does this. Very on the
nose name, I know. Um but the idea of
verdict is essentially there's a lot of
great intuition from the scalable
oversight community, which is subfield
of AI safety. Goal of scalable oversight
is basically how do you take smaller
language models and have them audit uh
and correct and steer stronger models.
Originally this is an AI safety concept
because people were worried about you
know in the age of superhuman AI how do
you have weaker models i.e humans
control the stronger models, right? And
that's how the field got started. But as
a result of scalable oversight, there's
been a lot of great intuition around the
architectures and primitives uh and
units that you would use to probe and
reason and uh critique what a stronger
model is doing. And so we baked a lot of
those primitives uh and architectures
into this vertic library. One example is
having LMS debate each other, having the
weaker LLMs debate each other about what
the stronger model is saying and seeing
if that makes sense. Uh another example
is having the LLM's weaker LMS
self-verify the results of their own
responses. Right? So you know have an LM
say okay this response of the stronger
model is good or bad. It is bad for this
reason and then maybe having an LM
critique its own reasoning right. So SE
verification is another great primitive.
Um ensembling of course another another
uh classic primitive in this case and so
on and so forth.
TLDDR scaling judge time compute in this
particular way uh through building
agents as judges actually allows you to
come up with extremely powerful judging
systems that are also quite cheap and
also uh low latency. So here's a plot of
price uh and latency and cost uh and and
accuracy uh of verdict systems visav uh
some of the frontier models uh Frontier
Labs reasoning models. So you can see
that verdict is beating uh 01 and 03
mini and of course GP4 and 3.5 sonnets
um on the task of expert QA
verification. So this is uh subjective
criteria grading in expert domains. Um
critically verdict here is powered by a
GP40 mini backbone. Right? So we
basically have stacked GPU GP40 mini
aggressively and what is in this case
like a self-verified debate ensemble uh
architecture and we're able to beat 01
for a fraction of the cost like less
than a third of the cost right and also
uh like less than a third of the latency
and this is all because of we have of
the fact that we've chosen the priors in
a pretty careful and uh intelligent way.
So that's one way to scale just time
compute is basically building agents uh
to do the task.
Other way to do it and this is a lot
more fun in my opinion is basically yeah
just rip RL from scratch train models to
do the judging task and this is
something that we've also been pretty
excited about over the past few months.
Um again uh for standard LM judges whole
host of issues but two particular issues
that are solved by RL is one uh there's
a lack of coherent ration that explain
why an LM judge thinks something is a
five out of five or thinks something is
good or bad and also standard elements
of judge doesn't provide real uh fine
grained tailored unique criteria to
whatever idiosyncratic task and data
you're looking at. Uh but both of these
can be solved by uh RL tuning or
specifically GRPO tuning. Uh one paper
recently that uh has come out in this
general flavor is from deepseek. This is
SPCT self-principled critique tuning.
The idea here is essentially can you get
uh an LM to first propose some data set
or sorry data point specific criteria
about what to test for. It's almost like
coming up with unit tests for the
specific data point you're looking at
and having the LM essentially look at
each of those criteria and critique the
the data points on against each of those
criteria. Right? So it's like instance
specific rubric and then instance
specific rubric critiques. Um this is
one way to train RL models. We ran a
pretty simple experiment using this uh
using a variant of this uh technique um
to GRPO train 600 million parameter and
1.7 billion parameter models and TLDDR.
gets us to you know competitive
performance on the reward bench task
with cloud3 opus which is at 80% uh GP4
mini which is at 80% L 370B at 77% and
J1 micro which is this 1.7 billion
parameter reward model at 80.7% uh
accuracy on the words bench task right
and this is all because of judge time
scaling this is all because we did gpo
to come up with uh better rubric
proposals and better critiques on the
specific task that we're looking at so
training off essentially uh much smaller
model uh doing more compute gets you
this this much better performance um and
similar numbers for the 600 million
parameter model.
Cool. So that's all judging and scoring
the outputs. Um equally important though
is how do you come up with inputs to
throw out the AI system, right? And how
do you run the search over time?
TLDDR there's two ways that we think
about this. There is fuzzing in the
general sense which is essentially okay
I just want to come up with some
variance uh of some customer happy path
and test my system under some reasonable
in distribution uh user inputs right
then there's the more fun part which is
how do you do adversarial testing right
how do you basically emulate some person
trying to sit down and prompt inject and
jailbreak and mess with your AI systems
at large uh and this is much more
aggressive in terms of how we pursue the
optimization problem.
Long story short is, you know, fuzzing
in the AI sense is much more structured
and optimization driven than in
classical security or or software or
hardware, right? It is impossible to
like search over the input space of
natural language and do a brute force
search uh in any reasonably short amount
of time. Like we're dealing with, you
know, let's say we're doing a llama 3
tokenizer. is uh 128,000 tokens per
individual input, right? You scale this
up to like 100 million tokens and you're
like literally impossible to scan this
entire input space. So you have to be
very clever and guided and prune the
search space as you do hazing and
fuzzing. We treat this task essentially
as an optimization problem. Right? This
is long story short just discrete
optimization. There's plenty of rich
literature over the past 60 70 years of
uh discrete math research to go and
support how to do this sort of task. So
you have to massage it of course to work
for the LM domain. Um but TLDDR the
shirt space is just natural language.
The objective that we're trying to
minimize in this case is essentially
whatever judge uh that we're using to
score the output. We basically want to
find inputs that break your A
application visav the judge gets the
output to score very low on some uh
measure of the judge. And yeah we we can
rip and throw a bunch of fun
optimization algorithms at this. Um we
can use gradient based methods to back
prop all the way from the judge loss
through the model to the input space and
use that to guide uh what tokens we want
to flip. Uh we can use various forms of
tree search and MCTS. We can search over
the latent space of uh embedding models
and then map from the embedding models
to text and throw that at the underlying
AI application or the application under
test. Um we can use DSPI. We can use all
sorts of other great uh tools and tricks
to solve this optimization problem.
Some fun case studies in the last few
minutes. Um, TLDDR, you could probably
imagine that this hazing thing matters a
lot for people in regulated industries
and indeed we work a lot with uh banks
and financial services and healthcare
and so on. Um, we did something recently
where we um hazed uh the largest bank in
Hungary. Uh they had this like loan
calculation AI application that they're
showing to customers. The customer
application had to follow this 18 line
code of conduct is what they called it.
uh and we basically threw everything
under the sun uh from our platform in
terms of optimization and scoring to uh
emulate adversaries. We were able to
discover a ton of uh prompt injections
and jailbreaks and honestly just like
unexpected corner cases that they didn't
account for in their code of conducts.
Um and they're able to patch this up and
then finally unblock their production
into prod.
We are doing this right now for a
fortune for Fortune 500 bank that wants
to do uh outbound debt collection with
voice agents. uh little bit actually
more complex problem because now
we're not just testing in the text
space. We're actually introducing a lot
of uh variance to just the audio signal
as well. So adding things like
background noise um stacking you know
weird static into the the input domain
changing the frequencies of things etc.
Right? But still an optimization problem
at the end of the day. um TLDDR what
took this team you know 3 months or so
to do with their internal ops teams uh
took in their own words uh only 5
minutes for a platform to do um so
scaling up adversary emulation uh works
for this task as well
and a little bit more different uh for
another voice agent company we've been
helping them with uh scaling up their
eval uh suite right so not so much
hazing but basically scaling up their
subjective human annotators uh through
verdicts uh they've seen a 38% increase
in ground truth human agreements using
verdicts um as opposed to using uh their
internal ops teams. And what we're using
here is essentially uh a triedand-rue
architecture from the verdict library
which is what we call rubric fanout. So
it is basically propose individual uh
unit test and criteria for any
particular data point uh critique it
self-verify your critique and then
aggregate results at the very end.
Cool. So we got a few minutes left uh
for questions but um yeah hazing is a
ton of fun. I think it matters a lot for
this new era of software that we're
building. Uh we're very aggressively
hiring. We're, you know, facing what I
would deem to be insurmountable
enterprise demand and we're only a team
of four people. Uh so we really need to
scale up our team and yeah, we're based
in New York in case you guys want to
move out to the city. Um and yeah, any
uh any last questions for me
>> for the hazing input? Is it multi-shot
or single shot?
>> Yeah, great question. So we do both. Uh
we do single turn, multi-turn. Uh we do
persistent conversations if you're doing
voice. Um yeah, all sorts of modalities,
all sorts of inputs.
[Music]