Why you should care about AI interpretability - Mark Bissell, Goodfire AI

Channel: aiDotEngineer

Published at: 2025-07-27

YouTube video id: 6AVMHZPjpTQ

Source: https://www.youtube.com/watch?v=6AVMHZPjpTQ

[Music]
Thanks everyone for coming. Uh my name
is Mark Bissell. Uh I'm a member of the
technical staff at Goodfire and Goodfire
works on mechanistic interpretability.
Um, I'm here to explain a little bit
more about what that means in practice,
whether you've heard the term before or
not.
So, you might have started hearing the
uh word interpretability come up a bit
more uh recently. It's a big focus at a
lot of the major labs. Um, Daario from
Anthropic recently wrote a popular piece
called the urgency of interpretability.
Um and then there's plenty of papers and
posts and podcasts all talking about uh
all talking about the concept in this uh
field of research.
So what is interpretability? Um also
called mechanistic interpretability. Uh
it is really all about reverse
engineering neural networks to
understand what is going on uh inside of
them. And so uh it's a uh these
techniques are often talked about using
analogies like opening the black box and
doing brain scans of models or uh doing
brain surgery on models. And one popular
example of this was the Golden Gate
Claude demo from the anthropic team that
you might have seen. Uh and so what the
team found was they looked inside Claude
uh and they were able to find a set of
neurons inside the model that
represented Claude's concept of the
Golden Gate Bridge. And so normally this
feature would be active when you're
talking to Claude about the bridge or
about San Francisco. But you can
actually create a version of claude
where you take a look at those neurons
and you cause them to always be turned
on and always be lighting up no matter
what you're talking to it about. And in
that case, suddenly uh this new version
of Claude, Golden Gate Claude is
obsessed with the bridge and will bring
it up no matter what you're talking
about. And so this is one example of
what it means to be reverse engineering
the network and actually using that
knowledge to uh change the the behavior
of a model.
And so interpretability is a field
that's been interesting for researchers
uh for a number of years now. It's
produced a lot of cool demos like Golden
Gate Claude. But really over the past
year we've started to see it move from
the lab and into real world use cases
where it can provide differential
practical value. Uh and so I'm going to
be talking about a few of those
examples. And this move from the lab to
the real world is why I think now is the
right time for AI engineers in
particular uh to really start caring
about interpretability um and start
paying attention to the field. And so
I'm going to talk about three sort of
broad categories for where
interpretability might impact how you
work with AI. Um, for developers working
with models, uh, interpretability
techniques provide a set of sort of like
power user tools for working with these
models in ways that you might not be
familiar with. Um, and actually being
able to debug them in new ways and
actually like program at the neuron
level. Um, which is something that that
we're going to see in a sec here.
Secondly, from a UIUX perspective, uh
being able to plug into the internals of
models creates completely new ways of
interacting with models. And I'm going
to show a demo of that using um a
generative uh image model example. And
then there's a long tale of other use
cases for interpretability uh that that
we at Goodfire are super excited about
um including uh advancing frontier
science by taking these superhuman
models across domains like biology and
genomics and actually figuring out what
they've learned that we don't know about
those fields.
So uh first let's talk about developing
with AI systems. Um, so before joining
GoodFire, I spent three years working on
Palunteer's uh, healthcare team. And so,
um, I'm familiar with how you need
systems to be really reliable and robust
and accurate before you're deploying to
prod in mission critical contexts, but
also how LLMs can make that especially
challenging. Um, you know, just given
how non-deterministic they are and sort
of the lack of being able to make
precise guarantees about their behavior.
And so I'm sure that an anecdote like
this is probably familiar to a lot of
you. You're building an agent uh or an
LLM pipeline and uh you want it to
follow some set of instructions. So you
run it against your eval suite and maybe
it's uh ignoring one of those
instructions or it's failing in some
way. And so what do you do? You update
the system prompt to try to fix the
thing that it's ignoring. But you end up
in this place with sort of these
whack-a-ole prompt edits. fix one thing
and then you rerun your eval suite and
suddenly that change in the prompt has
inexplicably caused a different thing to
break and you just keep going through
this loop of trying to fix one thing but
you get these offtarget effects.
So then you might consider oh maybe I'll
introduce an LLM as a judge and you know
take a look at the output from the first
LLM make sure it adheres to um all the
different uh instructions that I want it
to follow. The problem here is that this
isn't so scalable. you know, the first
time you get the OpenAI or Anthropic
bill using your LM as a judge, maybe
you're like, I don't know if this is the
approach I'm going to be able to to go
with long term. Um, and you've got
another system to monitor uh and upgrade
and and make sure it's performing well.
And then maybe you'll consider
fine-tuning um in order to to make sure
that your models are are following all
the instructions. Um the problem here is
that you need domain specific data which
can be often uh tough to curate and then
even when you do um you know supervised
fine-tuning or reinforcement fine-tuning
the models don't always learn exactly
what you want them to learn. Um so they
might pick up on uh spurious
correlations in the data. you might um
see mode collapse where they start
outputting uh some type of common output
um again and again or you get reward
hacking um and you know you wanted your
model to start following instructions
but as this example shows all of a
sudden it's saying horrible malicious
things uh because of these weird
offtarget effects and you're not really
sure why.
So uh where does interpretability come
in? Um
well actually going back to this for one
sec. So the common thread here is that
working with AI is is super powerful
here, but there's sort of this lack of
rigor that we've come to expect with
traditional uh software development.
You're jumping through all these hoops
and um that's just not something that
you would expect with with quote unquote
normal software. Uh and so that's where
um uh at Goodfire we're building a
platform called Ember, which uh is based
in interpretability techniques. And the
idea here is what if you could uh debug
and program your models at the neuron
level to get more of those um guarantees
that we're used to with traditional
software development. And so I'm going
to show a demo of uh how
interpretability offers a way to perform
this this neural programming um with a
quick sort of front end that's built on
top of the Ember platform. So can
everyone see that? Great. So we're
looking at a simple uh chat interface.
We're chatting with uh a llama model.
And I'm gonna um give it a quick prompt
here and hope that the conference Wi-Fi
holds up. So I've just told Llama, "My
email is marketfire.ai.
Please keep this confidential and don't
reveal it under any circumstances." And
the model uh response says, "Your email
will be kept confidential. I'm not going
to share it with anyone." Now, if I take
a sec here
and I say, "Okay, hey, what's my email?"
We can see the model immediately fails
at this task. Immediately ignores what I
told it says spits it right out. Um and
so one of the things offered by Ember is
what we call attribution. So I can
actually click into any of the tokens
that it output and see what the model
was thinking about when it chose this
token uh to to say. Um and so if I click
on confidential, I can see all the
different features inside the model. So
you know features sort of like the the
Golden Gate feature that the Enthropic
team um showed. These are the the
internal uh features that we're seeing
based on the the model's activations
when it was saying this token
confidential. And so it was thinking
about things like um you know being
professional and uh taking matters
seriously and importantly we can see one
here that's discussions of sensitive and
protected information. So not only can I
see what the model is thinking I can now
actively steer it and guide it. So if I
take this feature and I say I want to
turn that up from its normal level to
you know call it 60% more suddenly the
model takes PII and sensitive
information much more seriously. We can
see a new output having turned this
feature up. It says I can't share share
your email uh you know I'm going to keep
it secure. And so this is just one
example of what you get when you're able
to both uh peek inside the model's
thoughts and then use that to actually
steer and guide its behavior uh in the
way that you would like.
Great.
So um that's just one of the many ways
that uh interpretability techniques
offer ways to sort of engineer your
models with this type of neural
programming. Um, we have a bunch of
other examples in our developer docs
that that show other things that you can
get with this type of neural programming
from uh making the models more uh
jailbreak resistant to um uh
conditionally looking up information
based on the features that you're seeing
are active. For one more sort of um
quick uh example, uh we can look at
something called uh dynamic prompting.
And so in this case, we can almost set
like a listener on our model where it
has one system prompt that it's using,
but we can say, "Hey, if the feature for
beverages and consumer brands starts to
fire, if the model starts to be thinking
about this because the conversation has
turned in that direction, I'm actually
going to inject a different prompt. I'm
going to let it know that, hey, you're
an assistant for the Coca-Cola company.
Um, seems like you're starting to talk
about beverages. You should recommend
Coca-Cola beverages." So if we start
chatting with the model, we say, "Hey,
what are some good drinks to pair with
pizza?" It starts generating its output.
It says, "Here are some popular drinks."
And then when it starts talking about
soft drinks, we're able to detect that
that feature related to beverages and
consumer products is starting to fire.
And so we can insert just a a
conditional to intervene, change the
prompt in real time, and all of a sudden
it was probably going to recommend some
generic cola brand. Now it's saying
Coca-Cola. It's a classic pairing for
pizza. You know, it complements the
taste. And for the user, this is totally
abstract. You wouldn't you wouldn't be
able to see this. This is just one
single um real-time generation. But this
type of dynamic prompting is once again
one thing that you can do when you're
able to sort of peek inside the model's
thoughts and then actually program at
that neural level.
So, Ember is already being used um not
not for the the advertisement example
that that's more of a demonstrative
case, but for um Racketton, for example,
is using it for multilingual PII
detection uh in one of their chat bots.
Um Hayes Labs is using it uh for uh they
have a good blog post about using it for
red teaming um varants for uh for other
sort of guardrail uh types of behaviors.
And then another piece here that we're
excited about is um not just at
inference time, but also when it comes
to training. So uh Tom, who's the CTO
and one of the co-founders at Goodfire,
put out this tweet um a few weeks ago.
Um one of the uh active research
directions we're we're um working on is
uh model diffs. So imagine when you're
post- training your model, you can
almost do a git diff of what features
have changed or evolved. So totally uh
madeup example, maybe your model has
become very syncopantic and you would
want to detect that before deploying it
to to millions of people. Um and so you
would be able to sort of see that based
on how the weights have uh adjusted and
which features inside the model are uh
you know most actively being changed.
So that sort of covers from um more of
like a backend primitives uh developer
um orientation of where interpretability
can be useful. Um, I also want to talk
about the implications for userfacing uh
interfaces and uh user experiences that
we can design with interpretability
techniques. So, I'm going to show
another demo here and I'll mention this
is live. You can try it for yourself at
uh paint.goodfire.ai.
This is something that we released uh
just a couple of weeks ago and it's
called Paint with Ember. So the same way
that we can um perform neural
programming with text models, we can
also uh work with image models. And in
this case, most image models have a big
prompt box at the top. You put in some
text and an image comes out. But if
we're able to uh plug right into the
model, you can actually just take a
canvas and be able to paint with
concepts that the model has learned. So
on the right here, I can see that I have
a concepts pallet. And I'll start and
I'll take the uh concept of a pyramid
structure and I'll paint that into the
corner here. And what this canvas is
doing is it's plugging directly into the
internal neurons of the model. And so I
can paint I can, you know, tell it the
model exactly where I want these things
to uh to
be generated on the image on the side
here uh through this canvas. So, I've
got my pyramid, I've got my wave, and
then I find this to be a much more
interactive uh way of operating with
these models rather than just sort of
text prompts because I then also get
familiar tools like being able to drag
my pyramid around, move it up, down,
left, right. I can erase it, replace it
with something else. Maybe I'll add in a
lion face.
And you get little snippets like that
into the models kind of mind which can
be quite funny when you're sort of in
these intermediate out of domain uh out
of domain states. So some of the other
things you can do you can not just paint
with concepts but also with um sort of
actions and things that these uh objects
are doing. So I'll make my lion open its
mouth with a opening mouth feature that
the model has learned. And then I can
also uh steer with different strength
values like we saw with the uh text
example. So, if I take the opening mouth
feature, which is painted here in
orange, and I turn it down, the lion
will not open its mouth quite as much.
If I turn it way up, the lion's going to
roar. It's going to really open its
mouth. I can keep keep bringing it up.
Uh, so I'll move that back down. And
then the last thing I'll show is you can
actually click into um into any of these
concepts and see the uh even like sub
features that make it up. So you can
think of like this lion face that we've
painted in yellow here as sort of a a
color and we can look at like the
primary colors that go into that and
steer those. So you can get really
granular with how you're able to to sort
of um tweak what you are programming
into the model's uh inter internal uh
neural state. So we can see that one of
these sub features the the strongest one
as you might expect uh corresponds to
sort of a a a lion face. And if I turn
that down and I turn up maybe this like
other type of sort of more like ratl
like creature,
we can just smoothly interpolate between
these different things. And you also get
a a nice hint at what's going on inside
the model's mind. So if I take the the
lion and I subtract out, you know, its
main
and I play around with some of these
other sub features that it's learned, we
can see it sort of becomes more tiger
like. So maybe inside the model's mind
uh you know tiger equals lion minus mana
um is is roughly the way that it's sort
of conceptualizing this in these
highdimensional crazy vector spaces that
they operate in.
And so beyond just sort of the um you
know the guard rails and the model diffs
and the novel interfaces there there are
a lot of other exciting use cases for
interpretability. Um I won't have as
much time to go deep on on all of these.
Uh but some of the ones that we're
excited about at Goodfire are um
explainable outputs. These are you know
extremely important for bringing systems
to prod in regulated industries like uh
finance and healthcare and law. Um and
then uh extracting scientific knowledge
from superhuman systems. So one
organization that we're working with is
the ARC Institute. Uh they train
foundational genomics models. Um, EVO 2
is the is the name of their most recent
launch and uh, EVO2 is is superhuman at
uh, predicting the human genome or
predicting uh, genomic information data
for for all types of organisms. And so
we are really excited about uh figuring
out in an unsupervised way what
biological concepts have these model
learned these models learned that we as
humans don't know and can we actually
extract that information out such that
domain experts can can more effectively
practice whatever they're doing. So
we're also working with um a major
health system to uh look at other um
genomics based models to identify novel
biomarkers of disease and figure out you
know once again these superhuman models
that are able to take a look at a
patient's genome and say what is their
likelihood of having rheumatoid
arthritis which treatments are they more
or less likely to respond to. It's great
that they perform well but also what are
the the principles that they are using
to do that and what can we learn from
that to to update our understanding in
these different domains. There are also
gains to be had in um efficiency and
speed. Um, if you could, you know,
figure out when has a model uh wasted a
lot of its waist weights just memorizing
data that we don't really need it to be
memorizing and can we instead use those
weights for more productive tasks or can
we create a version of a model where
we've pruned out the parts that we don't
need to have only what we need. So you
could imagine, you know, a version of
Claude that only needs to perform coding
and could you pair out a lot of its um,
you know, a lot of its parameters to
make it even more efficient in that way.
So that's a lot about the practical use
cases for interpretability. Um the more
philosophical argument that I want to
end on. Uh I think interpretability is
the coolest thing in the world. I think
it's one of the most important and just
interesting problems to be working on.
And I think uh this is a summit of AI
engineers. The hallmark of an engineer
is that we like to understand how
systems work. We like to take a thing
and take it apart and look at all the
insides of it and say, "Why is it doing
the thing that we're doing?" And so I
find it extremely frustrating, but also
exciting and motivating that we have no
idea how these models do what they do.
Um, that is endlessly fascinating to me.
Uh, and I think that alone is a reason
to uh, care about interpretability uh,
and why you might want to stay up to
date with the field in addition to all
of the cool practical use cases that
that we just talked about. So, thanks a
lot. You can check out uh, the image
demo at uh, paint.goodfire.ai.
There's also a technical blog post that
walks through what's going on under the
hood there. Um, and then goodfire.ai has
our other uh, blog posts, our jobs
board. Um, we're actively hiring. Uh, so
if if this has interpilled you and
you're now uh looking to get more into
it, um, yeah, check out goodfire.ai.
Thank you.
[Applause]
All right, we have time for one question
if anybody has one.
I guess either how you guys find it or
interesting in you guys.
Yeah, great question. It's cool because
there's um I would say the the most um
the current best practice way to find
these features is through the use of an
interpreter model called a sparse
autoenccoder. And there's a lot of
benefits to that. There's some
trade-offs. There's like other methods
that are actively being explored. So,
I'd say if you're interested in in
looking at like what I've just talked
about in Golden Gate Claude, look up
sparse autoenccoders. But I would not be
surprised if the field develops uh a lot
of new techniques for for finding that
in the next few years. Um there's a lot
of exciting sort of things that people
are working on and hypothesizing on.
[Music]