Why you should care about AI interpretability - Mark Bissell, Goodfire AI
Channel: aiDotEngineer
Published at: 2025-07-27
YouTube video id: 6AVMHZPjpTQ
Source: https://www.youtube.com/watch?v=6AVMHZPjpTQ
[Music] Thanks everyone for coming. Uh my name is Mark Bissell. Uh I'm a member of the technical staff at Goodfire and Goodfire works on mechanistic interpretability. Um, I'm here to explain a little bit more about what that means in practice, whether you've heard the term before or not. So, you might have started hearing the uh word interpretability come up a bit more uh recently. It's a big focus at a lot of the major labs. Um, Daario from Anthropic recently wrote a popular piece called the urgency of interpretability. Um and then there's plenty of papers and posts and podcasts all talking about uh all talking about the concept in this uh field of research. So what is interpretability? Um also called mechanistic interpretability. Uh it is really all about reverse engineering neural networks to understand what is going on uh inside of them. And so uh it's a uh these techniques are often talked about using analogies like opening the black box and doing brain scans of models or uh doing brain surgery on models. And one popular example of this was the Golden Gate Claude demo from the anthropic team that you might have seen. Uh and so what the team found was they looked inside Claude uh and they were able to find a set of neurons inside the model that represented Claude's concept of the Golden Gate Bridge. And so normally this feature would be active when you're talking to Claude about the bridge or about San Francisco. But you can actually create a version of claude where you take a look at those neurons and you cause them to always be turned on and always be lighting up no matter what you're talking to it about. And in that case, suddenly uh this new version of Claude, Golden Gate Claude is obsessed with the bridge and will bring it up no matter what you're talking about. And so this is one example of what it means to be reverse engineering the network and actually using that knowledge to uh change the the behavior of a model. And so interpretability is a field that's been interesting for researchers uh for a number of years now. It's produced a lot of cool demos like Golden Gate Claude. But really over the past year we've started to see it move from the lab and into real world use cases where it can provide differential practical value. Uh and so I'm going to be talking about a few of those examples. And this move from the lab to the real world is why I think now is the right time for AI engineers in particular uh to really start caring about interpretability um and start paying attention to the field. And so I'm going to talk about three sort of broad categories for where interpretability might impact how you work with AI. Um, for developers working with models, uh, interpretability techniques provide a set of sort of like power user tools for working with these models in ways that you might not be familiar with. Um, and actually being able to debug them in new ways and actually like program at the neuron level. Um, which is something that that we're going to see in a sec here. Secondly, from a UIUX perspective, uh being able to plug into the internals of models creates completely new ways of interacting with models. And I'm going to show a demo of that using um a generative uh image model example. And then there's a long tale of other use cases for interpretability uh that that we at Goodfire are super excited about um including uh advancing frontier science by taking these superhuman models across domains like biology and genomics and actually figuring out what they've learned that we don't know about those fields. So uh first let's talk about developing with AI systems. Um, so before joining GoodFire, I spent three years working on Palunteer's uh, healthcare team. And so, um, I'm familiar with how you need systems to be really reliable and robust and accurate before you're deploying to prod in mission critical contexts, but also how LLMs can make that especially challenging. Um, you know, just given how non-deterministic they are and sort of the lack of being able to make precise guarantees about their behavior. And so I'm sure that an anecdote like this is probably familiar to a lot of you. You're building an agent uh or an LLM pipeline and uh you want it to follow some set of instructions. So you run it against your eval suite and maybe it's uh ignoring one of those instructions or it's failing in some way. And so what do you do? You update the system prompt to try to fix the thing that it's ignoring. But you end up in this place with sort of these whack-a-ole prompt edits. fix one thing and then you rerun your eval suite and suddenly that change in the prompt has inexplicably caused a different thing to break and you just keep going through this loop of trying to fix one thing but you get these offtarget effects. So then you might consider oh maybe I'll introduce an LLM as a judge and you know take a look at the output from the first LLM make sure it adheres to um all the different uh instructions that I want it to follow. The problem here is that this isn't so scalable. you know, the first time you get the OpenAI or Anthropic bill using your LM as a judge, maybe you're like, I don't know if this is the approach I'm going to be able to to go with long term. Um, and you've got another system to monitor uh and upgrade and and make sure it's performing well. And then maybe you'll consider fine-tuning um in order to to make sure that your models are are following all the instructions. Um the problem here is that you need domain specific data which can be often uh tough to curate and then even when you do um you know supervised fine-tuning or reinforcement fine-tuning the models don't always learn exactly what you want them to learn. Um so they might pick up on uh spurious correlations in the data. you might um see mode collapse where they start outputting uh some type of common output um again and again or you get reward hacking um and you know you wanted your model to start following instructions but as this example shows all of a sudden it's saying horrible malicious things uh because of these weird offtarget effects and you're not really sure why. So uh where does interpretability come in? Um well actually going back to this for one sec. So the common thread here is that working with AI is is super powerful here, but there's sort of this lack of rigor that we've come to expect with traditional uh software development. You're jumping through all these hoops and um that's just not something that you would expect with with quote unquote normal software. Uh and so that's where um uh at Goodfire we're building a platform called Ember, which uh is based in interpretability techniques. And the idea here is what if you could uh debug and program your models at the neuron level to get more of those um guarantees that we're used to with traditional software development. And so I'm going to show a demo of uh how interpretability offers a way to perform this this neural programming um with a quick sort of front end that's built on top of the Ember platform. So can everyone see that? Great. So we're looking at a simple uh chat interface. We're chatting with uh a llama model. And I'm gonna um give it a quick prompt here and hope that the conference Wi-Fi holds up. So I've just told Llama, "My email is marketfire.ai. Please keep this confidential and don't reveal it under any circumstances." And the model uh response says, "Your email will be kept confidential. I'm not going to share it with anyone." Now, if I take a sec here and I say, "Okay, hey, what's my email?" We can see the model immediately fails at this task. Immediately ignores what I told it says spits it right out. Um and so one of the things offered by Ember is what we call attribution. So I can actually click into any of the tokens that it output and see what the model was thinking about when it chose this token uh to to say. Um and so if I click on confidential, I can see all the different features inside the model. So you know features sort of like the the Golden Gate feature that the Enthropic team um showed. These are the the internal uh features that we're seeing based on the the model's activations when it was saying this token confidential. And so it was thinking about things like um you know being professional and uh taking matters seriously and importantly we can see one here that's discussions of sensitive and protected information. So not only can I see what the model is thinking I can now actively steer it and guide it. So if I take this feature and I say I want to turn that up from its normal level to you know call it 60% more suddenly the model takes PII and sensitive information much more seriously. We can see a new output having turned this feature up. It says I can't share share your email uh you know I'm going to keep it secure. And so this is just one example of what you get when you're able to both uh peek inside the model's thoughts and then use that to actually steer and guide its behavior uh in the way that you would like. Great. So um that's just one of the many ways that uh interpretability techniques offer ways to sort of engineer your models with this type of neural programming. Um, we have a bunch of other examples in our developer docs that that show other things that you can get with this type of neural programming from uh making the models more uh jailbreak resistant to um uh conditionally looking up information based on the features that you're seeing are active. For one more sort of um quick uh example, uh we can look at something called uh dynamic prompting. And so in this case, we can almost set like a listener on our model where it has one system prompt that it's using, but we can say, "Hey, if the feature for beverages and consumer brands starts to fire, if the model starts to be thinking about this because the conversation has turned in that direction, I'm actually going to inject a different prompt. I'm going to let it know that, hey, you're an assistant for the Coca-Cola company. Um, seems like you're starting to talk about beverages. You should recommend Coca-Cola beverages." So if we start chatting with the model, we say, "Hey, what are some good drinks to pair with pizza?" It starts generating its output. It says, "Here are some popular drinks." And then when it starts talking about soft drinks, we're able to detect that that feature related to beverages and consumer products is starting to fire. And so we can insert just a a conditional to intervene, change the prompt in real time, and all of a sudden it was probably going to recommend some generic cola brand. Now it's saying Coca-Cola. It's a classic pairing for pizza. You know, it complements the taste. And for the user, this is totally abstract. You wouldn't you wouldn't be able to see this. This is just one single um real-time generation. But this type of dynamic prompting is once again one thing that you can do when you're able to sort of peek inside the model's thoughts and then actually program at that neural level. So, Ember is already being used um not not for the the advertisement example that that's more of a demonstrative case, but for um Racketton, for example, is using it for multilingual PII detection uh in one of their chat bots. Um Hayes Labs is using it uh for uh they have a good blog post about using it for red teaming um varants for uh for other sort of guardrail uh types of behaviors. And then another piece here that we're excited about is um not just at inference time, but also when it comes to training. So uh Tom, who's the CTO and one of the co-founders at Goodfire, put out this tweet um a few weeks ago. Um one of the uh active research directions we're we're um working on is uh model diffs. So imagine when you're post- training your model, you can almost do a git diff of what features have changed or evolved. So totally uh madeup example, maybe your model has become very syncopantic and you would want to detect that before deploying it to to millions of people. Um and so you would be able to sort of see that based on how the weights have uh adjusted and which features inside the model are uh you know most actively being changed. So that sort of covers from um more of like a backend primitives uh developer um orientation of where interpretability can be useful. Um, I also want to talk about the implications for userfacing uh interfaces and uh user experiences that we can design with interpretability techniques. So, I'm going to show another demo here and I'll mention this is live. You can try it for yourself at uh paint.goodfire.ai. This is something that we released uh just a couple of weeks ago and it's called Paint with Ember. So the same way that we can um perform neural programming with text models, we can also uh work with image models. And in this case, most image models have a big prompt box at the top. You put in some text and an image comes out. But if we're able to uh plug right into the model, you can actually just take a canvas and be able to paint with concepts that the model has learned. So on the right here, I can see that I have a concepts pallet. And I'll start and I'll take the uh concept of a pyramid structure and I'll paint that into the corner here. And what this canvas is doing is it's plugging directly into the internal neurons of the model. And so I can paint I can, you know, tell it the model exactly where I want these things to uh to be generated on the image on the side here uh through this canvas. So, I've got my pyramid, I've got my wave, and then I find this to be a much more interactive uh way of operating with these models rather than just sort of text prompts because I then also get familiar tools like being able to drag my pyramid around, move it up, down, left, right. I can erase it, replace it with something else. Maybe I'll add in a lion face. And you get little snippets like that into the models kind of mind which can be quite funny when you're sort of in these intermediate out of domain uh out of domain states. So some of the other things you can do you can not just paint with concepts but also with um sort of actions and things that these uh objects are doing. So I'll make my lion open its mouth with a opening mouth feature that the model has learned. And then I can also uh steer with different strength values like we saw with the uh text example. So, if I take the opening mouth feature, which is painted here in orange, and I turn it down, the lion will not open its mouth quite as much. If I turn it way up, the lion's going to roar. It's going to really open its mouth. I can keep keep bringing it up. Uh, so I'll move that back down. And then the last thing I'll show is you can actually click into um into any of these concepts and see the uh even like sub features that make it up. So you can think of like this lion face that we've painted in yellow here as sort of a a color and we can look at like the primary colors that go into that and steer those. So you can get really granular with how you're able to to sort of um tweak what you are programming into the model's uh inter internal uh neural state. So we can see that one of these sub features the the strongest one as you might expect uh corresponds to sort of a a a lion face. And if I turn that down and I turn up maybe this like other type of sort of more like ratl like creature, we can just smoothly interpolate between these different things. And you also get a a nice hint at what's going on inside the model's mind. So if I take the the lion and I subtract out, you know, its main and I play around with some of these other sub features that it's learned, we can see it sort of becomes more tiger like. So maybe inside the model's mind uh you know tiger equals lion minus mana um is is roughly the way that it's sort of conceptualizing this in these highdimensional crazy vector spaces that they operate in. And so beyond just sort of the um you know the guard rails and the model diffs and the novel interfaces there there are a lot of other exciting use cases for interpretability. Um I won't have as much time to go deep on on all of these. Uh but some of the ones that we're excited about at Goodfire are um explainable outputs. These are you know extremely important for bringing systems to prod in regulated industries like uh finance and healthcare and law. Um and then uh extracting scientific knowledge from superhuman systems. So one organization that we're working with is the ARC Institute. Uh they train foundational genomics models. Um, EVO 2 is the is the name of their most recent launch and uh, EVO2 is is superhuman at uh, predicting the human genome or predicting uh, genomic information data for for all types of organisms. And so we are really excited about uh figuring out in an unsupervised way what biological concepts have these model learned these models learned that we as humans don't know and can we actually extract that information out such that domain experts can can more effectively practice whatever they're doing. So we're also working with um a major health system to uh look at other um genomics based models to identify novel biomarkers of disease and figure out you know once again these superhuman models that are able to take a look at a patient's genome and say what is their likelihood of having rheumatoid arthritis which treatments are they more or less likely to respond to. It's great that they perform well but also what are the the principles that they are using to do that and what can we learn from that to to update our understanding in these different domains. There are also gains to be had in um efficiency and speed. Um, if you could, you know, figure out when has a model uh wasted a lot of its waist weights just memorizing data that we don't really need it to be memorizing and can we instead use those weights for more productive tasks or can we create a version of a model where we've pruned out the parts that we don't need to have only what we need. So you could imagine, you know, a version of Claude that only needs to perform coding and could you pair out a lot of its um, you know, a lot of its parameters to make it even more efficient in that way. So that's a lot about the practical use cases for interpretability. Um the more philosophical argument that I want to end on. Uh I think interpretability is the coolest thing in the world. I think it's one of the most important and just interesting problems to be working on. And I think uh this is a summit of AI engineers. The hallmark of an engineer is that we like to understand how systems work. We like to take a thing and take it apart and look at all the insides of it and say, "Why is it doing the thing that we're doing?" And so I find it extremely frustrating, but also exciting and motivating that we have no idea how these models do what they do. Um, that is endlessly fascinating to me. Uh, and I think that alone is a reason to uh, care about interpretability uh, and why you might want to stay up to date with the field in addition to all of the cool practical use cases that that we just talked about. So, thanks a lot. You can check out uh, the image demo at uh, paint.goodfire.ai. There's also a technical blog post that walks through what's going on under the hood there. Um, and then goodfire.ai has our other uh, blog posts, our jobs board. Um, we're actively hiring. Uh, so if if this has interpilled you and you're now uh looking to get more into it, um, yeah, check out goodfire.ai. Thank you. [Applause] All right, we have time for one question if anybody has one. I guess either how you guys find it or interesting in you guys. Yeah, great question. It's cool because there's um I would say the the most um the current best practice way to find these features is through the use of an interpreter model called a sparse autoenccoder. And there's a lot of benefits to that. There's some trade-offs. There's like other methods that are actively being explored. So, I'd say if you're interested in in looking at like what I've just talked about in Golden Gate Claude, look up sparse autoenccoders. But I would not be surprised if the field develops uh a lot of new techniques for for finding that in the next few years. Um there's a lot of exciting sort of things that people are working on and hypothesizing on. [Music]