MIB: A Mechanistic Interpretability Benchmark

Aaron Mueller*,1,2, Atticus Geiger*,3, Sarah Wiegreffe*,4, Dana Arad2, Iván Arcuschin5, Adam Belfki1, Yik Siu Chan6, Jaden Fiotto-Kaufman1, Tal Haklay2, Michael Hanna7, Jing Huang8, Rohan Gupta5, Yaniv Nikankin2, Hadas Orgad2, Nikhil Prakash1, Anja Reusch2, Aruna Sankaranarayanan9, Shun Shao10, Alessandro Stolfo11, Martin Tutek2, Amir Zur3, David Bau1, Yonatan Belinkov2
1Northeastern University    2Technion – IIT    3Pr(Ai)2R Group    4AI2    5Independent    6Brown University    7University of Amsterdam    8Stanford University    9Massachusetts Institute of Technology    10University of Cambridge    11ETH Zürich   

Paper Data Code Leaderboard

Abstract

How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. The circuit localization track compares methods that locate the model components—and connections between them—most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAE) or distributed alignment search (DAS), and locate model features for a causal variable relevant to the task. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAEs features are not better than neurons, i.e., standard dimensions of hidden vectors. These findings illustrate that MIB enables meaningful comparisons of MI methods, and increases our confidence that there has been real progress in the field.

Key Contributions

Motivation

Mechanistic interpretability (MI) methods allow us to understand why language models (LMs) behave the way they do. MI methods have been proliferating quickly, but it's difficult to compare the efficacy of MI methods. How can we know whether true methods are producing real advancements over prior work? We propose MIB as a stable standard.

Types of MI Methods

We view most MI methods as performing either localization or featurization (or both). We split these two functions into two tracks: the circuit localization track, and the causal variable track.

Materials

Data

Both tracks evaluate across four tasks. These are selected to represent various reasoning types, difficulty levels, and answer formats. Two of these tasks (IOI and Arithmetic) were chosen because they have been extensively studied. The others (MCQA and ARC) were chosen because they have not.

Models

We include models of diverse capability levels and sizes:

Circuit Localization Track

Given a task, a circuit is the subset of the computation graph that performs the task.

Metrics

Past circuit discovery work often uses faithfulness. This is good for measuring the quality of a single circuit, but how do we measure the quality of a circuit discovery method?

Furthermore, people often mean one of two things by this: (i) the subgraph that is responsible for performing the task well, or (ii) the smallest subgraph that replicates the model's behavior (including its failures).

Thus, we propose two metrics: the integrated circuit performance ratio (CPR; higher is better), and the integrated circuit-model difference (CMD; 0 is best). CPR is basically the area under the faithfulness curve at many circuit sizes. CMD is the area between the faithfulness curve and 1, where 1 indicates that the circuit and model have the exact same task behavior (with respect to what is being measured).
Illustration of CPR (area under the faithfulness curve) and CMD (area between the faithfulness curve and 1).


An issue with faithfulness is that it's not clear what the lower or upper bounds are. Thus, we include a fifth model for this track: an InterpBench model. This is a model that we train to contain a known ground-truth circuit. Because we know what the edges are, we can compute the AUROC over the edges at many circuit sizes.

Baselines

We evaluate a variety of methods, including:

Results

CMD scores (closer to 0 is better), and AUROC scores (for InterpBench only; higher is better).
Attribution patching with integrated gradients (*AP-IG) outperforms attribution patching (*AP), and most other methods.
Edge-level circuits (E*) outperform node-level circuits (A*).
Patching with activations from counterfactual inputs (CF) outperforms other common patching methods.
UGS, a mask-learning method, performs well.

Causal Variable Localization Track

In this track, the goal is to align model representations with specific known causal variables.

Submissions

A submission will align a causal variable in a model that solves the task with features of a hidden vector. For each layer, a submission can provide a hidden vector, a featurizer, and a set of features the variable is aligned to.
Example of of an alignment that would be provided in a submission. In the arithmetic task, the high-level causal model has a carry-the-one variable that is aligned with features of a hidden vector.

Metrics

We want to evaluate the quality of a featurizer, a transformation of the activations that makes it easier to isolate the desired causal variable. For this, we typically use interchange intervention accuracy.
Interchange interventions are used to evaluate submissions. The causal model and language model are both run on a base input and then a variable in the causal model and aligned features in the language model are both set to the values they would take if a counterfactual input were used instead. The output of the causal model and language model are compared, and the more similar the inputs the more accurately the causal model abstracts the language model.

Baselines

We evaluate a mixture of supervised and unsupervised, as well as parametric and non-parametric, methods. As a naive baseline, we compare to no featurizer (i.e., the full untransformed vector).

Results

The supervised features from DAS generally performs best.
Learning masks over basis-aligned dimensions of hidden vectors and principal components are also strong methods.
SAEs fail to provide a better unit of analysis than basis-aligned dimensions, except for the RAVEL task for the continent causal variable in the Gemma-2 model.
SAEs are high-variance: sometimes they approach the performance of the best methods, but sometimes approach that of the worst.

How to cite

bibliography

Aaron Mueller*, Atticus Geiger*, Sarah Wiegreffe*, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov, “MIB: A Mechanistic Interpretability Benchmark”.

bibtex

@article{mib-2025,
	title = {{MIB}: A Mechanistic Interpretability Benchmark},
	author = {Aaron Mueller and Atticus Geiger and Sarah Wiegreffe and Dana Arad and Iv{\'a}n Arcuschin and Adam Belfki and Yik Siu Chan and Jaden Fiotto-Kaufman and Tal Haklay and Michael Hanna and Jing Huang and Rohan Gupta and Yaniv Nikankin and Hadas Orgad and Nikhil Prakash and Anja Reusch and Aruna Sankaranarayanan and Shun Shao and Alessandro Stolfo and Martin Tutek and Amir Zur and David Bau and Yonatan Belinkov},
	year = {2025},
	journal = {CoRR},
	volume = {arXiv:2504.13151},
	url = {https://arxiv.org/abs/2504.13151v1}
}