One of the most vexing public concerns regarding powerful AI systems such as large language models (LLMs) is their lack of interpretability: How can people trust a model’s output when they cannot understand how it was formed?
In the new paper Finding Neurons in a Haystack: Case Studies with Sparse Probing, a research team from the
Massachusetts Institute of Technology, Harvard University and Northeastern University proposes sparse probing, a technique designed to identify the LLM neurons that are relevant to a specific feature or concept and aid in the understanding of how high-level human-interpretable features are represented in such models’ neuron activations.
Probing is an established method that trains a classifier (probe) on a model’s internal activations to predict whether and where a neural network represents specific features. The team’s proposed sparse probing approach constrains the probing classifier to use no more than k neurons in its prediction — where k is a variable ranging from 1-256 — and probes for over 100 features to localize the relevant neurons.
This approach addresses several shortcomings of previous probing techniques and provides valuable new insights into the rich structure within LLMs.
The team leverages recent advances in optimal sparse prediction to prove the optimality of the k-sparse feature selection subproblem for small k values and address the conflation of ranking and classification quality. They employ sparsity as an inductive bias to enable their probes to maintain a strong simplicity prior and obtain a more precise localization of crucial neurons for fine-grained analysis. Further, because a lack of capacity constrains their probes from memorizing correlation patterns associated with features of interest, the approach can produce a more reliable signal of whether a given feature is explicitly represented and used downstream.
In their empirical study, the team used autoregressive transformer LLMs, trained probes for a range of values for the k neurons, and reported their classification performance. They summarize their main findings as follows:
- There is a tremendous amount of interpretable structure within the neurons of LLMs, and sparse probing is an effective methodology to locate such neurons (even in superposition), but requires careful use and follow-up analysis to draw rigorous conclusions.
- Many early layer neurons are in superposition, where features are represented as sparse linear combinations of polysemantic neurons, each of which activates for a large collection of unrelated n-grams and local patterns. Moreover, based on weight statistics and insights from toy models, we conclude that the first 25% of fully connected layers employ substantially more superposition than the rest.
- Higher-level contextual and linguistic features (e.g., is_python_code) are seemingly encoded by monosemantic neurons, predominantly in middle layers, though conclusive statements about monosemanticity remain methodologically out of reach.
- As models increase in size, representation sparsity increases on average, but different features obey different dynamics: some features with dedicated neurons emerge with scale, others split into finer-grained features with scale, and many remain unchanged or appear somewhat randomly.
This work introduces a novel sparse probing technique to reveal a tremendous amount of rich structure in LLMs that is understandable by humans. The team suggests this “ambitious interpretability” can be more productively accomplished with an empirical approach reminiscent of the natural sciences than with traditional machine learning experimental loops and encourages other researchers in the field to join in its exploration.
The paper Finding Neurons in a Haystack: Case Studies with Sparse Probing is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.