Interpretability has emerged as a new buzzword and research focus in AI system development and deployment. Although deep learning models have achieved remarkable success in recent years, the question of how these models work remains underexplored and poorly understood — mainly due to a lack of ground truth information regarding the models’ complex internal mechanisms.
In the new paper Tracr: Compiled Transformers as a Laboratory for Interpretability, a research team from ETH Zurich and DeepMind presents Tracr, a compiler that addresses the absence of ground truth explanations in deep neural network models by “compiling” human readable code to the weights of a transformer model.
The team summarizes their main contributions as follows:
- Describe a modified version of the RASP programming language better suited for being compiled to model weights and discuss some limitations of the RASP programming model.
- Introduce Tracr, a “compiler” for translating RASP programs into transformer model weights. To describe Tracr, we also introduce Craft, its intermediate representation for expressing linear algebra operations using named basis directions.
- Showcase several transformer models obtained by using Tracr.
- Propose an optimization procedure to “compress” the compiled models and make them more efficient and realistic. We analyze models compressed this way, demonstrating superposition.
- Discuss potential applications and limitations of Tracr and how compiled models can help to accelerate interpretability research.
- Provide an open-source implementation of Tracr.
This work is built on the Restricted Access Sequence Processing Language (RASP, Weiss et al., 2021), a domain-specific language for describing transformer computations. The team first maps RASP operations — including embeddings, multi-layer perceptron (MLP) layers and multi-headed attention (MHA) layers — directly to the components of a transformer model. They introduce a few modifications to the RASP language to enable translating it into model weights.
The team then uses Craft, their “assembly language” for transformer models, to represent vector spaces with labelled basis dimensions and operations. As such, the researchers can describe projections or other linear operations in terms of basis direction labels.
The paper provides a detailed description of how Tracr translates RASP programs to transformer weights in a process comprising six steps: 1) Construct a computational graph, 2) Infer s-op input and output values, 3) Independently translate s-ops to craft components, 4) Assign components to layers, 5) Construct craft model, and 6) Assemble transformer weights.
The team evaluated the proposed Tracr tool’s effectiveness on a range of ground truth transformers implementing programs such as computing token frequencies, sorting, and Dyck-n parenthesis checking. The results validate Tracr as an effective tool for advancing interpretability research on neural network models.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.