[Anthropic Toy Model 이후의 논문들과 유사하지만 다른 방식의 관찰 및 전개]
Contents
1. 서론
본 연구는 대규모 언어모델(LLMs)의 내부 메커니즘을 이해하기 위한 시도로, 모델의 연속적인 훈련과정에서 회로가 어떻게 형성되고 변화하는지를 추적합니다. 특히, 회로 분석이 초기 사전 훈련 모델에만 국한되지 않고 추가적인 사전 훈련 및 모델 스케일에 걸쳐 일반화될 수 있는지 여부에 초점을 맞춥니다. 연구는 Pythia 모델군을 사용하여 3000억 개의 토큰에 걸친 훈련 데이터를 분석하며, 이는 다양한 모델 크기(7000만에서 28억 개의 파라미터)에 걸쳐 수행됩니다.
2. 방법
2.1 회로의 정의 및 중요성
회로는 모델의 작업 수행 메커니즘을 설명하는 계산 서브그래프로 정의됩니다. 회로는 특정 작업에 대해 모델의 행동을 신뢰성 있게 나타내는지 여부를 통해 검증됩니다. 예를 들어, 간접 객체 식별 작업(IOI)에서는 모델이 주어진 입력에 대해 예상되는 출력을 생성할 수 있는지를 평가합니다.
2.2 회로 찾기
본 섹션에서는 모델의 계산 서브그래프를 식별하기 위한 방법으로 통합 그래디언트를 이용한 엣지 속성 부여 방식(EAP-IG)을 사용합니다. 이 방법은 모델의 모든 엣지에 대해 중요도 점수를 할당하며, 특정 엣지가 변조됐을 때 발생하는 손실 변화를 기반으로 점수를 계산합니다. 선행 연구인 Hanna et al. (2024)은 EAP-IG가 그래디언트 기반의 근사치를 사용하여 각 엣지의 중요도를 평가하고, 이를 통해 더 신뢰성 있는 회로를 적은 수의 엣지로 찾을 수 있다고 제안합니다.
EAP-IG에서 사용하는 주요 수식은 다음과 같습니다.
\[(\mathbf{z}_u' - \mathbf{z}_u) \frac{1}{m} \sum_{k=1}^m \frac{\partial L(\mathbf{z}' + \frac{k}{m} (\mathbf{z} - \mathbf{z}'))}{\partial \mathbf{z}_v}\][회로 정의 및 신뢰성]
이 방법을 사용하여 각 엣지를 평가한 후, 절대 점수가 가장 높은 엣지를 기반으로 회로를 구성합니다. 이 회로는 전체 모델 성능의 최소 80%를 달성하는 것을 목표로 하며, 이는 이진 탐색을 통해 결정됩니다. 회로의 신뢰성은 대부분의 모델 메커니즘을 포착하는 데 중요하며, 그 신뢰성은 설정된 높은 신뢰도 임계값을 통해 확보됩니다.
2.3 모델
본 연구에서 사용된 Pythia 모델은 다양한 규모의 언어 모델로 구성된 오픈 소스 모델 스위트입니다. 각 연구에 사용된 Pythia 모델군은 154개의 훈련 체크포인트를 가지며, 이를 통해 다양한 훈련 단계에서의 모델 성능과 회로를 분석할 수 있습니다. 이는 연속적인 훈련 과정에서 모델의 동작을 분석하는 데 이상적입니다. 모델은 7B에서 1.2B에 이르기까지 다양한 크기로 구성되어 있습니다.
Pythia 레포를 보면 모델의 체크포인트 별로 분석에 집중하며 다양한 조건으로 학습시켰으므로 이상적임. Sornet과 관련하여 Anthropic도 비슷한 분석 및 연구를 수행하고 있으나 Anthropic은 체크포인트 및 모델 웨이트를 공유하지 않으므로…
2.4 Task
연구에서 분석된 태스크는 간접 객체 식별, 성별 대명사, 대소 비교 등의 단순 태스크들로 구성됩니다. 분석된 태스크에는 간접 객체 식별(IOI), 성별 대명사 판별, 대소 비교 등이 포함됩니다. 이런 태스크들은 비교적 단순하여, 심지어 작은 모델에서도 수행 가능하며, 선행 연구에서 이미 일부 메커니즘에 대해 설명하고 있습니다. 이 태스크들을 통해 모델이 사용하는 회로가 기존 연구에서 분석된 회로와 유사한지를 검증하고, 간단한 태스크들은 모델이 어떻게 특정 출력을 생성하는지에 대한 이해를 돕습니다.
3. 회로 형성
3.1 행동 평가
모델의 작업 수행 능력을 시간에 따라 분석함으로써, 특정 회로가 언제 그리고 어떻게 발달하는지를 파악합니다. 초기에는 모델의 작업 수행 능력이 비슷하게 나타나며, 이는 다양한 모델 크기에서 일관된 학습 패턴을 보여줍니다.
3.2 구성요소의 등장
특정 작업과 관련된 주요 구성요소들이 모델의 학습 과정에서 어떻게 나타나는지를 분석합니다. 이런 구성요소에는 인덕션 헤드, 후속 헤드, 복제 억제 헤드 등이 포함됩니다. 이들은 모델이 특정 작업을 수행하는 데 중요한 역할을 합니다.
4. 알고리즘의 안정성과 일반화
4.1 모델 행동과 회로 구성요소
학습이 진행됨에 따라 각 회로의 구성요소는 변화할 수 있으나, 구현된 알고리즘은 일관성을 유지합니다. 이는 회로가 일정한 작업 수행 알고리즘을 계속해서 제공한다는 것을 의미합니다.
4.2 회로 알고리즘의 안정성
각 구성요소가 변화하더라도, 회로가 구현하는 알고리즘은 안정적입니다. 이는 회로 분석이 다양한 훈련 상태와 모델 스케일에서 일반화될 수 있음을 시사합니다.
As LLMs’ capabilities have grown, so has interest in char- acterizing their mechanisms. Recent work in mechanistic interpretability often seeks to do so via circuits: compu- tational subgraphs that explain task-solving mechanisms (Wang et al., 2023; Hanna et al., 2023; Merullo et al., 2024; Lieberum et al., 2023). Circuits can be found and verified using a variety of methods (Conmy et al., 2023; Syed et al., 2023; Hanna et al., 2024; Kramár et al., 2024; Ferrando & Voita, 2024), with the aim of reverse-engineering models’ task-solving algorithms.
Though much circuits research is motivated by LLMs’ capa- bilities, the setting in which such research is performed often differs from that of currently deployed models. Crucially, while most LLM circuits work (Wang et al., 2023; Hanna et al., 2023; Merullo et al., 2024; Lieberum et al., 2023; Tigges et al., 2023) studies models at the end of pre-training, currently deployed models often undergo continuous train- ing (OpenAI et al., 2024; Anthropic, 2024; Gemini Team et al., 2024) or are fine-tuned for specific tasks (Chung et al., 2022; Hu et al., 2021). Other subfields of interpretability have studied model development during training (Hu et al., 2023; Chang et al., 2023; Warstadt et al., 2020; Choshen et al., 2022; Chang & Bergen, 2022), but similar work on LLM mechanisms is scarce. Existing mechanistic work over training has studied syntactic attention structures and induction heads (Olsson et al., 2022; Chen et al., 2024; Singh et al., 2024), but has focused on small encoder or toy models. Prakash et al. (2024) examines circuits in 7- billion-parameter models post-finetuning, but the evolution of circuits during pre-training remains unexplored. This raises questions about whether circuit analyses will general- ize if the model in question is further trained or fine-tuned.
We address this issue by exploring when and how circuits and their components emerge during training, and their con- sistency across training and different model scales. We study circuits in models from the Pythia suite (Biderman et al., 2023b) across 300 billion tokens, at scales from 70 million to 2.8 billion parameters. We supplement this with additional data from models ranging up to 12 billion param- eters. Our results suggest remarkable consistency in circuits and their attributes across scale and training. We summarize our contributions as follows:
Performance acquisition and functional component emergence are similar across scale: Task ability acquisi- tion rates tend to reach a maximum at similar token counts across different model sizes. Functional components like name mover heads, copy suppression heads, and succes- sor heads also emerge consistently at similar points across scales, paralleling previous findings that induction heads emerge at roughly 2B-5B tokens across models of all scales LLM Circuit Analyses Are Consistent Across Training and Scale (Olsson et al., 2022).
Circuit algorithms can remain stable despite component- level fluctuations: Analysis of the indirect object identifi- cation (IOI; Wang et al., 2023) circuit across training and scale reveals that even when individual components change, the overall algorithm remains consistent, indicating a degree of algorithmic stability. The algorithm also tends to be simi- lar for dramatically different model scales, suggesting that some currently-identified circuits may generalize, at least on simple tasks.
Taken as a whole, our results suggest that circuit analysis can generalize well over both (pre-)training and scale even in the face of component and circuit size changes, and that circuits studied at the end of training in smaller models can sometimes be informative for larger models as well as for models with longer training runs. We hope to see this validated for other circuits, especially more complex ones, confirming our initial findings.
A circuit (Olah et al., 2020; Elhage et al., 2021; Wang et al., 2023) is the minimal computational subgraph of a model that is faithful to its behavior on a given task. At a high level, this means that circuits describe the components of a model—e.g., attention heads or multi-layer perceptrons (MLPs)—that the model uses to perform the task. A task, within the circuits framework, is defined by inputs, expected outputs, and a (continuous) metric that measures model per- formance on the task. For example, in the indirect object identification (IOI, (Wang et al., 2023)) task, the LM re- ceives inputs like “When John and Mary went to the store, John gave a drink to”, and is expected to output Mary, rather than John. We can measure the extent to which the LM ful- fills our expectations by measuring the difference in logits assigned to Mary and John.
Circuits are useful objects of study because we can verify that are faithful to LM behavior on the given task. We say that a circuit is faithful if we can corrupt all nodes and edges outside the circuit without changing model behavior on the task. Concretely, we test faithfulness by running the model on normal input, while replacing the activations correspond- ing to edges outside our circuit, with activations from a corrupted input, which elicits very different model behav- ior. In the above case, our corrupted input could instead be “When John and Mary went to the store, Mary gave a drink to”, eliciting John over Mary. If the circuit still predicts Mary, rather than John, it is faithful. As circuits are often small, including less than 5% of model edges, this faithful- ness test corrupts most of the model, thus guaranteeing that circuits capture a small set of task-relevant model mechanisms. For more details on the circuits framework, see prior work and surveys (Conmy et al., 2023; Hanna et al., 2024; Ferrando et al., 2024).
Circuits have a number of advantages over other inter- pretability frameworks. As computational subgraphs of the model that flow from its inputs to its outputs, they provide complete explanations for a model’s mechanisms. Moreover, their faithfulness, verified using a causal test, makes them more reliable explanations. This stands in contrast to prob- ing (Belinkov, 2022), which only offers layer-representation- level explanations, and can be unfaithful, capturing features unused by the model (Elazar et al., 2020). Similarly, in- put attributions (Shrikumar et al., 2017; Sundararajan et al., 2017a) only address which input tokens are used, and may be unreliable (Adebayo et al., 2018; Bilodeau et al., 2024).
In order to find faithful circuits at scale over many checkpoints, we use efficient, attribution-based circuit finding methods. Such methods score the importance of all edges in a model’s graph in a fixed number of forward and backward passes, independent of model size; though other patching-based circuit-finding methods (Conmy et al., 2023) are more accurate, they are too slow, requiring a number of forward passes that grows with model size. From the many existing attribution methods (Nanda, 2023; Ferrando & Voita, 2024; Kramár et al., 2024), we select edge attribution patching with integrated gradients (EAP-IG; Hanna et al., 2024) due to its faithful circuit-finding ability. Much like its predecessor, edge attribution patching (EAP; Nanda, 2023), EAP-IG assigns each edge an importance score using a gradient-based approximation of the change in loss that would occur if that edge were corrupted; however, EAP-IG yields more faithful circuits with fewer edges. Concretely, EAP-IG computes the score of an edge between nodes \(u\) and \(v\), with activations \(z_u, z_v\) as
\[(z'_u - z_u) \frac{1}{m} \sum_{k=1}^{m} \frac{\partial L(z' + \frac{k}{m} (z - z'))}{\partial z_v},\]where \(m\) is the number of integrated gradient steps (Sundararajan et al., 2017b) to perform. This method requires \(O(m)\) forward and backward passes to score all model edges; we choose \(m = 5\) based on Hanna et al.’s (2024) recommendations.
After running EAP-IG to score each edge, we define our circuit by greedily searching for the edges with the highest absolute score. We search for the minimal circuit that achieves at least 80% of the whole model’s performance on the task. We do this using binary search over circuit sizes; the initial search space ranges from 1 edge to 5% of the model’s edges. The high faithfulness threshold we set gives us confidence that our circuits capture most model mechanisms used on the given task. However, ensuring that a circuit is entirely complete, containing all relevant model nodes and edges, is challenging, and no definitive method of verifying this has emerged.
We use the circuits we identify through this method to identify key nodes and structures, but we do not limit our study of functional heads to components found through this method alone. Discussion of the size- and similarity-based metrics for these circuit graphs can be found in Appendix C.
We study Biderman et al.’s (2023b) Pythia model suite, a collection of open-source autoregressive language models that includes intermediate training checkpoints. Though we could train our own language models or use another model suite with intermediate checkpoints (Sellam et al., 2022; Liu et al., 2023; Groeneveld et al., 2024), Pythia is particularly useful in providing a thorough set of checkpoints for models at a variety of scales, all with identical training data. Each model in the Pythia suite has 154 checkpoints: 11 of these correspond to the model after 0, 1, 2, 4, . . . , and 512 steps of training; the remaining 143 correspond to 1000, 2000, . . . , and 143,000 steps. We find circuits at each of these checkpoints. As Pythia uses a uniform batch size of 2.1 million tokens, these models are trained on far more tokens (300 billion) than those in existing studies of model internals over time. We study models of varying sizes, from 70 million to 12 billion parameters.
We examine the mechanisms behind four different tasks taken from the (mechanistic) interpretability literature. We choose simple tasks explicitly because they are feasible for even the smaller models we study to perform, and also because these tasks are simple enough that existing work has already provided clues and sometimes detailed descriptions of how models perform them. By contrast, we do not yet have circuit-level representations of more complex tasks and do not yet understand how models perform them. To verify that our models use similar circuits as heretofore-studied models to perform the simple tasks we selected, we briefly analyze our models’ indirect object identification and greater-than circuits in Appendix A. The other tasks are MLP-dominant and do not involve much attention head activity; for these circuits, we verify that this is still the case in Pythia models.
The indirect object identification (IOI; Wang et al., 2023) task feeds models inputs such as “When John and Mary went to the store, John gave a drink to”; models should prefer Mary over John. Corrupted inputs, like “When John and Mary went to the store, Mary gave a drink to”, reverse model preferences. We measure model behavior via the difference in logits assigned to the two names (Mary and John). We use a small dataset of 70 IOI examples created with Wang et al.’s (2023) generator, as larger datasets did not provide significantly better results in our experiments and this size fit into GPU memory more easily.
The Gendered-Pronoun task (Vig et al., 2020; Mathwin et al., 2023; Chintam et al., 2023) measures the gender of the pronouns that models produce to refer to a previously mentioned entity. Prior work has shown “So Paul is such a good cook, isn’t”, models prefer the continuation “he” to “she”; we measure the degree to which this occurs via the difference in the pronouns’ logits. In the corrupted case, we replace the “Paul” with “Mary”; we include opposite-bias examples as well. We craft 70 examples as in (Mathwin et al., 2023).
The Greater-Than task (Hanna et al., 2023) measures a model’s ability to complete inputs such as \(s = \text{“The war lasted from the year 1732 to the year 17”}\) with a valid year (i.e. a year \(> 32\)). Task performance is measured via probability difference (prob diff); in this example, the prob diff is $$\sum_{y=33}^{99} p(y | s) - \sum_{y=00}^{32} p(y | s)$$. In corrupted inputs, the last two digits of the start year are replaced by “01”, pushing the model to output early (invalid) years that decrease the prob diff. We create 200 Greater-Than examples with Hanna et al.’s (2023) generator. |
Subject-verb agreement (SVA), widely studied within the NLP interpretability literature (Linzen et al., 2016; Newman et al., 2021; Lasri et al., 2022), tasks models with predicting verb forms that match a sentence’s subject. Given input such as “The keys on the cabinet”, models must predict “are” over “is”; a corrupted input, “The key on the cabinet” pushes models toward the opposite response. We measure model performance via prob diff, taking the difference of probability assigned to verbs that agree with the subject, and those that do not. We use 200 synthetic SVA example sentences from (Newman et al., 2021).
We begin our analysis of LLMs’ task mechanisms over time by analyzing LLM behavior on these tasks; without under- standing their task behaviors, we cannot understand their task mechanisms. We test these by running each model (Section 2.3) on each task (Section 2.4). Our results (Fig- ure 1) display three trends across all tasks. First, all models but the weakest (Pythia-70m) tend to arrive at similar task
Figure 1. Task behavior across models and time (higher indicates a better match with expected behavior). Across tasks and scales, model abilities tend to develop at the same number of tokens. We use logit difference (the difference between the logits for the “correct” and “incorrect” names in the task) and probability difference (average probability for the correct and incorrect answer groups) as metrics, as these were used in the original works that examined these tasks. Often, models will show negative performance on tasks immediately prior to developing the ability to do them; we leave to future work why this is the case. performance at the end of training. This is consistent with our choice of tasks: they are simple, learnable even by small models, and scaling does not significantly improve perfor- mance. Second, once models begin learning a task, their overall performance is generally non-decreasing, though there are minor fluctuations; Pythia-2.8b’s logit difference on Gendered Pronouns dips slightly after it learns the task. In general, though, models tend not to undergo significant unlearning. The only marked downward trend (Pythia-70m at the end of SVA) comes from a weak model.
Finally, for each task we examined, we observed that there was a model size beyond which additional scale did not improve the rate of learning, and sometimes even decreased it; task acquisition appeared to approach an asymptote. We found this surprising due to the existence of findings show- ing the opposite trend for some tasks: (Kaplan et al., 2020; Rae et al., 2022). On some tasks (Gendered Pronouns and Greater-Than), all models above a certain size (70M param- eters for Gendered Pronouns and 160M for Greater-Than) learn tasks at roughly the same rate. On IOI, models from 410M to 2.8B parameters learn the task the fastest, but larger models (6.9B and 12B) have learning curves more like Pythia-160m. We obtain similar results on more difficult tasks like SciQ (Welbl et al., 2017); results in Appendix F.
What drives this last trend, limiting how fast large models learn tasks? To understand this, we delve into the internal model components that support these behaviors and trends.
Prior work (Olsson et al., 2022; Chen et al., 2024; Singh et al., 2024) has shown how a model’s ability to perform a specific task can hinge on the development of certain components, i.e. the emergence of attention heads or MLPs with specific, task-beneficial behaviors. Prior work has also thoroughly characterized the components underlying model abilities in two of our tasks, IOI and Greater-Than, at the end of training. We thus ask: is it the development of these components that causes the task learning trends we saw before? We focus on four main components, all of which are attention heads, which we briefly describe here:
Induction Heads (Olsson et al., 2022) activate on sequences of the form [A][B]…[A], attending to and upweighting [B]. This allow models to recreate patterns in their input, and supports IOI and Greater-Than.
Successor Heads (Gould et al., 2023) identify sequential values in the input (e.g. “11” or “Thursday”) and upweight their successor (e.g. “12” or “Friday”); this supports Greater- Than behavior.
Copy Suppression Heads (McDougall et al., 2023) attend to previous words in the input, lowering the output prob- ability of repeated tokens that are highly predicted in the degree to which it acts like one of the four aforementioned heads. We then plot the earliest-emerging heads of each type, per model.
Figure 2. The development of components relevant to IOI and Greater-Than, across models and time. Each line indicates the strength of component behavior of the selected attention head from that model; higher values imply stronger component behavior. For each model and component, we plot the head in the relevant circuit (either IOI or Greater Than) that displays the component behavior the earliest.
Our results (Figure 2) indicate that many of the hypothesized responsible components emerge the same time as model per- formance increases. Most models’ induction heads emerge soon after they have seen 2 × 109 tokens, replicating the findings in (Olsson et al., 2022); immediately after this, Greater-Than behavior emerges. The successor heads, also involved in Greater-Than, emerge at a similar time.
For IOI, the name-mover heads emerge at similar timesteps (2 - 8×109 tokens) across models, with a very high strength, during or just before IOI behavior appears. Copy suppres- sion heads emerge at the same timescale, but at varying speeds, and with varying strengths. Given that these heads are the main contributors to model performance in each task’s circuit, and they emerge as or just before models’ task performance increases, we can be reasonably sure that they are responsible for the emergence of performance. This said, we note an unusual trend: though model performance (Fig- ure 1) does not decrease over time, the functional behavior of certain attention heads does. In the following section, we explain how this occurs.
Residual stream input to the head. In the original IOI circuit, copy suppression heads hurt performance, downweighting the correct name. In contrast, we find (Appendix E) that they contribute positively to the Pythia IOI circuit by down- weighting the incorrect name; this is possible because both names are already highly predicted in these heads’ input, and they respond by downweighting the most repeated one.
Name-Mover Heads (Wang et al., 2023) perform the last step of the IOI task, by attending to and copying the correct name. Unlike the other heads described so far, this behavior is specific to IOI-type tasks; their behavior across the entire data distribution has not yet been characterized.
Because the importance of these components to IOI and Greater-Than has been established in other models, but not necessarily in those of the Pythia suite, we must first confirm their importance in these models. We do so by finding circuits for each model at each checkpoint using EAP-IG, as described in Section 2.2; we omit Pythia-6.9b and 12b from circuit finding for reasons of computational cost. We find that these component types indeed appear within the circuits of Pythia models’ tasks circuits; see Appendix A and Appendix B for details on our methods and findings.
For each component, prior work has developed a metric to determine whether a model’s attention head is acting like that component type; see Appendix E for details on these. Using these metrics, we score each of our models’ heads for each of these behaviors at each checkpoint, evaluating the in Post-Formation Circuits (Singh et al., 2024); in such work, components and task behaviors appear constant after component formation.
Figure 3. The development over time of various components relevant to IOI and Greater-Than in Pythia-160m. Here, we show the top heads for each function in the model. Each line indicates the degree to which an attention head, denoted as (layer, head), exhibits a given function; higher values imply stronger functional behavior. Heads often lose their current function; as this occurs, other heads take their place (but not always to the same degree or in the same numbers.
We demonstrated in Section 3 that across a variety of tasks, models with differing sizes learn to perform the given task after the same amount of training; this appears to happen because each task relies on a set of components which de- velop after a similar count of training tokens across models. However, in Figure 2, we observed that attention heads that had a given function earlier in behavior can lose their function later in training. This raises questions: when the heads being used to solve a task change, does the algorithm implemented by the model change too? And how do these algorithms generalize across model scale?
Post-Formation
To understand how model component behaviors change over time, we now zoom in on the components in one model, Pythia-160m, and study them over the course of training; where we earlier plotted only the top component (e.g. the top successor head), of each model, we now plot the top 5 of Pythia-160m’s heads that exhibit a given functional behavior (or fewer, if fewer than 5 exist). By evaluating components and algorithms over Pythia-160m’s 300B token training span, we go beyond previous work, which studies models trained on relatively few (≤ 50M) tokens (Chen et al., 2024;
By contrast, our results (Figure 3) show that over the longer training period of Pythia models, the identity of components in each circuit is not constant. For example, the name- mover head (4,6) suddenly stops exhibiting this behavior at 3 × 1010 tokens, having acquired it after 4 × 109 tokens. Similarly, Pythia-160m’s main successor head (5,9) loses its successor behavior towards the end of training; however, (11,0) exhibits more successor behavior at precisely that time. Such balancing may lead to the model’s task perfor- mance remaining stable, as we observed in the prior section (Figure 1). It seems plausible that self-repair (McGrath et al., 2023; Rushing & Nanda, 2024) contributes to this behavioral stability, but we leave the question of the exact “load-balancing” mechanism to future work. Nevertheless, models can clearly compensate for losses of and changes in individual circuit components.
This instability of functional components raises an important question—when attention heads begin or cease to partici- pate in a circuit, does the underlying algorithm change? To answer this, we examined the IOI circuit, as it is the most thoroughly characterized (Wang et al., 2023) circuit algorithm of our set of tasks. Our investigation follows a three-stage approach: first, we analyzed the IOI circuit at the end of training, reverse-engineering its algorithm; next, we developed a set of metrics to quantify whether the model was still performing that algorithm; finally, we applied these metrics across checkpoints, to determine if the algorithm was stable over training.
Figure 4. A: Pythia-160m’s IOI circuit at the end of training (300B tokens). The remaining plots show the percent of model IOI performance that is explained by the Copy Suppression and Name-Mover Heads (B), the S-Inhibition Heads’ edges to those heads (C), and the Induction / Duplicate Token Heads’ connections to the S-Inhibition heads (D); higher percentages indicate that the corresponding edge is indeed important. Each of plots B-D verifies the importance of an edge from diagram A. The set of components analyzed changes from checkpoint to checkpoint such that all heads performing a relevant function (like name-moving) at that checkpoint are considered.
The first stage of our analysis is to analyze the IOI circuit at the end of training. Here, we present only the results of our analysis, but see Appendix B for details of this pro- cess, which follows the original analysis (Wang et al., 2023). Figure 4A shows the circuit that results from our analysis; it involves three logical “steps,” each of which involves a different set of attention head types. Working backwards from the logit predictions, the direct contributors towards the logit difference are name-mover heads and copy sup- pression heads. The former attend to the indirect object in the prompt and copy it to the last position; the latter attend to and downweight tokens that appear earlier in the input. In the next step, the name-mover heads (but not the copy- suppression heads) use on token and positional information output by the S-inhibition heads to attend to the correct to- ken. Finally, S-inhibition heads rely on information from induction heads and duplicate-token heads (only the former of which is involved in the IOI circuit for Pythia-160m in particular).
Next, we quantify the extent to which the circuit depends on each of these three steps, via path patching (Goldowsky- Dill et al., 2023), a form of ablation where activations are swapped with those from counterfactual prompts (see Ap- pendix B for details). If a step is important, ablating the connection between the components involved in that step (e.g. in step 2, between induction / duplicate-token heads and S-inhibition heads) should have a large direct effect, and cause a large drop in model performance. For each step, our metric measures this direct effect, divided by the sum of the direct effects of ablating each edge with the same endpoint. Our metrics thus range from 0-100%; higher is better.
Finally, we compute each of these metrics for each model from 160M to 2.8B parameters in size.1 We run them on each checkpoint post-circuit emergence (that is, when all component types appear in the circuit); for Pythia-160m, we test every checkpoint, and for the larger models we space out checkpoints to save compute, using approximately 1/3rd of the available checkpoints). We find (Figure 4B-D) that the behavior measured by these metrics is stable once the initial circuit has formed. Notably, in no model or metric are there dramatic shifts in algorithm corresponding to functional component shifts within the circuit. Moreover, all scores are relatively high, generally above 50%; the core solvers of the algorithm, copy suppression and name-mover heads, have scores above 70%. This suggests that analyses of circuits in fully pre-trained models may generalize well to other model states, rather than being contingent on the particular checkpoint selected.
1 We omit Pythia-70m, as it does not learn the task; due to computational constraints, we omit Pythia-6.9b/12b.
[색인마킹]
We emphasize that these metrics show algorithmic stability even in the face of component shifts; that is, many com- ponents of a particular type (e.g. name mover heads) can cease playing their role without perturbing the nature of the algorithm. Other heads start assuming the role of the components that have shifted away from their task, but this seems unlikely to be the only way the model can adapt to these kinds of changes. To further quantify the degree to which the set of component nodes involved in these circuits changes, we present a series of metrics in Appendix D.
Generalization across model scales also seems promising, as IOI circuit metrics from Pythia-160m are also high in larger Pythia variants. However, there is variation: while the name-mover, copy-suppression, and S-inhibition heads are at work in all models’ circuits, the Pythia-160m circuit does not involve duplicate-token heads, while others do. So small differences exist amid big-picture similarity. More- over, we stress that these algorithmic similarities might not hold for more complex tasks, for which a greater variety of algorithms could exist.
Implications for Interpretability Research While our findings are based on a limited set of circuits, they hold significant implications for mechanistic interpretability re- search. Our study was motivated by the fact that most such research does not study models that vary over time, like currently deployed models. However, the stability of circuit algorithms over the course of training suggests that analyses performed on models at a given point during training may provide valuable insights into earlier and later phases of training as well. Moreover, the consistency in the emer- gence of critical components and the algorithmic structure of these circuits across different model scales suggests that studying smaller models can sometimes provide insights ap- plicable to larger models. This dual stability across training and scale could reduce the computational burden of inter- pretability research and allow for more efficient study of model mechanisms. However, further research is needed to confirm these trends across a broader range of tasks and model architectures.
Limitations and Future Work Our analysis was limited to a narrow range of tasks feasible for small models. This limits in turn the scope of the claims that we can make. We believe it to be very possible that more complex tasks, not solvable by small models, which permit a larger range of algorithmic solutions, may show different trends from those that we discuss here. Such work would be valuable, though computationally expensive due to the model sizes required. Our analysis also studied models only from one model family, Pythia. It is thus not possible to tell if our results are limited to the specific model family we have chosen, which shares both architecture and training setup across model scale. Such work is in part hampered by the lack of large-scale model suites such as Pythia; future work could provide these suites to enable this sort of analysis.
Our work additionally only studies circuits over the course of training; in contrast, open-source models are more of- ten fine-tuned, which could lead to different changes in mechanisms, though previous small-scale studies suggest this is not the case (Prakash et al., 2024). Finally, future work would do well to explore more complex phenomena, such as the self-repair and load-balancing mechanisms of LLMs, which ensure consistent task performance despite component fluctuations.
Interpretability Over Time LLMs’ development over the course of pre-training has been studied with various non- mechanistic interpretability techniques, particularly behav- ioral interpretability, which characterizes model behavior without making claims about its implementation. Such lon- gitudinal analyses have studied LLM learning curves and shown that models of different sizes acquire capabilities in the same sequence (Xia et al., 2023; Chang et al., 2023), examined how LLMs learn linguistic information (Warstadt et al., 2020; Choshen et al., 2022; Chang & Bergen, 2022) and even predicted LLM behavior later in training (Hu et al., 2023; Biderman et al., 2023a). Nevertheless, behavioral studies alone cannot inform us about model internals. Prior work has studied the development of mechanisms in smaller models (Nanda et al., 2023; Olsson et al., 2022), and sug- gests that model mechanisms can change abruptly, even as models’ outward behavior stays the same. Other previous studies have examined the pre-training window where ac- quisition of extrinsic grammatical capabilities occurs (Chen et al., 2024).
Mechanistic Interpretability We build on previous work in mechanistic interpretability, which aims to reverse engi- neer neural networks. Circuits are a significant paradigm of model analysis that has emerged from this field, originating with vision models (Olah et al., 2020) and continuing to transformer LMs (Meng et al., 2023; Wang et al., 2023; Hanna et al., 2023; Varma et al., 2023; Merullo et al., 2024; Lieberum et al., 2023; Tigges et al., 2023). Increasingly, research has tried to characterize the individual components at work within circuits, not only at the level of attention heads (Olsson et al., 2022; Chen et al., 2024; Singh et al., 2024; Gould et al., 2023; McDougall et al., 2023), but also neurons (Vig et al., 2020; Finlayson et al., 2021; Sajjad et al., 2022; Gurnee et al., 2023; Voita et al., 2023) and other sorts of features (Bricken et al., 2023; Huben et al., 2024; Marks et al., 2024). Recent work has also tried to accelerate mechanistic research via automated techniques (Conmy et al., 2023; Bills et al., 2023; Syed et al., 2023; Hanna et al., 2024). Though mechanistic interpretability is a diverse field, it is often tied together by a reliance on causal methods (Vig et al., 2020; Chan et al., 2022; Geiger et al., 2021; 2023; Meng et al., 2023; Wang et al., 2023; Chan et al., 2023; Cohen et al., 2023), which provide more faithful mechanistic explanations.
Impact Statement
This paper aims to advance the field of Mechanistic Inter- pretability. By studying the stability and generalizability of language model circuits and components, our research contributes to understanding the degree to which mecha- nisms remain relevant across training and model scale. This understanding can aid in developing tools to detect and ana- lyze critical behaviors in language models, in the long term potentially helping to identify and mitigate harmful or de- ceptive patterns in AI systems, thus enhancing their safety and reliability.