[Anthropic toy model과 참조 색인마킹]
1. 서론
최근 대규모 언어모델(LLM)의 발전은 자연어 처리(NLP) 분야를 크게 변화시켰다. 이 모델들은 방대한 데이터셋 위에서의 pre-training을 통해 여러 언어 간의 텍스트 이해 및 생성 능력을 보여주고 있다. 그러나 이런 모델들이 다양한 언어를 어떻게 처리하는지에 대한 메커니즘은 아직 명확하게 이해되지 않고 있다. 이에 대한 연구 필요성이 대두되었으며, 본 논문에서는 다양한 언어에 대한 질의를 통해 LLM의 숨겨진 임베딩을 분석하고, 이를 통해 다양한 언어의 토큰 비율을 측정하였다. 그 결과, 비영어 질의는 초기에 비영어 임베딩을 생성하지만, 중간 레이어를 통과하면서 영어 중심의 표현으로 변화하고, 마지막 레이어에서는 다시 비영어 임베딩으로 복귀하는 경향을 확인할 수 있었다.
2. 다언어 워크플로우(MWork) 가설 설정 및 검증
이런 관찰을 바탕으로, 다언어 처리를 위한 세 단계 워크플로우를 가설로 설정하였다. 이해, 문제 해결, 생성. 첫 번째 단계에서 LLM은 다양한 언어의 질의를 통일된 표현으로 변환하며 이해한다. 다음으로, 문제 해결 단계에서는 영어를 사용하여 사고하고 다양한 언어의 지식을 통합한다. 마지막으로, 모델은 원래의 언어로 응답을 생성한다. 이 가설을 검증하기 위해, 언어별 특정 파라미터를 추출하고 이를 선택적으로 비활성화하여 각 구조의 기능을 평가하였다. 특히, PLND(Parallel Language-specific Neuron Detection) 방법을 사용하여 언어 특정 뉴런을 탐지하고, 이를 통해 다양한 벤치마크 작업에서의 모델 성능을 평가하였다.
3. 실험 및 연구 방법
3.1 데이터셋과 벤치마크
XQuAD, MGSM, X-CSQA, XLSum 등 다양한 벤치마크를 사용하여 모델의 이해, 인퍼런스, 지식 추출, 생성 능력을 평가하였다. 이 데이터셋들은 다양한 언어로 구성되어 있어, LLM의 다언어 처리 능력을 종합적으로 평가할 수 있었다.
3.2 주요 방법
언어별 특정 뉴런을 탐지하기 위해, PLND 방법을 개발하였다. 이 방법은 레이블이 지정된 데이터나 파라미터 조정 없이도 주목해야 할 뉴런을 식별할 수 있게 해준다. 각 레이어의 특정 뉴런에 대한 중요성을 다음과 같이 수학적으로 정의하였다.
\[\text{Imp}(N(i)|c) = \|T_i\setminus N(i)(h_i) - T_i(h_i)\|^2\]수식에서 $T_i\setminus N(i)(\cdot)$는 $N(i)$를 비활성화한 상태에서의 $i$번째 레이어의 파라미터를 나타낸다. 이를 통해 각 언어에 특정적으로 활성화되는 뉴런을 식별하고, 이 뉴런들을 비활성화함으로써 LLM의 다양한 언어 처리 능력에 미치는 영향을 평가할 수 있었다.
주요 방법: PLND (Parallel Language-specific Neuron Detection)
[개요 및 목표]
PLND 방법은 레이블이 없는 데이터를 사용하여 언어 모델 내의 특정 뉴런이 주어진 언어에 대해 얼마나 중요한지를 식별하는 데 초점을 맞춘다. 이 방법은 주로 Transformer 기반 모델에서 언어별로 활성화되는 뉴런을 검출하고, 이 뉴런들이 해당 언어의 처리에 어떻게 기여하는지 이해하는 데 사용된다.
[이론적 배경]
Transformer 모델은 입력 시퀀스를 받아 각 레이어에서 순차적으로 처리하여 최종 출력을 생성한다. 각 레이어는 주로 Attention 메커니즘과 Feed-forward 네트워크로 구성된다. PLND는 이런 각 레이어에서 언어별로 중요한 뉴런을 식별하기 위해 개별 뉴런의 활성화 패턴을 분석한다.
[수학적 정의 및 설명]
PLND에서 중요한 뉴런을 식별하는 기본 방법은 다음과 같다.
임베딩과 뉴런의 활성화 상태
입력 \(c\)가 주어졌을 때, \(i\)-번째 레이어 전의 임베딩을 \(h_i\)라고 하고, \(i\)-번째 레이어 후의 임베딩을 \(h_{i+1} = T_i(h_i)\)라고 한다. 수식에서 \(T_i\)는 \(i\)-번째 레이어의 파라미터이다.
뉴런의 중요성 측정
특정 뉴런 \(N(i)\)가 비활성화되었을 때와 활성화되었을 때의 출력 차이를 이용해 뉴런의 중요성을 측정한다. 이를 수식으로 나타내면 다음과 같다.
\[\text{Imp}(N(i)|c) = \\|T_i\setminus N(i)(h_i) - T_i(h_i)\|^2\]수식에서 \(T_i\setminus N(i)\)는 뉴런 \(N(i)\)를 제외하고 \(i\)-번째 레이어를 계산한 결과이다. 이 식은 \(N(i)\) 뉴런이 입력 \(c\)에서 얼마나 중요한 영향을 미치는지 유클리드 거리로 측정한다.
언어별 중요성 판단
특정 언어에 대한 전체 코퍼스 \(C = \{c_1, \dots, c_n\}\)에 대해 각 뉴런의 중요성을 계산하고, 모든 코퍼스에 대해 일정 기준 \(\epsilon\) 이상 중요한 뉴런들을 언어 특정 뉴런으로 식별한다.
\[\{N(i) \\ \text{Imp}(N(i)\\|c_l) \geq \epsilon, \forall c_l \in C\}\][병렬 처리 방법]
전통적인 순차적 뉴런 검사 방법은 계산 비용이 높기 때문에 PLND는 병렬 처리(행렬 연산과 GPU 가속을 활용하여 뉴런 중요성 계산을 최적화)를 통해 이 과정을 가속화한다. 각 뉴런의 중요성을 동시에 계산해 효율적으로 특정 언어 뉴런을 식별할 수 있다.
이 방법을 통해 언어 모델의 다양한 언어 처리 능력을 더 깊이 이해하고, 특정 언어에 최적화된 모델을 개발할 수 있는 기반을 마련할 수 있다. 이런 접근은 언어 모델의 성능을 향상시키는 데 중요한 역할을 하며, 다양한 언어 환경에서의 효율적인 NLP 응용을 가능하게 한다.
[다중언어모델, 언어이해능력 색인마킹]
4. 관련 연구
자연어 처리(NLP)의 빠르게 발전하는 분야에서 대규모 언어모델(LLMs)은 다양한 언어 능력의 최전선에 있다. 이 모델들의 복잡성과 다양성은 다양한 혁신적 연구 노력을 촉발시켜 다언어 처리를 개선하고 이해하려는 것을 목표로 한다. 이 글은 최근의 발전과 다양한 다언어 LLMs의 방법을 몇 가지 주요 주제 주위로 구성된 포괄적인 검토를 제공한다.
[다양한 언어 벤치마크와 성능 향상]
[표현 정렬 및 프롬프팅]
[교차 언어 전달 및 지속적 훈련]
[모델 해석 가능성 및 지식 저장소]
[Self-attention과 인퍼런스]
Self-attention 메커니즘: Hou et al. (2023)과 Stolfo et al. (2023)은 self-attention 메커니즘이 복잡한 인퍼런스 작업을 수행하는 데 어떻게 중요한 역할을 하는지를 보였다. 이 메커니즘을 분석함으로써 연구자들은 모델이 인퍼런스 과정에서 입력 데이터의 어떤 측면을 우선시하는지를 알 수 있다.
다양한 언어를 처리하는 대규모 언어모델의 분야는 활발하게 발전하고 있으며, 각 새로운 연구는 보다 광범위한 이해와 향상된 능력에 기여한다. 벤치마크를 개선하는 것부터 교차 언어 전달 방법을 파인튜닝하는 것까지, 이 분야의 진전은 이 강력한 모델들이 달성할 수 있는 것의 경계를 밀어붙이는 것뿐 아니라 다양한 언어 및 문화 맥락에서 실용적인 응용을 위한 새로운 방법을 열고 있다.
5. 결론 및 향후 연구
이 연구를 통해, LLM이 다양한 언어를 처리하는 방식을 더 잘 이해할 수 있게 되었으며, 언어별 특정 뉴런을 파인튜닝함으로써 모델의 다언어 능력을 향상시킬 수 있는 방법을 제시하며, 향후 연구에서는 더 다양한 언어와 더 복잡한 다언어 작업을 포함시켜 이런 방법의 유효성을 더 폭넓게 검증할 필요가 있다고 언급한다.
LM이 먼저 쿼리를 영어로 번역하고, 다중 언어 지식의 도움을 받아 영어를 사용하여 처리한 다음, 응답을 다시 원래 언어로 번역함으로써 다중 언어 사용을 다룬다는 가설을 검증하기 위해, Parallel Language-specific Neuron Detection (PLND)을 구성해서 Vicuna와 Mistral을 사용해서 En, De, Fr, Zh, Es, Ru의 언어별 뉴런의 활성을 연구했다고 한다.
Self-Attention 및 Feed-forward 구조에서 언어별 뉴런의 겹치는 비율이 영어와 높았으며, 다양한 뉴런의 활성을 비교해보며 결국 안에는 이미 번역을 도와주는 영역이 있었고, 다중 언어의 쿼리를 영어로 바꿔 해결하고 다시 다중 언어로 반환해준다는 가설을 제시한다.
Recent advancements in large language models (LLMs) (OpenAI, 2023; Touvron et al., 2023; Team et al., 2023) have dramatically transformed the field of natural language processing (NLP). Thanks to the extensive pretraining on massive corpora mixed with different languages, these models demonstrate remarkable capabilities in understanding and generating text across multiple languages (Huang et al., 2023; Zhu et al., 2023; Zhang et al., 2023a; Zhao et al., 2024a). Despite these advancements, the intricate mechanism of their multilingual processing behavior remains largely unclear, which leads to an important research question: How do large language models handle multilingualism?
To understand the working mechanism of LLMs, existing studies mainly focus on the relationship between model architectures and certain capabilities, with some investigating reasoning abilities with self-attention layers (Hou et al., 2023; Stolfo et al., 2023; Friedman et al., 2023), and others interpreting feed-forward layers as key-value memories for storing factual knowledge (Geva et al., 2021; Dai et al., 2022a; Meng et al., 2022). However, these works solely center on English and neglect the multilingual features of LLMs in their interpretations.
To gain an initial understanding of the multilingual mechanism of LLMs, we test LLMs with various non-English queries and decode the hidden embeddings of each layer to tokens within the LLM’s vocabulary. Subsequently, we classify these decoded tokens into either English or non-English, and analyze the ratio. Figure 1 illustrates the ratio of English and non-English tokens for each layer of two LLMs. We observe that non-English queries initially generate non-English embeddings as expected. However, as queries progress through the middle layers, the representations become English-centric. In the final layers, there is a reversion to predominantly non-English embeddings, matching the non-English queries.
Motivated by the observed transfor- mation above, we hypothesize a three- stage multilingual workflow: under- standing, task-solving, and generat- ing. This involves understanding the original non-English queries and inter- preting them in English, solving tasks in English, and reverting outputs back to the original language. Furthermore, building upon previous studies that link self-attention structures to rea- soning and feed-forward structures to factual knowledge storage (Hou et al., 2023; Geva et al., 2021), we further decouple the task-solving stage into reasoning with self-attention structures and extracting multilin- gual knowledge with feed-forward structures. Therefore, our hypothesized Multilingual Workflow (MWork) illustrated in Figure 2 outlines the three operational stages of LLMs in processing multi- lingual queries: Initially, LLMs understand queries by converting diverse linguistic features into a unified representation. In the task-solving phase, LLMs think in English and incorporate multilingual knowledge to obtain factual content, using self-attention and feed-forward structures, respectively. Finally, models generate responses in the original language as the original query.
Figure 2: Our hypothesized multilingual workflow MWork.
To verify the proposed MWork workflow, we extract language-specific parameters and selectively deactivate them within different structures, thereby assessing the functionality of corresponding structures and validating our hypothesis. To identify the parameters to be activated, we develop a novel approach called Parallel Language-specific Neuron Detection (PLND). Unlike existing methods that rely on fine-tuning(Frankle and Carbin, 2018; Aghajanyan et al., 2021; Zhang et al., 2023b), labeled data (Tang et al., 2024; Liu et al., 2024), or parallel corpora (Libovick`y et al., 2020; Tanti et al., 2021; Zhang et al., 2024) to detect activated parameters, PLND measures the significance of individual neurons with respect to the input in both attention and feed-forward structures without any labeled data or parameter adjustments. Using PLND, we identify language-specific neurons with free text corpus of that language and isolate consistently activated neurons. We find that by deactivating language-specific neurons which account for only 0.13% of all neurons, LLMs’ performance on a summarization task could drop by 99%.
We then extensively verify the hypothesized MWork framework using the proposed PLND method. Employing various benchmark tasks, including XQuAD (Artetxe et al., 2020) for understanding, MGSM (Shi et al., 2022) for reasoning, X-CSQA (Lin et al., 2021) for knowledge extraction, and XLSum for generation (Hasan et al., 2021), we selectively deactivate language-specific neurons in each component and verify the functionality of the component by observing a significant decline in performance on the corresponding task. For example, when deactivating the language-specific neurons in the understanding layer, the performance on the multilingual understanding task XQuAD remains stable in English, while experiencing a decrease of 14% in non-English languages. Other tasks exhibit the same characteristics when deactivating corresponding neurons. More importantly, with the verified MWork framework, enhancing the multilingual capabilities of LLMs can thus be achieved through the fine-tuning of language-specific neurons for certain capabilities. With a remarkable reduction in the training corpus size to a mere few hundred documents, this fine-tuning procedure enhances the multilingual capabilities of LLMs for both high-resource and low-resource languages by an average of 3.6% and 2.3% across all tasks, respectively. Notably, even without an English training corpus, there is a noticeable improvement in English performance, as the enhancement of language-specific neurons yields greater accuracy in enhancing specific languages, while simultaneously ensuring a clear division of parameters among different languages. In summary, the verified MWork reveals how LLMs handle multilingual tasks and offers an effective approach for conducting language-specific enhancements without compromising performance in other languages.
To verify the hypothesized workflow, we propose PLND that effectively detects language-specific neurons without relying on any labeled data.
We define a neuron as a single row or column of a parameter matrix of a language model. To identify neurons responsible for a specific language, it is crucial to discern the significance of a neuron with respect to the inference of a given input. Specifically, when processing the input \(c\) in the model, we denote the hidden embedding before the \(i\)-th layer in Transformer (Vaswani et al., 2017) as \(h_i\), and the hidden embedding after the \(i\)-th layer as \(h_{i+1} = T_i(h_i)\), where \(T_i\) represents the parameters of the \(i\)-th layer. For a specific neuron within the \(i\)-th layer, denoted as \(N (i)\), either located in the attention or feed-forward network, we quantify its importance in processing the input \(c\) by measuring the difference in the hidden embedding after the \(i\)-th layer, i.e., \(h_{i+1}\), when \(N (i)\) is activated or deactivated. Formally, the impact of neuron \(N (i)\) for input \(c\) is defined as
\[\text{Imp}(N (i)|c) = \|T_i\setminus N (i)(h_i) - T_i(h_i)\|^2,\]where \(T_i\setminus N (i)(\cdot)\) denotes deactivating \(N (i)\) in \(T_i\), i.e., setting all parameters of the neuron \(N (i)\) to zero. With a set of \(n\) corpus in a specific language, denoted as \(C = \{c_1, \cdots, c_l, \cdots, c_n\}\), we calculate the importance of each neuron in each layer to each corpus. Furthermore, we can obtain language-specific neurons that are important to all corpus in that language, i.e.,
\[\{N (i) | \text{Imp}(N (i)|c_l) \geq \epsilon, \forall c_l \in C\},\]where \(\epsilon\) is the pre-defined threshold.
The sequential neuron detection is time-consuming, requiring traversal of all neurons and inputs sequentially. To address this, we further propose a parallel algorithm for accelerating the process.
In the latest open-source models, when processing input \(c\), the feed-forward network in a certain layer is defined as
\[\text{FFN}(x) = \text{SiLU}(W_{\text{gate}}(x)) \cdot W_{\text{up}}(x),\]where \(x \in \mathbb{R}^{l \times d_{\text{model}}}\) is the embedding fed into the FFN, \(W_{\text{gate}}, W_{\text{up}} \in \mathbb{R}^{d_{\text{model}} \times d_{\text{inter}}}\), \(W_{\text{down}} \in \mathbb{R}^{d_{\text{inter}} \times d_{\text{model}}}\). The calculation of the importance of the \(k\)-th neuron in \(W_{\text{up}}\), when processing the input \(c\), as presented in Equation 1, can be equivalently transformed to
\[\text{Imp}(W_{\text{up}}[:, k]|c) = \| \hat{\text{FFN}}(x) - \text{FFN}(x)\|^2 = \| (h_{\text{ffn}} \cdot \text{Mask}[k]) W_{\text{down}}(x)\|^2,\]where \(h_{\text{ffn}} \in \mathbb{R}^{d_{\text{inter}}}\) represents the embedding before \(W_{\text{down}}\), and \(\text{Mask}[k] \in \mathbb{R}^{d_{\text{inter}}}\) is a vector with the \(k\)-th element equal to 1 and the rest equal to 0. To calculate $$\text{Imp}(W_{\text{up}}[:, k] | c)\(for\)k \in d_{\text{inter}}\(parallelly, we introduce a diagonal mask matrix of size\)(d_{\text{inter}}, d_{\text{inter}})\(, denoted as\)\text{Mask}$$. Therefore, |
Furthermore, we observe that deactivating the \(k\)-th neuron of \(W_{\text{down}}\) is equivalent to deactivating the \(k\)-th neuron in \(W_{\text{up}}\), as they both result in \(h_{\text{ffn}}[k] = 0\). Hence, we can also derive $$\text{Imp}(W_{\text{down}} | c)$$ by employing Equation (5). |
When processing input \(c\), the self-attention network in a certain layer is
\[\text{Attention}(x) = \text{Softmax} \left( \frac{W_Q(x)W_K^T(x)}{\sqrt{d}} \right) W_V(x),\]where \(W_Q, W_K, W_V \in \mathbb{R}^{d_{\text{model}} \times d_{\text{mid}}}\). Since \(W_V(x)\) is not in the non-linear softmax calculation, we can calculate $$\text{Imp}(W_V | c)\(by applying Equation (5). For\)W_Q\(, we obtain\)\text{Imp}(W_Q[:, k] | c)\(by deactivating its\)k\(-th neuron, specifically,\)\hat{W_Q} \leftarrow W_Q[:, k] = 0$$. Firstly, we calculate the difference in attention weight before and after deactivation, prior to scaling and softmax, |
Next, as the changes in attention exhibit a positive correlation with the changes in the output of this layer, the importance of \(W_Q[:, k]\) in processing \(c\), as defined in Equation 1, can be approximated as
\[\text{Imp}(W_Q[:, k]|c) \approx \| \text{attention}(x) - \text{attention}(x) \|^2 = \| \text{softmax} \left( \frac{W_Q(x) W_K^T(x)}{\sqrt{d}} \right) - \text{softmax} \left( \frac{W_Q(x) W_K^T(x) - \Delta_k(x)}{\sqrt{d}} \right) \|^2.\]This process can also be calculated in parallel, specifically,
\[\Delta(x) = \hat{W_Q}(x) W_K^T(x) - W_Q(x) W_K^T(x) = W_Q(x). \text{resize}(l, 1, d_{\text{mid}}) \times W_K(x). \text{resize}(1, l, d_{\text{mid}}) \in \mathbb{R}^{l \times l \times d_{\text{mid}}}.\]Therefore, the importance of \(W_Q\) in processing input \(c\) is calculated by
\[\text{Imp}(W_Q|c) \approx \| \text{softmax} \left( \frac{W_Q(x) W_K^T(x)}{\sqrt{d}} \right) - \text{softmax} \left( \frac{W_Q(x) W_K^T(x) - \Delta(x)}{\sqrt{d}} \right) \|^2.\]Similarly, since \(W_K\) is symmetrical to \(W_Q\), $$\text{Imp}(W_K | c)$$ can be calculated in the same way. |
We then apply PLND to selected languages and models to validate its effectiveness in detecting language-specific neurons and to further investigate the relationships between languages.
Experimental Setup. We test two open-source models that perform well on multilingual tasks, including Vicuna-7b-v1.56 (Chiang et al., 2023) and Mistral-7b-Instruct-v0.2 (Jiang et al., 2023). For simplicity, we abbreviate them as Vicuna and Mistral hereafter to represent the two models respectively. We select the text summarization task with the XLSum (Hasan et al., 2021) dataset as the reference task to evaluate multilingual performance as it requires the model to comprehend the input text and generate a coherent fragment. We adopt 4 high-resource languages including French (Fr), Chinese (Zh), Spanish (Es), and Russian (Ru), as their initial performance on those languages is already quite reasonable for observing the multilingual processing mechanism. Furthermore, we utilize OSCAR (Caswell et al., 2020) corpus which contains web crawling texts for each language to compile a language-specific corpus without task-specific considerations, and details are presented in Appendix B.
Existence of Language-Specific Neurons Using PLND, we feed a corpus in a specific language to LLMs and identify neurons that are consistently activated, which are responsible for processing queries in that language. To ascertain whether these neurons are genuinely language-specific, we assess the performance of LLMs in corresponding languages when these neurons are deactivated versus when the same number of randomly sampled neurons are deactivated.
5In some models like Vicuna and Mistral, dmodel = dmid, but we use different notations to avoid ambiguity. 6We do not directly utilize Llama2-chat as it does not follow multilingual instructions, consistently responding in English regardless of the language of the query.
Table 1: Multilingual performance on XLSum when deactivating language-specific neurons (“Lang- Spec”) and an equivalent number of randomly selected neurons (“Random”).
Table 1 demonstrates the decline of multilingual capabilities when deactivating language-specific neurons. Although just deactivating around 0.13% neurons, LLMs lose their multilingual capabilities and fail to generate meaningful content. In contrast, deactivating the same number of randomly selected neurons does not yield any difference. Therefore, the detected neurons are language-specific and related to handling corresponding multilingual inputs. Furthermore, we investigate the degree of overlap among their language-specific neurons in Appendix C.
By classifying the hidden representations of each layer in LLMs into English or non-English (as shown in Figure 1), we can observe the shift from non-English to English-centric, and back to non-English with the progression through the layers. This motivates us to hypothesize a three- stage multilingual workflow: understanding the original non-English queries and interpreting them in English, task-solving in English, and generating back to the original language. Nev- ertheless, the presence of certain non-English tokens during the English-centric task-solving stage inspires us to further investigate this stage.
Figure 3: Number of language-specific neurons when processing multilingual queries.
With the proposed PLND method, we extract language-specific neurons from attention and feed-forward structures when processing various multilingual queries. We plot the average number of activated language-specific neurons when processing each query in Figure 3. Notably, the number of language-specific neurons decreases within the self-attention structure in the task-solving layer but remains consistent across the layers of the feed-forward structure. This decline implies a reliance on the English language for thinking while extracting multilingual knowledge to support query processing, which is also consistent with (Geva et al., 2021)’s interpretation of the feed-forward structure as key-value memories for knowledge extraction. Therefore, we further decompose the task-solving layer into two parts: thinking in English and extracting knowledge in a multilingual context.
Considering the above insights, we propose the MWork hypothesis for explaining LLM’s multilingual workflow: LLMs first understand user input by unifying diverse linguistic features. They then engage in the task-solving phase, employing English for thinking and leveraging multilingual knowl- edge through self-attention and feed-forward structures, respectively. Finally, the models generate responses aligned with the query’s original language.
To verify MWork, we selectively deactivate language-specific neurons from each component. Then its functionality can be verified if this deactivation results in minimal impact on English performance while exhibiting a notable decline in multilingual performance for the corresponding task.
Table 2: Results of the understanding task, where ‘✗’ indicates that chosen neurons in the corre- sponding layer are deactivated, and ‘✓’ signifies they are activated. ∆ is defined as the difference between the reduction in performance in English, denoted as ∆Eng, and the reduction in performance in non-English languages, denoted as ∆n-Eng.
Deactivating Method
Dataset To comprehensively understand how LLMs work with different abilities, we employ four kinds of tasks including MGSM (Shi et al., 2022) for reasoning task, XQuAD (Artetxe et al., 2020) for understanding task, X-CSQA (Lin et al., 2021) for knowledge question answering task, and XLSum (Hasan et al., 2021) for generation task. Detailed information regarding these datasets and the testing prompts can be found in Appendix D. We adopt 6 languages including English (En), German (De), French (Fr), Chinese (Zh), Spanish (Es), and Russian (Ru), as their initial performance on those languages is already quite reasonable for observing the multilingual processing mechanism. For XLSum, we randomly sample 500 data points from the whole test set for each language taking into consideration its long inference time, while for other tasks, we employ the entire test set. We evaluate the vanilla performance of Vicuna and Mistral on these datasets for later comparison as presented in Appendix E. For reasoning, understanding, and knowledge question answering tasks, we adopt accuracy as the metric. As for the generation tasks, we adopt ROUGE-L as the metric.
Deactivation Strategy We primarily consider two aspects when selecting the deactivation settings: (1) language-specific neurons versus randomly chosen neurons, and (2) the position of neurons, which encompasses four structures. More detailed settings are explained from Section 3.3 to Section 3.6. For the concrete numbers of different layers, we tune hyperparameters by XQuAD in Chinese. Details are explained in Appendix F.
Notations Tables 2 to 5 present the results of deactivating certain neurons, where “Under” denotes the understanding layers, “S-ATTN” and “S-FFN” correspond to the self-attention and the feed- forward structures within the task-solving layers respectively, “Gen” refers to the generation layers. The term “Random” is used to describe deactivating randomly chosen neurons, whereas “Lang-Spec” refers to the deactivation of language-specific neurons. We also present the gap between the original performance (as shown in Table 9) and performance after deactivation (as shown in Table 12 to Table 15) for English (∆Eng) and averaged non-English languages (∆n-Eng), respectively. A single metric ∆ is then introduced as ∆Eng − ∆n-Eng, where a high value indicates such deactivation operation does not bring much impact to the English performance but lead to performance drop in non-English. Therefore, this provides evidence that the deactivated neurons are language-specific and hold a significant responsibility in executing the corresponding task.
Deactivating Method Table 2 shows the results of the understanding task following the deactivation of five distinct sets of neurons: (i) neurons randomly selected from the understanding layers; (ii) neurons randomly chosen across all layers; (iii) language-specific neurons within the task-solving layers; (iv) language-specific neurons in the generation layers; (v) language-specific neurons in the understanding layers. For a fair comparison, we ensure the numbers of deactivated neurons in all settings are the same. As mentioned above, in order to verify the functionality of the understanding layer (setting v), we compare it with deactivating other types of layers, specifically setting iii for the task-solving layer and setting iv for the generation layer. Full results are listed in Appendix G.
Findings We find that by deactivating randomly sampled neurons, no matter in the understanding layer or all layers, the performance of LLMs in both English and non-English languages is almost unaffected compared to other deactivating methods. Note that in some cases, deactivating randomly sampled neurons may even increase the performance because irrelevant neurons are removed, which also aligns with the finding from (Sharma et al., 2023). When assessing the differential impact on English and non-English language performance after the deactivation, specifically the difference calculated as ∆Eng − ∆n-Eng, it is evident that the deactivation of random neurons within the under- standing layer amplifies this effect. This observation lends partial support to the hypothesized role of the understanding layer in language processing.
Furthermore, we find that deactivating language-specific neurons in the understanding layer influences the performance in English a little while significantly decreasing the performance in non-English languages. When deactivating language-specific neurons in the task-solving layer, both English and non-English languages are significantly reduced while deactivating language-specific neurons in the generation layer influences a little for both English and non-English languages. Therefore, we prove that the first several layers are responsible for understanding because deactivated neurons just disable LLMs on the NLU task in non-English languages. Furthermore, disabling language-specific neurons in the task-solving layer shows that LLMs rely on English, as performance drops across all languages.
Deactivating Method Table 3 shows the result of the reasoning task, where we deactivate 6 sets of neurons. We adhere to the previous logic of selecting deactivation settings, with the exception that we do not conduct an independent experiment on deactivating neurons in the understanding layer, as its functionality has already been verified. Details are listed in Appendix G.
Findings We find that deactivating randomly sampled neurons in task-solving layers disables the capabilities of LLMs in reasoning to a greater extent than deactivating randomly sampled neurons in all layers, which verifies the function of the task-solving layer. Furthermore, comparing three deactivating language-specific neuron methods, we find that deactivating the task-solving layer decreases performance in both English and non-English. On the contrary, when we only deactivate language-specific neurons not in the task-solving layer, non-English is influenced more seriously than English. Moreover, eliminating interference from the feed-forward layer achieves better results, which verifies the function of attention structure in the task-solving layer.
Deactivating Method Table 4 shows the result of the knowledge question answering task, where we deactivate 5 sets of neurons. Similarly, we exclude the deactivation of neurons in layers that have already been verified and instead concentrate on the self-attention structure and feed-forward structure in the task-solving layer. Details are listed in Appendix G.
Findings Likewise, targeted deactivation of language-specific neurons within the feed-forward structure of the task-solving layer predominantly affects non-English languages. This implies that processing multilingual queries necessitates accessing the multilingual information embedded within the relevant structures. However, disabling the self-attention structure compromises the ability to solve tasks across all languages.
Table 4: Results of the knowledge question answering task. The highest performance reduction difference (∆) is achieved by disabling all language-specific neurons in the feed-forward structure within the task-solving layer.
Table 5: Results of the generation task. The highest performance reduction difference (∆) is achieved by disabling all language-specific neurons in the generation layer. Deactivating Method
Deactivating Method Table 5 shows the result of the generation task, where we deactivate 3 sets of neurons. Since all previous layers have been verified, we solely deactivate neurons in the generation layer and compare them with randomly selected neurons. Details are listed in Appendix G.
Findings Similar to other tasks, the disabling of language-specific neurons within the generation layer diminishes their capacity to generate content in the respective languages. By selectively deactivating neurons that are not associated with English, we do not completely eliminate the models’ multilingual generation abilities. However, as demonstrated in Table 1, the complete deactivation of all language-specific neurons results in the total loss of the LLMs’ multilingual generation capabilities.
We have verified MWork for explaining the multilingual working mechanism of LLMs in the above section via deactivating certain neurons. While opposite to employing deactivation, we can also enhance their multilingual ability, especially the understanding and generating ability, by fine-tuning these language-specific neurons. With language-specific neurons comprising only around 0.1% of all parameters, the need for training documents to improve multilingual capabilities can be significantly reduced to just a few hundred. Additionally, fine-tuning only the language-specific neurons for a particular language does not impact performance in other languages, allowing us to enhance specific languages while preserving performance in others.
MWork helps with enhancing multilingual ability by hundreds of documents. We employ Mistral-7b-v0.1 for enhancement to eliminate the interference of instruction fine-tuning, and select causal language modeling as our training task. We create a dataset comprising {100, 200, 400, 800} randomly selected documents for each language, extracted from the Wikipedia corpus.7 Figure 4 shows the results of enhancement on high-resource languages (De, Fr, Zh, Es, Ru). The numbers represent the sizes of the training corpus when fine-tuning language-specific neurons, while “Random” represents the fine-tuning of an equivalent number of randomly chosen neurons using a corpus of 400. Our findings reveal that fine-tuning with a few hundred documents yields significant performance improvements on multilingual tasks: 3.4% on MGSM, 4.4% on XQuAD, 4.3% on X-CSQA, and 2.3% on XLSum. Moreover, English performance is enhanced by an average of 3.7% across all tasks. These results further confirm the effectiveness of MWork in interpreting structure functionality for LLM’s multilingual query handling, offering precise and independent methods for multilingual enhancement. When fine-tuning with 800 documents, the performance deteriorates compared to using 400 documents. This drop can be attributed to the incorporation of additional knowledge, which disrupts the original knowledge distribution and leads to overfitting of the model to Wikipedia. This can be addressed by mixing data from more sources such as textbooks or websites.
In addition, we verify the effectiveness of such enhancement method on low-resource languages, given that low-resource performance is relatively low with the original model. We select four languages including Vietnamese (Vi), Thai (Th), Arabic (Ar), and Swahili (Sw), covering languages with both latin and non-latin scripts and having corresponding testing set in our considered benchmarks. The model was then evaluated on four benchmarks, and the result shown in Table 6 is the average among tasks. It is evident that the fine-tuning method using language-specific neurons enhances the model’s multilingual performance in low-resource languages by an average of 2.2%. Notably, the improvement of 3.5% in English performance is observed even without an English training corpus, indicating the effectiveness of the distinct language responsibilities assigned to neurons.
Table 6: Enhancement is achieved by fine- tuning Mistral-7b-v0.1 model utilizing 400 documents from each language correspond- ingly. The results are averaged across four tasks. Performance on English (“En”) is ob- tained by averaging the results from three fine-tuned models.
Figure 4: Enhancement results on high-resource lan- guages, while the number is average among languages.
In the era of Language and Linguistic Models (LLMs), numerous studies have been conducted to develop multilingual benchmarks (Zhang et al., 2023a), enhance multilingual performance without parameter adjustments through translation (Liang et al., 2023; Huang et al., 2023), aligning repre- sentations (Nguyen et al., 2023a; Salesky et al., 2023), prompting (Li et al., 2023b; Tanwar et al., 2023). Furthermore, certain works focus on improving multilingual abilities for a single task via cross-lingual transfer (Kim et al., 2017; Lin et al., 2019; Pfeiffer et al., 2020; Zhao et al., 2024b), while others aim to enhance multilingual proficiency by continuous training in one language to obtain mono-lingual LLMs (Cui et al., 2023a), or in multiple domain languages to obtain domain-lingual LLMs (Nguyen et al., 2023b). Additionally, some works achieve multilingual LLMs by training from scratch (Muennighoff et al., 2023). However, these studies are limited to specific task types or require substantial training corpora due to a lack of comprehensive understanding of the multilingual mechanisms of LLMs.
Conventional interpretability research investigates the significance of input features with their cor- responding outputs (Vig, 2019; Hewitt and Liang, 2019; Qiu et al., 2020). In the era of LLMs, one brunch of work includes efforts to understand knowledge storage, with (Geva et al., 2021) initiating the study of the feed-forward layer as a knowledge base. Subsequent work has furthered this by alter- ing neuron values (Dai et al., 2022b), mapping embeddings to words (Geva et al., 2022), modifying inputs to recover embeddings (Meng et al., 2022), and analyzing attention heads (Li et al., 2023a). Another line of research centers on the self-attention layer, examining its connection to reasoning capability (Hou et al., 2023; Stolfo et al., 2023; Friedman et al., 2023) by contrasting the reasoning tree based on attention weights.
In this work, we examine how LLMs handle multilingualism. The proposed multilingual workflow (MWork) suggests that LLMs initially understand queries by converting multilingual inputs into English, think in English in intermediate layers while incorporating multilingual knowledge, and generate responses aligned with the original language in the final layers. The validity of MWork is verified using Parallel Language-specific Neuron Detection (PLND), which identifies activated neurons for different languages without labeled data. By detecting language-specific neurons and fine-tuning them with a small training corpus, MWork enhances multilingual abilities in specific languages without compromising others, resulting in significant average improvements across tasks.
Limitation and Impact Statements
Our experiments are mainly conducted on models with a size of approximately 7 billion parameters (i.e., vicuna-7b-v1.5 and mistral-7b-v1.0), primarily due to computational resource constraints. Testing our methods on larger models could potentially yield additional insights, particularly in understanding the scalability of our proposed framework. Furthermore, our exploration into enhancing the multilingual abilities of language models through our framework was preliminary. Expanding our experiments to include a broader array of languages, especially those considered low-resource, could better demonstrate the effectiveness of our proposed framework. Moreover, extending the scope of our experiments to evaluate other capabilities such as reasoning and multilingual knowledge extraction with specific datasets could provide a more comprehensive picture of the potential benefits of our approach.
Our paper has the potential to significantly enhance the multilingual ability of LLMs. By effectively improving their performance across all languages, it can enable the development of real multilingual models that excel in various applications, promoting better communication and understanding among diverse linguistic communities.