00:00:00

Art of Balancing

https://dsdanielpark.github.io https://github.com/dsdanielpark

Art of Balancing

MinWoo(Daniel) Park | Tech Blog

Created: 2023-12-18 10:03:04 +0000

Last modified: 2024-09-05 20:56:50 +0900

Art of Balancing

Related Project: Private
Category: Paper Review
Date: 2023-12-18

The Art of Balancing: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment

url: https://arxiv.org/abs/2312.09979
pdf: https://arxiv.org/pdf/2312.09979
abstract: Supervised fine-tuning (SFT) is a crucial step for large language models (LLMs), enabling them to align with human instructions and enhance their capabilities in downstream tasks. When the models are required to align with a broader range of downstream tasks, or there is a desire to notably improve the performance on a specific task, a substantial increase in fine-tuning data often emerges as the solution. However, we find that large-scale increases in instruction data can disrupt the world knowledge previously stored in the LLMs, i.e., world knowledge forgetting. In this paper, we introduce LoRAMoE to address above challenge. The LoRAMoE is a plugin version of Mixture of Experts (MoE) The plugin-form ensures the integrity of world knowledge by freezing the backbone model during the training phase. And we propose the use of localized balancing constraints to coordinate parts of experts for task utilization, meanwhile enables other experts to to fully leverage the world knowledge stored in the models. Experimental results demonstrate that LoRAMoE can reasonly coordinate experts based on data type during inference, and even dramatically increasing instruction data does not result in knowledge forgetting. Moreover, LoRAMoE provides additional benefits for the performance of downstream tasks, indicating the potential of our approach for multi-task learning.

Contents

The Art of Balancing: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment

TL;DR

대규모 언어모델의 세밀한 파인튜닝: World Knowledge 보존과 다양한 태스크 수행능력 증대

대규모 언어모델(LLM)에서 슈퍼바이즈드 파인튜닝(SFT)의 대규모 데이터 사용은 지식 소멸 문제를 유발함.
LoRAMoE 기법을 제안하여 다수의 플러그인 전문가를 통해 이 문제를 해결하고 다양한 태스크 수행능력을 증대시킴.
로컬 밸런싱 제약을 통해 전문가 그룹 간의 균형을 유지하며, 실험을 통해 제안 기법의 효능을 입증함.

1. 서론

대규모 언어모델(LLMs)은 다양한 태스크에서 놀라운 성능을 보여주고 있습니다. (Touvron et al., 2023; Muennighoff et al., 2022) 이런 모델을 휴먼의 지침에 맞추어 조정하는 슈퍼바이즈드 파인튜닝(SFT)은 모델의 잠재력을 최대한 발휘하는 데 중요한 단계입니다. (Chung et al., 2022; Ouyang et al., 2022) 몇몇 연구에서는 적은 양의 파인튜닝 데이터로도 모델이 휴먼의 지침을 잘 따를 수 있음을 보여주었지만(Zhou et al., 2023; Cao et al., 2023), 태스크의 다양성이 증가하거나 특정 태스크에서의 성능을 향상시키기 위해 데이터 양을 늘리는 것은 일반적인 해결책입니다.

그러나 대규모 파인튜닝 데이터의 증가는 새로운 문제를 야기합니다. 특히, Closed-Book Question Answering(CBQA) 데이터셋에서 성능이 큰 폭으로 하락하는 현상이 관찰되었습니다. (Figure 1 참조) 이런 성능 하락이 모델에 저장된 World Knowledge의 붕괴와 관련이 있을 것이라고 가정합니다. 이를 두 단계로 검증하였습니다. 첫째, CBQA 데이터셋이 인퍼런스를 위해 모델에 저장된 World Knowledge을 활용하는지를 확인했습니다. 둘째, 대규모 파인튜닝이 모델의 파라미터를 대폭 변화시켜 World Knowledge을 파괴하는 결과를 초래함을 입증했습니다. 이런 관찰을 통해, 대규모 슈퍼바이즈드 파인튜닝과 LLM의 World Knowledge 유지 간에 본질적인 모순이 존재함을 알 수 있었습니다.

주요 발견

대규모 데이터로의 SFT는 LLM에 저장된 파라메트릭 지식을 잊게 만듦(forgetting 문제)
모델의 특정 영역을 World Knowledge 저장에 전념시키는 이상적인 해결책은 모델의 파라미터를 동결시키는 것
플러그인 기반의 파인튜닝을 통해 모델의 백본 파라미터를 동결하고 추가 네트워크에 파라미터 변화를 결합함으로써 이 문제를 해결

2. 대규모 파인튜닝 데이터 확장과 World Knowledge 유지 간의 충돌

2.1 구현

데이터셋

대규모 SFT가 LLM의 World Knowledge에 미치는 영향을 이해하기 위해, 다양한 태스크를 포함하는 대규모 데이터셋을 구성하고 증강합니다.

다음 7개의 태스크를 포함해 training dataset셋을 500M으로 증강합니다. (CBQA, Coreference Resolution, Natural Language Inference(NLI), Abstract Summarization, Multi-lingual Translation, Reading Comprehension, Text Cassification)

베이스 모델로 LLaMA-2-7B(Touvron et al., 2023)를 베이스 모델로 사용합니다.

평가

World Knowledge 평가를 위해 CBQA를 키 벤치마크로 사용하며, 평가 데이터셋으로는 Filtered NQ와 Filtered TriviaQA를 사용했습니다.

2.2 대규모 파인튜닝 데이터 확장이 지식 소멸을 유발하는 현상

대규모 파인튜닝 데이터 확장 시 두 가지 유형의 태스크에서 성능이 다르게 나타나는 것을 관찰했습니다. (Figure 2 참조) 요약, NLI, 기계 번역 등의 태스크에서는 성능이 크게 향상된 후 안정화되었지만, World Knowledge 벤치마크에서는 성능이 큰 폭으로 하락했습니다.

World Knowledge 벤치마크의 성능은 모델의 pre-training 단계에서 학습한 지식과 기술에 의존합니다. 이를 조사하기 위해, 250k 샘플로만 구성된 CBQA 데이터셋을 사용하여 모델을 파인튜닝하고 테스트셋에서 평가를 수행했습니다. 결과는 초기 파인튜닝 단계에서 성능이 향상되고 이후에는 더 많은 샘플이 성능을 개선하지 못함을 보여주었습니다.

따라서, 대규모 지침 데이터로의 SFT 과정이 LLM에 저장된 지식을 방해하여 지식 소멸을 초래한다는 가설을 검증하기 위해 두 단계의 파인튜닝을 수행했습니다.

첫 번째 단계에서 CBQA 세그먼트를 제외한 지침 데이터를 사용하여 모델을 파인튜닝한 후, 두 번째 단계에서 CBQA 데이터셋을 사용하여 계속 파인튜닝했습니다.

실험 결과, 모델의 World Knowledge 능력이 큰 폭으로 저하되었고, 이는 첫 번째 단계의 파인튜닝 과정에서 모델의 지식이 손상되었음을 나타냅니다.

3. 방법

3.1 사전 지식

3.1.1 전문가 혼합(Mixture of Experts, MoE)

MoE는 모델의 파라미터 수를 큰 폭으로 증가시키지 않고도 모델의 계산 비용을 증가시키지 않는 방법으로, Transformer 기반의 대규모 언어모델에서 MoE는 각 Transformer 블록의 전형적인 피드 포워드 신경망 레이어를 여러 독립적인 전문가 네트워크($E_i$)와 게이트 함수($G(\cdot)$)로 대체합니다. 수식으로, 블록 내의 어텐션(attention) 레이어 출력($h$)에 대한 MoE 레이어 출력($y$)는 다음과 같이 표현할 수 있습니다.

\[y = \sum_{i=1}^N G(h)_i E_i(h)\]

상기 수식에서 $E_i(h)$와 $G(h)_i$는 각각 $i$번째 전문가의 출력과 해당 가중치를 나타냅니다. 게이트 함수 $G(\cdot)$는 다음과 같이 작성될 수 있습니다.

\[G(h) = \text{Softmax}(h W_g)\]

상기 수식에서 $W_g$는 게이트 함수의 훈련 가능한 가중치 행렬입니다.

MoE 기반 언어 모델은 전문가들이 다양한 역량을 습득하고 서로 다른 능력에 집중하도록 돕습니다. 이런 MoE 패러다임을 바탕으로 전문가를 세분화하여 능력을 전략적으로 배치함으로써 지식 소멸 문제를 해결하려고 합니다.

3.1.2 LoRA

LoRA(Low-Rank Adaptation)는 pre-training된 모델의 가중치를 특정 태스크에 적응시키기 위해, 가중치 행렬을 Low-Rank 분해를 이용하여 업데이트하는 방법입니다. 수식을 사용해 이 과정을 표현하면 다음과 같습니다.

pre-training된 행렬 $W$ (e.g., attention-k)와 차원 $d_{\text{in}} \times d_{\text{out}}$을 가진 $W_0$에 대해, LoRA는 $W$를 다음과 같이 Low-Rank 행렬 $A$와 $B$를 사용하여 업데이트합니다.

\[W = W_0 + AB^T\]

상기 식에서,

$W_0$는 원래 모델에서의 원본 가중치 행렬을 나타냅니다.
$A$는 $d_{\text{in}} \times r$ 차원을 가지며,
$B$는 $d_{\text{out}} \times r$ 차원을 가집니다.
$r$은 Low-Rank 분해에서의 랭크(rank)를 나타내며, 이는 $r \ll d_{\text{in}}, r \ll d_{\text{out}}$ 조건을 만족합니다. 수식을 수정하고, 설명을 보완해서 LoRA를 보다 명확하게 표현하겠습니다.

LoRA는 기존 모델의 가중치 $W$에 대한 Low-Rank 업데이트 $\Delta W$를 적용하여 모델의 특정 태스크 적응성을 향상시킵니다. 이 Low-Rank 업데이트는 다음과 같은 수식으로 나타낼 수 있습니다.

\[\Delta W = BA\]

상기 식에서,

$A$는 $d_{\text{in}} \times r$ 차원을 가진 행렬로, 입력 차원에서 Low-Rank 차원으로 매핑합니다.
$B$는 $r \times d_{\text{out}}$ 차원을 가진 행렬로, Low-Rank 차원에서 출력 차원으로 매핑합니다.
$r$은 LoRA의 랭크를 나타내며, Low-Rank 분해의 특성을 결정합니다.

LoRA의 전진 과정에서, 출력 $y$는 입력 $x$와 가중치 $W$ 및 업데이트 $\Delta W$의 조합을 통해 계산됩니다. 이 과정은 다음 수식으로 표현할 수 있습니다.

\[y = xW + \alpha x \Delta W\]

상기 식에서 $\alpha$는 LoRA 모듈이 원래의 가중치 $W$에 가하는 변경의 크기를 조정하는 스케일링 팩터입니다. 이 스케일링 팩터를 통해 모델이 새로운 태스크에 얼마나 빠르게 적응할지를 조절할 수 있습니다.

이 방식은 모델의 전반적인 구조를 변경하지 않으면서도, 학습 가능한 파라미터의 수를 효과적으로 관리하고, 특정 태스크에 대한 적응력을 증가시킬 수 있습니다.

3.2 LoRAMoE

3.2.1 아키텍처

목표는 지침 데이터 확장과 파인튜닝 단계에서 LLM 내의 World Knowledge 유지 간의 충돌 문제를 해결하는 것입니다. 이를 위해, 모델의 백본 파라미터를 동결시키고, 각 Transformer 블록의 피드 포워드 신경망 레이어에 여러 병렬 전문가를 추가하여 라우터로 연결했습니다. 전문가 네트워크를 LoRA 기반으로 대체하여 훈련 효율성을 높였습니다.

수식으로, 전통적인 Transformer 아키텍처에서 각 디코더 블록의 전진 과정은 다음과 같이 간단히 표현할 수 있습니다.

\[f(x) = x + f_{\text{FFN}}(x)\]

상기 수식에서 $f_{\text{FFN}}(\cdot)$는 피드 포워드 신경망 블록을 나타내며, $x$는 FFN 블록의 입력입니다. 선형 레이어의 행렬 연산은 다음과 같이 표현할 수 있습니다.

\[o = x W_0 + \Delta W\]

상기 수식에서 $W_0 \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$는 파라미터 행렬을, $\Delta W \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$는 훈련 중 파라미터 업데이트를 나타냅니다. $o$는 출력입니다.

MoE 레이어의 전진 과정은 다음과 같이 수식적으로 표현할 수 있습니다.

\[o = \sum_{i=1}^N G(x) E_i(x)\]

상기 수식에서 $E_i(\cdot)$와 $G(\cdot) = \text{Softmax}(x W_g)$는 각각 $i$번째 전문가와 라우터를 나타내며, $W_g$는 라우터의 훈련 가능한 파라미터 행렬입니다.

LoRA는 효율적인 훈련을 위해 전문가 네트워크의 파라미터를 대폭 줄이며, 최적화 과정에서 비용을 절감합니다. 수정된 선형 레이어(LoRAMoE Layer)의 전진 과정은 다음과 같이 표현될 수 있습니다.

\[o = x W_0 + \sum_{i=1}^N G(x) E_i(x) + \alpha \Delta W\]

3.2.2 전문가 균형을 위한 혼합 분포 문제

MoE를 제약 없이 파인튜닝할 때, 라우터 메커니즘은 소수의 전문가에게 불균형적으로 큰 가중치를 할당하는 상태로 수렴하는 경향이 있습니다. 이런 불균형은 초기 훈련 단계에서 더 많은 최적화를 거친 전문가들이 더 큰 선호를 받게 되어 발생합니다.

전통적인 해결책은 전문가들의 중요성 변동 계수를 손실 함수로 사용하여 각 전문가의 중요성을 균등하게 하는 것입니다. 하지만 이는 단일 분포를 가정하며, 데이터 소스의 다양성을 고려하지 않습니다. 이런 접근 방식은 사실상 단일 데이터 분포를 가정함으로써 다양한 데이터 분포를 무시하여 편향된 결과를 초래할 수 있습니다.

3.2.3 로컬 밸런싱 제약

위에서 언급한 문제를 해결하기 위해, 전문가들을 두 그룹으로 분리하여 하나는 대규모 태스크 학습에, 다른 하나는 World Knowledge 정렬에 집중하도록 하였습니다. 각 그룹 내 전문가들은 균형을 유지하되, 그룹 간에는 균일하게 적용되지 않습니다.

수학적으로, LoRAMoE 레이어의 중요도 행렬 $Q$와 $Q_{n,m}$는 $n$번째 전문가가 $m$번째 훈련 샘플에 대해 할당된 라우터 값의 합으로 표현될 수 있습니다.

\[Q_{n,m} = \sum_{j=1}^{T_m} G(x_j)\]

상기 수식에서 $N$은 전문가 수를, $T_m$은 $m$번째 훈련 샘플의 토큰 수를 나타냅니다. $x_j$는 LoRAMoE 레이어에 대한 $j$번째 토큰의 히든 입력입니다. 중요도 행렬 $Q$와 동일한 크기의 계수 행렬 $I$를 정의하며, $I_{n,m}$는 $Q_{n,m}$의 중요도 계수로 다음과 같이 표현됩니다.

\[I_{n,m} = \begin{cases} 1 + \delta, & \text{if } \text{Type}_e(n) = \text{Type}_s(m) \\ 1, & \text{otherwise} \end{cases}\]

상기 수식에서 $\delta \in [0, 1]$는 전문가 그룹 간 불균형 정도를 조절합니다. $\text{Type}_e(n)$과 $\text{Type}_s(m)$는 각각 $n$번째 전문가의 대상 유형과 $m$번째 훈련 샘플의 태스크 유형을 나타냅니다.

4. 실험

4.1 실험 설정

훈련 구현을 소개합니다. LLM의 피드 포워드 신경망에서 선형 레이어를 LoRAMoE 레이어로 대체합니다. 각 LoRAMoE 레이어를 6명의 전문가로 초기화했으며, 이 중 3명은 downstream 태스크 해결에, 나머지 3명은 World Knowledge 정렬에 전념했습니다.

4.2 결과

3백만 개의 훈련 샘플로 LoRAMoE의 성능을 평가한 결과, World Knowledge 벤치마크와 다른 태스크 모두에서 좋은 성능을 보였습니다. LoRAMoE 플러그인을 사용한 모델은 기존의 SFT를 단독으로 수행한 모델보다 더 나은 성능을 보였습니다.

4.3 전문가 활용 시각화

라우터가 downstream 태스크와 지식 벤치마크에서 각각의 전문가 그룹에 할당한 가중치를 시각화한 결과, 두 전문가 그룹의 활용이 다름을 확인했습니다. 이는 라우터가 특정 태스크에 대해 해당 전문가 그룹을 자동으로 할당할 수 있음을 시사합니다.

5. 관련 연구

파라미터 효율적 파인튜닝(PEFT)

PEFT는 언어 모델의 파라미터 수가 증가함에 따라 중요한 연구 경향이 되었습니다. 여러 방법이 언어 모델의 효율적인 파인튜닝을 달성했으며, 본 연구에서도 효율성을 높이기 위해 LoRAMoE에 Low-Rank 적응을 도입했습니다.

전문가 혼합(MoE)

MoE는 피드 포워드 신경망 레이어를 소수의 활성화된 전문가로 대체하여 모델을 크게 확장하면서도 계산 비용을 크게 증가시키지 않는 방법입니다. 본 연구에서는 MoE와 LoRA를 결합하여 파라미터 효율적인 파인튜닝을 통해 다양한 태스크를 효과적으로 해결할 수 있는 LoRAMoE를 제안했습니다.

1 INTRODUCTION

Large Language Models (LLMs) (Touvron et al., 2023; Muennighoff et al., 2022) have demonstrated remarkable capabilities in a variety of tasks. Supervised fine-tuning (SFT) of the models to align them with human instructions is a crucial step in unleashing their full potential (Chung et al., 2022; Ouyang et al., 2022) Although some works (Zhou et al., 2023; Cao et al., 2023) have indicated that models can follow human instruction well with a little fine-tuning data, increasing the amount of data is a straightforward solution when the variety of tasks expands, or when enhanced performance on a specific task is required (as shown in the left of Figure 1)

However, the large-scale increase in fine-tuning data brings new challenges. Specifically, we observe a notable decline in performance on the Closed-Book Question Answering (CBQA) dataset (e.g., TriviaQA (Han et al., 2019), Natural Questions (Kwiatkowski et al., 2019)) when there is a substantial increase in the amount of fine-tuning data, as shown in the blue group of lines on the right of Figure 1. We hypothesize that this significant decline in performance may be related to the collapse of world knowledge (Touvron et al., 2023) (i.e., parametric knowledge Neeman et al. (2022)) previously learned and stored in the pre-trained models (Petroni et al., 2019; Yu et al., 2023) This hypothesis is demonstrated in two steps. First, we verify that the CBQA dataset relies on the world

Figure 1: (Left) When the number of fine-tuning data increases from 100,000 to 3 million, the performance of many tasks is significantly improved. (Right) With the amount of instruction data increasing, fine-tuning (training for an epoch using the same set of hyperparameters) can continually change model parameters (shown as the red line), resulting in a decline in performance on the benchmarks that measure world knowledge. The details of training implementation can be seen in Section 2.1.

knowledge stored in the models for making inferences. Second, we demonstrate that the substantial decrease in performance on the CBQA dataset is attributed to the fact that large-scale fine-tuning can markedly change the model’s parameters (as shown in the right of Figure 1), leading to the destruction of world knowledge (i.e., knowledge forgetting) Overall, in vanilla supervised fine-tuning, there is a contradiction between simultaneously improving performance on downstream tasks and maintaining world knowledge of LLMs.

Findings: Vanilla SFT with massive data can lead to forgetting parametric knowledge stored in LLMs.

An ideal solution is to delineate a specific region within the model dedicated to storing world knowledge, akin to how the hippocampus in the human brain is specialized for memory (Treves & Rolls, 1994; Voss et al., 2017) Such a structure allows the parameters in this part to be frozen and thus protected during the fine-tuning process, thereby keeping world knowledge from being disrupted. However, due to the black-box nature of large language models (Sun et al., 2022; Kaddour et al., 2023), identifying regions of the world knowledge of LLMs is highly challenging (Feng et al., 2023) Instead of identifying these regions, another solution is to keep all the parameters of the model by plugin-based fine-tuning. Specifically, it fine-tunes the model by freezing all the parameters of the backbone model and binding parameter changes to an additional network. With a complete backup of the backbone model’s parameters available, the world knowledge stored in it is technically recoverable.

However, fine-tuning with a single plugin is similar in form to direct fine-tuning (Ding et al., 2022; He et al., 2021b) Consequently, the issue of knowledge forgetting persists. Mixture of Experts (MoE) (Jacobs et al., 1991) is an architecture that introduces multiple experts, where data with different characteristics are routed to the corresponding experts for customized processing (Du et al., 2022; Shazeer et al., 2016) Drawing on this idea, we hope to introduce multiple plugins as experts, allowing a part of them the opportunity to access the backup, while another part can perform downstream tasks.

In this paper, we propose LoRAMoE, to alleviate world knowledge forgetting and simultaneously enhance the LLMs’ capabilities of solving downstream tasks. LoRAMoE is a plugin version of MoE. It changes the model’s architecture by adding multiple parallel plugins as experts (i.e. LoRA Hu et al. (2021)) in each feed-forward layer and connecting them with routers. We then propose the use of localized balancing constraint to split the experts in each LoRAMoE layer into distinct groups. Specifically, one group is dedicated to downstream tasks, while the other focuses on aligning world knowledge within the backbone model with human instructions to alleviate knowledge forgetting. Additionally, localized balancing constraint also balances the importance of all experts within the same expert group, which prevents only a few experts in the same group from being valued by the routers. It enables several experts to collaborate, improving capabilities of solving downstream tasks.

Experiment results show that LoRAMoE can effectively keep the world knowledge in language models from being disrupted by the large-scale fine-tuning. Further, we confirm the effectiveness of LoRAMoE on capability localization at an interpretable level by visualizing the expert weight for tasks. Our observations reveal when finishing world knowledge benchmarks, the router pays more attention to the output of experts specifically handling these tasks. Conversely, for other downstream tasks, the router focuses on experts from another group. LoRAMoE effectively resolves the conflict by fostering collaboration among experts. Besides, the experiment result shows that learning on various downstream tasks also benefits from our method, implying the prospect of our method in multi-task learning.

Our contributions can be summarized as follows:

We find that significantly increasing the amount of supervised fine-tuning data can severely impair the world knowledge inside the LLMs, due to its great modification to their parameters. This indicates that maintaining the world knowledge inside the LLMs is in conflict with the large-scale addition of downstream fine-tuning data.
We introduce a new trainable plugin for LLMs, LoRAMoE, which is similar in architecture to the Mixture of Experts. LoRAMoE can automatically route different types of data to respective experts during the SFT phase without interfacing with the original parameter of LLMs. LoRAMoE employs localized balancing constraint in training, enhancing expert group specialization and internal balance. It partitions experts into two groups to learn tasks and align world knowledge with human instructions, reducing knowledge forgetting.
We demonstrate the effectiveness of LoRAMoE through extensive experiments. We can maintain a stable knowledge in the model when scaling up fine-tuning data, during which performance in other tasks has also seen considerable improvement. Our method is further evidenced by the visualization of expert utilization.

2 CONFLICT BETWEEN EXPANDING FINE-TUNING DATA AND RETENTION OF WORLD KNOWLEDGE IN LLMS

In this section, we conduct SFT tasks on the LLM with vast and varied datasets. We find that the world knowledge inside the LLM is severely compromised during the expansion of SFT.

2.1 IMPLEMENTATION

Datasets To understand the impact of large-scale SFT on the world knowledge of the LLM, we construct a large-scale dataset that includes a variety of tasks. Namely, there are seven tasks, Closed-Book Question Answering (CBQA), coreference resolution, Natural Language Inference (NLI), abstract summarization, multi-lingual translation, reading comprehension, and text classification. We augmented the training datasets to 5 million through data augmentation methods. More details about our composition of fine-tuning data can be seen in the Appendix A.1.
Base Model We utilize LLaMA-2-7B (Touvron et al., 2023) as the base model, considering it stands out as one of the most notable and widely used open-source LLMs in current academia.
Evaluation For evaluation of the world knowledge, we use CBQA as a key benchmark of the model’s world knowledge, with reference to the previous work (Touvron et al., 2023; Petroni et al., 2019) Notably, considering previous work that has noted train-test overlap in CBQA datasets (Lewis et al., 2020), we elaborately select parts of the CBQA dataset without train-test overlap for our testing set, namely Filtered NQ and Filtered TriviaQA, to analyze the world knowledge of mod-

Figure 2: Performance on the various tasks after expanding the amount of fine-tuning data. For most of the downstream tasks (e.g., NLI and summarization), with the expansion of training data, performance on these tasks remains stable after improvement. Whereas, for the world knowledge benchmark, a significant decline can be witnessed after a large amount of instruction data.

For evaluating the performance on other downstream tasks, we utilize the opencompass1 framework to run the evaluation process on the aforementioned tasks.

2.2 THE EXPANSION OF FINE-TUNING DATA LEADS TO THE KNOWLEDGE FORGETTING

INSIDE THE LLMS

During the expansion of fine-tuning data, we observed a diverging trend in the performance across two types of tasks, as can be seen in Figure 2:

On some tasks, such as summarization, NLI, machine translation, etc., the performance fine-tuned model initially increased magnificently and stabilized at a promising level. However, there was a catastrophic decline in the model’s performance on the benchmark measuring knowledge capability(e.g. TriviaQA, NQ, Hotpot QA), even much lower than the baseline. Notably, with the training data expanding, a contiguous decline can be witnessed. Besides, in the filtered test set, the collapse happens earlier on the filtered test set than the original one.

Further, we dissect the reasons behind the decline of the performance in world knowledge benchmarks with the expansion of fine-tuning data.

Figure 3: Performance on world knowledge benchmarks after training on CBQA solely. Its performance rises greatly after training with very few samples and remains relatively stable thereafter.

The performance on world knowledge benchmarks highly relies on the knowledge and skills learned during pre-training phase.

https://opencompass.org.cn/

To investigate the relationship between world knowledge benchmarks and the knowledge embedded in pre-trained models, we conduct fine-tuning solely on the CBQA dataset with 250k samples and run evaluation on the test sets without train-test overlap.

The results in Figure 3 shows that: the performance on these benchmarks can be significantly enhanced through naive training, however, the first one percent of the training process (approximately 1000 samples) contributes to most of the boost, and further increases in the training sample do not actually improve the performance to any great extent.

Actually, this phenomenon is reasonable. In the early stages of fine-tuning, the model quickly learns to follow human instructions through training data and align the world knowledge already stored in the model with the instructions (Zhou et al., 2023), thereby increasing the performance of CBQA. However, due to the limited overlap between the training and testing data, the knowledge in the test set is difficult to incorporate during training, so more samples do not improve performance.

Therefore, the model’s capability for completing the world knowledge benchmark is highly dependent on the knowledge and skills acquired during pre-training phase. Given this, it is naturally assumed that the markedly diminished performance of the model on CBQA stems from the disruption of knowledge stored in the LLM due to large-scale instruction tuning.

SFT process with large-scale instruction data disrupts the stored knowledge in LLMs, thus leading to knowledge forgetting.

To verify this hypothesis, we employed two distinct datasets for sequential fine-tuning of the model. Initially, the model was fine-tuned using instruction data excluding the CBQA segment. Subsequently, we further fine-tuned the model with the CBQA dataset that had been segregated before.

The experimental results are presented in Table 1, where we can observe a notable degradation in the knowledge capabilities of the fine-tuned model and its performance was inferior compared to the original LLM. This indicates that the world knowledge within the model was compromised during the first stage of fine-tuning, resulting in the model’s inability to forge the alignment between human instructions and the already disrupted knowledge in the subsequent stage of fine-tuning.

Further, we find that there is a massive change in the LLM’s parameter during the expansion of fine-tuning data, as can be seen in the right part of Figure 1. As the previous research has documented that models store knowledge within their parameters during the pre-training process (Petroni et al., 2019; Roberts et al., 2020; AlKhamissi et al., 2022), this further indicates the destruction of the knowledge stored in the parameters during large-scale fine-tuning, which results in the knowledge forgetting.

Overall, enhancing the instruction following capability and facilitating the performance on various downstream tasks of LLM through large-scale vanilla SFT inherently conflicts at the parameter level with retaining the world knowledge stored in the model.

3 METHODOLOGY

Tasks	Baseline	SFT solely on CBQA	Two-stage Fine-tuning
TriviaQA	33.5	36.22	13.7
NQ	7.8	12.8	3.6
HotpotQA	11.2	16.1	7.1

태스크(TriviaQA, NQ, HotpotQA) 접근 방법(Baseline, SFT solely on CBQA, Two-stage Fine-tuning)별 성능 비교

Table 1: From left to right are the performance of LlaMA-2-7B, the model fine-tuned solely on CBQA datasets and the model first fine-tuned on 3 million instruction datasets excluding CBQA then continue-trained on CBQA datasets. After continuing to fine-tune on CBQA dataset, the model after a large-scale SFT still fails to improve its knowledge-answering ability and falls far below the baseline.

Enhancing the language model’s capabilities in various tasks while retaining its world knowledge is crucial for applying them across various downstream tasks. However, as previously mentioned, these two objectives conflict at the parameter level with traditional methods. To achieve this goal, we propose LoRAMoE, an LLM adapter employing the Mixture of Experts (MoE) approach, designed to partition different capabilities within distinct sections of LoRA (Low-Rank Adaptation) In this section, we initially provide a concise overview of the traditional MoE and LoRA methodologies. Subsequently, we delve into how LoRAMoE ingeniously amalgamates the methodologies of MoE and LoRA. This integration effectively leverages the strengths of both methods and capably addresses the conflict issue outlined in section 2.

3.1 PRELIMINARIES

3.1.1 MIXTURE OF EXPERTS

The Mixture of Experts significantly scales up model parameters without correspondingly increasing computational efforts. For transformers-based large language models, MoE supplants the conventional feed-forward neural network layer in each transformer block with an MoE layer (Shazeer et al., 2016; Fedus et al., 2021; Lepikhin et al., 2020). This MoE layer is composed of $N$ parametrically identical and independent feed-forward neural networks ${E_i}_{i=1}^N$ as the experts, coupled with a gating function $G(\cdot)$ as the router. The router is used to model the probability distribution that governs the weights of outputs from these expert networks. Formally, for the output $h$ of the attention layer in any given block, the output $y$ of the MoE layer can be mathematically represented as follows:

\[y = \sum_{i=1}^N G(h)_i E_i(h)\]

where $E_i(h)$ and $G(h)_i$ denote the output and the corresponding weight of $i$-th expert in the MoE layer, respectively. The router $G(\cdot)$ can be written as follows:

\[G(h) = \text{Softmax}(W_g h)\]

where $W_g$ is the trainable weight matrix for router $G(\cdot)$.

The language model based on MoE architecture facilitates the acquisition of varied competencies by its experts and their focus on distinct capabilities, achieved through the mechanism of routing (Du et al., 2022; Riquelme et al., 2021; Bao et al., 2022). Our work draws upon this MoE paradigm and we fine-tuned the experts to strategically distribute competencies, aiming to address the knowledge forgetting issue. To clarify, we freeze the parameters of the backbones and substitute the feedforward neural network layer of experts with LoRA to enhance the efficiency of the training process.

3.1.2 LOW-RANK ADAPTION

LoRA (Low-Rank Adapter) has been demonstrated to be an effective and efficient way to adapt Pre-trained Models to specific tasks (Hu et al., 2021). Formally, for a pre-trained matrix $W$ (e.g., attention-k) with $W_0 \in \mathbb{R}^{d_{in} \times d_{out}}$, LoRA updates the $W$ with a low-rank decomposition:

\[W = W_0 + U V^T\]

where $r$ represents LoRA rank. The forward process with LoRA is as follows, during which the $W$ is frozen:

\[y = \alpha (U V^T) x\]

where $\alpha$ represents a scaling factor that adjusts the magnitude of the changes on the original $W$ made by LoRA modules. It is worth noting that, the forward process of LoRA is different from LoRAMoE. In the latter, LoRA is used as the expert network within the MoE architecture, primarily to facilitate accelerated training and optimization.

3.2 LORAMOE

3.2.1 ARCHITECTURE

Our goal is to tackle the challenge of conflict in the expansion of instruction data with the maintenance of world knowledge inside LLM during the fine-tuning phase. The left of Figure 4 illustrates the forward process of the standard MoE architecture. In the MoE-based architecture, the router function focuses on different experts according to the instruction data categories, allowing them to divide their labor to complete the forward process (Jacobs et al., 1991)

Taking inspiration from that discovery, as shown on the left side of Figure 4, we attempt to extend the pre-trained language model architecture into an MoE-like architecture to tackle the aforementioned conflict, at the same time retaining its original strong capabilities of language model. Specifically, we freeze the parameters of the backbone model, allowing experts within LoRAMoE the opportunity to leverage the existing world knowledge in the base model. For the feed-forward neural network layer in each transformers block, we define several parallel trainable experts connected by the router. Simultaneously we replace the fully-connected layer of the expert with a low-rank form to improve training and inference efficiency.

Figure 4: The architecture of LoRAMoE, compared with classic MoE. LoRAMoE utilizes multiple LoRAs as adaptable experts and a router to gate them in the FFN layer of every transformer block. During the training process, only the experts and the router are optimized.

Formally, for the traditional transformers architecture, the forward propagation process of each decoder block can be simplified as follows:

\[f(x) = x + f_{\text{FFN}}(x)\]

where $f_{\text{FFN}}(\cdot)$ stands for the feed-forward neural network block and $x$ denotes the input of the FFN block. The matrix operation of the linear layer in the FNN block can be expressed as:

\[o = x W_0 + \Delta W\]

where $W_0 \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$ represents the parameter matrix and $\Delta W \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$ denotes the parameter update in the training phase. $o$ is the output of the linear layer with dimension $d_{\text{out}}$.

To address the conflict issue delineated in Section 2 during the fine-tuning phase, we adopted a strategy distinct from optimizing a single matrix with the entire instruction dataset. Specifically, we substituted the linear layer with the MoE architecture. This modification permits the experts within the MoE layer to collaborate and learn the updated matrix $\Delta W$. Consider the MoE layer comprising $N$ experts and the set of experts denoted as $\{E_i\}_{i=1}^N$, the forward process of the MoE layer can be mathematically expressed as follows:

\[o = \sum_{i=1}^N G(x) E_i(x)\]

where $E_i(\cdot)$ and $G(\cdot) = \text{Softmax}(x W_g)$ represent the $i$-th expert and the router in the MoE layer, respectively. The $W_g$ is the trainable parameter matrix of the route function. Through this approach, the experts and routing function work in tandem during the training phase, enabling the experts to develop varied capabilities and efficiently handle diverse types of tasks.

Concurrently, Low-Rank Adaptation has been proven to be both effective and efficient in the fine-tuning of pre-trained language models (Wang et al., 2023; Liu et al., 2022; Pan et al., 2022)

Figure 5: The coefficient of variation for the experts of the unconstrained LoRAMoE progressively escalates and sustains at a high level (i.e., approximately three, similar to the phenomenon observed at Shazeer et al. (2016)), signifying the prolonged predominance of a limited number of experts.

Figure 6: The experts are segregated into two categories: one of which concentrates on learning massive tasks, while another focuses on aligning the world knowledge inside the LLM and human instruction. The routing mechanism assigns similar importance to the experts within a group, and it selectively activates the group more specialized for the given data type.

Specifically, the matrix ∆WE ∈ Rdin×dout of single expert E(·) in the LoRAMoE layer can be written as follows:

∆WE = BA (8) where A ∈ Rdin×r, B ∈ Rr×dout , and the rank r ≪ min(din, dout)

LoRA contributes to a significant reduction in the parameters of the experts network, thereby enhancing efficiency and saving costs during the fine-tuning process. Overall, the forward process of the modified linear layer (i.e., LoRAMoE Layer) in the feed-forward neural network layer can be represented as:

where ωi denotes the attention weight of i-th expert in the LoRAMoE layer and α is the constant hyper-parameter, approximately equivalent to the learning rate (Hu et al., 2021)

3.2.2 MIXED DISTRIBUTION DILEMMAS FOR EXPERT BALANCING

When fine-tuning MoE without any constraints, the router mechanism often converges to a state in which a small number of experts receive a disproportionately large share of preferences by the router, as depicted in Figure 5. This imbalance among experts presents a challenge to correct, as experts that receive greater routing weights in the early stages of training undergo more rapid optimization, thereby garnering increased preferences from the router. A similar phenomenon has been documented in the work presented in Shazeer et al. (2016) and Fedus et al. (2021)

A conventional solution for balancing experts utilization involves employing the coefficient of variation of the experts’ importance as the loss function, aimed at equalizing the significance of each expert (Shazeer et al., 2016) This solution assumes that the distribution of training samples for optimising MoE is a single distribution, which inherently eliminates the necessity of considering the diverse origins of data distribution. Specifically, this traditional approach simplifies the modeling process by assuming homogeneity in data sources that often do not align with fine-tuning data containing both factual knowledge QA and other downstream tasks. Therefore, such simplification can lead to significant biases, particularly when encountering datasets with varied distributional characteristics.

Traditional balancing constraints, which aim to allocate a uniform distribution of training samples across all experts, can lead to inaccurate parameter estimation. This is because such constraints do not account for the intrinsic differences in data representation and importance across various categories. Recognizing the disparate nature of data distributions, LoRAMoE strategically assigns data to experts, not uniformly, but based on the observed imbalances. This allocation is governed by a set of weights that are calibrated to reflect the varying significance and representation of different data categories within the overall dataset.

Such a specialized allocation method is pivotal in addressing the challenges posed by uneven data distributions. By tailoring the distribution of training samples to each expert based on the inherent disparities in the data, LoRAMoE facilitates a more accurate and representative parameter estimation. This nuanced approach to data distribution allows for a more effective fitting of the model to diverse data subsets, significantly enhancing the model’s predictive accuracy and generalization capability. This strategy is particularly effective in scenarios where data imbalance could otherwise lead to skewed learning and generalization errors, ensuring that each data category is appropriately represented and modeled within the overall system. To illustrate the concept with a simplified model, let’s assume our training data is sampled from a mixture of two Gaussian distributions. The means ($\mu_1, \mu_2$) and variances ($\sigma^2_1, \sigma^2_2$) of these distributions are implicit. The proportion of training data from each distribution is denoted as $p_1$ and $p_2$ where $p_1 + p_2 = 1$. Without loss of generality, we assume that $p_1 \leq p_2$. When a MoE model fits the proposed distribution with balanced weights $m$, the likelihood of the model given the data can be expressed as:

\[L(\text{model} \mid \text{data}) = \sum_{i=1}^N \log \left( p_1 \mathcal{N}(x_i \mid \mu_1, \sigma^2_1) + p_2 \mathcal{N}(x_i \mid \mu_2, \sigma^2_2) \right)\]

The optimal mean value for the $\mu$ distribution is in the same family of mixed distributions $N( ext, p_1)$ as the sampling distribution. This optimal mean value satisfies the following conditions, with its value being 0 when the fitted distribution is perfect:

\[\mu_{\text{opt}} = \frac{p_1 \mu_1 + p_2 \mu_2}{p_1 + p_2}\]

In equation 10, we can replace part of the summation with the empirical estimate of the mean of the input $x$. For an ideal routing network, there must exist a distribution $N_i$ such that the data allocated to this distribution is independently and identically distributed with one of the peaks in the sampling distribution. Let’s assume this distribution to be $N_2$. In this case, if $m \geq p_1$, then the fitting result for distribution $\mu'_1$ will be:

\[\mu'_1 = \frac{p_1 \mu_1 + (m - p_1) \mu_2}{m}\]

Based on the chain rule of differential derivation, we end up with:

$\frac{\partial L}{\partial \mu_1} = \frac{p_1}{\sigma^2_1}$mu_1 - \mu’_1) + \frac{p_2}{\sigma^2_2} $mu_2 - \mu'_2)$

The inverse result can be derived similarly. Therefore, the best training error is achieved only when the mixing coefficient $m$ of the prior distribution is consistent with the actual sampling distribution weight $p_1$. In the next section, we will introduce a localized balancing constraint algorithm, which uses adaptive balancing coefficients to explore the optimal data distribution mixing coefficient.

3.2.3 LOCALIZED BALANCING CONSTRAINT

Considering the aforementioned dilemmas, we introduce a strategic approach as depicted in Figure 6. Unlike previous efforts that solely focused on distributing the learning data among all experts evenly, LoRAMoE can methodically segregate the experts into two distinct groups, one of which concentrates on learning massive tasks, while another focuses on aligning the world knowledge with instructions (i.e. invoking the world knowledge in the base model to take respond to the human instructions) While all experts within the same group are balanced, this balance is not uniformly applied across groups. Formally, we define the importance matrix $Q$ of LoRAMoE layer and $Q_{n,m}$ denotes the sum of router values of the $n$-th expert for the $m$-th training sample in a batch, which can be represented as follows:

\[Q_{n,m} = \sum_{j=1}^{T_m} G(x_j)\]

where $N$ and $T_m$ denote the number of experts and the number of tokens of $m$-th training sample, respectively. $x_j$ is the hidden input of the $j$-th token to the LoRAMoE layer. We then define the coefficient matrix $I$ with the same size of $Q$, corresponding to the importance matrix $Q$. $I_{n,m}$ denotes the importance coefficient of $Q_{n,m}$, which can be written as follows:

\[I_{n,m} = \begin{cases} 1 + \delta, & \text{if } \text{Type}_e(n) = \text{Type}_s(m) \\ 1, & \text{otherwise} \end{cases}\]

where $\delta \in [0, 1]$ controls the degree of imbalance between experts groups. $\text{Type}_e(n)$ and $\text{Type}_s(m)$ are pre-defined target type of $n$-th expert and the task type of $m$-th training sample in a batch, respectively. We categorize the instruction data into two distinct groups: CBQA and other downstream task data. The CBQA serves to enable expert groups to align human instructions with knowledge, whereas the remaining data is allocated to task-focused experts to boost task performance.

Specifically, suppose that $I_{i,k}$ and $I_{j,k}$ denote the importance coefficient of the $i$-th and $j$-th expert for the $k$-th sample, respectively. Two experts intentionally assigned to the same panel hold equivalent values at corresponding positions in the coefficient matrix (i.e., $I_{i,k} = I_{j,k}$) This implies that their importance is assigned equal weight. On the contrary, two experts in distinct groups possess divergent values at corresponding positions in their coefficient matrix (i.e., $I_{i,k} \neq I_{j,k}$)

We define the localized balancing constraint loss $L_{\text{lbc}}$ to quantify the dispersion of the weighted importance matrix $Z = I \circ Q$, which can be mathematically represented as:

\[L_{\text{lbc}} = \frac{\sigma^2(Z)}{\mu(Z)}\]

where $\sigma^2(Z)$ and $\mu(Z)$ represent the variance and mean of $Z$, respectively. Specifically, for a specific sample, given that experts on the same panel possess identical values in the coefficient matrix $I$, the reduction of the optimized loss $L_{\text{lbc}}$ results in a progressive equalization of their importance. On the other hand, experts engaged in distinct panels are assigned differing values in $I$, thus, the convergent state of the loss essentially equates to attaining a soft-weighted balance among these different groups with the $I$ scaling the $Q$. In this state, the router intensifies its focus on the particular expert group according to the type of task, yet refrains from fully obscuring other groups, thereby preserving the capacity for generalization.

Ultimately, optimizing $L_{\text{lbc}}$ reduction achieves a balance in the importance of experts within the same group, while also ensuring different groups concentrate on different capabilities. The overall loss of LoRAMoE can be represented as follows:

\[L_{\text{total}} = L + \beta L_{\text{lbc}}\]

where $L$ is the next-token prediction loss of large language models and $L_{\text{lbc}}$ is the localized balancing constraint loss for all LoRAMoE layers. $\beta$ controls the strength of localized balancing constraint.

During the fine-tuning phase, we freeze the base model and the trainable parameters are those of the experts and routers within the LoRAMoE layers. LoRAMoE significantly conserves resources compared to fine-tuning the full range of parameters of models. In the inference process, the router autonomously determines and assigns output weights to all experts, thereby obviating the need for pre-specified data types.

여러 태스크 접근 방식(Baseline, SFT solely on CBQA, SFT LoRA LoRAMoE, LoRAMoE with Llbc)별 성능 비교

Task	Baseline	SFT solely on CBQA	SFT LoRA LoRAMoE	LoRAMoE (with Llbc)
WSC	65.4	-	57.8	70.2
Winogrande	61.7	-	28.6	69.6
Flores	0.1	-	36.2	25.9
Xsum	19.7	-	12.8	33.2
Race-middle	30.5	-	16.1	90.0
Race-high	30.4	-	76.0	86.5
RTE	52.7	-	71.2	87.4
ReCoRD	29.4	-	24.3	85.9
AX-g	52.0	-	34.7	87.1
MultiRC	44.0	-	89.1	87.9
TriviaQA	52.2	57.8	51.1	58.1
NQ	18.5	28.6	24.5	28.0
Filtered TriviaQA	33.5	36.2	21.6	35.4
Filtered NQ	7.8	12.8	7.3	12.0
Hotpot QA	11.2	16.1	13.4	16.1

Table 2: Results of LoRAMoE. Contrary to direct full fine-tuning and the use of LoRA-tuning that exhibits reduced performance on world knowledge benchmarks after training, our approach ensures simultaneous growth of both world knowledge benchmarks and other downstream tasks.

4 EXPERIMENTS

4.1 EXPERIMENT SETUP

In this section, we introduce the training implementation for LoRAMoE. We only replace the linear layer in the feed-forward neural network of LLM with the LoRAMoE layer. We initialize each LoRAMoE layer with six experts, of which three experts are dedicated to addressing downstream tasks, and the other three are responsible for aligning world knowledge in the base model with the human instructions. The hyperparameters for control constraint strength β and degree of imbalance δ are both set to 0.1. For the experts based on low-rank adapter, the α, and r are set to 32 and four, respectively. The dropout is 0.05, and the learning rate for both the experts and router in the LoRAMoE layer is 2e − 4. The training dataset is the 3 million set the same as the one described in Section 2.1. We freeze the parameters of the base model, rendering only the experts and router in LoRAMoE trainable. The global batch size is set to 64, and all our experiments were conducted on 32-card A100 80G.

4.2 RESULTS

Table 2 displays the performance of LoRAMoE with 3 million training samples and compares this result with the outcomes of directly applying SFT to the model or utilizing LoRA tuning. We report the results on the same test set as discussed in Section 2.1. The results show that the language model with LoRAMoE gets good performance on both world knowledge benchmarks and others, indicating the effectiveness of avoiding knowledge forgetting while improving various tasks.

For world knowledge benchmarks, first of all, the catastrophic collapse observed in Section 2 did not occur. On the contrary, the model with the LoRAMoE plugin even outperforms the one fine-tuned solely with the CBQA dataset, such as on TriviaQA and HotpotQA. Besides, compared to vanilla SFT using the same amount of data, LoRAMoE achieves up to a 64% improvement in world knowl- edge benchmarks, with an average increase of 35%. By comparing the performance of LoRAMoE with and without the implementation of the local balance expert loss (the last two columns in Table 2), we observed that Llbc can facilitate the performance on the world knowledge benchmarks.

LoRA tuning is intuitively a method involving smaller degrees of parameter modification than vanilla SFT (He et al., 2021a) However, it is observed that the forgetting of world knowledge still occurs. Its performance in CBQA is generally lower than the baseline, and there is an average decrease of 22% compared to direct single-task tuning. Multiple parallel experts are helpful in achieving excellent results in balancing world knowledge retention and the scaling up of fine-tuning data.

For other downstream tasks, LoRAMoE is capable of achieving performance close to or even surpassing that of direct SFT, For instance, in all reading comprehension tasks (i.e., Race, ReCoRD, multiRC), LoRAMoE achieved superior performance. Besides, Llbc improves outcomes for LoRAMoE in the vast majority of tasks, both world knowledge benchmarks and others. Notably, for reading comprehension, NLI, and the original CBQA dataset, the benefits of this method were quite substantial, up to 11.8%. This indicates capability partitioning in the expert group benefits the performance. Further, LoRAMoE holds significant promise for multi-task learning.

4.3 VISUALIZING THE EXPERTS UTILIZATION

To confirm the effectiveness of LoRAMoE in specializing the grouped experts, we visualize their weight assigned by the router when encountered with data from downstream tasks and knowledge benchmarks respectively, as illustrated in Figure 7.

There is a distinct contrast in the utilization of the two expert groups when dealing with world knowledge benchmarks and other downstream tasks. This suggests that the routers can automatically allocate specific tasks to experts with corresponding abilities during the inference phase. Specifically, the expert group requested to prioritize aligning parametric knowledge with human instructions is greatly employed in world knowledge benchmarks (e.g., TriviaQA, Natural Question, and HotpotQA), underscoring their vital role in preventing world knowledge forgetting. This corresponds to the fact we state in Section 2 that supervised fine-tuning boosts the model’s capabilities in these tasks by associating pre-stored world knowledge in the model with human instructions. On the other hand, experts assigned to focus on enhancing performance in downstream tasks are given increased prominence when encountering these tasks. Through this visualized result, we find that some downstream tasks still require another group of experts. It is reasonable. For example, in reading comprehension tasks, the knowledge learned by the model during pre-training can better assist in making factual judgments. This phenomenon is even more pronounced in language-based tasks. In the WSC task (Levesque et al., 2012), the router allocates an average of about 45% of its attention to the expert group responsible for world knowledge.

Figure 7: Visualization of routers’ weight on different types of data, where Group 1 refers to the experts dedicated to aligning the world knowledge in the base model with the human instruction and Group 2 refers to the experts focus on downstream tasks. The utilization rate of the group of experts diverged significantly across tasks.

Parameter-Efficient Fine-tuning. With the significant rise in the number of parameters in language models, parameter-efficient fine-tuning (PEFT) (He et al., 2021a) has become an important research trend. It can consume less resources while fine-tuning the large language models. Several methods achieve more efficient fine-tuning of the language model, including prompt LoRA (Hu et al., 2021), adapters (Houlsby et al., 2019), and prompt learning (Lester et al., 2021) PEFT based on Lora-rank adapters (Hu et al., 2021) is widely used, which introduce two trainable low-rank matrices for each fully connected layer, to achieve significant savings in training resources without adding additional inference computation cost. In this paper, We introduce low-rank to the expert networks of LoRAMoE to reduce fine-tuning consumption, which increases the number of experts to improve performance while not significantly increasing computational resources.

Mixture-of-Experts. Mixture-of-Experts (MoE) (Jacobs et al., 1991) modifies the feed-forward neural network layer to sparsely activated experts, which significantly enlarges the model without remarkably increasing the computational cost. The exploration of MoE has attracted more and more attention in recent years, including the early sample-level MoE (Jacobs et al., 1991) up to the tokenlevel MoE (Shazeer et al., 2016; Lepikhin et al., 2020; Du et al., 2022; Riquelme et al., 2021; Fedus et al., 2021; Xue et al., 2022) that have become mainstream nowadays. Meanwhile some researchers (Zhou et al., 2022; Chi et al., 2022) aim to investigate the router selection problem in MoE. However, the vast majority of the works are trying to significantly increase the model parameters and reduce the computational cost. Differently, Chen et al. (2023) explores the advantages of MoE-based language modeling in continuous learning. Our approach is significantly different. We use a MoE-like structure to address the parametric knowledge retention issue in LLMs, rather than significantly expanding the count of model parameters.

Multi-LoRA Architecture. Some researchers have also proposed the utilization of multiple LoRAs to improve the performance of the models or to be more advantageous in some aspects. Huang et al. (2023) proposed LoraHub which trains several LoRAs and chooses different LoRA combinations based on data types during the inferencing phase. MOELoRA (Liu et al., 2023) fine-tuned language models through the incorporation of MoE structures, thereby boosting their efficacy in multitasking within the medical domain. However, these methods take the data type as the input of the router during training, which necessitates prior knowledge of the data type during inference, i.e., choosing the combination of LoRAs based on the data type. Zadouri et al. (2023) introduced Mixture of Vectors and Mixture of LoRA to reduce the resource consumption of fine-tuning the large language models and improve the performance on unseen tasks. Sheng et al. (2023) proposed S-LoRA, a system that can serve thousands of LoRA adapters from a single machine. These methods are very different from LoRAMoE. We explored for the first time that the expansion of instruction data seriously damages the knowledge of the pre-trained language models during the fine-tuning phase. We further propose to address conflicts through the MoE structure and to utilize the LoRA to reduce resource consumption. Simultaneously LoRAMoE is also an end-to-end approach that does not require a priori knowledge of data types for inference.

6 CONCLUSION

In this paper, we delve into the consequences of increasing data quantity on the instruction task during the supervised fine-tuning stage. We uncover that an extensive expansion in data quantity can critically impair the world knowledge of LLMs, consequently causing world knowledge forgetting. To address this conflict, we then introduce LoRAMoE, a new MoE-based trainable LLM plugin that separates networks concentrate on different capabilities with localized balancing constraints. Extensive experiments have shown that LoRAMoE is adept at resolving this conflict. It preserves the world knowledge inside the LLMs, while simultaneously enhancing the performance of other downstream tasks within the instruction data during the fine-tuning process.

APPENDIX

A.1 DETAILS ABOUT FINE-TUNING DATASETS

Table 3 shows the specific tasks covered by the 3-million-sample dataset we used in Section 2 and the amount of data for each task. The five million fine-tuning data we use includes three million versions and their variants from data augmentation strategies. The 1-million-sample version is the subset of the original 3-million-sample dataset.

post contain ""

No matching posts found containing ""

Art of Balancing

Art of Balancing

Art of Balancing

The Art of Balancing: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment

TL;DR

1 INTRODUCTION

2 CONFLICT BETWEEN EXPANDING FINE-TUNING DATA AND RETENTION OF WORLD KNOWLEDGE IN LLMS

2.1 IMPLEMENTATION

2.2 THE EXPANSION OF FINE-TUNING DATA LEADS TO THE KNOWLEDGE FORGETTING

INSIDE THE LLMS

3 METHODOLOGY

3.1 PRELIMINARIES

3.1.1 MIXTURE OF EXPERTS

3.1.2 LOW-RANK ADAPTION

3.2 LORAMOE

3.2.1 ARCHITECTURE

3.2.2 MIXED DISTRIBUTION DILEMMAS FOR EXPERT BALANCING

3.2.3 LOCALIZED BALANCING CONSTRAINT

4 EXPERIMENTS

4.1 EXPERIMENT SETUP

4.2 RESULTS

4.3 VISUALIZING THE EXPERTS UTILIZATION

6 CONCLUSION

APPENDIX

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views

Share Your Feedback 🏝️

Art of Balancing

Art of Balancing

The Art of Balancing: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment

TL;DR

1 INTRODUCTION

2 CONFLICT BETWEEN EXPANDING FINE-TUNING DATA AND RETENTION OF WORLD KNOWLEDGE IN LLMS

2.1 IMPLEMENTATION

2.2 THE EXPANSION OF FINE-TUNING DATA LEADS TO THE KNOWLEDGE FORGETTING

INSIDE THE LLMS

3 METHODOLOGY

3.1 PRELIMINARIES

3.1.1 MIXTURE OF EXPERTS

3.1.2 LOW-RANK ADAPTION

3.2 LORAMOE

3.2.1 ARCHITECTURE

3.2.2 MIXED DISTRIBUTION DILEMMAS FOR EXPERT BALANCING

3.2.3 LOCALIZED BALANCING CONSTRAINT

4 EXPERIMENTS

4.1 EXPERIMENT SETUP

4.2 RESULTS

4.3 VISUALIZING THE EXPERTS UTILIZATION

5 RELATED WORK

6 CONCLUSION

APPENDIX

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views