00:00:00

Share Your Feedback 🏝️

Model | LLaMA Pro

Model | LLaMA Pro

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: PeriFlow Next: Attention | Lightning Attention 2

Model | LLaMA Pro

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-01-08

LLaMA Pro: Progressive LLaMA with Block Expansion

  • url: https://arxiv.org/abs/2401.02415
  • pdf: https://arxiv.org/pdf/2401.02415
  • model: https://huggingface.co/TencentARC/LLaMA-Pro-8B
  • abstract: Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e.g., from LLaMA to CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with an expansion of Transformer blocks. We tune the expanded blocks using only new corpus, efficiently and effectively improving the model’s knowledge without catastrophic forgetting. In this paper, we experiment on the corpus of code and math, yielding LLaMA Pro-8.3B, a versatile foundation model initialized from LLaMA2-7B, excelling in general tasks, programming, and mathematics. LLaMA Pro and its instruction-following counterpart (LLaMA Pro-Instruct) achieve advanced performance among various benchmarks, demonstrating superiority over existing open models in the LLaMA family and the immense potential of reasoning and addressing diverse tasks as an intelligent agent. Our findings provide valuable insights into integrating natural and programming languages, laying a solid foundation for developing advanced language agents that operate effectively in various environments.

Contents

TL;DR


  • 대규모 언어모델의 도메인 적응 전처리 방법 개발
  • 블록 확장을 통한 특정 도메인 지식 주입 및 일반 능력 유지
  • LLAMA PRO 모델로 다양한 벤치마크에서 최고 성능 달성

1. 서론

최근 대규모 언어모델(LLM)의 발전은 자연어 처리 분야에서 혁명적인 변화를 가져왔으며, 다양한 실제 작업에서 향상된 능력을 보여주었습니다. 그러나 프로그래밍, 수학, 생명과학, 금융 등 일부 도메인에서는 여전히 한계를 보이며, 이는 보다 넓은 응용을 위한 일반 언어 에이전트 개발의 진전을 방해합니다. 이를 극복하기 위해, 본 연구에서는 도메인 적응형 전처리 방법을 소개하고, 이를 통해 도메인 특화 지식을 주입하면서도 모델의 일반 능력을 유지하는 것을 목표로 합니다.


2. 선행 연구

대규모 언어모델의 발전은 데이터와 모델 규모의 증가와 함께 연구가 진행되어 왔습니다. 특히, 일반적인 모델들이 다양한 문제를 해결하고 새로운 작업에 빠르게 적응할 수 있게 되었습니다. 이와 병행하여, 도메인 적응형 전처리는 모델을 특정 도메인에 맞춰 파인튜닝하는 방법으로, 크게 두 단계로 나뉩니다. 일반 도메인 전처리와 도메인 특화 트레이닝. 본 연구는 이런 기법을 발전시켜, 특정 작업에 대한 전문성을 유지하면서도 전체적인 성능을 희생하지 않는 새로운 전략을 제안합니다.


3. 방법

3.1 기초: LLaMA 블록

LLaMA 블록은 다음과 같이 정의됩니다.

\[x' = x + \text{MHSA}(\text{RMSNorm}(x)) \\ y = x' + \text{FFN}(\text{RMSNorm}(x')) \tag{1}\]

입력 \(x\)는 \(n \times d\) 차원을 가지며, \(y\)는 입력과 동일한 차원을 출력합니다. MHSA는 중요한 변환 연산으로, 각 head는 다음과 같이 계산됩니다.

\[\text{MHSA}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O\]

이런 구조를 통해, 본 연구에서는 신원 블록을 도입하여 모델의 출력을 유지하면서 새로운 블록을 추가합니다. 이 방법은 다음과 같은 수학적 근거에 기반합니다.

\[\phi_{\text{id}}(x) = x \tag{2}\]

신원 블록은 입력과 출력이 동일하게 설정되어, 원래 모델의 출력을 그대로 유지하도록 설계되었습니다.

3.2 모델 확장 및 특화 트레이닝

모델 확장 단계에서는 LLM을 새로운 도메인에 맞게 확장합니다. 이를 위해 기존 블록을 동결하고 새로 추가된 블록만을 파인튜닝합니다. 이 과정은 도메인 특화 코퍼스를 사용하여 수행됩니다. 특히, 새로운 블록의 파인튜닝은 다음과 같은 식으로 이루어집니다.

\[\text{New Block Fine-tuning}(\text{domain corpus}) \tag{3}\]

이를 통해, 모델은 새로운 지식을 효과적으로 통합하면서도 기존의 일반 지식을 유지할 수 있습니다.


4. 실험

4.1 실험 설정

본 연구에서는 코드 및 수학 데이터셋을 중심으로 실험을 진행하였습니다. 사용된 데이터셋은 GitHub에서 파생된 Stack-dedup 및 수학적 내용이 풍부한 Proof-pile-2입니다. 이 데이터를 사용하여, 모델은 코드와 수학 문제에 대해 높은 성능을 보이는 것으로 나타났습니다. 실험은 다음과 같은 설정 하에 진행되었습니다.

  • 배치 크기: 1024
  • 시퀀스 길이: 4096
  • 학습률: \(2 \times 10^{-4}\)
  • 학습 스케줄러: Cosine


4.2 블록 확장 튜닝 효과 검증

블록 확장 후의 튜닝은 모델의 도메인 특화 성능을 향상시켰습니다. LLaMA-Pro 모델은 벤치마크 테스트에서 기존 모델들을 뛰어넘는 성능을 보였으며, 특히 코드 및 수학 도메인에서 향상된 결과를 보였습니다.

이 연구는 대규모 언어모델의 특화된 도메인 전처리 및 파인튜닝 방법을 통해 특정 작업에서의 성능을 향상시키는 새로운 접근 방식을 제시하며, 특히 도메인 특화 데이터를 효과적으로 활용하는 방법을 제공합니다.

4.2.1. 블록 확장 방법 (Block Expansion Methodology)

블록 확장은 본 연구에서 도입된 핵심 기법인데, 이는 기존 모델의 일반적 능력을 유지하면서 도메인 특화 지식을 효과적으로 통합할 수 있도록 설계되었습니다. 이 방법은 다음과 같은 절차로 이루어집니다.

  • 초기화: 모델의 각 트랜스포머 블록은 영(Zero) 초기화된 선형 계층을 포함하며, 이는 신원 매핑(identity mapping)을 가능하게 합니다.
  • 블록 복제: 기존 모델의 특정 블록을 복제하여 신원 블록을 생성하고, 이를 통해 모델의 깊이를 확장합니다.
  • 도메인 특화 트레이닝: 신원 블록을 포함한 확장된 모델을 특정 도메인의 데이터셋으로 파인튜닝합니다. 이 과정에서 기존 블록은 동결되고, 신규 블록만이 트레이닝 대상이 됩니다.

[도메인 특화 학습 색인마킹]

이 접근 방식은 모델이 새로운 도메인 지식을 학습하는 동안 기존의 일반적 능력을 잃지 않도록 보장합니다. 특히, 수학 및 프로그래밍 작업에서 향상된 성능을 발휘하도록 돕습니다.


4.2.2. 지도 학습 (Supervised Fine-Tuning, SFT)

지도 학습 단계는 모델이 특정 작업에 대한 지시(instruction)를 더 잘 따르도록 하는 것을 목표로 합니다. 이 단계에서는 다음과 같은 데이터 소스를 활용합니다.

  • ShareGPT: 실제 사용자와 ChatGPT 간의 대화 기록을 포함합니다.
  • WizardLM evolution instruction dataset: 다양한 복잡도를 가진 지시 데이터를 제공합니다.
  • Evolution CodeAlpaca dataset: ChatGPT가 생성한 복잡한 코딩 작업과 그 해결책을 포함합니다.
  • MetaMath: 다양한 관점에서 문제를 재구성하는 데이터셋입니다.
  • SlimOrca: 효율적인 성능을 달성하기 위해 선별된 데이터셋입니다.

SFT 과정은 다음 설정 하에 진행됩니다.

  • 배치 크기: 128
  • 시퀀스 길이: 4096
  • 워밍업 비율: 0.03
  • 학습률: 2e-5
  • 학습 스케줄러: Cosine
  • 혼합 Precision: bf16

이 단계를 통해 LLaMA-Pro 모델은 다양한 벤치마크에서 최고의 성능을 발휘하며, 특히 복잡한 다중 단계 수학 및 프로그래밍 작업에서 향상된 결과를 보여줍니다.


4. 실험

4.1 실험 설정

사전 훈련 세부 사항
이 연구에서는 코드와 수학에 초점을 맞춘 데이터셋을 구성했습니다. 코드 부분에서는 GitHub에서 허가된 소스 코드를 모은 Stack-dedup 데이터셋을 사용하였고, 특히 Python 분할을 사용했습니다. 수학 부분에서는 과학 논문, 웹 데이터, 수학 코드를 포함하는 550억 토큰의 Proof-pile-2 데이터셋을 선택했습니다.

LLaMA2-7B를 기반 모델로 초기화하고 블록 수를 32개에서 40개로 확장했습니다. 블록 확장 과정에서 \(P = 1\), \(M = 4\), \(N = 8\)로 설정하여, 각 그룹이 4개 블록에서 5개 블록으로 확장됩니다. 코드 및 수학 코퍼스 사전 훈련을 위해 배치 크기는 1024, 시퀀스 길이는 4096, 워밍업 비율은 6%, 학습률은 \(2 \times 10^{-4}\), 그리고 Cosine 학습률 스케줄러를 사용했습니다. 또한 bf16 혼합 Precision, 0.1의 가중치 감쇠 및 1.0의 그래디언트 클리핑을 적용했습니다. 훈련 속도를 높이기 위해 flash-attention 메커니즘을 적용했습니다.

4.2 블록 확장 튜닝의 효과 검증

LLAMA PRO 모델의 사전 훈련은 다음의 데이터 소스에서 수행되었습니다.

  • Proof-Pile-2: 55B 토큰
  • AlgebraicStack: 11B 토큰
  • OpenWebMath: 15B 토큰
  • ArXiv: 29B 토큰
  • The-Stack-Dedup: 22B 토큰, 가중치 1.50

블록 확장 후의 튜닝 효과를 검증하기 위해 다양한 벤치마크 데이터셋에서 모델을 평가했습니다. 이 데이터셋들은 LLAMA PRO의 일반적인 언어 처리 능력과 프로그래밍, 수학 문제 해결 능력을 종합적으로 평가합니다. 평가는 다음과 같은 벤치마크에서 수행되었습니다.

  • HumanEval: Python 프로그래밍 문제
  • GSM8K: 다단계 수학 문제
  • MBPP: Python 프로그래밍 문제

4.3 지도 학습 튜닝 (SFT) 결과

지도 학습 단계에서는 다음과 같은 데이터 소스를 결합하여 LLaMA-Pro를 생성했습니다.

  • ShareGPT1: 실제 사용자와 ChatGPT 간의 채팅 기록
  • WizardLM evolution instruction dataset: 다양한 복잡성을 가진 지시 데이터
  • Evolution CodeAlpaca dataset: ChatGPT에 의해 생성된 복잡한 코딩 작업 및 그 솔루션
  • MetaMath: 다양한 관점에서 질문을 재구성하는 데이터셋
  • SlimOrca: 선택된 OpenOrca 데이터의 하위 집합

이 단계에서 사용된 학습 설정은 배치 크기 128, 시퀀스 길이 4096, 워밍업 비율 0.03, 학습률 2e-5, Cosine 학습률 스케줄러, bf16 혼합 Precision입니다.

4.4 Ablation Study

Ablation 연구는 여러 훈련 전략을 비교 분석하여 각각의 유용성을 평가하고, 특정 설계 결정이 모델 성능에 미치는 영향을 조사합니다. 본 연구에서는 다음과 같은 전략들을 사용하여 TRACE 벤치마크를 통해 연속 학습을 평가했습니다.

모델 전략:

  • LoRA: 가중치를 재학습하지 않고, 기존 네트워크의 작은 부분만을 조정하여 학습 효율성을 높이는 방법.
  • SeqFT (Sequential Fine-Tuning): 기존 블록을 순차적으로 파인튜닝하면서 새로운 데이터에 적응.
  • Block Expansion: 본 논문에서 제안된 주요 방법으로, 기존 모델의 블록을 확장하여 새로운 블록을 추가하고 특정 도메인에 대해 파인튜닝함.

Ablation Study를 통해, 본 논문에서 제안된 Block Expansion 방법이 기존의 LoRA나 SeqFT 방법보다 더 나은 성능을 제공하며, 특히 연속적인 학습 상황에서 모델의 안정성과 지속 가능성을 향상시킬 수 있는 유용한 전략임을 확인합니다.

5. 평가

평가 메트릭:

  • Overall Performance (OP): 모델이 모든 학습된 태스크에서 보여주는 평균 성능
  • Backward Transfer (BWT): 새로운 태스크 학습 후 기존 태스크 성능에 미치는 영향으로, 이는 모델이 이전에 학습한 정보를 얼마나 잘 유지하는지를 나타냄.
모델 기술 Overall Performance (OP) Backward Transfer (BWT)
LoRA 37.1 -17.3%
SeqFT 45.5 -14.7%
Block Expansion 46.5 -14.3%

본 연구에서는 다양한 학습 전략을 비교함으로써, 각 전략이 연속 학습 과정에서 어떻게 다른 결과를 나타내는지를 분석했습니다. 결과적으로, Block Expansion 전략이 전반적인 성능과 정보 유지 능력에서 가장 우수한 것으로 나타났습니다. 이는 새로운 블록을 추가하여 도메인 특화 학습을 진행하는 과정에서 기존 정보를 보존하면서도 효과적으로 새로운 정보를 통합할 수 있음을 보여줍니다.

또한, 연속 학습 과정에서의 정보 유지 능력을 평가하기 위해 BWT 지표를 사용했으며, Block Expansion 전략이 다른 전략들에 비해 높은 BWT 값을 보여, 새로운 태스크 학습이 기존 지식에 미치는 부정적인 영향이 가장 적은 것으로 나타났습니다. 이런 결과는 제안된 블록 확장 방법이 효과적으로 모델의 태스크 특화 능력을 개선하면서도 기존 지식의 손실을 최소화할 수 있음을 시사합니다.


6. 주요 설계 선택의 결과 평가

설계 선택의 결과를 평가하기 위해 다음과 같은 벤치마크를 사용하여 LLaMA-Pro의 성능을 평가합니다.

  • AI2 Reasoning Challenge: 학교 수준의 과학 문제
  • HellaSwag: 상식 인퍼런스
  • MMLU: 다양한 학문 분야를 포함한 멀티태스크 정확도 측정
  • TruthfulQA: 온라인에서 흔히 발견되는 거짓말 재현 능력 측정
  • Winogrande: 상식 인퍼런스를 위한 광범위한 도전적 벤치마크
  • GSM8k: 다단계 수학적 인퍼런스 문제 해결 능력 측정

전반적인 평가 결과 LLaMA-Pro는 일반 언어 작업, 프로그래밍, 수학 작업에 걸쳐 다른 LLaMA 커뮤니티 모델들과 비교했을 때 향상된 성능을 보였습니다. 이는 모델이 특정 도메인 지식을 효과적으로 통합하면서도 기존의 일반적 능력을 유지할 수 있음을 시사합니다.


1 Introduction

The advent of Large Language Models (LLMs) has revolutionized the field of natural language processing, exhibiting remarkable proficiency in a variety of real-world tasks (OpenAI, 2023; Chowdhery et al., 2023).

Figure 2: (a) We begin with a large language model (LLM) pre-trained on a massive unlabeled corpus, resulting in a model with strong general capabilities. Here we select the off-the-shelf LLaMA2 for convenience. (b) We employ backbone expansion and fine-tune the expanded identity blocks using the aspect corpus while freezing the blocks inherited from the base model. The model after post-pretraining can be used for instruction tuning as usual.

Despite the versatility, LLMs still fall short in certain domains, for example, programming, mathematics, biomedical, or finance. This limitation impedes the progress of developing generic language agents for broader applications.

Existing works (Liu et al., 2023; Li et al., 2023a; Wu et al., 2023b) attempted to improve the multifaceted capabilities of pre-trained LLMs with tailored data recipes. While feasible, they require substantial computational resources and vast amounts of data, which poses a challenge to the democratization of LLM research. Consequently, another line of research, known as domain-adaptive pretraining, focuses on post-pretraining with domain-specific corpora (Gururangan et al., 2020). These approaches have demonstrated efficacy in adapting various LLMs to specific domains (Roziere et al., 2023; Azerbayev et al., 2023; Wu et al., 2023b; Xu et al., 2023b), resulting in enhanced performance on downstream domain-specific tasks at a reduced computational cost.

Nonetheless, a considerable obstacle emerges in catastrophic forgetting (De Lange et al., 2021). Postpretraining often leads to a decline in the model’s original general abilities, inhibiting the fine-tuned performance of the model on diverse tasks (Cheng et al., 2023; Dong et al., 2023). This necessitates a method that can inject domain-specific knowledge into LLMs while preserving their general abilities, thereby enhancing their comprehensive capabilities.

Towards this end, we introduce a simple yet effective post-pretraining method, termed block expansion. We expand the off-the-shelf pre-trained LLM using copied Transformer blocks, as illustrated in Figure 2. The newly added blocks, whose linear layers are zero-initialized to enable identity mapping, are further tuned with only domain-specific corpus while the remaining blocks are frozen. After tuning, the extended pre-trained model excels in both general and domain-specific tasks.

In practice, we extend the pre-trained LLaMA2-7B (Touvron et al., 2023) by eight more blocks, yielding LLAMA PRO, a foundation model with 8.3B parameters, and enhanced performance in programming, coding, and reasoning. We pre-train LLAMA PRO’s expanded blocks on 80B tokens using open-source code and math data for 2830 GPU Hours (16 NVIDIA H800 GPUs for about 7 days). We further perform supervised instruction tuning (fully fine-tuning of all the blocks, aka SFT) on LLAMA PRO with approximately 80M tokens, yielding LLaMA-Pro. It is noted that pre-trained models produced by our block expansion method are well-compatible with the subsequent SFT techniques without specific modification.

As shown in Figure 1, LLaMA-Pro reaches state-of-the-art performance across a broad range of general, code (i.e., HumanEval), and math (i.e., GSM8K) tasks. Furthermore, we assess the capabilities of LLaMA-Pro as a language agent across various scenarios (i.e., MINT-Bench), with a focus on the tool usage abilities and the capacity to ground in environmental and human feedback. We also employ GPT-4 (OpenAI, 2023) automatic evaluation to assess LLAMA PRO’s ability to serve as an effective assistant (i.e., MT-Bench). Comprehensive experimental results indicate the superiority of LLaMA-Pro over other models from the LLaMA family on both benchmarks and practical applications. Our contributions are three-fold:

  • We propose a novel post-pretraining method for LLMs, termed block expansion, enabling the injection of new knowledge while preserving the initial capabilities.
  • We introduce LLAMA PRO and LLaMA-Pro, versatile LLMs that well integrate natural and programming languages, excelling in general tasks, programming, and mathematics.
  • We benchmark the family of LLAMA PRO on extensive datasets, including both traditional and agent- oriented tasks, demonstrating its superiority and great potential in broader complex applications.

Advancements in Large Language Models. The field of large language models has witnessed significant progress in recent years. The growth in model and data scale has played a crucial role in achieving state-of-the-art performance across various tasks (Hoffmann et al., 2022; Kaplan et al., 2020; Chowdhery et al., 2023). Concurrently, the development of more generalist models has led to the creation of models that can address diverse problems and quickly adapt to new tasks (Radford et al., 2019; Brown et al., 2020). These advancements have been further bolstered by the open-source community, which has released powerful open large language models for research, such as LLaMA (Touvron et al., 2023) and CodeLLaMA (Roziere et al., 2023). Our work builds upon these developments by providing a methodology for specializing large language models in the domain of code, paving the way for future research and applications in this area.

Post-pretraining. Language model applications typically involve a two-step process: an initial generaldomain pretraining step, followed by domain-specific training (Roziere et al., 2023; Azerbayev et al., 2023). The fine-tuning step is often aimed at enhancing instruction-following abilities (Sanh et al., 2021; Wei et al., 2021; Wang et al., 2023d) or aligning the model’s outputs with human preferences (Ziegler et al., 2019; Ouyang et al., 2022; Bai et al., 2022). Additionally, some studies explore adapting pretrained models to novel domains using parameter-efficient fine-tuning methods (Houlsby et al., 2019; Hu et al., 2021; Wu et al., 2023a). Many works also focus on how to do continual learning after the pretraining phace (Wang et al., 2023b; Gupta et al., 2023; Scialom et al., 2022). In our work, we propose an adaptation strategy that combines continued training with targeted general capability maintenance, allowing large language models to specialize in specific tasks without sacrificing their overall performance.

Progressive Learning. In recent years, progressive training has gained attention for its ability to accelerate the training of large-scale models in both computer vision (Zhang et al., 2023) and NLP research (Yao et al., 2023; Li et al., 2023b). Gong et al. (2019) proposed a stacking method that doubles the model depth at each stage. CompoundGrow (Gu et al., 2020) extends stacking by incorporating FeedForward Network (FFN) expansion into the schedule design. Shen et al. (2022) proposed a staged method that further supports expanding the hidden size of features. Bert2BERT (Chen et al., 2021a) and LiGO (Wang et al., 2023a) support all possible growth dimensions. Our method employs depth growth to preserve general performance while adapting to a specific domain.

3 Method

3.1 Preliminaries: The LLaMA Block

The LLaMA block consists of a multi-head self-attention (MHSA) mechanism followed by a positionwise feed-forward network (FFN) with residual connections and a Swish-Gated Linear Unit (SwiGLU) operation as Figure 3 shows. Given an input \(x\), the LLaMA block produces an output \(y\) as described by the following equations:

\[x' = x + \text{MHSA}(\text{RMSNorm}(x)) \\ y = x' + \text{FFN}(\text{RMSNorm}(x')) \tag{1}\]

The input \(x\) has a dimension of \(n \times d\), where \(n\) is the sequence length and \(d\) is the hidden size. The output \(y\) has the same dimension as the input \(x\). The MHSA operation is a crucial component of the transformer, defined as:

\[\text{MHSA}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O\]

where \(Q, K,\) and \(V\) are the query, key, and value matrices, respectively, and \(W^O\) is the output weight matrix without bias. Each head is computed as:

\[\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\]

Given a model with blocks \(\phi_0, \phi_1, \ldots, \phi_L)\), the block expansion incorporates an identity block \(\phi_{\text{id}}\) after each block in the original model, ensuring that the expanded model maintains the same output after expansion. The identity block is defined as \(\phi_{\text{id}}(x) = x\), where the input and output are identical.

Suppose we have an initial model with \(L\) blocks that needs to be expanded to \(L'\) blocks. First, we partition the original \(L\) blocks into \(N\) groups, with each group containing \(\frac{L}{N}\) blocks. For each group, we create identity copies of the top \(P\) blocks and stack them on top of each group, as depicted in Figure 3. We arrange these blocks in an interleaved manner to maintain the structural characteristic of the transformer model, whose prior is that deeper blocks encode more complex information (Van Aken et al., 2019; Tenney et al., 2019). This process leads to an increased depth in the model while maintaining its output behavior.

Shen et al. (Shen et al., 2022) proposed the initialization of scale parameters in the Norm modules within the identity blocks to zero for the construction of the identity block. However, this approach may not be effective when applied to the LLaMA block. The reason lies in the fact that the gradient of the loss function \(L\) with respect to the RMSNorm weight \(w\) during backpropagation would be zero. This would prevent the training of RMSNorm, implying that when \(\text{RMSNorm}(x') = 0\), the following condition will hold:

\[\frac{\partial L}{\partial w} = 0\]

This equation signifies that the gradient of the loss function with respect to the weight of RMSNorm is zero, which would hinder the training of the RMSNorm module. This is further explained in Appendix A. Referring to the LLaMA block formulation in Equation 1, the identity can be achieved as long as \(\text{MHSA}(\text{RMSNorm}(x)) = 0\) and \(\text{FFN}(\text{RMSNorm}(x')) = 0\). We initialize the \(W^O\) and \(W_3\) weight matrices in the identity blocks to zero. Due to the presence of residual connections and the absence of bias terms in the LLaMA block, only the residual flows through the identity block. As a result, the entire block is reduced to an identity block at initialization, preserving the output from the initial model.

The entire training pipeline is depicted in Figure 2. Our method concentrates on the post-pretraining stage, targeting specific domain corpora such as code corpora. We begin by initializing our model with large language models trained on extensive unlabeled general corpora, where all blocks will be fine-tuned. To enhance the model’s capacity for accommodating additional domain knowledge while retaining its general knowledge, we employ block expansion to increase the number of blocks in the LLM. During this process, we only fine-tune the newly added blocks while freezing the original blocks, thereby preserving the general abilities of the model.

Figure 3: LLaMA Block Architecture

(a) An overview of the LLaMA Block, comprising an MHSA mechanism followed by the FFN with SwiGLU activation. (b) The Identity LLaMA block after an identity copy, achieved by initializing the output linear matrix to zero in order to preserve the output from the base LLaMA model.

4 Experiments

This section presents our key experimental findings. We begin with experimental settings (described in Sec. 4.1), and then verify the effectiveness of block expanded tuning after pretraining (described in Sec. 4.2). Next, we give the supervised fine-tuning (SFT) results (described in Sec. 4.3). Finally, ablation studies of the key design choices are presented (described in Sec. 4.4). This section presents our key experimental findings. We begin with experimental settings (described in Sec. 4.1), and then verify the effectiveness of block expanded tuning after pretraining (described in Sec. 4.2). Next, we give the supervised fine-tuning (SFT) results (described in Sec. 4.3). Finally, ablation studies of the key design choices are presented (described in Sec. 4.4).

4.1 Experimental Settings

Pretrain details. We construct a dataset that concentrates on code and math. For the code component, we rely on the Stack-dedup dataset, which is a compilation of permissively licensed source codes from GitHub. Among all the programming languages available in Stack-dedup, we specifically utilize the Python split. As for the math component, we opt for the Proof-pile-2 dataset (Azerbayev et al., 2023), a 55-billion-token amalgamation of scientific papers, web data containing mathematical content, and mathematical code.

We initialize our base model with LLaMA2-7B and expand the number of blocks from 32 to 40 using an interleaved approach. In the block expansion process, we configure the parameters as \(P = 1\), \(M = 4\), and \(N = 8\), resulting in 8 groups where each group expands from 4 blocks to 5 blocks. For the code and math corpus pretraining, we employ a batch size of 1024, a sequence length of 4096, a warmup ratio of 6%, a learning rate of \(2 \times 10^{-4}\), and a Cosine learning rate scheduler. We also use bf16 mixed precision, a weight decay of 0.1, and gradient clipping at 1.0. To speed up the training process, we apply the flash-attention mechanism.

Data Source Tokens Weight
Proof-Pile-2 55B 1.00
AlgebraicStack 11B 1.00
OpenWebMath 15B 1.00
ArXiv 29B 1.00
The-Stack-Dedup 22B 1.50

Table 1: Pretrain Data Sources, Tokens, and Mixture Weights

This table outlines the pretraining data sources, the number of tokens from each source, and their respective weights in the training mixture.

Dataset Query Source Response Source # Instances ¯Nrounds ¯Lprompt ¯Lcompletion
User prompts Human-written GPT-4 63,817 2.9 293.2 1157.1
ShareGPT Human-written/GPT-4 GPT-4 143,000 1.0 602.6 1704.9
WizardLM_evol_instruct_V2 GPT-4 GPT-3.5/GPT-4 517,982 1.0 574.3 599.3
SlimOrca GPT-4 GPT-4 395,000 1.0 209.4 498.2
MetaMath GPT-4 GPT-4 111,272 1.0 652.5 1552.0
Evol-CodeAlpaca GPT-4 GPT-4 - - - -

Table 2: Datasets Information

This table provides details on various datasets including the source of queries and responses, number of instances, average number of rounds (¯Nrounds), average prompt length (¯Lprompt), and average completion length (¯Lcompletion).

Our experiment is conducted on 16 NVIDIA H800 GPUs. LLAMA PRO is trained for a total of 15,900 steps. This training process corresponds to approximately 2830 H800 GPU hours.

  • SFT details
    • During the instruction fine-tuning phase, we combine five data sources to create LLaMA-Pro. These sources include ShareGPT1, which contains real user and ChatGPT chat history records, and the WizardLM evolution instruction dataset (Xu et al., 2023a), offering a wealth of instruction data with varying complexity levels. We also incorporate the evolution CodeAlpaca dataset (Luo et al., 2023), which includes complex coding tasks generated by ChatGPT and their corresponding solutions. Additionally, we use MetaMath (Yu et al., 2023), which reframes questions from multiple perspectives, and SlimOrca (Lian et al., 2023), a curated subset of our OpenOrca data. SlimOrca provides an efficient route to achieve performance comparable to using larger data slices, while only incorporating approximately 500,000 GPT-4 completions.
    • The final sft dataset consists of approximately 1M samples. To fine-tune the basic models, we employ specific configurations, including a batch size of 128, a sequence length of 4096, 0.03 warmup ratio, a learning rate of 2e-5, a Cosine learning rate scheduler, and bf16 mixed precision.
  • Evaluation details
    • We conduct a comparative analysis of LLAMA PRO with the latest state-of-the-art (SOTA) Large Language Models (LLMs). The evaluation is performed on six key general benchmarks using the Eleuther AI Language Model Evaluation Harness2, a unified framework designed to test generative language models across a vast array of evaluation tasks. For code-related tasks, we employ the BigCode Evaluation Harness3 to evaluate HumanEval and MBPP, and we report the pass@1 rate of code tasks with greedy decoding.
    • The benchmarks used for evaluation include:
      • AI2 Reasoning Challenge (Clark et al., 2018) (25-shot): a set of grade-school science questions.

https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

https://github.com/EleutherAI/lm-evaluation-harness

https://github.com/bigcode-project/bigcode-evaluation-harness

Model Language Tasks Math Tasks Code Tasks Avg.
Pretrained comparison        
LLAMA PRO (8B) 54.10 77.94 73.95 17.89
CrystalCoder (7B) 47.01 71.97 67.17 10.77
LLaMA2-7B 53.07 78.59 74.03 14.48
CodeLLaMA-7B 39.93 60.80 64.01 5.16
StarCoder-15B 30.38 47.93 56.12 9.48
LLaMA-7B 50.94 77.81 71.43 8.04
OpenLLaMA-v2-7B 43.69 72.20 69.38 3.49
Falcon-7B 47.87 78.13 72.38 4.62
SFT comparison        
LLaMA-Pro 52.30 76.88 72.53 7.35
LLaMA2-7B-Chat 52.90 78.55 71.74 7.96
CodeLLaMA-7B-Instruct 36.52 55.44 64.56 4.70
WizardCoder-Python-7B 41.81 65.06 61.72 2.73
WizardMath-7B 54.10 79.55 72.69 25.42
       
Individual Task Scores        
ARC 47.88 47.88 43.59 25.57
HellaSwag 48.78 48.78 7.35 28.66
MMLU 46.87 46.87 7.96 28.38
TruthfulQA 31.12 31.12 4.70 13.05
Winogrande 29.96 29.96 2.73 33.50
GSM8K 35.69 35.69 25.42 33.63
GSM8K-PoT 41.29 41.29 55.61 10.61
HumanEval 27.79 27.79 19.73 15.32
MBPP 52.57 52.57 44.51 9.42

Table 3: Comparison of evaluation results among several prominent code and language models.

  • HellaSwag (10-shot) (Zellers et al., 2019): a test of commonsense inference, which is easy for humans (approximately 95%) but challenging for SOTA models.
  • MMLU (5-shot) (Hendrycks et al., 2020): a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
  • TruthfulQA (0-shot) (Lin et al., 2021): a test to measure a model’s propensity to reproduce falsehoods commonly found online.
  • Winogrande (5-shot) (Sakaguchi et al., 2021): an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
  • GSM8k (5-shot) (Cobbe et al., 2021): diverse grade school math word problems to measure a model’s ability to solve multi-step mathematical reasoning problems. Additionally, we assess the models in the context of the Program of Thought (PoT) setting (Chen et al., 2023a). The PoT setting utilizes Python code to solve mathematical problems, which serves to evaluate the code generation capabilities of the models.
  • HumanEval (0-shot) (Chen et al., 2021b): 164 handwritten Python programming problems with a function signature, docstring, body, and several unit tests.
  • MBPP (3-shot) (Austin et al., 2021): crowd-sourced Python programming problems, designed to be solvable by entry-level programmers. Each problem consists of a task description in English, a code solution and 3 automated test cases.

4.2 Pretrain Results

We evaluate LLAMA PRO’s performance with benchmark datasets from the Open LLM Leaderboard. Furthermore, we incorporate coding benchmark datasets, including HumanEval pass@1 and MBPP pass@1, as well as the math benchmark GSM8K, to provide a comprehensive evaluation. We compare the performance of LLAMA PRO with a selection of state-of-the-art pretrained models that were trained around the same period with similar size. This includes general-purpose pretrained models like LLaMA2 and code-oriented pretrained models like CodeLLaMA. The results are presented in Table 3.

The results highlight that LLAMA PRO effectively balances natural language processing and coding capabilities. It not only preserves the general performance of its base model, LLaMA2-7B, but also surpasses it in the average performance of general language tasks. Conversely, CodeLLaMA-7B sacrifices general performance to enhance its code ability. We attribute this improvement to our expansion design, which freezes the initial LLaMA blocks to maintain their capabilities and increases the blocks to accommodate more domain-specific knowledge.

Figure 4: We compare LLAMA PRO’s general performance and code performance to a set of models trained around the same time, spanning from general LLMs to code-oriented LLMs. The size of the blobs is proportional to the number of tokens trained. Mistral-7B is not included here, as the number of tokens is not reported in its paper.

As depicted in Figure 4, LLAMA PRO shows robust general performance alongside code performance that is on par with code-oriented LLMs. Situated on the Pareto frontier, LLAMA PRO has undergone fine-tuning with an additional 80B tokens in conjunction with LLaMA2, which more than doubles the code tasks average performance. In contrast, CodeLLaMA is fine-tuned with 500B tokens. LLAMA PRO excels in general performance while maintaining code performance that is competitive with code-oriented LLMs, whether they are trained from scratch, such as StarCoder-15B and CrystalCoder, or fine-tuned like CodeLLaMA-7B.

4.3 SFT Results

Modern LLMs typically undergo supervised fine-tuning or instruction tuning after pretraining on vast amounts of unlabeled data. In this section, we aim to demonstrate that our expansion strategy can adapt to this widely used training pipeline, just as traditional LLMs do.

Table 3 presents a comparison of evaluation results among several prominent supervised fine-tuning (SFT) LLMs from the LLaMA community, across general tasks, math tasks, and code tasks benchmarks. As a singular SFT model, LLaMA-Pro attains state-of-the-art performance, even when compared to specifically tuned models such as WizardCoder and WizardMath. This demonstrates its more comprehensive capabilities.

Model MT Bench
Alpaca-13B 4.53
CodeLLaMA-7B-Instruct 5.71
Vicuna-7B 6.17
LLaMA2-7B-Chat 6.27
LLaMA-Pro 6.32

Table 4: GPT-4 automatic evaluation of Chatbot models. LLaMA-Pro outperforms widely used LLaMA community chatbots.

Model Interaction Turns 1 Interaction Turns 2 Interaction Turns 3 Interaction Turns 4 Interaction Turns 5 Avg.
AgentLM-7B 0.0 4.44 7.34 7.85 7.34 4.71
CodeLLaMA-7B-Instruct 0.34 4.27 8.70 12.12 13.99 7.37
LLaMA2-7B-Chat 1.02 12.63 5.29 10.24 14.68 5.77
Mistral-Instruct-v0.1 1.54 6.66 6.48 13.31 11.95 11.02
LLaMA-Pro 0.68 11.95 6.48 14.16 11.95 10.38

Table 5: In the tool-augmented reasoning assessments, we evaluate the model’s proficiency in integrating tools into its reasoning workflow. The model’s effectiveness is measured by its success rate across various stages of interaction.

As seen in Figure 1, LLaMA-Pro boosts both code and math tasks to state-of-the-art performances while maintaining reliable general performance. We enhance the average performance of LLaMA2-7B-chat and CodeLLaMA-7B-instruct by 13.81% and 14.50% respectively, which highlights the benefits of balancing textual and coding abilities.

To assess the comprehensive conversational performance of the LLaMA-Pro assistant, we evaluate it using the MT-Bench with GPT-4 automatic scoring, as proposed by Vicuna (Zheng et al., 2023). As depicted in Table 4, LLaMA-Pro surpasses widely used chatbots from the LLaMA community. This indicates its potential as a chatbot capable of providing helpful responses, in addition to its impressive performance in traditional benchmarks. The details of MT-Bench can be found in the Appendix C.

We use MINT-Bench (Wang et al., 2023c) to evaluate our model’s ability to solve multi-turn interactions by using tools. MINT-Bench tests LLMs’ ability to use tools by generating and executing Python code, focusing on tool-augmented task-solving and leveraging natural language feedback. MINT includes eight datasets covering reasoning, code generation, and decision-making. The details of MINT can be found in the Appendix B. The results are shown in Table 5. LLaMA-Pro achieves SOTA performance compared to similar size models in multi-turn interactions with the use of tools.

4.4 Ablation Study

We evaluate various training strategies, including LoRA, fine-tuning, and the block expansion training approach that we propose, using the TRACE benchmark (Wang et al., 2023b). TRACE is designed to assess continual learning in LLMs and comprises eight distinct datasets that span challenging tasks such as domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning. We assess the ability of different strategies to retain the model’s existing knowledge while incorporating new skills. Details are provided in the Appendix D.

We employ Overall Performance (OP (Chaudhry et al., 2018)) and Backward Transfer (BWT (Lopez-Paz and Ranzato, 2017)) scores as evaluation metrics. After incrementally learning the t-th task, the model’s score on the i-th task (where i ≤ t) is denoted as RD t,i. The OP and BWT scores are calculated

Model Technique Overall Performance (OP) Backward Transfer (BWT)
LoRA 37.1 -17.3%
SeqFT 45.5 -14.7%
Block Expansion 46.5 -14.3%

Table 6: Performance comparison of various training strategies on the TRACE benchmark following their continual learning phase with LLaMA2-7B. The table presents the Overall Performance (OP) and Backward Transfer (BWT) scores for each strategy, demonstrating the superior adaptability of the proposed block expansion training approach.

Figure 5: Training loss with varying added blocks and mixture-of-expert (MoE) expansion.

using the following formulas:

Table 6 presents the performance of different strategies on the TRACE benchmark following their continual learning phase with LLaMA2-7B. The results show that block expansion training exhibits superior task-specific adaptability compared to sequential fine-tuning and LoRA, as evidenced by its better OP and BWT scores.

Apart from the aspect of code corpus, we explore our method on another domain: law, with the freelaw subset of Pile dataset as our pretrain corpus (Gao et al., 2020). We evaluate on UNFAIR-ToS (Lippi et al., 2019) of the LexGLUE benchmark (Chalkidis et al., 2021). The details can be found in the Appendix E. In our experiment, we assess the scalability of our block expansion method in terms of training loss and downstream task performance as we increase the number of added blocks. We also compare our method with the Mixture-of-Expert (MoE) expansion method (Fedus et al., 2022).

We first examine the training loss with varying added blocks. As seen in Figure 5, the training loss of the models consistently decreases as training progresses, regardless of the number of added blocks. Moreover, the loss decreases more rapidly as we increase the size of the model. These findings suggest that our method exhibits strong scalability with larger models and more data. The training loss of MoE is comparable to our method with four added blocks.

Performance Metrics Across Various Tasks and Block Additions

Block Addition Law Task ARC HellaSwag MMLU TruthfulQA Winogrand Avg. Unfair-ToS
Add 1 Block 52.30 77.92 26.12 77.89 38.62 39.62 37.30 41.74
Add 2 Block 53.16 77.91 77.89 38.62 39.62 37.30 41.74 41.35
Add 4 Block 52.39 76.92 23.12 39.10 37.80 38.92 40.53 39.83
Add 8 Block 52.90 76.63 39.83 22.52 39.03 73.16 73.01 72.22
Add 16 Block 51.88 76.59 40.13 71.82 72.77 72.23 47.20 72.38
Add 32 Block 50.77 76.72 40.13 72.23 47.20 72.38 55.96 56.52
Avg. 61.71 63.05 58.59 65.91 65.76 65.23 61.92 15.08
Mixture-of-Expert (MoE) 51.45 - - - - - - -
Prefix Stacking (8 Block) 27.82 - - - - - - -
Suffix Stacking (8 Block) 52.56 - - - - - - -

This table displays performance metrics across various tasks with different block additions in a model. The table includes scores for tasks like Law, ARC, HellaSwag, MMLU, TruthfulQA, Winogrand, and Unfair-ToS, as well as average scores. Additional techniques like Mixture-of-Expert (MoE), Prefix Stacking, and Suffix Stacking are also included.

Table 7: Comparison of evaluation results among several prominent code and language models. The last column represents the average of the language task average and the code task average.

Figure 6: By fine-tuning both LLaMA2-7B and LLAMA PRO using the same instruction dataset, LLAMA PRO consistently outperforms LLaMA2-7B across all tasks. This result highlights the effectiveness of our method, as it demonstrates that LLAMA PRO successfully encodes more domain knowledge during the pretraining process.

However, a lower overall training loss does not necessarily guarantee superior performance on domain-specific tasks. Therefore, we evaluate models of different sizes on both general language tasks and Unfair-ToS, as shown in Table 7. All the expanded models effectively preserve the general capabilities of the initial model. For the domain-specific task, larger models achieve better performance. We find that adding eight blocks provides optimal performance with minimal cost compared to larger models, hence we adopt this as our default strategy.

We also analyze the impact of the position where the identity blocks are added, either at the bottom or the top of the model, compared to adding them interleaved, as shown in Table 7. We observe that adding blocks at the bottom results in poor evaluation performance, likely because it disrupts the model’s foundation, causing errors to propagate throughout the model. Adding blocks at the top of the model (Gong et al., 2019) preserves the initial model’s performance, but its performance on domain-specific tasks is lower than when adding blocks interleaved.

As highlighted in the LIMA study (Zhou et al., 2023), the majority of knowledge in large language models is acquired during pretraining, with only a limited amount of instruction tuning data required to generate high-quality output. To investigate the extent of knowledge encoded during pretraining, we conducted a comparative analysis between LLaMA2-7B and LLAMA PRO using the same instruction dataset, as illustrated in Figure 6. Our results showed that LLAMA PRO consistently outperforms LLaMA2-7B across all tasks, indicating that our method effectively enables LLAMA PRO to encode more domain-specific knowledge during the pretraining phase.

5 Conclusion

In this study, we introduced a novel block expansion method for Large Language Models (LLMs) post-pretraining, aiming to enhance domain-specific abilities while preserving the original general capabilities. Our approach effectively balances the model’s performance across both general and domain-specific tasks. We demonstrated the effectiveness of our method through LLAMA PRO, an LLM initialized from LLaMA2-7B with 8 added blocks, which outperformed other LLaMA-series models on comprehensive benchmarks.

The work highlights the importance of balancing general and domain-specific abilities in LLMs and offers a promising approach to achieving this balance. Future research could explore broader applications of our block expansion method in other domains, for instance, it is an important task for multimodal large language models (Ge et al., 2023; Bai et al., 2023) to preserve the original language ability.

Previous: PeriFlow Next: Attention | Lightning Attention 2

post contain ""

    No matching posts found containing ""