00:00:00

Share Your Feedback 🏝️

Fusion | Knowledge Fusion of Large Language Models

Fusion | Knowledge Fusion of Large Language Models

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | OLMo Next: Reasonning | Premise Order Matters in Reasoning

Fusion | Knowledge Fusion of Large Language Models

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-02-03

Knowledge Fusion of Large Language Models

  • url: https://arxiv.org/abs/2401.10491
  • pdf: https://arxiv.org/pdf/2401.10491
  • model: https://huggingface.co/Wanfq/FuseLLM-7B
  • abstract: While training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures–Llama-2, MPT, and OpenLLaMA–across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. Our code, model weights, and data are public at url.

[Knowledge Fusion 지식 병합 등 관련 색인마킹]


Contents

TL;DR


LLM(대규모 언어모델) 지식 융합

  1. 배경 및 문제 정의: LLM의 개발 비용과 환경적 영향을 해결하기 위해 LLM 융합 방법을 제안합니다.
  2. 기존 연구와의 비교: 기존의 모델 앙상블 및 가중치 병합 방법과 달리, 다양한 아키텍처를 가진 LLM의 지식을 융합합니다.
  3. 제안된 방법: FUSELLM 방법은 여러 LLM의 확률 분포를 융합하여 타겟 LLM에 지식을 전이하는 방식을 채택합니다.

[발견]

  • 확률 분포 융합: 다양한 소스 LLM의 생성 확률 분포를 통합하여 지식을 융합하고, 타겟 LLM에 전이하는 새로운 접근 방식 제시
  • 토큰 정렬과 융합 함수: 서로 다른 LLM 간의 토큰화 방식을 정렬하고, 확률 분포 융합을 위한 두 가지 새로운 함수(최소 교차 엔트로피 점수와 교차 엔트로피 점수 기반 가중 평균)를 제안

[주장]

  • 성능 향상: FUSELLM은 다수의 벤치마크에서 기존 모델보다 우수한 성능을 보이며, 특히 앙상블과 가중치 병합 방법을 상회하는 성능을 나타냄.
  • 아키텍처 제약 극복: 서로 다른 아키텍처를 가진 LLM 간의 융합 가능, 이로 인해 보다 유연한 모델 통합과 활용이 가능
  • 효율적인 지식 통합: 소스 LLM의 집합적 지식을 타겟 LLM에 효과적으로 전이하며, 이는 데이터셋의 크기가 제한적인 상황에서도 높은 성능을 유지

[근거]

  • 실험 결과: 세 가지 대표적인 벤치마크(BBH, CS, ME)에서 FUSELLM이 기존 Llama-2 모델보다 각각 평균 5.16%, 1.25%, 6.36% 향상된 성능을 보임.
  • 효과적인 융합 메커니즘: 융합된 확률 분포를 사용하여 성능 향상 경향을 추적하고, 토큰 정렬과 융합 함수의 효과를 입증

1. 서론

LLM인 GPT와 LLaMA 시리즈가 자연어 처리(NLP) 작업에서 성공을 거두면서, 기업들은 자신만의 LLM을 개발하려는 전략적 필요성을 느끼고 있습니다. 그러나 LLM 개발에는 방대한 training dataset, 기술, 막대한 컴퓨팅 자원, 숙련된 인력이 필요하여 비용이 큽니다. 기존의 방법 대신, 이미 존재하는 LLM들을 결합하여 새로운, 더 강력한 모델을 만드는 방법이 있습니다. 이를 지식 융합(Knowledge Fusion)이라 부릅니다.

기존의 모델 융합 방법

모델 융합의 기존 접근 방식으로는 (1) 앙상블 방법과 (2) 가중치 병합 방법이 있습니다.

  • (1) 앙상블 방법은 여러 모델의 출력을 결합하여 성능을 향상시키는 방식이지만, 다수의 훈련된 모델을 유지하고 인퍼런스 시 각 모델을 독립적으로 실행해야 하므로 다양한 유즈케이스에서 적용하기 어려울 수 있습니다.
  • (2) 그러나 가중치 병합 방법은 파라미터 수준에서 여러 신경망을 병합하는 방식으로, 확장가능할 수 있지만 동일한 아키텍처를 가진 모델 간에만 적용할 수 있다는 문제가 있습니다.

제안된 FUSELLM 방법

본 논문에서는 확률 분포 관점에서 LLM을 융합하는 방법을 탐구합니다. 주어진 입력 텍스트에 대해, 서로 다른 소스 LLM들이 생성한 확률 분포가 텍스트 이해에 대한 내재된 지식을 반영한다고 가정합니다. 따라서 FUSELLM은 소스 LLM들의 생성 분포를 활용하여 이들의 지식을 타겟 LLM에 전이합니다. 이를 위해 서로 다른 LLM에서 유래된 토큰화 방식을 정렬하는 새로운 전략을 개발하고, 다양한 LLM이 생성한 확률 분포를 융합하는 두 가지 방법을 탐구합니다. FUSELLM은 타겟 LLM의 확률 분포와 소스 LLM들의 확률 분포 간의 발산을 최소화하는 데 중점을 둡니다.

실험 설정 및 결과

세 가지 대표적인 오픈 소스 LLM인 Llama-2, OpenLLaMA, MPT를 소스 모델로 하여 FUSELLM의 효과를 실험적으로 입증합니다. 42개의 작업을 포함하는 세 가지 벤치마크를 통해 타겟 모델이 대부분의 작업에서 소스 모델 및 Baseline Model보다 뛰어남을 확인했습니다.


2. 관련 연구

모델 융합

모델 융합의 전통적인 기술로는 모델 앙상블이 있습니다. 여러 모델의 출력을 결합하여 전체 시스템의 성능을 향상시키는 방법입니다. 최근에는 여러 오픈 소스 LLM의 다양한 강점을 활용하여 상위 후보를 결합하는 앙상블 프레임워크가 소개되었습니다.

가중치 병합은 파라미터 수준에서 모델을 융합하는 또 다른 접근 방식입니다. 동일한 구조를 가진 모델들의 가중치를 병합하여 성능을 향상시키는 방법입니다. 그러나 가중치 병합은 동일한 아키텍처를 가진 모델에만 한정됩니다.

지식 증류

지식 증류는 하나 이상의 teacher 모델의 지도를 받으며 student 모델을 훈련하는 방식입니다. NLP에서는 주로 텍스트 분류 작업에 적용되어 student 모델이 teacher의 출력 분포나 중간 계층의 특징을 복제하도록 훈련합니다. 본 논문에서는 다중 teacher 지식 증류와 유사한 프레임워크를 공유하지만, 타겟 모델의 크기에 제한이 없고, 융합 후 타겟 모델이 소스 모델보다 성능이 우수할 것으로 기대합니다.


3. LLM의 지식 융합

3.1 기초

LLM 융합의 주요 목표는 여러 소스 LLM에 내재된 지식을 외부화하고 타겟 LLM에 통합하는 것입니다. \(K\)개의 소스 LLM \(\{M_s^j\}_{j=1}^K\)가 주어졌을 때, 각 모델이 예측한 다음 토큰의 확률 분포를 평가하고, 가장 정확한 예측을 통해 타겟 LLM \(M_t\)를 지속적으로 훈련합니다.

Causal Language Modeling(CLM) 목표는 언어 모델의 파라미터 \(\theta\)를 사용하여 주어진 시퀀스 \(t = (t_1, t_2, \ldots, t_N)\)의 우도(likelihood)를 최대화하는 것입니다.

\[L_{CLM}( ext; t) = -\sum_{i=1}^N \log P_\theta(t_i \mid t_{<i})\]

이 목표는 시퀀스 우도를 토큰 수준의 교차 엔트로피 손실로 분해하여 각 토큰의 예측 분포를 원-핫(one-hot) 표현과 비교합니다.

3.2 LLM의 융합

언어 모델의 확률 분포 행렬은 텍스트 이해에 대한 내재된 지식을 반영한다고 언급합니다. 동일한 텍스트에 대해 서로 다른 LLM에서 유래된 확률 분포 행렬은 각 모델에 내재된 다양한 지식을 나타냅니다. 제안된 FUSELLM 접근 방식은 확률 모델링을 통해 LLM을 융합하고, 소스 LLM들의 확률 분포를 융합하여 단일 통합 LLM을 생성합니다.

코퍼스 \(C\)의 각 텍스트에 대해, 제공된 \(K\)개의 소스 LLM을 적용하여 확률 분포 행렬 \(\{P_{\theta_j}\}_{j=1}^K\)을 얻습니다. 그런 다음 이 행렬을 정렬하고 단일 컴팩트 표현 \(P_t\)로 융합합니다.

\[P_t = \text{Fusion}(\{P_{\theta_j}\}_{j=1}^K)\]

소스 LLM의 능력을 타겟 LLM에 전이하기 위해, 타겟 LLM의 예측 \(Q_t\)와 융합된 표현 행렬 \(P_t\) 간의 정렬을 강제합니다. 융합 목표는 다음과 같습니다.

\[L_{Fusion} = D_{KL}(Q_t \| P_t)\]

지속적인 훈련의 전체 목표는 CLM 목표 \(L_{CLM}\)과 융합 목표 \(L_{Fusion}\)의 가중 결합으로 구성됩니다.

\[L_{total} = \lambda L_{CLM} + (1 - \lambda)L_{Fusion}\]

3.3 FUSELLM의 구현

토큰 정렬: 여러 LLM 간의 토큰 정렬을 보장하는 것은 지식 융합에 중요합니다. 최소 편집 거리(MinED) 전략을 사용하여 서로 다른 토크나이저에서 생성된 토큰을 MinED 기반으로 매핑하여 성공률을 높입니다.

융합 전략: 소스 LLM의 집합적 지식을 결합하면서 고유한 강점을 보존하기 위해, 각 LLM의 예측 품질을 평가하고 각각의 확률 분포 행렬에 다른 중요도를 부여합니다. 두 가지 융합 함수를 도입합니다. 최소 교차 엔트로피 점수(MinCE)와 교차 엔트로피 점수 기반 가중 평균(AvgCE).

알고리즘 1은 FUSELLM 방법의 전체 과정을 설명합니다.


4. 실험

4.1 실험 설정

지속적인 훈련을 위한 데이터셋: MiniPile을 사용합니다. 이 데이터셋은 약 100만 개의 문서와 18억 개의 토큰으로 구성되어 있으며, Llama-2의 2조 개 훈련 토큰의 0.1% 미만입니다.

훈련 세부 사항: Llama-2 7B 타겟 LLM을 배치 크기 128, 최대 길이 2048로 단일 노드에서 8개의 NVIDIA A100 GPU를 사용하여 훈련합니다. 훈련은 Huggingface Transformers와 FlashAttention을 기반으로 구현되었으며, 단일 에포크 동안 약 33시간이 소요됩니다.

평가: Big-Bench Hard (BBH), Common Sense (CS), MultiPL-E (ME) 세 가지 벤치마크로 평가합니다.

4.2 전체 결과

Table 1은 BBH에서 FUSELLM과 기준 방법의 전체 결과를 제시합니다. FUSELLM은 전체 27개의 작업에서 원래의 Llama-2보다 평균 상대 성능이 5.16%로 향상되었습니다.

Table 2는 CS 벤치마크에서 FUSELLM과 기준 방법의 0-shot 성능을 보여주며, FUSELLM은 Llama-2보다 1.25% 상대 성능이 향상되었습니다.

Table 3은 ME 벤치마크에서 FUSELLM의 0-shot 성능을 보고하며, Llama-2보다 평균 성능이 6.36% 향상되었습니다.

4.3 융합된 확률 분포

여러 LLM에서 얻은 융합된 확률 분포의 효과를 조사하고 훈련 과정에서 성능 향상의 경향을 추적합니다. Figure 2는 BBH에서 Llama-2 CLM과 FUSELLM의 몇 샷 CoT 성능을 비교하며, FUSELLM이 정확한 일치(EM) 정확도를 2.5% 향상시킵니다.

4.4 구현 과정 분석

  • 소스 LLM의 수: Table 4는 다양한 수의 LLM을 융합한 결과를 제시하며, 모델 수가 1에서 3으로 증가함에 따라 FUSELLM의 성능이 개선됨을 보여줍니다.
  • 토큰 정렬 기준: Table 5(상단)는 두 가지 정렬 기준을 비교하며, MinED 방법이 EM 방법보다 일관되게 우수함을 보여줍니다.
  • 융합 함수: Table 5(하단)는 두 가지 융합 함수를 비교하며, MinCE를 사용하는 FUSELLM이 모든 벤치마크에서 AvgCE보다 일관되게 우수함을 보여줍니다.

4.5 FUSELLM vs. 지식 증류

Table 6은 FUSELLM과 전통적인 지식 증류를 비교하며, FUSELLM이 다양한 아키텍처를 가진 세 개의 7B 모델을 통합하여 더 우수한 성능을 달성함을 보여줍니다.

4.6 FUSELLM vs. 앙상블/병합

Table 7은 테스트 세트에서 FUSELLM과 다른 융합 방법의 퍼플렉시티 결과를 제시하며, FUSELLM이 앙상블 및 가중치 병합 방법보다 집합적 지식을 더 효과적으로 활용함을 보여줍니다.


1 INTRODUCTION

With the continuous success of large language models (LLMs) such as GPT (Brown et al., 2020) and LLaMA (Touvron et al., 2023) series across a wide range of natural language processing (NLP) tasks, it has become a strategic imperative for corporations to create their own LLMs. However, the costs associated with LLM development are astronomical. In addition to requiring vast amounts of training data, advanced techniques, substantial computational resources, and skilled labor, the development process also exerts significant pressure on energy consumption and the environment (Rillig et al., 2023). While these LLMs exhibit structural and functional differences, they share similar capabilities across a spectrum of NLP tasks. Consequently, beyond the traditional approach of training an LLM from scratch, an alternative option is to combine existing LLMs into a new, more powerful one, which is termed knowledge fusion of LLMs in this paper. If successful, this fusion not only cuts the cost of initial training but also allows the integrated model to benefit from the strengths of all the LLMs. This new model can also be fine-tuned and adapted for various downstream tasks. Moreover, the fusion can also happen among fine-tuned LLMs that specialize in a specific task.

The endeavor to integrate the capabilities of multiple models has been a long-standing pursuit. For example, ensemble methods (Littlestone & Warmuth, 1994; Jiang et al., 2023) directly aggregate the outputs of different models to enhance prediction performance and robustness. However, this approach requires maintaining multiple trained models and executing each during inference, which is impractical for LLMs due to their substantial memory and inference time requirements. Likewise, this approach doesn’t facilitate fine-tuning, which is essential for many LLMs. Another approach is to directly merge several neural networks into a single network through parameter-wise arithmetic operations (Wortsman et al., 2022; Jin et al., 2022). This approach typically assumes uniform network architectures and attempts to establish mappings between the weights of distinct neural networks, which is often unattainable in the context of LLMs. Moreover, weight merging may lead to suboptimal results when substantial differences exist in the parameter space (Li et al., 2022).

Figure 1: Illustration of conventional model fusion techniques (ensemble and weight merging) and our knowledge fusion approach for LLMs (FUSELLM). Different animal icons represent different LLMs, with various species denoting LLMs possessing differing architectures. FUSELLM externalizes the knowledge from multiple LLMs and transfers their capabilities to a target LLM.

In this paper, we explore the fusion of LLMs from a probabilistic distribution perspective. For an input text, we argue that the probabilistic distributions generated by different source LLMs can reflect their inherent knowledge in understanding this text. Therefore, the proposed FUSELLM leverages the generative distributions of source LLMs to externalize both their collective knowledge and individual strengths and transfer them to the target LLM through lightweight continual training. To achieve this, we develop a new strategy for aligning tokenizations originating from different LLMs and explore two methods for fusing the probability distributions generated by these diverse LLMs. During the continual training, FUSELLM places significant emphasis on minimizing the divergence between the target LLM’s probabilistic distributions and those of the source LLMs.

To empirically demonstrate the effectiveness of FUSELLM, we examine a challenging yet general scenario of LLMs fusion, where the source models share minimal commonalities. Specifically, we focus on three popular open-source LLMs that possess distinct architectures and functionalities: Llama-2 (Touvron et al., 2023), OpenLLaMA (Geng & Liu, 2023), and MPT (Team, 2023). Evaluations across three benchmarks, which consist of a total of 42 tasks spanning reasoning, commonsense, and code generation, confirm that the target model trained by our method outperforms each source LLM and the baseline in most tasks. Moreover, we simulate the existence of functionally distinct LLMs with identical architecture by continually training a single base model on several domain-specific corpora. When evaluated based on perplexity, our method demonstrates superior potential in combining the capabilities of these structurally identical LLMs compared to traditional ensemble and weight merging methods.

To sum up, this paper explores a novel challenge called LLMs fusion, with the goal of creating a unified model that effectively utilizes the collective capabilities and unique strengths of diverse LLMs. Illustrated in Figure 1, our proposed approach distinguishes itself from traditional ensemble and weight merging techniques by prioritizing the fusion of multiple LLMs through knowledge externalization and transfer. This study yields several findings that may spark future research. Firstly, while we demonstrate the effectiveness of our method through lightweight continual training on a compact, high-quality corpus, the thoughtful selection of the training corpus can be a crucial consideration, particularly with regard to its relevance to downstream tasks. Secondly, in scenarios where the capabilities of source LLMs vary significantly, the fusion function appears to be crucial in effectively combining their respective strengths. Lastly, when compared to traditional model ensemble and merging techniques, the field of LLMs fusion appears to be a more promising avenue for exploration, especially in light of the diverse structures and substantial model sizes of LLMs.

Model Fusion The integration of capabilities from diverse models has been a long-standing objective, with existing approaches mainly falling into two categories. Firstly, the traditional technique of model ensemble combines the outputs of multiple models to enhance overall system performance (Littlestone & Warmuth, 1994; Sagi & Rokach, 2018). Note that this technique doesn’t involve the explicit merging of multiple models into a new one. Common methods for model ensemble typically employ weighted averaging (Littlestone & Warmuth, 1994) or majority voting (Monteith et al., 2011) to consolidate predictions from various models. Recently, Jiang et al. (2023) introduced an ensemble framework designed to leverage the diverse strengths of multiple open-source LLMs. This framework first employs a pairwise comparison method to detect subtle distinctions among candidate outputs. Then, it combines the top-ranked candidates to produce an enhanced output, capitalizing on their strengths while mitigating their weaknesses.

Secondly, weight merging presents another approach that facilitates model fusion at the parameter level. Gupta et al. (2020) and Wortsman et al. (2022) merged weights from models with identical structures, obtained through different strategies or configurations, to achieve improved overall performance. Similarly, Cha et al. (2021), Rame et al. (2022), and Arpit et al. (2022) explored weighted averaging of models derived from different configurations to enhance out-of-distribution generalization. Furthermore, Jin et al. (2022) merged models designed for specific domains or tasks to create a generalist capable of addressing all domains or tasks. Going beyond parameter merging of entire models, Wang et al. (2022b), Huang et al. (2023), and Zhang et al. (2023) applied linear mathematical operations to adapter parameters to achieve superior generalization performance.

In a nutshell, while model ensemble requires the parallel deployment of multiple models, weight merging is generally limited to models with identical architectures. In contrast, the approach proposed in this paper supports the fusion of multiple LLMs with diverse architectures by explicitly transferring their knowledge and capabilities to a target LLM.

Knowledge Distillation Knowledge distillation (Hinton et al., 2015), initially proposed for model compression, involves training a student model under the guidance of one or more teacher models. In the NLP community, knowledge distillation has been widely applied to text classification tasks. These applications include training the student model to replicate the teacher’s output distribution (Sanh et al., 2019; Turc et al., 2019), as well as features (Sun et al., 2019; Jiao et al., 2020) and relations (Wang et al., 2020) derived from intermediate layers of the teacher model. In the realm of text generation, the conventional approach focuses on minimizing the KL divergence between the student and teacher generation distributions. This is achieved by using the teacher’s probability distributions at each time step as supervision (Khanuja et al., 2021; Gu et al., 2023; Agarwal et al., 2023) or by directly training on the teacher’s generated texts (Peng et al., 2023; Xu et al., 2023).

While our method shares a framework similar to multi-teacher knowledge distillation, there are two significant distinctions. First, in traditional knowledge distillation, the student models are typically constrained to be smaller in size than the teachers. In our scenario, however, there are no limitations on the size of the target model. Second, traditional knowledge distillation often results in the student models lagging behind the teachers in performance after distillation. In contrast, we anticipate that after the fusion, the target model will surpass any of the source models in performance.

3 KNOWLEDGE FUSION OF LLMS

The primary objective of LLMs fusion is to externalize the collective knowledge embedded within multiple source LLMs and integrate their capabilities into a target LLM. Given K source LLMs {Ms j=1} with varying architectures, each having undergone individual pre-training or fine-tuning on distinct datasets, the key idea behind our approach is to initially stimulate LLMs to manifest their inherent knowledge by challenging them to predict the next token. The probabilistic distributions of these predictions are thoroughly assessed, and the most accurate predictions are utilized to continually train the target LLM Mt on a corpus C using the causal language modeling objective. In the following sections, we start with a brief introduction to the preliminaries, followed by a detailed explanation of our LLMs fusion framework. Finally, we delve into the implementation details.

3.1 PRELIMINARIES

The above objective decomposes sequence likelihood into token-level cross-entropy losses, comparing each token’s predicted distribution to its one-hot representation. To provide a more generalized perspective, we reframe this token-level view into a sequential distribution format. Specifically, for the text sequence t, we aggregate token-level predictions and create a probabilistic distribution ma- t ∈ RN ×V , where the i-th row represents the distribution predicted by the model for the trix, Pθ ith token over the vocabulary of size V . The CLM objective can then be interpreted as reducing the discrepancy between Pθ t and the one-hot label matrix, Ot ∈ {0, 1}N ×V , where each row is a one-hot representation of the corresponding gold token. Formally, the CLM objective is transformed into the following representation:

3.2 LLMS FUSION

Taking this perspective on a language model, we argue that the probabilistic distribution matrix can reflect its certain inherent knowledge in understanding the text. Consequently, different probabilistic distribution matrices for the same text, originating from various LLMs, can be used to represent the diverse knowledge embedded within these models. Acknowledging this, the proposed FUSELLM approach tackles LLMs fusion through probabilistic modeling, aiming to create a unified LLM by merging the probabilistic distributions of the source LLMs. To achieve this, when starting with a set of LLMs to fuse, FUSELLM undergoes lightweight continual training of the target LLM on a raw text corpus that mirrors the pre-training dataset. Instead of relying solely on the CLM objective, FUSELLM places significant emphasis on minimizing the divergence between the target LLM’s probabilistic distributions and those of the source LLMs.

For each text in the corpus C, we apply the provided K source LLMs and obtain a set of probabilistic distribution matrices, denoted as {Pθj j=1}, where θj represents the parameters of the jth LLM. Utilizing these matrices, we externalize the knowledge from individual models into a unified space, essentially creating unified probabilistic representations over the text. We acknowledge that variances in vocabulary among the source LLMs can lead to misaligned matrices {Pθj j=1}. To address this, we employ a token alignment strategy, which is explained in Section 3.3, to foster more coherent probabilistic interpretations across models.

Having aligned the probabilistic matrices, we proceed to fuse them into a single compact representation. Various fusion strategies can be applied for this purpose, as detailed in Section 3.3. We use Pt to represent the fused representation matrix as follows:

(3) where Fusion(·) denotes the function that combines multiple matrices, and the resulting matrix Pt is seen as a representation of the collective knowledge and distinctive strengths of the source LLMs.

To transfer the capabilities of source LLMs to the target LLM, we enforce alignment between the target LLM’s predictions and the fused representation matrix Pt. We use Qt to represent the output distribution matrix of the target LLM for text t, and then define our fusion objective as follows:

The overall objective for our continual training consists of a weighted combination of the causal language modeling objective LCLM and the fusion objective LFusion as follows:

3.3 IMPLEMENTATION OF FUSELLM

In this section, we present the implementation details of token alignment and the fusion function for fusing different LLMs in our FUSELLM method.

Token Alignment Ensuring token alignment across multiple LLMs is crucial for effective knowledge fusion, as it guarantees proper mapping of probabilistic distribution matrices. Fu et al. (2023) employed dynamic programming to recursively minimize the total cost of editing one sequence of tokens to match the other. If a one-to-one mapping exists between two tokens, the corresponding distributions are perfectly mapped. Otherwise, the mapped distribution degenerates into a one-hot vector. Since tokens generated by different tokenizers for the same sequence typically exhibit limited differences, we propose to enhance the success rate of token alignment by replacing the exact match (EM) constraint in Fu et al. (2023) with a minimum edit distance (MinED) strategy, which maps tokens from different tokenizers based on MinED. This relaxation of token alignment helps preserve substantial information in the distribution matrices while introducing minor errors. For more details of the token alignment, please refer to Appendix A.

Fusion Strategies To combine the collective knowledge of source LLMs while preserving their unique strengths, it is essential to evaluate the quality of different LLMs and assign varying levels of importance to their respective distribution matrices. For this purpose, when dealing with text t, we utilize the cross-entropy loss between the distribution matrices and the gold labels as an indicator of the prediction quality of the LLMs (Marion et al., 2023). A lower cross-entropy score for a source LLM signifies a more accurate understanding of the text, and its prediction should be accorded greater significance. Based on this criterion, we introduce two fusion functions: (1) MinCE: This function outputs the distribution matrix with the minimum cross-entropy score; (2) AvgCE: This function produces a weighted average of the distribution matrices based on cross-entropy scores.

The complete process of the FUSELLM method is described in Algorithm 1.

Algorithm 1 FUSELLM for LLMs Fusion Require: Source LLMs {Ms j}K 1: Initialize the target LLM Mt with one of the source LLMs. 2: for text t in C do 3:

4 EXPERIMENTS

In our experiments, we consider a general but challenging scenario of LLMs fusion where the source models share minimal commonalities in architectures or functionalities. Specifically, we conduct experiments on the 7B scale and select three representative open-source models: Llama-2, OpenLLaMA, and MPT as the source LLMs for fusion. Regarding the target LLM, we opt for another Llama-2 7B, which is generally the most robust one among the three source LLMs. The target LLM starts with the same pre-trained weights as its source counterpart but differs in that it updates parameters during training. To evaluate the performance of FUSELLM, we conduct experiments on benchmarks assessing the capabilities of LLMs in reasoning, commonsense, and code generation.

4.1 EXPERIMENTAL SETUP

Dataset for continual training To continually train the target LLM for LLMs fusion, it is essential to have a compact yet diverse training dataset. We have chosen MiniPile, a meticulously curated dataset resulting from a thorough clustering and filtering process. MiniPile comprises approximately 1 million documents across 22 domains and 1.8 billion tokens, constituting less than 0.1% of the 2 trillion training tokens of Llama-2. More dataset details can be found in Appendix B.

Fusion function For the fusion function, we use the minimum cross-entropy (MinCE). However, the impact of employing alternative fusion functions will be examined in Section 4.4.

Training details We train the target LLM of Llama-2 7B using a batch size of 128 and a maximum length of 2048 on a single node equipped with 8 NVIDIA A100 GPUs, each with 40GB of memory.

Our training framework is implemented based on the Huggingface Transformers (Wolf et al., 2020) and accelerated with FlashAttention (Dao et al., 2022). We empirically set the combination weight λ in Eq. 5 to 0.9. The training consists of only a single epoch, which takes approximately 33 hours. For further hyper-parameter details, please refer to Appendix C.

Evaluation We evaluate FUSELLM on three benchmarks that represent different core capabilities of LLMs, spanning reasoning, commonsense, and code generation.

• Big-Bench Hard (BBH) (Suzgun et al., 2022) is a benchmark to evaluate the general reasoning ability of LLMs. It contains 23 multiple-choice tasks and 4 free-form generation tasks from the Big-Bench (Srivastava et al., 2022), which can be classified into four categories: algorithmic and arithmetic reasoning, natural language understanding, world knowledge, and multilingual knowl- edge and reasoning. We follow previous work (Wang et al., 2023b) to generate the predictions based on few-shot chain-of-thought (CoT) prompts and then calculate the exact match (EM) accuracy. • Common Sense (CS) is a benchmark to evaluate the commonsense capability of LLMs. We consider 5 standard multiple-choice tasks: ARC easy and challenge (Clark et al., 2018), BoolQ (Clark et al., 2019a), HellaSwag (Zellers et al., 2019), and OpenBookQA (Mihaylov et al., 2018). We employ lm-eval-hardness (Gao et al., 2021) to conduct a likelihood-based zero-shot evaluation. Specifically, we select the option with the highest likelihood given the context and report the accuracy. • MultiPL-E (ME) (Cassano et al., 2022) is a multilingual programming benchmark to assess the coding ability of LLMs. It is translated from the Python benchmark (Chen et al., 2021) into parallel datasets in 18 programming languages. We use the bigcode-evaluation-hardness (Ben Allal et al., 2022) to perform zero-shot code generation in 10 popular programming languages in the HumanEval category and report the pass@1 (Chen et al., 2021) based on 20 generated samples for each question.

Baselines In our experiments, we compare our FUSELLM with two sets of baselines: (1) original LLMs, including Llama-2 7B, OpenLLaMA 7B, and MPT 7B; and (2) Llama-2 CLM: continually trained Llama-2 7B on MiniPile using only the casual language modeling objective.

4.2 OVERALL RESULTS

Table 1 presents the overall results of FUSELLM in comparison to the baseline methods on BBH. We can observe that the three source LLMs exhibit varying performance across the 27 BBH tasks, with Llama-2 generally outperforming the others. After continual training with a compact and diverse corpus, Llama-2 CLM shows a relative improvement of 1.86% compared to Llama-2, although this improvement is relatively modest and inconsistent across tasks. On average, FUSELLM demonstrates an average relative performance gain of 5.16% over the original Llama-2 across all 27 tasks. In specific tasks, the enhancements achieved by FUSELLM are substantial (e.g., from 54.40 to 65.20 in the Hyperbaton task). In tasks such as Dick Languages where simple continual pre-training leads to a decline in performance, FUSELLM leverages the combined strengths of individual source LLMs to recover performance improvements. Note that FUSELLM occasionally exhibits degraded performance on tasks such as Geometric Shapes and Word Sorting, which could be attributed to two reasons. First, the other source LLMs, apart from Llama-2, perform poorly on these tasks, affecting the fusion results. Second, the relevance between the continual training dataset and downstream tasks also contributes to the performance degradation.

Table 2 shows the zero-shot performance of FUSELLM and the baseline methods on the Common Sense (CS) benchmark. The results demonstrate that FUSELLM consistently surpasses the baselines across all five tasks, achieving a relative performance improvement of 1.25% over Llama-2. In contrast, Llama-2 CLM exhibits a marginal improvement, with only a 0.16% relative enhancement compared to Llama-2. Notably, substantial improvements from Llama-2 to FUSELLM are observed in the challenging ARC-challenge (2.40%) and OpenBookQA (2.71%) tasks, highlighting the effectiveness of FUSELLM in leveraging collective knowledge to address intricate problems.

For the code generation evaluation, the zero-shot performance of FUSELLM on the MultiPL-E (ME) benchmark is reported in Table 3. We observe that FUSELLM outperforms Llama-2 in 9 out of the 10 tasks, with a notable enhancement in the pass@1 score for specific programming languages such as R, increasing from 4.97 to 5.84. Given that both OpenLLaMA and MPT demonstrate remarkable performances in code generation tasks compared to Llama-2, the fusion result via FUSELLM achieves an average performance gain of 6.36%, which is considerably higher than the 1.37% improvement observed in Llama-2 CLM. However, it’s important to note that FUSELLM still exhibits

Task Boolean Expressions Causal Judgement Date Understanding Disambiguation QA Dyck Languages Formal Fallacies Geometric Shapes Hyperbaton Logical Deduction (3 objects) Logical Deduction (5 objects) Logical Deduction (7 objects) Movie Recommendation Multistep Arithmetic Two Navigate Object Counting Penguins in a Table Reasoning about Colored Objects Ruin Names Salient Translation Error Detection Snarks Sports Understanding Temporal Sequences Tracking Shuffled Objects (3 objects) Tracking Shuffled Objects (5 objects) Tracking Shuffled Objects (7 objects) Web of Lies Word Sorting Avg. 27 Tasks

Table 1: Overall results of FUSELLM and baselines in reasoning evaluations on Big-Bench Hard (BBH), where percentages indicate the rate of improvement/decrease compared to Llama-2.

Table 3: Overall results of FUSELLM and baselines in code generation evaluations on MultiPL-E (ME), where percentages indicate the rate of improvement/decrease compared to Llama-2.

1 Since MiniPile lacks specific data percentages for individual domains, we approximate this by considering the percentage of the Github domain in The Pile.

4.3 THE FUSED PROBABILISTIC DISTRIBUTIONS

We investigate the effectiveness of the fused probabilistic distributions obtained from multiple LLMs and track the trend of performance improvement during the training process. Figure 2 illustrates the comparison of few-shot CoT performance between Llama-2 CLM and FUSELLM with varying scales of training data on BBH. Our observations reveal that FUSELLM enhances the exact match (EM) accuracy by 2.5% compared to Llama-2 CLM and achieves the best performance of Llama-2 CLM within 0.52 billion tokens. Notably, this represents a 3.9× reduction in token requirements compared to the 1.57 billion tokens needed by Llama-2 CLM. These results suggest that the probabilistic distributions derived from LLMs contain knowledge that is more readily learnable than the original text sequences, which accelerates the optimization process. This finding aligns with the observations in Hsieh et al. (2023). We further conduct an experiment to show that our performance improvement stems from the integration of knowledge from multiple LLMs rather than solely from continual training. The results and analysis are shown in Appendix G.

Figure 2: Effect of the fused distributions in accelerating the optimization process on BBH, where the x-axis denotes the number of training tokens and the y-axis denotes the exact match accuracy.

4.4 ANALYSIS OF IMPLEMENTATION PROCESS

In this section, we delve into the crucial elements of FUSELLM’s implementation, including the number of source LLMs, the criteria for token alignment, and the choice of the fusion function.

Number of source LLMs. In Table 4, we present the results of fusing different numbers of LLMs. We note that the performance of FUSELLM demonstrates apparent improvement as the number of models increases from 1 to 3. Nevertheless, the benefits of integrating additional models exhibit variations across benchmarks. Remarkably, a consistent performance improvement is observed in BBH. Whereas in CS or ME, the advantages are more prominent when fusing two models. This phenomenon may be attributed to the considerable performance differences among the three models on various tasks in BBH, while the performance differences in tasks of CS or ME are relatively smaller.

BBH Model 33.87 OpenLLaMA 33.38 MPT 39.70 Llama-2 Llama-2 CLM 40.44 (+1.86%) Llama-2 + OpenLLaMA 41.00 (+3.27%) 41.16 (+3.68%) Llama-2 + MPT 41.75 (+5.16%) FUSELLM

Table 4: Results of FUSELLM by incorporating varying numbers of models.

Criteria for token alignment. During the fusion of LLMs, ensuring the proper alignment of tokens and vocabularies from multiple models is of paramount importance. In Table 5 (upper), we It is evident that the proposed MinED method, present a comparison of two alignment criteria. which is based on minimum edit distance, consistently outperforms the EM method introduced by Fu et al. (2023), which relies on exact matching. We suggest that this performance enhancement results from MinED’s ability to relax the constraints of EM, as tokens separated by distinct tokenizers within the same sequence often exhibit minor discrepancies. Consequently, MinED effectively supplements a considerable amount of useful token information while introducing negligible errors.

Fusion function. In Section 3.3, we introduce two variations of the fusion function for FUSELLM: one utilizing a distribution matrix with minimum cross entropy (MinCE) and the other adopting a weighted average of distribution matrices based on cross entropy (AvgCE). A comparison of the two functions is presented in Table 5 (down). The findings demonstrate that FUSELLM with MinCE consistently outperforms AvgCE across all benchmarks. This can be attributed to the distortions introduced by the straightforward weighted summation used in AvgCE, which may diminish the distinct advantages of individual LLMs.

Table 5: Comparison of different token alignment criteria (upper) and fusion functions (down).

4.5 FUSELLM vs. KNOWLEDGE DISTILLATION

While knowledge distillation techniques can also be utilized to enhance a LLM’s capabilities, FUSELLM stands out due to two distinct aspects, as previously outlined. In this section, we compare FUSELLM with traditional knowledge distillation. Specifically, we extract probabilistic distributions from Llama-2 13B and apply the conventional knowledge distillation method to transfer its abilities into Llama-2 7B. As illustrated in Table 6, the distilled model (Llama-2 KD) outperforms the original Llama2 7B across all benchmarks, demonstrating the effectiveness of knowledge distillation. However, when compared to FUSELLM, the improvement achieved by Llama-2 KD is relatively modest, especially in the case of BBH (2.97% vs. 5.16%). This suggests that the superior results achieved by FUSELLM through the integration of three 7B models with diverse architectures via continual training outweigh the benefits of simply distilling knowledge from a single 13B model. This observation highlights the idea that “More is different, but different can also be more” (Tay et al., 2022).

Table 6: Comparison of FUSELLM and knowledge distillation. Llama-2 KD denotes the enhanced Llama-2 7B achieved via knowledge distillation from Llama-2 13B. Percentages indicate the rate of improvement compared to Llama-2.

4.6 FUSELLM vs. ENSEMBLE/MERGING

As previously mentioned, conventional techniques such as model ensemble and weight merging are commonly employed to fuse multiple LLMs. To compare the efficacy of our FUSELLM with these existing fusion methods, we conduct experiments simulating scenarios where multiple LLMs originated from the same base model but were trained on distinct corpora. We first select three relevant domains (PhilPapers, NIH ExPorter, and USPTO Backgrounds) from The Pile and use 1 billion tokens from each domain to continually train Pythia 1B (Biderman et al., 2023), resulting in three distinct LLMs with identical structures. Then, we apply different fusion techniques to these LLMs: (1) The ensemble method calculates a weighted average of the probabilities generated by all LLMs, considering the performance of each model; (2) The weight merging method merges multiple LLMs into a single one within the parameter space, with the merging weights determined by model performance; (3) FUSELLM undergoes continual training on 0.1 billion tokens sampled from the three domains. The results of perplexity for FUSELLM and the other fusion methods on the test sets are presented in Table 7. We measure perplexity in bits per UTF-8 encoded byte (BPB) following the implementation in The Pile. We observe that after training with 1 billion tokens, the capabilities of the original LLM are transferred to each domain-specific LLM, resulting in decreased performance in other domains. While all fusion techniques can integrate the strengths of diverse models, FUSELLM consistently achieves the lowest average perplexity across the three domains. This underscores its potential for harnessing collective knowledge more effectively than ensemble and weight merging methods.

Table 7: Comparison of perplexity between FUSELLM and ensemble&weight merging.

5 CONCLUSION

In this study, we have explored the realm of knowledge fusion for LLMs to create a unified model that combines the capabilities and distinctive strengths of multiple structurally diverse LLMs. We introduced a novel method, FUSELLM, which leverages the generative distributions of these source LLMs to externalize their knowledge and employs them in the continual training of the target LLM. Through a series of experiments, we have demonstrated the superiority of FUSELLM over individual source LLMs and established baselines. Notably, in a simulated experiment featuring multiple structurally identical LLMs, FUSELLM has showcased its competitive effectiveness compared to ensemble and weight merging methods. Hence, the domain of LLMs fusion emerges as a more promising avenue for exploration, particularly given the diverse structures and substantial model sizes of LLMs. We believe that these findings will inspire future research endeavors.

APPENDIX

A. DETAILS OF TOKEN ALIGNMENT

For an input text, token alignment involves aligning two distribution matrices from two source LLMs. Therefore, the alignment comprises two dimensions: token-wise with respect to the text and distribution-wise with respect to the vocabulary. To provide a clear explanation, we show an example of different methods for token alignment in Figure 3.

In the token dimension, we utilize the dynamic programming approach to recursively minimize the total cost of editing one sequence of tokens to align with another. When the mapped tokens are identical, such as the token “now” in the given example, these tokens are successfully aligned, allowing for the corresponding distributions to align subsequently. However, when the mapped tokens exhibit differences, such as the “get” and “gets” tokens in the example, the previous EM method proposed by Fu et al. (2023) does not align these tokens, resulting in the distributions degenerating into onehot vectors. In contrast, our proposed MinED method successfully aligns the “gets” token with the “get” token, as they exhibit the minimal edit distance in the vocabularies from the two source LLMs.

Concerning the distribution dimension, the alignment is performed between two vocabularies from different tokenizers of two source LLMs. Therefore, for distribution values with identical tokens, such as “current 0.05” and “current 0.04”, they will be aligned effectively. For distribution values involving different tokens, such as “immediate 0.04” and “immediately 0.03”, the EM method disregards this value. However, the proposed MinED method maps “immediately” to “immediate” due to their minimal edit distance, resulting in the successful alignment of these distribution values.

Figure 3: An example of different methods for token alignment.

B. DETAILS OF MINIPILE

MiniPile is curated from The Pile (Gao et al., 2020) through a three-stage pruning process: (1) extracting embeddings for all documents with E5-Large (Wang et al., 2022a), which is a sentence embedding model, (2) clustering the embeddings using K-means, and (3) filtering out low-quality clusters. Therefore, MiniPile retains a compact scale while exhibiting extensive diversity, making it a prevalent choice for efficient training of LLMs (Kaddour et al., 2023; Sanyal et al., 2023).

C. TRAINING DETAILS

Our model is optimized using the AdamW optimizer with β1 = 0.9 and β2 = 0.95, with gradient clipping set to 1.0 and weight decay to 0.1. A cosine learning rate schedule is employed, with a maximum learning rate of 1e-5 and a warmup ratio of 0.008. To accelerate the training, we employ packing (Raffel et al., 2020), where multiple training instances are grouped into a single sequence separated by end-of-sequence tokens, allowing for training on more tokens in each batch.

D. ADDITIONAL EVALUATION RESULTS

To further illustrate the effectiveness of FUSELLM, we incorporate additional generative benchmarks related to knowledge-based question-answering, reading comprehension, content analysis, machine translation, and theorem application. The results presented in Table 8 highlight FuseLLM’s superiority over all source LLMs across all tasks.

• TriviaQA (Joshi et al., 2017) is a benchmark to evaluate the knowledge-based question-answering ability. We conduct a zero-shot evaluation and report the EM accuracy. • DROP (Dua et al., 2019) is a benchmark to evaluate the reading comprehension ability. We conduct a few-shot evaluation with CoT prompts and report the EM accuracy. • LAMBADA (Paperno et al., 2016) is a benchmark to evaluate the content analysis ability. We conduct a zero-shot evaluation and report the EM accuracy. • IWSLT2017 (Cettolo et al., 2017) is a benchmark to evaluate the machine translation ability. We conduct a zero-shot evaluation and report the BLEU (Papineni et al., 2002) score. • SciBench (Wang et al., 2023a) is a benchmark to evaluate the theorem application ability. We conduct a few-shot evaluation with CoT prompts and report the EM accuracy.

Table 8: Overall results of FUSELLM and baselines in additional generative benchmarks, where percentages indicate the rate of improvement/decrease compared to Llama-2.

E. FUSELLM vs. PREVIOUS MODEL FUSION METHODS

The motivation behind FuseLLM is to integrate the collective knowledge of multiple LLMs with diverse architectures and pre-training corpora. Consequently, the traditional fusion method of model merging, which demands identical model architectures, is not directly applicable in this context. While the model ensemble technique aggregates predictions from multiple LLMs, the drawback lies in the substantial memory and time costs when maintaining multiple source LLMs during inference. We further compare FUSELLM with an ensemble method for LLMs, LLM-Blender (Jiang et al., 2023), which ranks and combines the output texts from multiple LLMs with ranker and fuser models. Specifically, we conduct experiments on the Big-Bench Hard and MultiPL-E benchmarks using the open-source ranker and fuser models. Notably, the CommonSense benchmark, which utilizes perplexity-based evaluation, cannot adapt the LLM-Blender method. The experimental results are shown in Table 9, where LLM-Blender (Rank&Fuse) refers to using the ranker to obtain the top three results and then using the fuser to combine them, and LLM-Blender (Rank) represents simply using the ranker to obtain the top one result. We observed a notable performance deterioration after fusion when employing both the ranker and fuser. This could be attributed to the fuser model’s training within the instruction-tuning context, potentially leading to inadequate generalization to the test tasks. Furthermore, while LLM-Blender (Rank) outperforms the LLM-Blender (Rank&Fuse), it remains inferior to the best-performing source LLM. This suggests the ranker model’s inability to discriminate the optimal responses when combining different LLMs efficiently.

Table 9: Comparison of FUSELLM and LLM-Blender.

F. INCORPORATING INSTRUCTION-TUNING MODELS WITH FUSELLM

Recall that the proposed FUSELLM involves extracting distribution matrices from multiple distinct source LLMs and continually training the target LLM. Therefore, FUSELLM is also applicable to instruction-tuning models, provided that all corresponding continual-training samples adhere to the instruction-tuning format and mask the instruction part when calculating the training loss.

To confirm this, we conduct new experiments on the fusion of instruction-tuning LLMs. Specifically, we initially fine-tune Llama-2, OpenLLaMA, and MPT using 20k samples from Evol-Instruct (Xu et al., 2023), ShareGPT (Chiang et al., 2023), and Open-Platypus (Lee et al., 2023) datasets, respectively. Consequently, the three source LLMs transitioned into instruction-tuning LLMs. Then, we sample another 5k samples from each of the aforementioned datasets to create a corpus for continual training, specifying Llama-2 Evol-Instruct as the target LLM for knowledge fusion. We assess the instruction-following performance on the Vicuna Benchmark using GPT-4 as an evaluator following Chiang et al. (2023), which gives a score from 1 to 10 for each answer. The results shown in Table 10 demonstrate that FuseLLM surpasses each individual source instruction-tuning LLM, achieving the best performance with GPT-4 judgment.

Table 10: Results of fusing instructiontuning models with FUSELLM.

G. CAUSE OF PERFORMANCE IMPROVEMENT

To further demonstrate that our performance improvement stems from the integration of knowledge from multiple LLMs rather than solely from continual training, we conduct an evaluation on an alternative corpus, RedPajama (Computer, 2023). To mitigate the impact of different tokenizers, we employ the bits per UTF-8 encoded byte (BPB) metric proposed by Gao et al. (2020), where a smaller value indicates lower perplexity. Then, we compute the percentage of test samples in each domain exhibiting decreased BPB from Llama-2 CLM to FUSELLM and from Llama-2 to OpenLLaMA or MPT. The results in Figure 4 suggest that when FUSELLM outperforms Llama-2 CLM, the performance of OpenLLaMA or MPT typically surpasses that of Llama-2, as evidenced by the slashed bars in each domain. This phenomenon is particularly pronounced in domains such as Arxiv, StackExchange, Wikipedia, and Github, where it exceeds 95%. This compelling evidence suggests that the performance enhancements achieved by FUSELLM are indeed attributed to the integration of knowledge from multiple LLMs.

Figure 4: Perplexity comparison on RedPajama. The bars denote the percentage of examples with reduced perplexity when transitioning from CLM to FUSELLM (solid) and from Llama-2 to OpenLLaMA or MPT (slashed).

Table 11: Comparison of different designs for λ.

In this experiment, we set the λ to increase linearly from 0.7 to 1.0. Table 11 presents a comparison between the static and dynamic designs of λ, demonstrating that the two approaches yield comparable performance. Therefore, we opt for the static design for simplicity.

I USING STRONGEST LLM AS TARGET

In the experiments, we consistently employ Llama-2 as the target LLM to maintain a fixed setup. We further conduct supplementary experiments on code generation tasks using the strongest LLM, OpenLLaMA, as the target LLM. The results are shown in Table 12. We observe a decrease in the performance of OpenLLaMA CLM compared to the source OpenLLaMA and MPT. This decline is attributed to the inclusion of the StarCoder (Li et al., 2023) data in OpenLLaMA’s pre-training corpus, whereas there is a limited amount of code data in our continual training corpus. For the same reason, even though FuseLLM showcases improved performance compared to OpenLLaMA CLM, it still lags behind the original OpenLLaMA.

Table 12: Overall results of FUSELLM using OpenLLaMA as the target LLM in code generation evaluations on MultiPL-E (ME).

J CASE STUDIES

In Table 13, Table 14, and Table 15, we present case studies to demonstrate how FUSELLM combines the strengths of multiple source LLMs to produce accurate results in different tasks.

Table 13: Case studies on the Logical Deduction (3 objects) task.

Table 14: Case studies on the Hyperbaton task.

Table 15: Case studies on the Disambiguation QA task.

Previous: Model | OLMo Next: Reasonning | Premise Order Matters in Reasoning

post contain ""

    No matching posts found containing ""