00:00:00

Share Your Feedback 🏝️

MoE | Mixtral of Experts

MoE | Mixtral of Experts

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: DPO | Self-Play Fine-Tuning** Next: WikiChat

MoE | Mixtral of Experts

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-01-05

Mixtral of Experts

  • url: https://arxiv.org/abs/2401.04088
  • pdf: https://arxiv.org/pdf/2401.04088
  • abstract: We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Contents

TL;DR


MoE(Mixture of Experts)를 적용한 MoE(Mixtral of Experts) 논문이 발표되었습니다. Mistral 모델에 MoE 레이어를 적용한 Mistral AI의 테크니컬 리포트입니다.

  1. Mixtral 8x7B 모델 소개: Apache 2.0 라이센스 하에 공개된 성능 우수한 스파스 혼합 전문가 모델(SMoE)
  2. 고성능 아키텍처: 각 토큰에 대해 파라미터의 일부만 사용하여 처리 속도 및 처리량 향상
  3. 벤치마크 결과: 여러 벤치마크에서 Llama 2 70B 및 GPT-3.5를 능가하며, 특히 수학 및 다국어 이해에서 우수함.

1. Mixtral 모델의 아키텍처

Mixtral 8x7B는 전문가의 혼합을 사용하는 디코더 전용 스파스 모델입니다. 각 입력 벡터는 라우터에 의해 8개의 전문가 그룹 중 2개에 할당되며, 레이어의 출력은 선택된 두 전문가의 출력의 가중합으로 계산됩니다. 이 모델은 파라미터의 수를 증가시키면서 비용과 지연 시간을 제어할 수 있습니다.

1.1 Mixtral 8x7B의 구조적 특징

Mixtral 8x7B는 트랜스포머 기반의 아키텍처를 사용하되, 표준 피드포워드 블록 대신 전문가의 혼합(Mixture of Experts, MoE) 계층을 사용합니다. 각 토큰에 대해 동적으로 두 개의 전문가 그룹을 선택하고, 그 출력을 결합하여 최종 출력을 생성합니다. 이 접근 방식은 토큰별로 필요한 파라미터의 일부만 사용하므로 처리 속도와 처리량을 개선합니다.

1.2 수학적 모델링

\[y = \sum_{i=1}^n G(x)_i \cdot E_i(x)\]

$G(x)_i$는 $i$번째 전문가에 대한 게이팅 네트워크 출력, $E_i(x)$는 $i$번째 전문가의 네트워크 출력으로 게이팅 함수 $G(x)$는 입력 $x$의 선형 변환 후, 상위 $K$개의 로짓에 소프트맥스를 적용하여 계산합니다.

\[G(x) = \text{Softmax}(\text{TopK}(x \cdot W_g))\]

위와 같이 전문가 선택을 최적화하고, 계산 리소스를 효율적으로 활용합니다.

\[\text{Output} = \sum_{i=1}^n G(x)_i \cdot E_i(x)\]

$G(x)_i$는 게이팅 네트워크의 출력이고, $E_i(x)$는 $i$번째 전문가 네트워크의 출력입니다. $G(x)$는 입력 $x$에 대한 선형 계층의 로짓에서 상위-K를 소프트맥스로 계산함으로써 구현됩니다.

\[G(x) := \text{Softmax}(\text{TopK}(x \cdot W_g))\]


2. 데이터셋 및 벤치마크 성능

Mixtral은 다양한 언어 데이터를 사용하여 사전 훈련되었으며, 32k 토큰의 컨텍스트 크기를 지원합니다. 이 모델은 수학, 코드 생성, 다국어 이해력에서 향상된 성능을 보여 Llama 2 70B 및 GPT-3.5를 초과하는 결과를 달성했습니다.

  • 수학 및 코드 생성: GSM8K 및 MATH 벤치마크에서 우수한 성능
  • 다국어 지원: ARC-Challenge, MMLU에서 여러 언어에 걸쳐 우수한 결과
  • 장기 기억 성능: 32k 토큰 범위에서 정보 검색 능력이 뛰어남


3. 방법

Mixtral은 MoE(Mixture of Experts) 계층을 사용하여 토큰별로 전문가를 선택하고, 해당 전문가만을 활성화하여 처리합니다. 이는 연산 비용을 효과적으로 관리하면서 모델의 파라미터 수를 증가시킬 수 있는 방법입니다.

  1. 가정: 각 토큰 처리에 필요한 파라미터 수를 제한하면 처리 속도와 비용을 최적화할 수 있습니다.
  2. 방법: MoE 계층을 사용하여 필요한 전문가만을 동적으로 선택하고 활성화합니다.
  3. 결론: Mixtral은 전체 파라미터 수는 크게 유지하면서도, 실제 계산에 필요한 활성 파라미터 수를 제한함으로써 높은 성능을 유지합니다.

3.1 MoE의 구현

MoE 계층을 통한 처리는 모델의 파라미터 총수는 유지하면서 활성 파라미터 수를 최소화하여, 효율적인 인퍼런스를 가능하게 합니다. 이는 리소스 사용 최적화와 비용 효율성을 동시에 달성합니다.

3.2 성능 및 효율성

전문가 계층은 단일 GPU에서도 높은 성능을 제공할 수 있도록 설계되었으며, 전문가 병렬성(Expert Parallelism)을 통해 여러 GPU에 분산 처리가 가능합니다. 이는 모델의 확장성과 유연성을 향상시킵니다.


4. 비교 및 평가

이 모델은 GPT-3.5 및 Llama 2 70B와 같은 기존 모델보다 더 나은 성능을 보여주며, 특히 수학적 및 다국어 벤치마크에서 향상된 결과를 보입니다. Mixtral은 또한 편향성과 감정 분석에서 더 균형 잡힌 프로필을 제시합니다.

4.1 경쟁 모델과의 비교

Llama 2 70B 및 GPT-3.5와의 비교에서 Mixtral은 대부분의 벤치마크에서 향상된 성능을 보입니다. 특히 수학 및 코드 생성에서의 성능은 우수합니다.

4.2 편향성 및 감정 분석 벤치마크

BBQ 및 BOLD 벤치마크를 통해 Mixtral은 기존 모델 대비 더 낮은 편향성과 더 긍정적인 감정 프로파일을 보여줍니다. 이는 모델의 사회적 책임성과 공정성을 강화합니다.


1 Introduction

In this paper, we present Mixtral 8x7B, a sparse mixture of experts model (SMoE) with open weights, licensed under Apache 2.0. Mixtral outperforms Llama 2 70B and GPT-3.5 on most benchmarks. As it only uses a subset of its parameters for every token, Mixtral allows faster inference speed at low batch-sizes, and higher throughput at large batch-sizes.

Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively. This technique increases the number of parameters of a model while controlling cost and latency, as the model only uses a fraction of the total set of parameters per token.

Mixtral is pretrained with multilingual data using a context size of 32k tokens. It either matches or exceeds the performance of Llama 2 70B and GPT-3.5, over several benchmarks. In particular, Mixtral demonstrates superior capabilities in mathematics, code generation, and tasks that require multilingual understanding, significantly outperforming Llama 2 70B in these domains. Experiments show that Mixtral is able to successfully retrieve information from its context window of 32k tokens, regardless of the sequence length and the location of the information in the sequence.

Code: https://github.com/mistralai/mistral-src

Webpage: https://mistral.ai/news/mixtral-of-experts/

Figure 1: Mixture of Experts Layer. Each input vector is assigned to 2 of the 8 experts by a router. The layer’s output is the weighted sum of the outputs of the two selected experts. In Mixtral, an expert is a standard feedforward block as in a vanilla transformer architecture.

We also present Mixtral 8x7B – Instruct, a chat model fine-tuned to follow instructions using supervised fine-tuning and Direct Preference Optimization [25]. Its performance notably surpasses that of GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B – chat model on human evaluation benchmarks. Mixtral – Instruct also demonstrates reduced biases, and a more balanced sentiment profile in benchmarks such as BBQ, and BOLD. We release both Mixtral 8x7B and Mixtral 8x7B – Instruct under the Apache 2.0 license1, free for academic and commercial usage, ensuring broad accessibility and potential for diverse applications. To enable the community to run Mixtral with a fully open-source stack, we submitted changes to the vLLM project, which integrates Megablocks CUDA kernels for efficient inference. Skypilot also allows the deployment of vLLM endpoints on any instance in the cloud.

2 Architectural details

Mixtral is based on a transformer architecture [31] and uses the same modifications as described in [18], with the notable exceptions that Mixtral supports a fully dense context length of 32k tokens, and the feed-forward blocks are replaced by Mixture-of-Expert layers (Section 2.1). The model architecture parameters are summarized in Table 1.

2.1 Sparse Mixture of Experts

We present a brief overview of the Mixture of Experts layer (Figure 1). For a more in-depth overview, see [12]. The output of the MoE module for a given input x is determined by the weighted sum of the outputs of the expert networks, where the weights are given by the gating network’s output. i.e. given n expert networks {E0, Ei, …, En−1}, the output of the expert layer is given by:

Here, G(x)i denotes the n-dimensional output of the gating network for the i-th expert, and Ei(x) is the output of the i-th expert network. If the gating vector is sparse, we can avoid computing the outputs of experts whose gates are zero. There are multiple alternative ways of implementing G(x) [6, 15, 35], but a simple and performant one is implemented by taking the softmax over the Top-K logits of a linear layer [28]. We use

G(x) := Softmax(TopK(x · Wg)),

where (TopK(ℓ))i := ℓi if ℓi is among the top-K coordinates of logits ℓ ∈ Rn and (TopK(ℓ))i := −∞ otherwise. The value of K – the number of experts used per token – is a hyper-parameter that modulates the amount of compute used to process each token. If one increases n while keeping K fixed, one can increase the model’s parameter count while keeping its computational cost effectively constant. This motivates a distinction between the model’s total parameter count (commonly referenced as the sparse parameter count), which grows with n, and the number of parameters used for processing an individual token (called the active parameter count), which grows with K up to n.

1 https://mistral.ai/news/mixtral-of-experts/

MoE layers can be run efficiently on single GPUs with high performance specialized kernels. For example, Megablocks [13] casts the feed-forward network (FFN) operations of the MoE layer as large sparse matrix multiplications, significantly enhancing the execution speed and naturally handling cases where different experts get a variable number of tokens assigned to them. Moreover, the MoE layer can be distributed to multiple GPUs through standard Model Parallelism techniques, and through a particular kind of partitioning strategy called Expert Parallelism (EP) [28]. During the MoE layer’s execution, tokens meant to be processed by a specific expert are routed to the corresponding GPU for processing, and the expert’s output is returned to the original token location. Note that EP introduces challenges in load balancing, as it is essential to distribute the workload evenly across the GPUs to prevent overloading individual GPUs or hitting computational bottlenecks.

In a Transformer model, the MoE layer is applied independently per token and replaces the feed-forward (FFN) sub-block of the transformer block. For Mixtral we use the same SwiGLU architecture as the expert function Ei(x) and set K = 2. This means each token is routed to two SwiGLU sub-blocks with different sets of weights. Taking this all together, the output y for an input token x is computed as:

This formulation is similar to the GShard architecture [21], with the exceptions that we replace all FFN sub-blocks by MoE layers while GShard replaces every other block, and that GShard uses a more elaborate gating strategy for the second expert assigned to each token.

3 Results

We compare Mixtral to Llama, and re-run all benchmarks with our own evaluation pipeline for fair comparison. We measure performance on a wide variety of tasks categorized as follow:

  • Commonsense Reasoning (0-shot): Hellaswag [32], Winogrande [26], PIQA [3], SIQA [27], OpenbookQA [22], ARC-Easy, ARC-Challenge [8], CommonsenseQA [30]
  • World Knowledge (5-shot): NaturalQuestions [20], TriviaQA [19]
  • Reading Comprehension (0-shot): BoolQ [7], QuAC [5]
  • Math: GSM8K [9] (8-shot) with maj@8 and MATH [17] (4-shot) with maj@4
  • Code: Humaneval [4] (0-shot) and MBPP [1] (3-shot)
  • Popular aggregated results: MMLU [16] (5-shot), BBH [29] (3-shot), and AGI Eval [34] (3-5-shot, English multiple-choice questions only)

Figure 2: Performance of Mixtral and different Llama models on a wide range of benchmarks. All models were re-evaluated on all metrics with our evaluation pipeline for accurate comparison. Mixtral outperforms or matches Llama 2 70B on all benchmarks. In particular, it is vastly superior in mathematics and code generation.

Table 2: Comparison of Mixtral with Llama. Mixtral outperforms or matches Llama 2 70B performance on almost all popular benchmarks while using 5x fewer active parameters during inference.

Figure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x7B) vs Llama 2 (7B/13B/70B). Mixtral largely outperforms Llama 2 70B on all benchmarks, except on reading comprehension benchmarks while using 5x lower active parameters. It is also vastly superior to Llama 2 70B on code and math.

Detailed results for Mixtral, Mistral 7B and Llama 2 7B/13B/70B and Llama 1 34B2 are reported in Table 2. Figure 2 compares the performance of Mixtral with the Llama models in different categories. Mixtral surpasses Llama 2 70B across most metrics. In particular, Mixtral displays a superior performance in code and mathematics benchmarks.

Size and Efficiency. We compare our performance to the Llama 2 family, aiming to understand Mixtral models’ efficiency in the cost-performance spectrum (see Figure 3). As a sparse Mixture-of-Experts model, Mixtral only uses 13B active parameters for each token. With 5x lower active parameters, Mixtral is able to outperform Llama 2 70B across most categories.

Note that this analysis focuses on the active parameter count (see Section 2.1), which is directly proportional to the inference compute cost, but does not consider the memory costs and hardware utilization. The memory costs for serving Mixtral are proportional to its sparse parameter count, 47B, which is still smaller than Llama 2 70B. As for device utilization, we note that the SMoEs layer introduces additional overhead due to the routing mechanism and due to the increased memory loads when running more than one expert per device. They are more suitable for batched workloads where one can reach a good degree of arithmetic intensity.

Comparison with Llama 2 70B and GPT-3.5. In Table 3, we report the performance of Mixtral 8x7B compared to Llama 2 70B and GPT-3.5. We observe that Mixtral performs similarly or above the two other models. On MMLU, Mixtral obtains a better performance, despite its significantly smaller capacity (47B tokens compared to 70B). For MT Bench, we report the performance of the latest GPT-3.5-Turbo model available, gpt-3.5-turbo-1106.

2 Since Llama 2 34B was not open-sourced, we report results for Llama 1 34B.

Table 3: Comparison of Mixtral with Llama 2 70B and GPT-3.5. Mixtral outperforms or matches Llama 2 70B and GPT-3.5 performance on most metrics.

Evaluation Differences. On some benchmarks, there are some differences between our evaluation protocol and the one reported in the Llama 2 paper: 1) on MBPP, we use the hand-verified subset 2) on TriviaQA, we do not provide Wikipedia contexts.

3.1 Multilingual benchmarks

Compared to Mistral 7B, we significantly upsample the proportion of multilingual data during pretraining. The extra capacity allows Mixtral to perform well on multilingual benchmarks while maintaining a high accuracy in English. In particular, Mixtral significantly outperforms Llama 2 70B in French, German, Spanish, and Italian, as shown in Table 4.

Table 4: Comparison of Mixtral with Llama on Multilingual Benchmarks. On ARC Challenge, Hellaswag, and MMLU, Mixtral outperforms Llama 2 70B on 4 languages: French, German, Spanish, and Italian.

3.2 Long range performance

To assess the capabilities of Mixtral to tackle long context, we evaluate it on the passkey retrieval task introduced in [23], a synthetic task designed to measure the ability of the model to retrieve a passkey inserted randomly in a long prompt. Results in Figure 4 (Left) show that Mixtral achieves a 100% retrieval accuracy regardless of the context length or the position of passkey in the sequence. Figure 4 (Right) shows that the perplexity of Mixtral on a subset of the proof-pile dataset [2] decreases monotonically as the size of the context increases.

Figure 4: Long range performance of Mixtral. (Left) Mixtral has 100% retrieval accuracy of the Passkey task regardless of the location of the passkey and length of the input sequence. (Right) The perplexity of Mixtral on the proof-pile dataset decreases monotonically as the context length increases.

3.3 Bias Benchmarks

To identify possible flaws to be corrected by fine-tuning / preference modeling, we measure the base model performance on Bias Benchmark for QA (BBQ) [24] and Bias in Open-Ended Language Generation Dataset (BOLD) [10]. BBQ is a dataset of hand-written question sets that target attested social biases against nine different socially-relevant categories: age, disability status, gender identity, nationality, physical appearance, race/ethnicity, religion, socio-economic status, sexual orientation. BOLD is a large-scale dataset that consists of 23,679 English text generation prompts for bias benchmarking across five domains.

Figure 5: Bias Benchmarks. Compared Llama 2 70B, Mixtral presents less bias (higher accuracy on BBQ, lower std on BOLD) and displays more positive sentiment (higher avg on BOLD).

We benchmark Llama 2 and Mixtral on BBQ and BOLD with our evaluation framework and report the results in Table 5. Compared to Llama 2, Mixtral presents less bias on the BBQ benchmark (56.0% vs 51.5%). For each group in BOLD, a higher average sentiment score means more positive sentiments and a lower standard deviation indicates less bias within the group. Overall, Mixtral displays more positive sentiments than Llama 2, with similar variances within each group.

Instruction Fine-tuning

We train Mixtral – Instruct using supervised fine-tuning (SFT) on an instruction dataset followed by Direct Preference Optimization (DPO) [25] on a paired feedback dataset. Mixtral – Instruct reaches a score of 8.30 on MT-Bench [33] (see Table 2), making it the best open-weights model as of December 2023. Independent human evaluation conducted by LMSys is reported in Figure 63 and shows that Mixtral – Instruct outperforms GPT-3.5-Turbo, Gemini Pro, Claude-2.1, and Llama 2 70B chat.

Figure 6: LMSys Leaderboard. (Screenshot from Dec 22, 2023) Mixtral 8x7B Instruct v0.1 achieves an Arena Elo rating of 1121 outperforming Claude-2.1 (1117), all versions of GPT-3.5-Turbo (1117 best), Gemini Pro (1111), and Llama-2-70b-chat (1077). Mixtral is currently the best open-weights model by a large margin.

3 https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

5 Routing analysis

In this section, we perform a small analysis on the expert selection by the router. In particular, we are interested to see if during training some experts specialized to some specific domains (e.g. mathematics, biology, philosophy, etc.).

To investigate this, we measure the distribution of selected experts on different subsets of The Pile validation dataset [14]. Results are presented in Figure 7, for layers 0, 15, and 31 (layers 0 and 31 respectively being the first and the last layers of the model). Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic. For instance, at all layers, the distribution of expert assignment is very similar for ArXiv papers (written in Latex), for biology (PubMed Abstracts), and for Philosophy (PhilPapers) documents.

Only for DM Mathematics we note a marginally different distribution of experts. This divergence is likely a consequence of the dataset’s synthetic nature and its limited coverage of the natural language spectrum, and is particularly noticeable at the first and last layers, where the hidden states are very correlated to the input and output embeddings respectively.

This suggests that the router does exhibit some structured syntactic behavior. Figure 8 shows examples of text from different domains (Python code, mathematics, and English), where each token is highlighted with a background color corresponding to its selected expert. The figure shows that words such as ‘self’ in Python and ‘Question’ in English often get routed through the same expert even though they involve multiple tokens. Similarly, in code, the indentation tokens are always assigned to the same experts, particularly at the first and last layers where the hidden states are more correlated to the input and output of the model.

We also note from Figure 8 that consecutive tokens are often assigned the same experts. In fact, we observe some degree of positional locality in The Pile datasets. Table 5 shows the proportion of consecutive tokens that get the same expert assignments per domain and layer. The proportion of repeated consecutive assignments is significantly higher than random for higher layers. This has implications in how one might optimize the model for fast training and inference. For example, cases with high locality are more likely to cause over-subscription of certain experts when doing Expert Parallelism. Conversely, this locality can be leveraged for caching, as is done in [11]. A more complete view of these same expert frequency is provided for all layers and across datasets in Figure 10 in the Appendix.

Figure 7: Proportion of tokens assigned to each expert on different domains from The Pile dataset for layers 0, 15, and 31. The gray dashed vertical line marks 1/8, i.e. the proportion expected with uniform sampling. Here, we consider experts that are either selected as a first or second choice by the router. A breakdown of the proportion of assignments done in each case cane be seen in Figure 9 in the Appendix.

Table 5: Percentage of expert assignment repetitions. We evaluate the proportion of times the same expert is assigned to a token i and its following token i+1. We report whether the first chosen expert is the same, or whether the same expert is observed as first or second choice in consecutive tokens. For reference, the expected proportion of repetitions in the case of random assignments is 1 5 7 ≈ 46% for “First and second choice”. Repetitions at the first layer are close to random, but are significantly higher at layers 15 and 31. The high number of repetitions shows that expert choice exhibits high temporal locality at these layers.

6 Conclusion

In this paper, we introduced Mixtral 8x7B, the first mixture-of-experts network to reach a state-of-theart performance among open-source models. Mixtral 8x7B Instruct outperforms Claude-2.1, Gemini Pro, and GPT-3.5 Turbo on human evaluation benchmarks. Because it only uses two experts at each time step, Mixtral only uses 13B active parameters per token while outperforming the previous best model using 70B parameters per token (Llama 2 70B). We are making our trained and fine-tuned models publicly available under the Apache 2.0 license. By sharing our models, we aim to facilitate the development of new techniques and applications that can benefit a wide range of industries and domains.

Figure 8: Text samples where each token is colored with the first expert choice. The selection of experts appears to be more aligned with the syntax rather than the domain, especially at the initial and final layers.

Previous: DPO | Self-Play Fine-Tuning** Next: WikiChat

post contain ""

    No matching posts found containing ""