00:00:00

Share Your Feedback 🏝️

MoE | Understanding MoE

MoE | Understanding MoE

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Sparse Upcycling Next: Model | BERT

MoE | Understanding MoE

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-01-15

Towards Understanding Mixture of Experts in Deep Learning

  • url: https://arxiv.org/abs/2208.02813
  • pdf: https://arxiv.org/pdf/2208.02813
  • abstract: The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a router, has achieved great success in deep learning. However, the understanding of such architecture remains elusive. In this paper, we formally study how the MoE layer improves the performance of neural network learning and why the mixture model will not collapse into a single model. Our empirical results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of MoE. To further understand this, we consider a challenging classification problem with intrinsic cluster structures, which is hard to learn using a single expert. Yet with the MoE layer, by choosing the experts as two-layer nonlinear convolutional neural networks (CNNs), we show that the problem can be learned successfully. Furthermore, our theory shows that the router can learn the cluster-center features, which helps divide the input complex problem into simpler linear classification sub-problems that individual experts can conquer. To our knowledge, this is the first result towards formally understanding the mechanism of the MoE layer for deep learning.

[MoE 핵심 색인마킹]


Contents

TL;DR


  1. 문제와 배경: MoE 구조의 이해도 향상 및 실용적 적용 연구
  2. 연구 방법: 비선형 MoE 모델의 클러스터 중심 신호 학습 및 테스트 정확도 향상을 위한 근거와 방법 연구
  3. 주요 결과: 비선형 MoE 모델은 클러스터 구조를 정확히 학습하고 높은 테스트 정확도를 달성

1. 서론

MoE(Mixture-of-Expert) 구조는 복잡한 문제를 해결하는 데 효과적인 인공 신경망 확장으로 널리 사용되고 있다. 특히, 전문가들의 동적 할당을 통해 효율적인 계산과 높은 정확도를 제공한다. 본 논문에서는 MoE 구조의 이론적 이해와 함께, 비선형 전문가 네트워크가 특정 클러스터에서 어떻게 특화되어 성능을 발휘하는지에 대한 수학적 분석을 제공한다.


2. 관련 연구

MoE 모델은 다양한 전문가 모델을 통합하여 복잡한 데이터셋에 대한 모델의 용량을 향상시키는 구조로, 초기 연구에서는 각종 머신러닝 문제에 적용되어 왔다. 최근에는 딥 러닝 구조에 적용하여, 효율적인 학습 및 인퍼런스 시간 감소를 실현했다. 이 연구들은 MoE의 성공적 적용 사례를 제공하지만, 이론적 근거는 아직 충분히 제공되지 않았다.


3. 문제 설정 및 예비 사항

3.1 데이터 분포

이 연구에서는 ‘mixture of classification’ 데이터 분포를 가정한다. 여러 클러스터 각각이 독립적인 특징 벡터를 사용하여 데이터를 생성하며, 각 데이터 포인트는 여러 패치로 구성되어, MoE가 각 클러스터에 특화된 전문가를 통해 더 효과적으로 학습할 수 있도록 한다.

3.2 MoE 구조

MoE 레이어는 여러 전문가 네트워크와 하나의 게이팅 네트워크로 구성되며, 게이팅 함수는 입력에 따라 최적의 전문가를 동적으로 선택하여 처리 효율을 극대화한다. 이 연구에서는 각 전문가가 비선형 활성화 함수를 사용하는 2층 CNN으로 설정되었다.


4. 주요 결과

4.1 단일 전문가 성능

비선형 활성화 함수를 사용하지 않는 단일 전문가는 제안된 데이터 분포에서 낮은 성능을 보였다. 이는 단일 전문가의 한계를 보여주며, 복수 전문가의 필요성을 강조한다.

4.2 비선형 MoE 성능

비선형 MoE는 클러스터 구조를 정확히 학습하여 높은 테스트 정확도를 달성했는데, 이는 각 전문가가 특정 클러스터에 특화되어 있으며, 게이팅 네트워크가 이를 정확히 인식하여 적절히 데이터를 분배함을 의미한다.


5. 주요 기술

MoE의 성공은 게이팅 네트워크가 클러스터 중심의 특징을 학습하고, 이를 기반으로 문제를 간단한 부분 문제로 나누어 각 전문가가 해결할 수 있도록 하는 데에 기인한다. 이 과정에서 발생할 수 있는 불연속성과 전문가 부하 불균형 문제를 해결하기 위한 기술들이 도입되었다.

6. 실험

비선형 MoE 모델은 합성 데이터와 실제 데이터셋에서 모두 우수한 성능을 보였다. 특히, 클러스터 구조가 명확한 데이터셋에서 MoE의 성능이 더욱 두드러졌다.

이전에 BERT 모델을 통한 분류 모델 생성에서 비슷한 경험을 했던 적이 있음. 관련 GitHub

7. 결론

이 연구는 MoE 구조의 이론적 이해를 심화하고, 비선형 전문가 네트워크가 클러스터 기반 데이터에서 어떻게 효과적으로 작동하는지에 대한 수학적 근거를 제공한다.


[참고자료 1] MoE의 수학적 이해 및 모델 아키텍처

Mixture-of-Experts (MoE)는 다양한 전문가를 조합하여 복잡한 문제를 해결하는 모델 구조로 특정 태스크에 최적화된 여러 전문가가 결합하여 입력에 따라 동적으로 최적의 전문가를 선택하는 방식으로 작동합니다.


1. MoE의 기본 구조

MoE 모델은 기본적으로 다음과 같이 구성됩니다.

  • 전문가 네트워크: 각 전문가는 특정 타입의 데이터나 태스크에 최적화되도록 설계되어 있습니다.
  • 게이팅 네트워크: 입력 데이터를 분석하여 어떤 전문가가 처리할지 결정합니다.

수학적으로, MoE의 출력은 모든 전문가의 가중합으로 표현됩니다.

\[F(x) = \sum_{m=1}^M \pi_m(x) f_m(x)\]

\(F(x)\)는 모델의 출력, \(\pi_m(x)\)는 입력 \(x\)에 대해 게이팅 네트워크가 계산한 전문가 \(m\)의 가중치, \(f_m(x)\)는 전문가 \(m\)의 출력

1.1 게이팅 함수의 정의

게이팅 네트워크의 출력은 소프트맥스 함수를 사용하여 모든 전문가에 대한 가중치를 정규화합니다.

\[\pi_m(x) = \frac{\exp(g_m(x))}{\sum_{j=1}^M \exp(g_j(x))}\]

\(g_m(x)\)는 게이팅 네트워크의 출력으로, 각 전문가의 적합성을 평가합니다.


2. MoE의 학습 동기

각 전문가는 특정 부분 집합의 데이터에 대해 학습되어 있기 때문에, MoE는 복잡한 데이터 분포를 갖는 큰 데이터셋에 대해서도 효율적으로 학습할 수 있습니다. MoE의 핵심은 각 입력에 대해 가장 적합한 전문가를 동적으로 선택하는 것입니다. 이런 동적 선택 메커니즘은 전체 모델의 성능을 최적화합니다.

2.1 게이팅 네트워크의 역할

게이팅 네트워크는 학습 과정에서 중요한 역할을 합니다. 게이팅 네트워크는 각 입력에 대해 모든 전문가의 적합성을 평가하고, 이 정보를 바탕으로 가장 적합한 전문가에게 입력을 할당합니다. 이 과정은 다음과 같은 최적화 문제로 설명할 수 있습니다.

\[\min_{\Theta} \mathcal{L}(F(x; \Theta), y)\]

\(\Theta\)는 모델의 모든 파라미터, \(\mathcal{L}\)은 손실 함수, \(y\)는 실제 레이블로, 게이팅 네트워크의 학습 목표는 전체 MoE 모델의 손실을 최소화하는 파라미터 \(\Theta\)를 찾는 것입니다.

2.2 전문가의 특화

전문가의 특화는 training dataset의 특성에 따라 다르게 나타납니다. 각 전문가는 특정 패턴이나 특징을 인식하도록 훈련되며, 이는 다음과 같은 최적화 과정을 통해 이루어집니다.

\[\min_{W_m} \sum_{x \in X_m} \mathcal{L}(f_m(x; W_m), y)\]

\(X_m\)은 전문가 \(m\)에 할당된 입력 데이터의 집합, \(W_m\)은 전문가 \(m\)의 파라미터


3. 이론적 결과 및 실험적 검증

실험을 통해 MoE 모델이 단일 전문가 모델보다 더 높은 정확도를 달성함을 확인합니다. 이는 MoE가 다양한 데이터의 특성을 더 잘 학습하고, 각 입력에 가장 적합한 전문가를 선택하여 처리하기 때문일 수 있다고 추측합니다.


1 Introduction

The Mixture-of-Expert (MoE) structure (Jacobs et al., 1991; Jordan and Jacobs, 1994) is a classic design that substantially scales up the model capacity and only introduces small computation overhead. In recent years, the MoE layer (Eigen et al., 2013; Shazeer et al., 2017), which is an extension of the MoE model to deep neural networks, has achieved remarkable success in deep learning. Generally speaking, an MoE layer contains many experts that share the same network architecture and are trained by the same algorithm, with a gating (or routing) function that routes individual inputs to a few experts among all the candidates. Through the sparse gating function, the router in the MoE layer can route each input to the top-\(K\) (\(K \geq 2\)) best experts (Shazeer et al., 2017), or the single (\(K = 1\)) best expert (Fedus et al., 2021). This routing scheme only costs the computation of \(K\) experts for a new input, which enjoys fast inference time.

Despite the great empirical success of the MoE layer, the theoretical understanding of such architecture is still elusive. In practice, all experts have the same structure, initialized from the same weight distribution (Fedus et al., 2021) and are trained with the same optimization configuration. The router is also initialized to dispatch the data uniformly. It is unclear why the experts can diverge to different functions that are specialized to make predictions for different inputs, and why the router can automatically learn to dispatch data, especially when they are all trained using simple local search algorithms such as gradient descent. Therefore, we aim to answer the following questions:

Why do the experts in MoE diversify instead of collapsing into a single model? And how can the router learn to dispatch the data to the right expert?

In this paper, in order to answer the above question, we consider the natural “mixture of classification” data distribution with cluster structure and theoretically study the behavior and benefit of the MoE layer. We focus on the simplest setting of the mixture of linear classification, where the data distribution has multiple clusters, and each cluster uses separate (linear) feature vectors to represent the labels. In detail, we consider the data generated as a combination of feature patches, cluster patches, and noise patches (See Definition 3.1 for more details). We study training an MoE layer based on the data generated from the “mixture of classification” distribution using gradient descent, where each expert is chosen to be a two-layer CNN. The main contributions of this paper are summarized as follows:

  • We first prove a negative result (Theorem 4.1) that any single expert, such as two-layer CNNs with arbitrary activation function, cannot achieve a test accuracy of more than 87.5% on our data distribution.
  • Empirically, we found that the mixture of linear experts performs better than the single expert but is still significantly worse than the mixture of non-linear experts. Figure 1 provides such a result in a special case of our data distribution with four clusters. Although a mixture of linear models can represent the labeling function of this data distribution with 100% accuracy, it fails to learn so after training. We can see that the underlying cluster structure cannot be recovered by the mixture of linear experts, and neither the router nor the experts are diversified enough after training. In contrast, the mixture of non-linear experts can correctly recover the cluster structure and diversify.
  • Motivated by the negative result and the experiment on the toy data, we study a sparsely-gated MoE model with two-layer CNNs trained by gradient descent. We prove that this MoE model can achieve nearly 100% test accuracy efficiently (Theorem 4.2).
  • Along with the result on the test accuracy, we formally prove that each expert of the sparsely-gated MoE model will be specialized to a specific portion of the data (i.e., at least one cluster), which is determined by the initialization of the weights. In the meantime, the router can learn the cluster-center features and route the input data to the right experts.
  • Finally, we also conduct extensive experiments on both synthetic and real datasets to corroborate our theory.

Notation. We use lower case letters, lower case bold face letters, and upper case bold face letters to denote scalars, vectors, and matrices respectively. We denote a union of disjoint sets \((A_i : i \in I)\) by \(\bigsqcup_{i \in I} A_i\). For a vector \(x\), we use \(\|x\|_2\) to denote its Euclidean norm. For a matrix \(W\), we use \(\|W\|_F\) to denote its Frobenius norm. Given two sequences \(\{x_n\}\) and \(\{y_n\}\), we denote \(x_n = O(y_n)\) if \(\|x_n\| \leq C_1 \|y_n\|\) for some absolute positive constant \(C_1\), \(x_n = \Omega(y_n)\) if \(\|x_n\| \geq C_2 \|y_n\)\ for some absolute positive constant \(C_2\), and \(x_n = \Theta(y_n)\) if \(C_3 \|y_n\| \leq \|x_n\| \leq C_4 \|y_n\|\) for some absolute constants \(C_3, C_4 > 0\). We also use \(\tilde{O}(\cdot)\) to hide logarithmic factors of \(d\) in \(O(\cdot)\). Additionally, we denote \(x_n = \text{poly}(y_n)\) if \(x_n = O(y_n^D)\) for some positive constant \(D\), and \(x_n = \text{polylog}(y_n)\) if \(x_n = \text{poly}(\log(y_n))\). We also denote by \(x_n = o(y_n)\) if \(\lim_{n \to \infty} x_n / y_n = 0\). Finally we use \([N]\) to denote the index set \(\{1, \ldots, N\}\).

Figure 1: Visualization of the training of MoE with nonlinear expert and linear expert. Different colors denote router’s dispatch to different experts. The lines denote the decision boundary of the MoE model. The data points are visualized on 2d space via t-SNE (Van der Maaten and Hinton, 2008). The MoE architecture follows Section 3 where nonlinear experts use activation function \(\sigma(z) = z^3\). For this visualization, we let the expert number \(M = 4\) and cluster number \(K = 4\). We generate \(n = 1,600\) data points from the distribution illustrated in Section 3 with \(\alpha \in (0.5, 2)\), \(\beta \in (1, 2)\), \(\gamma \in (1, 2)\), and \(\sigma_p = 1\). More details of the visualization are discussed in Appendix A.

Mixture of Experts Model. The mixture of experts model (Jacobs et al., 1991; Jordan and Jacobs, 1994) has long been studied in the machine learning community. These MoE models are based on various base expert models such as support vector machines (Collobert et al., 2002), Gaussian processes (Tresp, 2001), or hidden Markov models (Jordan et al., 1997). In order to increase the model capacity to deal with complex vision and speech data, Eigen et al. (2013) extended the MoE structure to deep neural networks, and proposed a deep MoE model composed of multiple layers of routers and experts. Shazeer et al. (2017) simplified the MoE layer by making the output of the gating function sparse for each example, which greatly improves the training stability and reduces the computational cost. Since then, the MoE layer with different base neural network structures (Shazeer et al., 2017; Dauphin et al., 2017; Vaswani et al., 2017) has been proposed and achieved tremendous successes in a variety of language tasks. Very recently, Fedus et al. (2021) improved the performance of the MoE layer by routing one example to only a single expert instead of \(K\) experts, which further reduces the routing computation while preserving the model quality.

In this paper, we consider a “mixture of classification” model. This type of model can be dated back to (De Veaux, 1989; Jordan and Jacobs, 1994; Faria and Soromenho, 2010) and has been applied to many tasks including object recognition (Quattoni et al., 2004), human action recognition (Wang and Mori, 2009), and machine translation (Liang et al., 2006). In order to learn the unknown parameters for mixture of linear regressions/classification models, (Anandkumar et al., 2012; Hsu et al., 2012; Chaganty and Liang, 2013; Anandkumar et al., 2014; Li and Liang, 2018) studied the method of moments and tensor factorization. Another line of work studies specific algorithms such as Expectation-Maximization (EM) algorithm (Khalili and Chen, 2007; Yi et al., 2014; Balakrishnan et al., 2017; Wang et al., 2015).

Theoretical Understanding of Deep Learning. In recent years, great efforts have been made to establish the theoretical foundation of deep learning. A series of studies have proved the convergence (Jacot et al., 2018; Li and Liang, 2018; Du et al., 2019; Allen-Zhu et al., 2019b; Zou et al., 2018) and generalization (Allen-Zhu et al., 2019a; Arora et al., 2019a,b; Cao and Gu, 2019) guarantees in the so-called “neural tangent kernel” (NTK) regime, where the parameters stay close to the initialization, and the neural network function is approximately linear in its parameters. A recent line of works (Allen-Zhu and Li, 2019; Bai and Lee, 2019; Allen-Zhu and Li, 2020a,b,c; Li et al., 2020; Cao et al., 2022; Zou et al., 2021; Wen and Li, 2021) studied the learning dynamic of neural networks beyond the NTK regime. It is worthwhile to mention that our analysis of the MoE model is also beyond the NTK regime.

3 Problem Setting and Preliminaries

We consider an MoE layer with each expert being a two-layer CNN trained by gradient descent (GD) over \(n\) independent training examples \(\{(x_i, y_i)\}_{i=1}^n\) generated from a data distribution \(D\). In this section, we will first introduce our data model \(D\), and then explain our neural network model and the details of the training algorithm.

3.1 Data distribution

We consider a binary classification problem over \(P\)-patch inputs, where each patch has \(d\) dimensions. In particular, each labeled data is represented by \((x, y)\), where input \(x = (x^{(1)}, x^{(2)}, \ldots, x^{(P)}) \in\)mathbb{R}^d)^P\(is a collection of\)P\(patches and\)y \in {\pm 1}\(is the data label. We consider data generated from\)K\(clusters. Each cluster\)k \in [K]\(has a label signal vector\)v_k\(and a cluster-center signal vector\)c_k\(with\)\|v_k\|2 = \|c_k\|_2 = 1\(. For simplicity, we assume that all the signals\){v_k}{k \in [K]} \cup {c_k}_{k \in [K]}$$ are orthogonal with each other.

Definition 3.1. A data pair \((x, y) \in\)mathbb{R}^d)^P \times {\pm 1}\(is generated from the distribution\)D$$ as follows:

  • Uniformly draw a pair \((k, k')\) with \(k \neq k'\) from \(\{1, \ldots, K\}\).
  • Generate the label \(y \in \{\pm 1\}\) uniformly, generate a Rademacher random variable \(\epsilon \in \{\pm 1\}\).
  • Independently generate random variables \(\alpha, \beta, \gamma\) from distribution \(D_\alpha, D_\beta, D_\gamma\). In this paper, we assume there exist absolute constants \(C_1, C_2\) such that almost surely \(0 < C_1 \leq \alpha, \beta, \gamma \leq C_2\).
  • Generate \(x\) as a collection of \(P\) patches: \(x = (x^{(1)}, x^{(2)}, \ldots, x^{(P)}) \in\)mathbb{R}^d)^P$$, where:
    • Feature signal. One and only one patch is given by \(y \alpha v_k\).
    • Cluster-center signal. One and only one patch is given by \(\beta c_k\).
    • Feature noise. One and only one patch is given by \(\epsilon \gamma v_{k'}\).
    • Random noise. The rest of the \(P - 3\) patches are Gaussian noises that are independently drawn from \(N(0, ( ext_p^2 / d) \cdot I_d)\) where \(\sigma_p\) is an absolute constant.

How to learn this type of data? Since the positions of signals and noises are not specified in Definition 3.1, it is natural to use the CNN structure that applies the same function to each patch. We point out that the strength of the feature noises \(\gamma\) could be as large as the strength of the feature signals \(\alpha\). As we will see later in Theorem 4.1, this classification problem is hard to learn with a single expert, such as any two-layer CNNs (any activation function with any number of neurons). However, such a classification problem has an intrinsic clustering structure that may be utilized to achieve better performance. Examples can be divided into \(K\) clusters \(\cup_{k \in [K]} \Omega_k\) based on the cluster-center signals: an example \((x, y) \in \Omega_k\) if and only if at least one patch of \(x\) aligns with \(c_k\). It is not difficult to show that the binary classification sub-problem over \(\Omega_k\) can be easily solved by an individual expert. We expect the MoE can learn this data cluster structure from the cluster-center signals.

Significance of our result. Although this data can be learned by existing works on a mixture of linear classifiers with sophisticated algorithms (Anandkumar et al., 2012; Hsu et al., 2012; Chaganty and Liang, 2013), the focus of our paper is training a mixture of nonlinear neural networks, a more practical model used in real applications. When an MoE is trained by variants of gradient descent, we show that the experts automatically learn to specialize on each cluster, while the router automatically learns to dispatch the data to the experts according to their specialty. Although from a representation point of view, it is not hard to see that the concept class can be represented by MoEs, our result is very significant as we prove that gradient descent from random initialization can find a good MoE with non-linear experts efficiently. To make our results even more compelling, we empirically show that MoE with linear experts, despite also being able to represent the concept class, cannot be trained to find a good classifier efficiently.

3.2 Structure of the MoE layer

An MoE layer consists of a set of \(M\) “expert networks” \(f_1, \ldots, f_M\), and a gating network which is generally set to be linear (Shazeer et al., 2017; Fedus et al., 2021). Denote by \(f_m(x; W)\) the output of the \(m\)-th expert network with input \(x\) and parameter \(W\). Define an \(M\)-dimensional vector \(h(x; \Theta) = \sum_{p \in [P]} \Theta^\top x(p)\) as the output of the gating network parameterized by \(\Theta = [\theta_1, \ldots, \theta_M] \in \mathbb{R}^{d \times M}\). The output \(F\) of the MoE layer can be written as follows:

\[F = \sum_{m \in T_x} \pi_m(x; \Theta) f_m(x; W)\]

where \(T_x \subseteq [M]\) is a set of selected indices and \(\pi_m(x; \Theta)\)’s are route gate values given by

\[\pi_m(x; \Theta) = \frac{e^{h_m(x; \Theta)}}{\sum_{j \in T_x} e^{h_j(x; \Theta)}}\]

Expert Model. In practice, one often uses nonlinear neural networks as experts in the MoE layer. In fact, we found that the non-linearity of the expert is essential for the success of the MoE layer (see Section 6). For the \(m\)-th expert, we consider a convolution neural network as follows:

\[f_m(x; W) = \sum_{j=1}^{J} \sigma(w_{m,j}^\top x)\]

where \(w_{m,j} \in \mathbb{R}^d\) is the weight vector of the \(j\)-th filter (i.e., neuron) in the \(m\)-th expert, \(J\) is the number of filters (i.e., neurons). We denote \(W_m = [w_{m,1}, \ldots, w_{m,J}] \in \mathbb{R}^{d \times J}\) as the weight matrix of the \(m\)-th expert and further let \(W = \{W_m\}_{m \in [M]}\) as the collection of expert weight matrices. For nonlinear CNN, we consider the cubic activation function \(\sigma(z) = z^3\), which is one of the simplest nonlinear activation functions (Vecci et al., 1998). We also include the experiment for other activation functions such as RELU in Appendix Table 7.

Top-1 Routing Model. A simple choice of the selection set \(T_x\) would be the whole experts set \(T_x = [M]\) (Jordan and Jacobs, 1994), which is the case for the so-called soft-routing model. However, it would be time-consuming to use soft-routing in deep learning. In this paper, we consider “switch routing”, which is introduced by Fedus et al. (2021) to make the gating network sparse and save computation time. For each input \(x\), instead of using all the experts, we only pick one expert from \([M]\), i.e., \(\\|T_x\\| = 1\). In particular, we choose \(T_x = \arg\max_m \{h_m(x; \Theta)\}\).

Algorithm 1 Gradient descent with random initialization

Require: Number of iterations \(T\), expert learning rate \(\eta\), router learning rate \(\eta_r\), initialization scale \(\sigma_0\), training set \(S = \{(x_i, y_i)\}_{i=1}^n\).

Figure 2: Illustration of an MoE layer. For each input x, the router will only select one expert to perform computations. The choice is based on the output of the gating network (dotted line). The expert layer returns the output of the selected expert (gray box) multiplied by the route gate value (softmax of the gating function output).

3.3 Training Algorithm

An MoE layer consists of a set of \(M\) “expert networks” \(f_1, \ldots, f_M\), and a gating network which is generally set to be linear (Shazeer et al., 2017; Fedus et al., 2021). Denote by \(f_m(x; W)\) the output of the \(m\)-th expert network with input \(x\) and parameter \(W\). Define an \(M\)-dimensional vector \(h(x; \Theta) = \sum_{p \in [P]} \Theta^\top x(p)\) as the output of the gating network parameterized by \(\Theta = [\theta_1, \ldots, \theta_M] \in \mathbb{R}^{d \times M}\). The output \(F\) of the MoE layer can be written as follows:

\[F = \sum_{m \in T_x} \pi_m(x; \Theta) f_m(x; W)\]

where \(T_x \subseteq [M]\) is a set of selected indices and \(\pi_m(x; \Theta)\)’s are route gate values given by

\[\pi_m(x; \Theta) = \frac{e^{h_m(x; \Theta)}}{\sum_{j \in T_x} e^{h_j(x; \Theta)}}\]

Expert Model. In practice, one often uses nonlinear neural networks as experts in the MoE layer. In fact, we found that the non-linearity of the expert is essential for the success of the MoE layer (see Section 6). For the \(m\)-th expert, we consider a convolution neural network as follows:

\[f_m(x; W) = \sum_{j=1}^{J} \sigma(w_{m,j}^\top x)\]

where \(w_{m,j} \in \mathbb{R}^d\) is the weight vector of the \(j\)-th filter (i.e., neuron) in the \(m\)-th expert, \(J\) is the number of filters (i.e., neurons). We denote \(W_m = [w_{m,1}, \ldots, w_{m,J}] \in \mathbb{R}^{d \times J}\) as the weight matrix of the \(m\)-th expert and further let \(W = \{W_m\}_{m \in [M]}\) as the collection of expert weight matrices. For nonlinear CNN, we consider the cubic activation function \(\sigma(z) = z^3\), which is one of the simplest nonlinear activation functions (Vecci et al., 1998). We also include the experiment for other activation functions such as RELU in Appendix Table 7.

Top-1 Routing Model. A simple choice of the selection set \(T_x\) would be the whole experts set \(T_x = [M]\) (Jordan and Jacobs, 1994), which is the case for the so-called soft-routing model. However, it would be time-consuming to use soft-routing in deep learning. In this paper, we consider “switch routing”, which is introduced by Fedus et al. (2021) to make the gating network sparse and save computation time. For each input \(x\), instead of using all the experts, we only pick one expert from \([M]\), i.e., \(\\|T_x\\| = 1\). In particular, we choose \(T_x = \arg\max_m \{h_m(x; \Theta)\}\).

Algorithm 1 Gradient descent with random initialization

Require: Number of iterations \(T\), expert learning rate \(\eta\), router learning rate \(\eta_r\), initialization scale \(\sigma_0\), training set \(S = \{(x_i, y_i)\}_{i=1}^n\).

where \(\ell\) is the logistic loss defined as \(\ell(z) = \log(1 + \exp(-z))\). We initialize \(\Theta^{(0)}\) to be zero and initialize each entry of \(W^{(0)}\) by i.i.d \(\mathcal{N}(0, \sigma_0^2)\). Zero initialization of the gating network is widely used in MoE training. As discussed in Shazeer et al. (2017), it can help avoid out-of-memory errors and initialize the network in a state of approximately equal expert load (see (5.1) for the definition of expert load).

Instead of directly using the gradient of empirical loss (3.2) to update weights, we add perturbation to the router and use the gradient of the perturbed empirical loss to update the weights. In particular, the training example \(x_i\) will be distributed to \(\arg\max_m \{h_m(x_i; \Theta^{(t)}) + r_m^{(t), i}\}\) instead, where \(\{r_m^{(t), i}\}_{m \in [M], i \in [n]}\) are random noises. Adding noise term is a widely used training strategy for sparsely-gated MoE layer (Shazeer et al., 2017; Fedus et al., 2021), which can encourage exploration across the experts and stabilize the MoE training. In this paper, we draw \(\{r_m^{(t), i}\}_{m \in [M], i \in [n]}\) independently from the uniform distribution \(\text{Unif}[0, 1]\) and denote its collection as \(r^{(t)}\). Therefore, the perturbed empirical loss at iteration \(t\) can be written as

\(\mathcal{L}^{(t)}\)Theta, W) = \sum_{i=1}^{n} \ell\left(y_i, \sum_{m \in T_{x_i}} \pi_m(x_i; \Theta^{(t)}) f_m(x_i; W)\right) $$

where \(\eta > 0\) is the expert learning rate. Starting from the initialization \(\Theta^{(0)}\), the gradient update rule for the gating network is

\(\Theta^{(t+1)} = \Theta^{(t)} - \eta_r \nabla_\Theta \mathcal{L}^{(t)}\)Theta, W) $$

where \(\eta_r > 0\) is the router learning rate. In practice, the experts are trained by Adam (?) to make sure they have similar learning speeds. Here we use a normalized gradient which can be viewed as a simpler alternative to Adam (Jelassi et al., 2021).

4 Main Results

In this section, we will present our main results. We first provide a negative result for learning with a single expert.

Theorem 4.1 (Single expert performs poorly). Suppose \(D_\alpha = D_\gamma\) in Definition 3.1, then any function with the form \(F(x) = \sum_{p=1}^{P} f(x^{(p)})\) will get large test error \(P_{(x,y) \sim D}(yF(x) \leq 0) \geq \frac{1}{8}\).

Theorem 4.1 indicates that if the feature noise has the same strength as the feature signal, i.e., \(D_\alpha = D_\gamma\), any two-layer CNNs with the form \(F(x) = \sum_{j} \sigma(w_j^\top x^{(p)} + b_j)\) can’t perform well on the classification problem defined in Definition 3.1 where \(\sigma\) can be any activation function. Theorem 4.1 also shows that a simple ensemble of the experts may not improve the performance because the ensemble of the two-layer CNNs is still in the form of the function defined in Theorem 4.1.

As a comparison, the following theorem gives the learning guarantees for training an MoE layer that follows the structure defined in Section 3.2 with cubic activation function.

Theorem 4.2 (Nonlinear MoE performs well). Suppose the training data size \(n = \Omega(d)\). Choose the number of experts \(M = \Theta(K \log K \log \log d)\), filter size \(J = \Theta(\log M \log \log d)\), initialization scale \(\sigma_0 \in [d^{-1/3}, d^{-0.01}]\), learning rate \(\eta = \tilde{O}( ext_0)\), \(\eta_r = \Theta(M^2)\eta\). Then with probability at least \(1 - o(1)\), Algorithm 1 is able to output \(\)Theta^{(T)}, W^{(T)})\(within\)T = \tilde{O}\(eta^{-1})\) iterations such that the non-linear MoE defined in Section 3.2 satisfies:

  • Training error is zero, i.e., \(y_i F(x_i; \Theta^{(T)}, W^{(T)}) > 0, \forall i \in [n]\).
  • Test error is nearly zero, i.e., \(P_{(x,y) \sim D}(yF(x; \Theta^{(T)}, W^{(T)}) \leq 0) = o(1)\).

More importantly:

  • (Each expert is good on one cluster) Each expert \(m \in M_k\) performs well on the cluster \(\Omega_k\), \(P_{(x,y) \sim D}(y f_m(x; W^{(T)}) \leq 0 \\| (x, y) \in \Omega_k) = o(1)\).
  • (Router only distributes examples to good experts) With probability at least \(1 - o(1)\), an example \(x \in \Omega_k\) will be routed to one of the experts in \(M_k\).

Theorem 4.2 shows that a non-linear MoE performs well on the classification problem in Definition 3.1. In addition, the router will learn the cluster structure and divide the problem into \(K\) simpler sub-problems, each of which is associated with one cluster. In particular, each cluster will be classified accurately by a subset of experts. On the other hand, each expert will perform well on at least one cluster.

Furthermore, together with Theorem 4.1, Theorem 4.2 suggests that there exist problem instances in Definition 3.1 (i.e., \(D_\alpha = D_\gamma\)) such that an MoE provably outperforms a single expert.

5 Overview of Key Techniques

A successful MoE layer needs to ensure that the router can learn the cluster-center features and divide the complex problem in Definition 3.1 into simpler linear classification sub-problems that individual experts can conquer. Finding such a gating network is difficult because this problem is highly non-convex. In the following, we will introduce the main difficulties in analyzing the MoE layer and the corresponding key techniques to overcome those barriers.

Main Difficulty 1: Discontinuities in Routing. Compared with the traditional soft-routing model, the sparse routing model saves computation and greatly reduces the inference time. However, this form of sparsity also causes discontinuities in routing (Shazeer et al., 2017). In fact, even a small perturbation of the gating network outputs \(h(x; \Theta) + \delta\) may change the router behavior drastically if the second largest gating network output is close to the largest gating network output.

Key Technique 1: Stability by Smoothing. We point out that the noise term added to the gating network output ensures a smooth transition between different routing behavior, which makes the router more stable. This is proved in the following lemma.

Lemma 5.1. Let \(h, \hat{h} \in \mathbb{R}^M\) be the output of the gating network and \(\{r_m\}_{m=1}^M\) be the noise independently drawn from \(\text{Unif}[0,1]\). Denote \(p, \hat{p} \in \mathbb{R}^M\) to be the probability that experts get routed, i.e., \(p_m = P\)arg\max_{m’ \in [M]} {h_{m’} + r_{m’}} = m)\(,\)\hat{p}_m = P\(arg\max_{m' \in [M]} \{\hat{h}_{m'} + r_{m'}\} = m)\). Then we have that \(\\\|p - \hat{p}\\\|_\infty \leq M^2 \\\|h - \hat{h}\\\|_\infty\).

Lemma 5.1 implies that when the change of the gating network outputs at iteration \(t\) and \(t'\) is small, i.e., \(\\\|h(x; \Theta^{(t)}) - h(x; \Theta^{(t')})\\\|_\infty\), the router behavior will be similar. So adding noise provides a smooth transition from time \(t\) to \(t'\). It is also worth noting that \(\Theta\) is zero initialized. So \(h(x; \Theta^{(0)}) = 0\) and thus each expert gets routed with the same probability \(p_m = \frac{1}{M}\) by symmetric property. Therefore, at the early stage of the training when \(\\\|h(x; \Theta^{(t)}) - h(x; \Theta^{(0)})\\\|_\infty\) is small, the router will almost uniformly pick one expert from \([M]\), which helps exploration across experts.

Main Difficulty 2: No “Real” Expert. At the beginning of the training, the gating network is zero, and the experts are randomly initialized. Thus it is hard for the router to learn the right features because all the experts look the same: they share the same network architecture and are trained by the same algorithm. The only difference would be the initialization. Moreover, if the router makes a mistake at the beginning of the training, the experts may amplify the mistake because the experts will be trained based on mistakenly dispatched data.

Key Technique 2: Experts from Exploration. Motivated by Key Technique 1, we introduce an exploration stage to the analysis of the MoE layer during which the router almost uniformly picks one expert from \([M]\). This stage starts at \(t = 0\) and ends at \(T_1 = \hat{\eta}^{-1} \sigma_0^{0.5}\) and the gating network remains nearly unchanged \(\\\|h(x; \Theta^{(t)}) - h(x; \Theta^{(0)})\\\|_\infty = O( ext_0^{1.5})\). Because the experts are treated almost equally during the exploration stage, we can show that the experts become specialized to some specific task only based on the initialization. In particular, the expert set \([M]\) can be divided into \(K\) nonempty disjoint sets \([M] = \bigsqcup_k M_k\), where \(M_k := \{m \mid \arg\max_{k' \in [K], j \in [J]} \langle v_{k'}, w_{m,j}^{(0)} \rangle = k\}\). For nonlinear MoE with cubic activation function, the following lemma further shows that experts in different sets \(M_k\) will diverge at the end of the exploration stage.

Lemma 5.2. Under the same condition as in Theorem 4.2, with probability at least \(1 - o(1)\), the following equations hold for all experts \(m \in M_k\), nearly zero test error on the cluster \(\Omega_k\) but high test error on the other clusters \(\Omega_{k'}, k' \neq k\).

Main Difficulty 3: Expert Load Imbalance. Given the training data set \(S = \{(x_i, y_i)\}_{i=1}^n\), the load of expert \(m\) at iterate \(t\) is defined as

\[\text{Load}_m^{(t)} = \frac{1}{n} \sum_{i=1}^{n} P(m_{i,t} = m)\]

where \(P(m_{i,t} = m)\) is the probability that the input \(x_i\) is routed to expert \(m\) at iteration \(t\). Eigen et al. (2013) first described the load imbalance issues in the training of the MoE layer. The gating network may converge to a state where it always produces large \(\text{Load}_m^{(t)}\) for the same few experts. This imbalance in expert load is self-reinforcing, as the favored experts are trained more rapidly and thus are selected even more frequently by the router (Shazeer et al., 2017; Fedus et al., 2021). Expert load imbalance issues not only cause memory and performance problems in practice but also impede the theoretical analysis of the expert training.

Key Technique 3: Normalized Gradient Descent. Lemma 5.2 shows that the experts will diverge into \(\bigsqcup_{k \in [K]} M_k\). Normalized gradient descent can help different experts in the same \(M_k\) be trained at the same speed regardless of the imbalance load caused by the router. Because the self-reinforcing circle no longer exists, we can prove that the router will treat different experts in the same \(M_k\) almost equally and dispatch almost the same amount of data to them (See Section E.2 in Appendix for details). This load imbalance issue can be further avoided by adding load balancing loss (Eigen et al., 2013; Shazeer et al., 2017; Fedus et al., 2021), or advanced MoE layer structures such as BASE Layers (Lewis et al., 2021; Dua et al., 2021) and Hash Layers (Roller et al., 2021).

Road Map: Here we provide the road map of the proof of Theorem 4.2 and the full proof is presented in Appendix E. The training process can be decomposed into several stages. The first stage is called the Exploration stage. During this stage, the experts will diverge into \(K\) professional groups \(\bigsqcup_{k=1}^{K} M_k = [M]\). In particular, we will show that \(M_k\) is not empty for all \(k \in [K]\). Besides, for all \(m \in M_k\), \(f_m\) is a good classifier over \(\Omega_k\). The second stage is called the router learning stage. During this stage, the router will learn to dispatch \(x \in \Omega_k\) to one of the experts in \(M_k\). Finally, we will give the generalization analysis for the MoEs from the previous two stages.

6 Experiments

Table 1: Comparison between MoE (linear) and MoE (nonlinear) in our setting. We report results of top-1 gating with noise for both linear and nonlinear models. Over ten random experiments, we report the average value ± standard deviation for both test accuracy and dispatch entropy.

Figure 3: Illustration of router dispatch entropy. We demonstrate the change of entropy of MoE during training on the synthetic data. MoE (linear)-1 and MoE (nonlinear)-1 refer to Setting 1 in Table 1. MoE (linear)-2 and MoE (nonlinear)-2 refer to Setting 2 in Table 1.

6.1 Synthetic-data Experiments

Datasets. We generate 16,000 training examples and 16,000 test examples from the data distribution defined in Definition 3.1 with cluster number \(K = 4\), patch number \(P = 4\) and dimension \(d = 50\). We randomly shuffle the order of the patches of \(x\) after we generate data \((x, y)\). We consider two parameter settings:

  1. \(\alpha \sim \text{Uniform}(0.5, 2)\), \(\beta \sim \text{Uniform}(1, 2)\), \(\gamma \sim \text{Uniform}(0.5, 3)\) and \(\sigma_p = 1\);
  2. \(\alpha \sim \text{Uniform}(0.5, 2)\), \(\beta \sim \text{Uniform}(1, 2)\), \(\gamma \sim \text{Uniform}(0.5, 3)\) and \(\sigma_p = 2\).

Note that Theorem 4.1 shows that when \(\alpha\) and \(\gamma\) follow the same distribution, neither single linear expert nor single nonlinear expert can give good performance. Here we consider a more general and difficult setting when \(\alpha\) and \(\gamma\) are from different distributions.

Models. We consider the performances of single linear CNN, single nonlinear CNN, linear MoE, and nonlinear MoE. The single nonlinear CNN architecture follows (3.1) with cubic activation function, while single linear CNN follows (3.1) with identity activation function. For both linear and nonlinear MoEs, we consider a mixture of 8 experts with each expert being a single linear CNN or a single nonlinear CNN. Finally, we train single models with gradient descent and train the MoEs with Algorithm 1. We run 10 random experiments and report the average accuracy with standard deviation.

Evaluation. To evaluate how well the router learned the underlying cluster structure of the data, we define the entropy of the router’s dispatch as follows. Denote by \(n_{k,m}\) the number of data in cluster \(k\) that are dispatched to expert \(m\). The total number of data dispatched to expert \(m\) is \(n_m = \sum_{k=1}^K n_{k,m}\). The dispatch entropy is

\[H = -\sum_{m=1}^M \sum_{k=1}^K \frac{n_{k,m}}{n_m} \log \left( \frac{n_{k,m}}{n_m} \right)\]

When each expert receives the data from at most one cluster, the dispatch entropy will be zero. And a uniform dispatch will result in the maximum dispatch entropy.

As shown in Table 1, the linear MoE does not perform as well as the nonlinear MoE in Setting 1, with around 6% less test accuracy and much higher variance. With stronger random noise (Setting 2), the difference between the nonlinear MoE and linear MoE becomes even more significant. We also observe that the final dispatch entropy of nonlinear MoE is nearly zero while that of the linear MoE is large. In Figure 3, we further demonstrate the change of dispatch entropy during the training process. The dispatch entropy of nonlinear MoE significantly decreases, while that of linear MoE remains large. Such a phenomenon indicates that the nonlinear MoE can successfully learn the underlying cluster structure of the data while the linear MoE fails to do so.

6.2 Real-data Experiments

We further conduct experiments on real image datasets and demonstrate the importance of the clustering data structure to the MoE layer in deep neural networks. Datasets. We consider the CIFAR-10 dataset (Krizhevsky, 2009) and the 10-class classification task. Furthermore, we create a CIFAR-10-Rotate dataset that has a strong underlying cluster structure that is independent of its labeling function. Specifically, we rotate the images by 30 degrees and merge the rotated dataset with the original one. The task is to predict if the image is rotated, which is a binary classification problem. We deem that some of the classes in CIFAR-10 form underlying clusters in CIFAR-10-Rotate. In Appendix A, we explain in detail how we generate CIFAR-10-Rotate and present some specific examples. Models. For the MoE, we consider a mixture of 4 experts with a linear gating network. For the expert/single model architectures, we consider a CNN with 2 convolutional layers (architecture details are illustrated in Appendix A.) For a more thorough evaluation, we also consider expert/single models with architecture including MobileNetV2 (Sandler et al., 2018) and ResNet18 (He et al., 2016). The training process of MoE also follows Algorithm 1.

The experiment results are shown in Table 2, where we compare single and mixture models of different architectures over CIFAR-10 and CIFAR-10-Rotate datasets. We observe that the improvement of MoEs over single models differs largely on the different datasets. On CIFAR-10, the performance of MoEs is very close to the single models. However, on the CIFAR-10-Rotate dataset, we can observe a significant performance improvement from single models to MoEs. Such results indicate the advantage of MoE over single models depends on the task and the cluster structure of the data.

7 Conclusion and Future Work

In this work, we formally study the mechanism of the Mixture of Experts (MoE) layer for deep learning. To our knowledge, we provide the first theoretical result toward understanding how the MoE layer works in deep learning. Our empirical evidence reveals that the cluster structure of the data plays an important role in the success of the MoE layer. Motivated by these empirical observations, we study a data distribution with cluster structure and show that Mixture-of-Experts provably improves the test accuracy of a single expert of two-layer CNNs.

There are several important future directions. First, our current results are for CNNs. It is interesting to extend our results to other neural network architectures, such as transformers. Second, our data distribution is motivated by the classification problem of image data. We plan to extend our analysis to other types of data (e.g., natural language data).

Previous: Sparse Upcycling Next: Model | BERT

post contain ""

    No matching posts found containing ""