00:00:00

Share Your Feedback 🏝️

MoE | Outrageously LNN, MoE Layer

MoE | Outrageously LNN, MoE Layer

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | Gemini Next: MoE | DeepSpeed MoE

MoE | Outrageously LNN, MoE Layer

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-12-08

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

  • url: https://arxiv.org/abs/1701.06538
  • pdf: https://arxiv.org/pdf/1701.06538
  • github: The Sparsely Gated Mixture of Experts Layer for PyTorch
  • abstract: The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

Contents

TL;DR


  • 조건부 계산과 전문가의 혼합을 통한 심층 학습 모델의 성능 향상
  • 적은 계산 비용으로 높은 모델 Capacity (number of parameters) 달성 시사
  • 다양한 언어 모델링 및 기계 번역 작업에서의 상태 최신 결과 개선

1 서론 및 관련 연구

1.1 조건부 계산

심층 학습의 성공은 training dataset 및 모델 크기의 확장에서 중요한 역할을 합니다. 대규모 데이터셋에서는 신경망의 capacity를 증가시킬 때 예측 정확도가 향상됩니다. 전체 모델이 각 예시마다 활성화되면 훈련 비용이 제곱적으로 증가하는 문제가 발생합니다.

이를 극복하기 위해 조건부 계산의 다양한 형태가 제안되었습니다. 이런 접근 방식에서는 네트워크의 큰 부분이 예시별로 활성화되거나 비활성화되며, 여러 형태의 강화 학습 및 역전파가 게이팅 결정을 훈련하기 위해 제안되었습니다.

\[\text{Cost} \propto \text{Size of Model}^2 \times \text{Number of Examples}\]

상기 수식은 모델 크기와 훈련 예시 수가 증가할 때 훈련 비용이 제곱적으로 증가함을 보여주며, 조건부 계산은 이 비용을 선형적으로 증가하게 만듭니다.

1.2 접근 방식: 희소 게이트 Mixture of Experts

희소 게이트 Mixture of Experts(MoE)을 소개하여 조건부 계산을 구현합니다. MoE는 각 입력에 대해 전문가의 희소 조합을 선택하는 훈련 가능한 게이팅 네트워크와 여러 전문가 네트워크로 구성되며, 모든 부분은 역전파에 의해 동시에 훈련됩니다.

\[y = \sum_{i=1}^{n} G(x)_i \cdot E_i(x)\]

상기 식에서 $G(x)_i$는 게이팅 네트워크의 출력이고, $E_i(x)$는 $i$번째 전문가 네트워크의 출력입니다. $G(x)_i$가 0이면 $E_i(x)$의 계산이 필요 없어 계산을 절약할 수 있습니다.

1.3 관련 연구

혼합 전문가 접근 방식은 다양한 전문가 아키텍처와 구성에 대한 연구가 활발하게 이루어졌습니다. 이전 연구들은 모델 전체가 혼합 전문가로 구성되었으나, 본 논문에서는 텍스트의 각 위치마다 다른 전문가 조합을 선택하는 MoE를 사용하여 이전의 방법보다 더 높은 capacity과 효율성을 달성합니다.


2 Mixture of Experts의 구조

MoE 층은 n개의 전문가 네트워크와 희소 n차원 벡터를 출력하는 게이팅 네트워크로 구성됩니다. 각 전문가는 독립적인 파라미터를 가지고 동일한 입력 크기와 출력 크기를 가집니다. 이 구조는 선택적 활성화를 통해 계산 효율을 향상시킵니다.

\[G(x) = \text{Softmax}(W_g \cdot x)\] \[y = \sum_{i \in \text{Top-K}} G(x)_i \cdot E_i(x)\]

상기 식에서 $W_g$는 게이팅 네트워크의 가중치 행렬로, Top-K 선택은 계산을 추가로 절약합니다.


3 성능 챌린지 해결

3.1 배치 크기 축소 문제

큰 배치 크기는 계산 효율을 위해 필수적인 것으로 알려져있는데, MoE의 도입으로 각 전문가가 받는 배치 크기가 줄어들 수 있는 문제를 해결하기 위해 데이터 병렬성과 모델 병렬성을 혼합하도록 하였습니다.

수학적 최적화:

\[\text{New Batch Size} = \frac{k \cdot b \cdot d}{n}\]

이 방법은 전문가 당 배치 크기를 증가시켜 전체적인 계산 효율을 높입니다.

3.2 네트워크 대역폭

전문가의 입력과 출력을 네트워크를 통해 전송해야 하기 때문에 네트워크 대역폭은 중요한 성능 요소입니다. 입력과 출력 사이의 계산 비율을 최적화하여 이 문제를 해결합니다.


4 전문가 활용의 균형

게이팅 네트워크는 일부 전문가에게 과도하게 높은 가중치를 부여하는 경향이 있습니다. 이를 균형 잡힌 방식으로 조절하기 위해 추가적인 손실 함수를 도입합니다.

\[L_{importance} = w_{importance} \cdot \left(\frac{\sigma}{\mu}\right)^2\]

상기 식에서 $\sigma$는 중요도 값들의 표준 편차이고, $\mu$는 평균입니다. 이 추가 손실 함수는 모든 전문가가 동등하게 중요하게 사용되도록 유도합니다. 알겠습니다. 다음과 같이 중간중간에 $와 $$로 래핑된 수식을 제공합니다.

  • 4.1 구성
    • $n$개의 “전문가 네트워크” ($E_1, \ldots, E_n$)
    • 하나의 “게이팅 네트워크” ($G$)
  • 4.2 게이팅 네트워크
    • 희소한 $n$차원 벡터 출력
    • 어떤 전문가를 활성화할지 결정
  • 4.3. 전문가 네트워크
    • 동일한 입력 크기와 출력 크기를 가짐
    • 이 연구에서는 동일한 구조의 피드포워드 네트워크로 제한
  • 4.4 MoE 출력 계산
    • \[y = \sum_i G(x)_i \cdot E_i(x)\]
    • $G(x)_i$가 0인 경우 $E_i(x)$ 계산 불필요
  • 4.5 계산 효율성
    • 많은 전문가 중 일부만 평가
    • 많은 전문가가 있을 경우 계층적 MoE 사용 가능
  • 4.6 관련 모델
    • 조건부 계산 모델과 유사성
    • 단순 가중치 행렬 전문가: Cho & Bengio (2014)의 파라미터화된 가중치 행렬과 유사
    • 단일 은닉층 전문가: Bengio et al. (2015)의 블록 단위 드롭아웃과 유사


5 실험

다양한 크기와 계산 예산을 가진 MoE 모델을 사용하여 언어 모델링 및 기계 번역 벤치마크에서 기존의 최고 성능을 상회하는 결과를 달성했습니다. 더 큰 데이터셋에서는 더 높은 모델 capacity이 더 큰 성능 개선을 가져왔습니다.


[참고자료 1] MoE vs. OMoE

  1. 기존 MoE (Mixture of Experts)**

혼합 전문가(Mixture of Experts, MoE) 모델은 여러 전문가 네트워크로 구성되어 있으며, 각 전문가는 입력에 대한 특정 부분을 처리하는 데 특화되어 있습니다. 기존 MoE는 주로 단일 게이팅 네트워크를 사용하여 모든 입력에 대해 전문가의 가중치를 결정합니다. 이 모델은 다음과 같은 수학적 형식을 취합니다.

\[y = \sum_{i=1}^n g_i(x) E_i(x)\]
  • \(g_i(x)\)는 게이팅 네트워크의 출력으로, \(x\)에 대한 $i$번째 전문가의 가중치를 의미합니다.
  • \(E_i(x)\)는 $i$번째 전문가의 출력입니다.
  • \(n\)은 전문가의 총 수입니다.


2. 본 논문의 접근 방식: 희소 게이팅 MoE

본 논문에서 제안하는 희소 게이팅 MoE는 각 입력에 대해 활성화되는 전문가의 수를 제한하여 계산 효율을 높이는 방법을 사용합니다. 이 접근 방식은 다음과 같은 수학적 표현을 사용합니다.

\(y = \sum_{i \in S(x)} g_i(x) E_i(x)\) \(S(x) = \text{top-k}(g(x))\)

\(S(x)\)는 입력 \(x\)에 대해 가장 높은 가중치를 가진 상위 \(k\)개 전문가의 집합입니다.

2.2. 수학적 배경과 연결성

  1. 희소 선택(Sparse Selection): 희소 게이팅 메커니즘은 계산 리소스를 절약하기 위해 각 입력에 대해 소수의 전문가만을 활성화합니다. 이는 게이팅 벡터 \(g(x)\)에서 상위 \(k\)개 요소만을 선택하고 나머지는 0으로 설정하는 과정을 포함합니다. 이런 접근 방식은 전체 네트워크의 비활성화된 부분에 대한 계산을 건너뛰게 하므로, 리소스 사용을 최적화합니다.
  2. 수학적 의미: \(S(x) = \text{top-k}(g(x))\)는 게이팅 네트워크의 출력에서 가장 큰 \(k\)개의 값을 선택하여 활성화 시키는 전문가를 결정합니다. 이는 주어진 입력에 대해 가장 관련성 높은 전문가만을 활성화하여 처리 효율성을 극대화합니다.
  3. 연산 절감: 각 입력에 대해 상위 \(k\)개 전문가만이 활성화되므로, 전체 전문가 네트워크의 부분만을 계산에 사용합니다. 이는 계산 비용을 절감할 수 있습니다.
  • 장점
    • 계산 효율성: 활성화된 전문가의 수를 제한함으로써 필요한 계산량을 줄일 수 있습니다.
    • 전문화: 각 전문가는 특정 유형의 입력에 대해 더 정교하게 훈련될 수 있으며, 이는 모델의 전반적인 성능 향상으로 이어집니다.
  • 단점
    • Sparcity에 의한 도전: 너무 많은 희소성은 일부 전문가가 충분히 훈련되지 않는 문제를 발생시킬 수 있습니다.
    • 하이퍼파라미터 선택의 어려움: \(k\)의 적절한 값 선택은 성능에 큰 영향을 미치며, 이는 때때로 실험적인 조정이 필요합니다.

상기와 같은 접근 방식은 수학적 이론과 배경은 복잡한 문제를 효율적으로 해결하고자 하는 심층 학습의 연구 개발에서 진전을 보이며, 계산 비용과 성능 간의 균형을 최적화하려는 지속적인 노력의 일환으로 볼 수 있습니다.

References

  1. Appenzeller, G. (n.d.). Mixture-of-Experts (MoE) models are here. LinkedIn. Retrieved from https://www.linkedin.com/posts/appenz_mixture-of-experts-moe-models-are-here-activity-7138984491431243776-ziSQ?utm_source=share&utm_medium=member_desktop
  2. Stanford University. (n.d.). Stanford CS25: V1 I Mixture of Experts (MoE) paradigm and the Switch Transformer. YouTube. Retrieved from https://www.youtube.com/watch?v=U8J32Z3qV8s
  3. Laekov, F. (n.d.). FastMoE. GitHub. Retrieved from https://github.com/laekov/fastmoe
  4. Fuzhao, X. (n.d.). Open MoE. GitHub. Retrieved from https://github.com/XueFuzhao/OpenMoE

1 Introduction

Exploiting scale in both training data and model size has been central to the success of deep learn- ing. When datasets are sufficiently large, increasing the capacity (number of parameters) of neural networks can give much better prediction accuracy. This has been shown in domains such as text (Sutskever et al., 2014; Bahdanau et al., 2014; Jozefowicz et al., 2016; Wu et al., 2016), images (Krizhevsky et al., 2012; Le et al., 2012), and audio (Hinton et al., 2012; Amodei et al., 2015). For typical deep learning models, where the entire model is activated for every example, this leads to a roughly quadratic blow-up in training costs, as both the model size and the number of training examples increase. Unfortunately, the advances in computing power and distributed computation fall short of meeting such demand.

Various forms of conditional computation have been proposed as a way to increase model capacity without a proportional increase in computational costs (Davis & Arel, 2013; Bengio et al., 2013; Eigen et al., 2013; Ludovic Denoyer, 2014; Cho & Bengio, 2014; Bengio et al., 2015; Almahairi et al., 2015). In these schemes, large parts of a network are active or inactive on a per-example basis. The gating decisions may be binary or sparse and continuous, stochastic or deterministic. Various forms of reinforcement learning and back-propagation are proposed for trarining the gating decisions.

Figure 1: A Mixture of Experts (MoE) layer embedded within a recurrent language model. In this case, the sparse gating function selects two experts to perform computations. Their outputs are modulated by the outputs of the gating network.

While these ideas are promising in theory, no work to date has yet demonstrated massive improvements in model capacity, training time, or model quality. We blame this on a combination of the following challenges:

  • Modern computing devices, especially GPUs, are much faster at arithmetic than at branching. Most of the works above recognize this and propose turning on/off large chunks of the network with each gating decision.
  • Large batch sizes are critical for performance, as they amortize the costs of parameter transfers and updates. Conditional computation reduces the batch sizes for the conditionally active chunks of the network.
  • Network bandwidth can be a bottleneck. A cluster of GPUs may have computational power thousands of times greater than the aggregate inter-device network bandwidth. To be computationally efficient, the relative computational versus network demands of an algorithm must exceed this ratio. Embedding layers, which can be seen as a form of conditional computation, are handicapped by this very problem. Since the embeddings generally need to be sent across the network, the number of (example, parameter) interactions is limited by network bandwidth instead of computational capacity.
  • Depending on the scheme, loss terms may be necessary to achieve the desired level of sparsity per-chunk and/or per example. Bengio et al. (2015) use three such terms. These issues can affect both model quality and load-balancing.
  • Model capacity is most critical for very large data sets. The existing literature on conditional computation deals with relatively small image recognition data sets consisting of up to 600,000 images. It is hard to imagine that the labels of these images provide a sufficient signal to adequately train a model with millions, let alone billions of parameters.

In this work, we for the first time address all of the above challenges and finally realize the promise of conditional computation. We obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets.

While the introduced technique is generic, in this paper we focus on language modeling and machine translation tasks, which are known to benefit from very large models. In particular, we apply a MoE convolutionally between stacked LSTM layers (Hochreiter & Schmidhuber, 1997), as in Figure 1. The MoE is called once for each position in the text, selecting a potentially different combination of experts at each position. The different experts tend to become highly specialized based on syntax and semantics (see Appendix E Table 9). On both language modeling and machine translation benchmarks, we improve on best published results at a fraction of the computational cost.

Since its introduction more than two decades ago (Jacobs et al., 1991; Jordan & Jacobs, 1994), the mixture-of-experts approach has been the subject of much research. Different types of expert architectures hae been proposed such as SVMs (Collobert et al., 2002), Gaussian Processes (Tresp, 2001; Theis & Bethge, 2015; Deisenroth & Ng, 2015), Dirichlet Processes (Shahbaba & Neal, 2009), and deep networks. Other work has focused on different expert configurations such as a hierarchical structure (Yao et al., 2009), infinite numbers of experts (Rasmussen & Ghahramani, 2002), and adding experts sequentially (Aljundi et al., 2016). Garmash & Monz (2016) suggest an ensemble model in the format of mixture of experts for machine translation. The gating network is trained on a pre-trained ensemble NMT model.

The works above concern top-level mixtures of experts. The mixture of experts is the whole model. Eigen et al. (2013) introduce the idea of using multiple MoEs with their own gating networks as parts of a deep model. It is intuitive that the latter approach is more powerful, since complex problems may contain many sub-problems each requiring different experts. They also allude in their conclusion to the potential to introduce sparsity, turning MoEs into a vehicle for computational computation.

Our work builds on this use of MoEs as a general purpose neural network component. While Eigen et al. (2013) uses two stacked MoEs allowing for two sets of gating decisions, our convolutional application of the MoE allows for different gating decisions at each position in the text. We also realize sparse gating and demonstrate its use as a practical way to massively increase model capacity.

2 The structcurhk 덮은 THE STRUCTURE OF THE MIXTURE-OF-EXPERTS LAYER

The Mixture-of-Experts (MoE) layer consists of a set of $n$ “expert networks” $E_1, \cdots, E_n$, and a “gating network” $G$ whose output is a sparse $n$-dimensional vector. Figure 1 shows an overview of the MoE module. The experts are themselves neural networks, each with their own parameters. Although in principle we only require that the experts accept the same sized inputs and produce the same-sized outputs, in our initial investigations in this paper, we restrict ourselves to the case where the models are feed-forward networks with identical architectures, but with separate parameters.

Let us denote by $G(x)$ and $E_i(x)$ the output of the gating network and the output of the $i$-th expert network for a given input $x$. The output $y$ of the MoE module can be written as follows:

\[y = \sum_i G(x)_i \cdot E_i(x)\]

We save computation based on the sparsity of the output of $G(x)$. Wherever $G(x)_i = 0$, we need not compute $E_i(x)$. In our experiments, we have up to thousands of experts, but only need to evaluate a handful of them for every example. If the number of experts is very large, we can reduce the branching factor by using a two-level hierarchical MoE. In a hierarchical MoE, a primary gating network chooses a sparse weighted combination of “experts”, each of which is itself a secondary mixture-of-experts with its own gating network. In the following we focus on ordinary MoEs. We provide more details on hierarchical MoEs in Appendix B.

Our implementation is related to other models of conditional computation. A MoE whose experts are simple weight matrices is similar to the parameterized weight matrix proposed in (Cho & Bengio, 2014). A MoE whose experts have one hidden layer is similar to the block-wise dropout described in (Bengio et al., 2015), where the dropped-out layer is sandwiched between fully-activated layers.

2.1 GATING NETWORK

Softmax Gating: A simple choice of non-sparse gating function (Jordan & Jacobs, 1994) is to multiply the input by a trainable weight matrix Wg and then apply the Sof tmax function.

Noisy Top-K Gating: We add two components to the Softmax gating network: sparsity and noise. Before taking the softmax function, we add tunable Gaussian noise, then keep only the top k values, setting the rest to −∞ (which causes the corresponding gate values to equal 0). The sparsity serves to save computation, as described above. While this form of sparsity creates some theoretically scary discontinuities in the output of gating function, we have not yet observed this to be a problem in practice. The noise term helps with load balancing, as will be discussed in Appendix A. The amount of noise per component is controlled by a second trainable weight matrix Wnoise.

Training the Gating Network We train the gating network by simple back-propagation, along with the rest of the model. If we choose k > 1, the gate values for the top k experts have nonzero derivatives with respect to the weights of the gating network. This type of occasionally-sensitive behavior is described in (Bengio et al., 2013) with respect to noisy rectifiers. Gradients also backpropagate through the gating network to its inputs. Our method differs here from (Bengio et al., 2015) who use boolean gates and a REINFORCE-style approach to train the gating network.

3 ADDRESSING PERFORMANCE CHALLENGES

3.1 THE SHRINKING BATCH PROBLEM

On modern CPUs and GPUs, large batch sizes are necessary for computational efficiency, so as to amortize the overhead of parameter loads and updates. If the gating network chooses k out of n experts for each example, then for a batch of b examples, each expert receives a much smaller batch of approximately kb n (cid:28) b examples. This causes a naive MoE implementation to become very inefficient as the number of experts increases. The solution to this shrinking batch problem is to make the original batch size as large as possible. However, batch size tends to be limited by the memory necessary to store activations between the forwards and backwards passes. We propose the following techniques for increasing the batch size:

Mixing Data Parallelism and Model Parallelism: In a conventional distributed 훈련 설정, multiple copies of the model on different devices asynchronously process distinct batches of data, and parameters are synchronized through a set of parameter servers. In our technique, these different batches run synchronously so that they can be combined for the MoE layer. We distribute the standard layers of the model and the gating network according to conventional data-parallel schemes, but keep only one shared copy of each expert. Each expert in the MoE layer receives a combined batch consisting of the relevant examples from all of the data-parallel input batches. The same set of devices function as data-parallel replicas (for the standard layers and the gating networks) and as model-parallel shards (each hosting a subset of the experts). If the model is distributed over d devices, and each device processes a batch of size b, each expert receives a batch of approximately kbd n examples. Thus, we achieve a factor of d improvement in expert batch size. In the case of a hierarchical MoE (Section B), the primary gating network employs data parallelism, and the secondary MoEs employ model parallelism. Each secondary MoE resides on one device.

This technique allows us to increase the number of experts (and hence the number of parameters) by proportionally increasing the number of devices in the training cluster. The total batch size increases, keeping the batch size per expert constant. The memory and bandwidth requirements per device also remain constant, as do the step times, as does the amount of time necessary to process a number of training examples equal to the number of parameters in the model. It is our goal to train a trillionparameter model on a trillion-word corpus. We have not scaled our systems this far as of the writing of this paper, but it should be possible by adding more hardware.

Taking Advantage of Convolutionality: In our language models, we apply the same MoE to each time step of the previous layer. If we wait for the previous layer to finish, we can apply the MoE to all the time steps together as one big batch. Doing so increases the size of the input batch to the MoE layer by a factor of the number of unrolled time steps.

Increasing Batch Size for a Recurrent MoE: We suspect that even more powerful models may involve applying a MoE recurrently. For example, the weight matrices of a LSTM or other RNN could be replaced by a MoE. Sadly, such models break the convolutional trick from the last paragraph, since the input to the MoE at one timestep depends on the output of the MoE at the previous timestep. Gruslys et al. (2016) describe a technique for drastically reducing the number of stored activations in an unrolled RNN, at the cost of recomputing forward activations. This would allow for a large increase in batch size.

3.2 NETWORK BANDWIDTH

Another major performance concern in distributed computing is network bandwidth. Since the experts are stationary (see above) and the number of gating parameters is small, most of the communication involves sending the inputs and outputs of the experts across the network. To maintain computational efficiency, the ratio of an expert’s computation to the size of its input and output must exceed the ratio of computational to network capacity of the computing device. For GPUs, this may be thousands to one. In our experiments, we use experts with one hidden layer containing thousands of RELU-activated units. Since the weight matrices in the expert have sizes input_size×hidden_size and hidden_size × output_size, the ratio of computation to input and output is equal to the size of the hidden layer. Conveniently, we can increase computational efficiency simply by using a larger hidden layer, or more hidden layers.

4 BALANCING EXPERT UTILIZATION

We have observed that the gating network tends to converge to a state where it always produces large weights for the same few experts. This imbalance is self-reinforcing, as the favored experts are trained more rapidly and thus are selected even more by the gating network. Eigen et al. (2013) describe the same phenomenon, and use a hard constraint at the beginning of training to avoid this local minimum. Bengio et al. (2015) include a soft constraint on the batch-wise average of each gate.1

We take a soft constraint approach. We define the importance of an expert relative to a batch of training examples to be the batchwise sum of the gate values for that expert. We define an additional loss Limportance, which is added to the overall loss function for the model. This loss is equal to the square of the coefficient of variation of the set of importance values, multiplied by a hand-tuned scaling factor wimportance. This additional loss encourages all experts to have equal importance.

1 Bengio et al. (2015) also include two additional losses. One controls per-example sparsity, which we do not need since it is enforced by the fixed value of k. A third loss encourages diversity of gate values. In our experiments, we find that the gate values naturally diversify as the experts specialize (in a virtuous cycle), and we do not need to enforce diversity of gate values.

While this loss function can ensure equal importance, experts may still receive very different numbers of examples. For example, one expert may receive a few examples with large weights, and another may receive many examples with small weights. This can cause memory and performance problems on distributed hardware. To solve this problem, we introduce a second loss function, Lload , which ensures balanced loads. Appendix A contains the definition of this function, along with experimental results.

5 EXPERIMENTS

5.1 BILLION WORD LANGUAGE MODELING BENCHMARK

Dataset: This dataset, introduced by (Chelba et al., 2013) consists of shuffled unique sentences from news articles, totaling approximately 829 million words, with a vocabulary of 793,471 words.

Previous State-of-the-Art: The best previously published results (Jozefowicz et al., 2016) use models consisting of one or more stacked Long Short-Term Memory (LSTM) layers (Hochreiter & Schmidhuber, 1997; Gers et al., 2000). The number of parameters in the LSTM layers of these models vary from 2 million to 151 million. Quality increases greatly with parameter count, as do computational costs. Results for these models form the top line of Figure 2-right.

MoE Models: Our models consist of two stacked LSTM layers with a MoE layer between them (see Figure 1). We vary the sizes of the layers and the number of experts. For full details on model architecture, training regimen, additional baselines and results, see Appendix C.

Low Computation, Varied Capacity: To investigate the effects of adding capacity, we trained a series of MoE models all with roughly equal computational costs: about 8 million multiply-andadds per training example per timestep in the forwards pass, excluding the softmax layer. We call this metric (ops/timestep). We trained models with flat MoEs containing 4, 32, and 256 experts, and models with hierarchical MoEs containing 256, 1024, and 4096 experts. Each expert had about 1 million parameters. For all the MoE layers, 4 experts were active per input.

The results of these models are shown in Figure 2-left. The model with 4 always-active experts performed (unsurprisingly) similarly to the computationally-matched baseline models, while the largest of the models (4096 experts) achieved an impressive 24% lower perplexity on the test set.

Figure 2: Model comparison on 1-Billion-Word Language-Modeling Benchmark. On the left, we plot test perplexity as a function of model capacity for models with similar computational budgets of approximately 8-million-ops-per-timestep. On the right, we plot test perplexity as a function of computational budget. The top line represents the LSTM models from (Jozefowicz et al., 2016). The bottom line represents 4-billion parameter MoE models with different computational budgets.

Varied Computation, High Capacity: In addition to the largest model from the previous section, we trained two more MoE models with similarly high capacity (4 billion parameters), but higher computation budgets. These models had larger LSTMs, and fewer but larger and experts. Details can be found in Appendix C.2. Results of these three models form the bottom line of Figure 2-right. Table 1 compares the results of these models to the best previously-published result on this dataset . Even the fastest of these models beats the best published result (when controlling for the number of training epochs), despite requiring only 6% of the computation.

Table 1: Summary of high-capacity MoE-augmented models with varying computational budgets, vs. best previously published results (Jozefowicz et al., 2016). Details in Appendix C.

Computational Efficiency: We trained our models using TensorFlow (Abadi et al., 2016) on clusters containing 16-32 Tesla K40 GPUs. For each of our models, we determine computational efficiency in TFLOPS/GPU by dividing the number of floating point operations required to process one training batch by the observed step time and the number of GPUs in the cluster. The operation counts used here are higher than the ones we report in our ops/timestep numbers in that we include the backwards pass, we include the importance-sampling-based training of the softmax layer, and we count a multiply-and-add as two separate operations. For all of our MoE models, the floating point operations involved in the experts represent between 37% and 46% of the total.

For our baseline models wtih no MoE, observed computational efficiency ranged from 1.07-1.29 TFLOPS/GPU. For our low-computation MoE models, computation efficiency ranged from 0.740.90 TFLOPS/GPU, except for the 4-expert model which did not make full use of the available parallelism. Our highest-computation MoE model was more efficient at 1.56 TFLOPS/GPU, likely due to the larger matrices. These numbers represent a significant fraction of the theoretical maximum of 4.29 TFLOPS/GPU claimed by NVIDIA. Detailed results are in Appendix C, Table 7.

5.2 100 BILLION WORD GOOGLE NEWS CORPUS

Figure 3: Language modeling on a 100 billion word corpus. Models have similar computational budgets (8 million ops/timestep).

On the 1-billion-word corpus, adding additional capacity seems to produce diminishing returns as the number of parameters in the MoE layer exceeds 1 billion, as can be seen in Figure 2-left. We hypothesized that for a larger training set, even higher capacities would produce significant quality improvements.

We constructed a similar training set consisting of shuffled unique sentences from Google’s internal news corpus, totalling roughly 100 billion words. Similarly to the previous section, we tested a series of models with similar computational costs of about 8 million ops/timestep. In addition to a baseline LSTM model, we trained models augmented with MoE layers containing 32, 256, 1024, 4096, 16384, 65536, and 131072 experts. This corresponds to up to 137 billion parameters in the MoE layer. Details on architecture, training, and results are given in Appendix D.

Results: Figure 3 shows test perplexity as a function of capacity after training on 10 billion words (top line) and 100 billion words (bottom line). When training over the full 100 billion words, test perplexity improves significantly up to 65536 experts (68 billion parameters), dropping 39% lower than the computationally matched baseline, but degrades at 131072 experts, possibly a result of too much sparsity. The widening gap between the two lines demonstrates (unsurprisingly) that increased model capacity helps more on larger training sets.

Even at 65536 experts (99.994% layer sparsity), computational efficiency for the model stays at a respectable 0.72 TFLOPS/GPU.

5.3 MACHINE TRANSLATION (SINGLE LANGUAGE PAIR)

Model Architecture: Our model was a modified version of the GNMT model described in (Wu et al., 2016). To reduce computation, we decreased the number of LSTM layers in the encoder and decoder from 9 and 8 to 3 and 2 respectively. We inserted MoE layers in both the encoder (between layers 2 and 3) and the decoder (between layers 1 and 2). Each MoE layer contained up to 2048 experts each with about two million parameters, adding a total of about 8 billion parameters to the models. Further details on model architecture, testing procedure and results can be found in Appendix E.

Datasets: We benchmarked our method on the WMT’14 En→Fr and En→De corpora, whose training sets have 36M sentence pairs and 5M sentence pairs, respectively. The experimental protocols were also similar to those in (Wu et al., 2016): newstest2014 was used as the test set to compare against previous work (Luong et al., 2015a; Zhou et al., 2016; Wu et al., 2016), while the combination of newstest2012 and newstest2013 was used as the development set. We also tested the same model on a Google’s Production English to French data.

Table 2: Results on WMT’14 En→ Fr newstest2014 (bold values represent best results). Model

Results: Tables 2, 3, and 4 show the results of our largest models, compared with published results. Our approach achieved BLEU scores of 40.56 and 26.03 on the WMT’14 En→Fr and En→De benchmarks. As our models did not use RL refinement, these results constitute significant gains of 1.34 and 1.12 BLEU score on top of the strong baselines in (Wu et al., 2016). The perplexity scores are also better.2 On the Google Production dataset, our model achieved 1.01 higher test BLEU score even after training for only one sixth of the time.

5.4 MULTILINGUAL MACHINE TRANSLATION

Dataset: (Johnson et al., 2016) train a single GNMT (Wu et al., 2016) model on a very large combined dataset of twelve language pairs. Results are somewhat worse than those for 12 separately trained single-pair GNMT models. This is not surprising, given that the twelve models have 12 times the capacity and twelve times the aggregate training of the one model. We repeat this experiment with a single MoE-augmented model. See Appendix E for details on model architecture. We train our model on the same dataset as (Johnson et al., 2016) and process the same number of training examples (about 3 billion sentence pairs). Our training time was shorter due to the lower computational budget of our model.

Results: Results for the single-pair GNMT models, the multilingual GNMT model and the multilingual MoE model are given in Table 5. The MoE model achieves 19% lower perplexity on the dev set than the multilingual GNMT model. On BLEU score, the MoE model significantly beats the multilingual GNMT model on 11 of the 12 language pairs (by as much as 5.84 points), and even beats the monolingual GNMT models on 8 of 12 language pairs. The poor performance on English → Korean seems to be a result of severe overtraining, as for the rarer language pairs a small number of real examples were highly oversampled in the training corpus.

Table 5: Multilingual Machine Translation (bold values represent best results). GNMT-Mono GNMT-Multi

6 CONCLUSION

This work is the first to demonstrate major wins from conditional computation in deep networks. We carefully identified the design considerations and challenges of conditional computing and addressed them with a combination of algorithmic and engineering solutions. While we focused on text, conditional computation may help in other domains as well, provided sufficiently large training sets. We look forward to seeing many novel implementations and applications of conditional computation in the years to come.

Previous: Model | Gemini Next: MoE | DeepSpeed MoE

post contain ""

    No matching posts found containing ""