00:00:00

Share Your Feedback 🏝️

ALiBi

ALiBi

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | Google - Gemma2** Next: Image QA

ALiBi

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-07-06

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (ALiBi)

  • url: https://arxiv.org/abs/2108.12409
  • pdf: https://arxiv.org/pdf/2108.12409
  • abstract: Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add Positional Embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a Sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi’s inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.

Contents

TL;DR


  • ALiBi 방법: 기존 Positional Embedding 대신 선형 편향을 사용하여 인퍼런스시에도 일관된 성능 보장
  • 실험 결과: WikiText-103 및 CC100+RoBERTa 데이터셋에서 기존 모델을 능가하는 성능을 보여줌
  • 수학적 배경: 소프트맥스 함수와 선형 편향의 통합을 통해 위치 정보를 효율적으로 처리

1 서론

본 연구에서는 트랜스포머 기반 언어 모델 설계의 주요 결정 요소인 훈련 시퀀스의 길이 \(L\)에 초점을 맞추고, 이를 어떻게 인퍼런스 시퀀스 길이로 확장할지 고민합니다. 기존의 RNN 언어 모델은 짧은 \(L\) 시퀀스로 훈련되어 긴 문맥으로의 일반화가 가정되었으나, 트랜스포머는 훈련 중에 경험하지 못한 시퀀스 길이로도 성능을 유지하는 능력, 즉 외삽(extrapolation) 능력이 필요하다고 보았습니다. 이를 위해, 기존의 Sinusoidal Positional Embedding이 아닌 Attention with Linear Biases (ALiBi) 방법을 도입하여 Positional Embedding 없이 선형 편향만을 사용하여 효율적인 외삽을 도모합니다.


2 현재 접근 방법의 한계

2.1 배경 및 실험 설정

트랜스포머 언어 모델은 토큰 리스트를 받아 다음 토큰에 대한 확률 분포를 출력합니다. 이 모델은 훈련 또는 평가 시퀀스의 부분 시퀀스로 입력을 받으며, \(L\)의 길이는 훈련 중 각 입력 부분 시퀀스에 해당합니다. 외삽 능력을 평가하기 위해서는 \(L_{\text{valid}} > L\)인 시퀀스를 평가할 때의 성능이 중요합니다. 상기 식에서 \(L\)은 훈련 중의 부분 시퀀스 길이를, \(L_{\text{valid}}\)는 검증 시의 길이를 나타냅니다.

2.2 외삽 측정

기존의 Sinusoidal Positional Embedding은 입력 토큰에 추가되는 비학습 벡터이며, 트랜스포머 언어 모델에서 자주 사용됩니다. 이 방식은 \(L = 512\) 토큰으로 훈련된 모델이 검증 집합에서 $L + k$ 토큰으로 인퍼런스할 때, 초기에는 성능이 개선되지만, $k = 50$ 이후에는 성능이 저하되기 시작한다는 것을 발견했습니다.


3 Attention with Linear Biases Enables Input Length Extrapolation (ALiBi)

ALiBi는 기존의 트랜스포머 모델에서 사용되는 Positional Embedding을 추가하지 않고, 쿼리와 키의 내적 후, 각 헤드별로 고정된 기울기 \(m\)을 가진 선형 편향을 더합니다.

\[\text{softmax}(q_i K^\top + m \cdot [- (i - 1), \ldots, -2, -1, 0]),\]

상기 식에서 \(m\)은 헤드별로 특정된 경사로서, 훈련 전에 설정됩니다. 이 방법은 각 키와 쿼리 쌍 사이의 거리에 따라 페널티를 부여하여 거리가 멀어질수록 페널티가 증가하게 만듭니다.

실험적 접근

다양한 위치 방법의 외삽 능력을 WikiText-103 코퍼스를 사용하여 Baevski & Auli의 변압기 언어 모델로 테스트합니다. 학습 세트는 영어 위키백과에서 약 1억 3천만 토큰으로, 이 모델은 16개의 변압기 층, 1024의 차원, 8개의 헤드, 그리고 4096의 피드포워드 내부 차원을 가집니다.

이 모델은 단어 임베딩과 소프트맥스 행렬을 연결합니다. 위치 방법과 학습 하위 시퀀스 길이 외에는 다른 하이퍼파라미터를 변경하지 않습니다.

ALiBi를 사용하여 트랜스포머 언어 모델을 훈련시키는 것은 기존 Sinusoidal 모델에 비해 메모리 사용량이 소폭 증가하긴 하지만, 훨씬 짧은 \(L\) 시퀀스로 훈련 가능하며, 이는 큰 메모리 절약을 의미합니다. ALiBi는 특히 긴 시퀀스를 처리할 때 강력한 성능을 발휘하여, 더 긴 출력 생성을 가능하게 합니다.


4 결과

ALiBi는 WikiText-103 및 CC100+RoBERTa 데이터셋에서 기존 위치 방법들을 능가하는 결과를 보여주었습니다. ALiBi를 사용한 모델은 짧은 시퀀스로 훈련되어도 긴 시퀀스를 효과적으로 처리할 수 있으므로 훈련과 인퍼런스 시 메모리 및 계산 비용을 줄일 수 있습니다. 이런 결과는 ALiBi가 위치 정보를 모델의 모든 레이어에 효율적으로 통합할 수 있는 유용한 접근 방식임을 보입니다.


5 관련 연구

5.1 Wennberg & Henter (2021)

  • 어텐션 점수에 쿼리와 키 간의 거리에 따른 편향을 추가하는 상대적 위치 방식을 도입합니다.
  • 사용된 레이디얼 베이시스 함수(Radial Basis Function, RBF)는 중심점에서의 거리에 따라 값이 변하는 함수로, 여러 학습 가능한 파라미터를 통해 각 거리에 대한 편향을 조정합니다. 일반적으로 RBF는 공간 내 어떤 점 \(x\) 주변에서 \(\phi(\|x - c\|)\) 형태로 표현되며, 상기 식에서 \(c\)는 중심점이고, \(\|x - c\|\)는 유클리드 거리를 의미합니다.
  • 적용 분야: 주로 텍스트 분류에 적용되어 언어 모델링이나 외삽에 대한 연구는 이루어지지 않았습니다.

5.2 Transformer-XL (Dai et al., 2019)

  • 훈련 시보다 더 많은 토큰을 처리할 수 있는 캐시 메커니즘을 도입한 언어 모델입니다.
  • 캐시를 사용하여 이전 토큰들의 정보를 저장하고, 이를 다음 스텝의 입력으로 활용하여 모델의 컨텍스트 윈도우를 확장합니다. 이를 통해 \(L\)보다 긴 시퀀스에 대한 처리가 가능하게 합니다.
  • 제한: 출력 길이가 훈련 길이 \(L\)로 제한되고, 상대적 위치 방법의 실행 속도가 느립니다.

5.3 Longformer (Beltagy et al., 2020)

  • 긴 문서를 처리하기 위해 설계된 모델로, 긴 입력 시퀀스에 대한 효율적 처리를 목표로 합니다.
  • 어텐션 메커니즘을 슬라이딩 윈도우 방식으로 조정(목적은 다르지만 zephyr 모델 참조)하여, 모델이 긴 시퀀스를 처리할 수 있도록 메모리와 계산 비용을 최적화합니다.
  • 필요 조건: 효과적인 적용을 위해 모델을 더 긴 시퀀스에 대해 부분적으로 훈련시켜야 합니다.
  • ALiBi와의 비교: ALiBi는 추가적인 훈련 없이도 외삽을 통해 긴 시퀀스를 처리할 수 있는 장점이 있습니다.

5.4 기타 관련 연구 (Rosendahl et al., 2019; Neishi & Yoshinaga, 2019 등)

  • 적용 분야: 기계 번역, 시퀀스-투-시퀀스 모델, pre-trained 모델의 산술 작업 테스트, 강화 학습, 이미지 및 음성 인식, 기계 번역, 단백질 구조 예측 등 다양합니다.
  • 외삽 능력에 대한 연구들은 일반적으로 모델이 학습 중에 보지 못한 데이터 혹은 구조에 대해 어떻게 일반화하고 성능을 유지하는지를 분석합니다. 이는 학습된 모델의 적용 범위와 성능 한계를 확장하는 데 중요한 요소입니다.


6 참고


[참고자료 1] Sinusoidal Embedding

문장 내에서 단어의 상대적 위치를 인식할 수 있도록하기 위해 주로 사용되는 방법

1. Sinusoidal Embedding이란?

Sinusoidal Embedding은 신호 처리에서 시작해 자연어 처리의 중심까지, Sinusoidal Embedding은 딥러닝에서 다양하게 활용되고 있는 방식으로 자연어 처리 분야, 특히 Transformer 모델에서 위치 정보를 표현하는 기법으로 주로 사용됩니다. 각 단어의 위치에 따라 고유한 frequencies와 위상을 가진 사인 및 코사인 함수의 값을 이용해 위치를 인코딩합니다. 이 방식은 딥러닝 모델이 문장 내에서 단어의 상대적 위치를 인식할 수 있도록 도와줍니다.


2. 기원과 배경

Sinusoidal Embedding은 Transformer 아키텍처의 등장과 함께 널리 사용되기 시작했습니다. Transformer는 2017년 Google의 연구팀에 의해 개발되었으며, 기존의 RNN(Recurrent Neural Network)과 CNN(Convolutional Neural Network)이 가진 시퀀스 처리의 한계를 극복하고자 등장했습니다. Transformer의 핵심은 ‘Attention Is All You Need’라는 논문에서 제시된 “Self-Attention” 메커니즘과 이를 효과적으로 지원하는 Sinusoidal Embedding입니다.


3. 정의

Sinusoidal Embedding은 다음과 같이 정의됩니다.

\(PE(pos, 2i) = \sin(pos / 10000^{2i/d_{\text{model}}})\) \(PE(pos, 2i+1) = \cos(pos / 10000^{2i/d_{\text{model}}})\)

상기 식에서 \(pos\)는 단어의 위치, \(i\)는 차원의 인덱스, \(d_{\text{model}}\)은 모델의 차원을 의미합니다. \(10000^{2i/d_{\text{model}}}\)은 각 차원에서의 frequencies를 조정합니다. 이는 각 차원이 서로 다른 frequencies를 갖게 되므로 모델이 다양한 frequencies의 신호를 학습할 수 있게되어 입력 데이터의 다양한 패턴을 더 잘 포착할 수 있게 됩니다.

  • 위치 인코딩이 더해진 후에도, 각 위치의 인코딩 값들이 유사한 패턴을 유지하면서 모델이 위치의 변화를 더 잘 인지할 수 있고,
  • 주기적인 패턴 때문에 모델이 장거리 의존성을 더 효과적으로 학습할 수 있습니다.
  • 위치에 대해 긴 시퀀스를 다룰 수 있으며, 훈련 중에 보지 못한 위치에 대해서도 일관된 방식으로 확장할 수 있습니다.


4. 장단점 분석

4.1 장점

  • 확장성: Sinusoidal Embedding은 미리 계산되어 모델에 적용되므로 계산 비용이 낮습니다.
  • 일반화: 훈련 중에 보지 못한 길이의 시퀀스에 대해서도 일관된 방식으로 위치 정보를 제공할 수 있습니다.
  • 장거리 의존성: 주기적인 패턴으로 인해 장거리 의존성을 모델이 학습하기 용이합니다.

4.2 단점

  • 고정 패턴: Sinusoidal Embedding은 고정된 frequencies 패턴을 사용하기 때문에, 특정 작업에 최적화하기 어려울 수 있습니다.
  • 유연성 부족: 학습 가능한 위치 인코딩에 비해 수정할 수 있는 요소가 없어, 모델이 특정 문맥에서 더 유연하게 작동하는 것을 제한할 수 있습니다.

Transformer 모델을 사용하여 “The quick brown fox jumps over the lazy dog”라는 문장을 처리하는 경우, 각 단어의 위치에 따라 Sinusoidal Embedding을 적용합니다. 예를 들어, ‘quick’이 두 번째 위치에 있다면, 이 단어의 위치 인코딩은 \(\sin\)과 \(\cos\) 함수의 값으로 계산되어 입력 벡터에 더해집니다. 이를 통해 모델은 ‘quick’의 위치가 ‘the’ 다음임을 인식하고 문맥을 더 정확히 이해할 수 있습니다.

Sinusoidal Embedding은 효율성 덕분에 많은 현대 자연어 처리 모델에서 핵심적인 요소로 자리 잡았습니다. 특히 긴 문서나 복잡한 언어 구조를 다룰 때 그 장점이 더욱 부각됩니다.


1 INTRODUCTION

When constructing a transformer-based language model, a major design decision is the length of training sequences, denoted \(L\) herein, which has to date been equivalent to the length of inference sequences. More context, achieved by larger \(L\), improves predictions at inference time. But longer sequences are more expensive to train on.

Before transformers, RNN language models were trained on shorter-\(L\) sequences and assumed to generalize to longer contexts at inference time (Mikolov et al., 2010; Mikolov & Zweig, 2012; Zaremba et al., 2014). Vaswani et al. (2017), introducing the transformer, speculated that it “may […] extrapolate to sequence lengths longer than the ones encountered during training.” We define extrapolation as a model’s ability to continue performing well as the number of input tokens during validation increases beyond the number of tokens on which the model was trained. We find that transformer language models (LMs) that use Sinusoidal position embeddings have very weak extrapolation abilities; see Figure 1.

We demonstrate that this failure to extrapolate is caused by the position embedding method. As shown in Figure 1, recent alternatives to the original Sinusoidal position method (Su et al., 2021; Raffel et al., 2020) have improved extrapolation. However, the better of these, the T5 bias, is considerably slower than the Sinusoidal approach and uses extra memory and parameters (Figure 2).

We therefore introduce Attention with Linear Biases (ALiBi) to facilitate efficient extrapolation. ALiBi negatively biases attention scores with a linearly decreasing penalty proportional to the distance between the relevant key and query. Our simple approach eliminates position embeddings.

1 Code & models: https://github.com/ofirpress/attention_with_linear_biases

2 Figure 7 in the appendix plots training speed, in words per second, against \(L\).

Figure 1: Extrapolation: as the (validation-set’s) input sequence gets longer (x-axis), current position methods (Sinusoidal, rotary, and T5) show degraded perplexity (y-axis, lower is better), but our method (§3) does not. Models were trained on WikiText-103 with sequences of \(L = 512\) (left) or \(L = 1,024\) (right) tokens. T5 ran out of memory on our 32GB GPU. For more detail on exact perplexities and runtimes, see Tables 2 and 3 in the appendix.

Compared to a Sinusoidal model trained on the same input length, our method requires no additional runtime or parameters and incurs a negligible (0–0.7%) memory increase. ALiBi can be implemented by changing only a few lines of existing transformer code.

Using ALiBi, a transformer LM can be trained on short-\(L\) sequences and therefore at much lower cost, and it can still be reliably applied to long sequences at runtime. For example, a 1.3 billion parameter LM trained on \(L = 1024\) tokens with ALiBi achieves the same perplexity as a Sinusoidal model trained on \(L = 2048\) when both are tested on sequences of 2048 tokens, even though our model is 11% faster and uses 11% less memory.

Though performance peaks at around two times the number of tokens that the model was trained on, ALiBi maintains strong performance even on sequences of length 10,000. In recently explored settings where NLP training examples are given as context to an LM (Brown et al., 2020), our approach will allow exposure to more examples. Additionally, it enables the generation of longer outputs.

2 CURRENT APPROACHES DO NOT EXTRAPOLATE EFFICIENTLY

We show for the first time that the Sinusoidal position method, which technically should be able to extrapolate, in practice has very limited extrapolation capabilities. Though the rotary position method improves over the Sinusoidal one, it still does not achieve satisfying results. Holding everything else constant, we are the first to observe that the T5 bias method leads to better extrapolation than either of these, and so we conclude that extrapolation ability depends heavily on the position embedding. Unfortunately, the T5 bias is computationally costly (Figure 2).

2.1 BACKGROUND AND EXPERIMENTAL SETUP

A transformer LM receives a list of tokens and outputs a probability distribution representing its prediction for the next token. We call the input list the current input subsequence since the inputs to language models are typically subsequences from (much longer) training or evaluation sequences. During both training and perplexity evaluation (i.e., scoring a fixed sequence), many predictions can be calculated at once; this is done using a “causal mask” that ensures each position’s prediction is influenced only by tokens to its left. Let \(L\) be the length of each input subsequence during training; it includes \(L\) predictions, which on average have access to \(L+1\) tokens of (left) context. To explore a model’s extrapolation abilities, we are interested in cases where sequences of length \(L_{\text{valid}} > L\) are considered at evaluation time. When \(L\) differs between inference and training, we use \(L\) to refer to the length of subsequences during training and \(L_{\text{valid}}\) to refer to their length at validation.

Figure 2: A comparison of batched training, inference speed and memory use of the Sinusoidal, rotary, T5 bias, and our ALiBi position methods. The speed differences between our method and the Sinusoidal are within 1% during training and 3% for inference, which is insignificant on our hardware. ALiBi uses 100MB of extra memory when training on input lengths 1024 and 3072 in this setting. Memory usage is lower in all approaches when training on 3072 tokens (compared to 1024) since we break batches into multiple updates. See Table 1 in the appendix for exact numbers.

Nonoverlapping Inference: To train on or evaluate a sequence longer than \(L\) tokens, it is typical to segment the sequence into \(L\)-length subsequences and train on or evaluate them independently. Unless otherwise stated, we use nonoverlapping inference to report perplexity scores.

Extrapolation During Inference: Formally, the functions that define a transformer layer are agnostic to input length; they map from some arbitrary, unfixed number of input vectors to the same number of output vectors. When transformers are applied to data that is inherently sequential, like text, positional information is injected into the inputs in various ways.

Vaswani et al. (2017) discussed two options for embedding positions into vectors to be added to word embeddings: learning embeddings for specific positions and unlearned Sinusoidal embeddings. They observed similar performance between these two but preferred the Sinusoidal approach, which they argued might extrapolate to longer input sequences during inference. We find that this model cannot extrapolate to more than a few dozen tokens beyond \(L\).

Experiment Setup: We first test the extrapolation abilities of various position methods on the WikiText-103 corpus (Merity et al., 2016) using the transformer language model of Baevski & Auli (2018). We use this model because of its prominent role in recent language modeling developments (Khandelwal et al., 2020; Press et al., 2021). The training set is about 103 million tokens from English Wikipedia (half a gigabyte). The model has 16 transformer layers of dimension 1024, with 8 heads, and a feedforward inner dimension of 4096. This model ties the word embedding and softmax matrices (Press & Wolf, 2017; Inan et al., 2017). In our experiments, other than varying the position method and training subsequence length, we modify no other hyperparameters, including the random seed and number of training epochs (205).

2.2 Measuring Extrapolation

Sinusoidal Position Embeddings: Sinusoidal position embeddings (Vaswani et al., 2017; §3.5) are constant, non-learned vectors that are added to token embeddings on input to the first layer of the transformer. They are frequently used in transformer language modeling (Baevski & Auli, 2018; Lewis et al., 2021) and machine translation (Vaswani et al., 2017; Ott et al., 2018) models. We first consider the unmodified model of Baevski & Auli (2018), which uses Sinusoidal position embeddings, and train it on \(L = 512\) tokens; we then run inference with it on the validation set on \(L + k\) tokens, with \(k\) ranging from 0 to 15,000. Figure 1 (left) and the corresponding Table 2 (in the appendix) show that while the model improves perplexity up to \(k = 20\), performance stops improving and stays steady from \(k = 20\) to \(k = 50\) and then begins degrading. Similar results are obtained for a model trained with \(L = 1024\) tokens (Figure 1 (right) and Table 3 in the appendix). That model improves for up to \(L_{\text{valid}} = L + 50\) tokens, after which performance declines.

3 These include the embedding lookup, feedforward sublayer, and softmax layer, which act independently on vector inputs, as well as the attention sublayers, whose parameters do not depend on input length (and which must handle variable-length inputs, e.g., due to causal masking).

4 The learned Positional Embedding approach does not have a way to encode positions greater than L; it therefore has no ability to extrapolate.

Rotary Position Embeddings

The rotary method was introduced by Su et al. (2021) and has recently been popularized by the open source GPT-3 (Brown et al., 2020) implementation GPT-J (Wang & Komatsuzaki, 2021). Instead of adding Sinusoidal embeddings at the bottom of the transformer, they multiply the keys and queries of every attention layer by Sinusoidal embeddings.

Unlike the Sinusoidal or learned Positional Embedding approach, the rotary method injects position information into the model at every layer, not just at the initial one. In addition, it adds no position information to the values of the self-attention sublayer. The output of a self-attention sublayer is a linearly transformed, weighted sum of the input value vectors; therefore, by not inserting position information into the values, the outputs of each transformer-layer contain no explicit position information. We suspect that this segregation of position information may be beneficial for extrapolation, and we draw inspiration from it in the design of our method (§3).

We apply the rotary position embedding method to our Baevski & Auli baseline. The perplexity results (Figure 1 and Appendix Tables 2 and 3) are better than the Sinusoidal approach: the model with \(L = 512\) (\(L = 1024\)) improves perplexity with up to \(k = 200\) (\(k = 100\)) more tokens than it saw during training, but this comes at the cost of slower training and inference (Figure 2).

T5 Bias

Though most models use trained or Sinusoidal position embeddings, the T5 model of Raffel et al. (2020) uses a relative position method (Shaw et al., 2018; Huang et al., 2019) that adds no position information to word embeddings (as in the previous method). Instead, it modifies the way attention values are computed. We refer to this as the “T5 bias” method. To compute attention values in the unmodified transformer, we compute the dot product of every query with every relevant key and then softmax these attention values. In this method, we compute the attention values as before, but then we add a learned, shared bias to each query-key score that is dependent on just the distance between the query and key. Therefore, all query-key scores where the query and key distance are zero (i.e., the query and key represent the same token) get a specific learned bias, all scores where the query and key are one word away get a different learned bias, and so on, up to a certain point, from where multiple different distances share the same learned bias (which might be beneficial for extrapolation). As in the rotary method, the T5 bias injects position information into the model at every layer and integrates no explicit position information into the self-attention value vectors.

Raffel et al. (2020) propose that the T5 bias may allow extrapolation, but they did not report experiments testing this. Here, we show that the T5 bias does allow language models to extrapolate. We do this by again modifying the Baevski & Auli model, this time to insert the T5 bias into it.

As Figure 1 shows, the T5 bias improves perplexity with longer sequences than the ones it was trained on, i.e., \(k = 600\) (\(k = 800\)) extra tokens for a model trained on \(L = 512\) (\(L = 1024\)) input tokens. Unfortunately, this impressive performance comes at a cost: training is at least twice as slow as with the Sinusoidal model. Therefore, this model’s extrapolation ability provides no efficiency advantage. For example, to do inference on 1024 tokens, we could either train the Sinusoidal model with \(L = 1024\) or train the T5 bias model on \(L = 512\) tokens and extrapolate to 1024 for inference. However, the \(L = 1024\) Sinusoidal model runs at 28.5k words per second (WPS), while the \(L = 512\) T5 bias model runs at 14.4k WPS (Appendix Table 1), so there is no speedup when training on shorter sequences with this method.

5 Our rotary method implementation is based on the code in https://github.com/JunnYu/ RoFormer_pytorch, which is linked to from the official repository of Su et al. (2021): (https: //github.com/ZhuiyiTechnology/roformer). After we finished running our experiments with the rotary method, we were informed that the runtime of the code linked above could be optimized, making it only 2% slower than the Sinusoidal approach. This optimization would not change extrapolation performance.

6 This method is similar to the one used in Parikh et al. (2016, Equation 7). 7Our T5 bias implementation is based on the one used in HuggingFace Transformers (Wolf et al., 2020), which in turn is based on the official Mesh Tensorflow T5 code.

8 Narang et al. (2021) benchmarked the T5 bias as being just 8.7% slower than the Sinusoidal approach; thus, while always incurring a runtime penalty, this method’s runtime could be faster depending on the choice of hardware and software frameworks used. Narang et al. used the Tensorflow T5 library running on TPUs, while we used the PyTorch Fairseq library running on GPUs.

3 ATTENTION WITH LINEAR BIASES (ALIBI)

In the transformer model of Vaswani et al. (2017), position embeddings are added to the word embeddings at the bottom of the network. For an input subsequence of length \(L\), the attention sublayer computes the attention scores for the \(i\)-th query \(q_i \in \mathbb{R}^{1 \times d}\), (\(1 \leq i \leq L\)) in each head, given the first \(i\) keys \(K \in \mathbb{R}^{i \times d}\), where \(d\) is the head dimension:

\[\text{softmax}(q_i K^\top)\]

These attention scores are then multiplied by the values to return the output of the attention sublayer.

Figure 3: When computing attention scores for each head, our linearly biased attention method, ALiBi, adds a constant bias (right) to each attention score (qi · kj, left). As in the unmodified attention sublayer, the softmax function is then applied to these scores, and the rest of the computation is unmodified. m is a head-specific scalar that is set and not learned throughout training. We show that our method for setting m values generalizes to multiple text domains, models and training compute budgets. When using ALiBi, we do not add Positional Embeddings at the bottom of the network.

When using ALiBi, we do not add position embeddings at any point in the network. The only modification we apply is after the query-key dot product, where we add a static, non-learned bias:

\[\text{softmax}(q_i K^\top + m \cdot [- (i - 1), \ldots, -2, -1, 0]),\]

where scalar \(m\) is a head-specific slope fixed before training. Figure 3 offers a visualization. For our models with 8 heads, the slopes that we used are the geometric sequence:

\[\left\{\frac{1}{2}, \frac{1}{2^{1.5}}, \ldots, \frac{1}{2^8}\right\}\]

For models that require 16 heads, we interpolate those 8 slopes by geometrically averaging every consecutive pair, resulting in the geometric sequence that starts at:

\[\frac{1}{2^{0.5}}, \frac{1}{2^{1}}, \ldots, \frac{1}{2^{8.5}}\]

In general, for \(n\) heads, our set of slopes is the geometric sequence that starts at \(2^{-1/n}\) and uses that same value as its ratio.

In §4, we observe that this set of slopes works on a wide variety of text domains and model sizes. Therefore, we do not believe that it is necessary to tune these slope values every time a new model is trained on a new dataset. This makes our method similar to the Sinusoidal approach, where the hyperparameters (the start and end of the geometric progression of wavelengths) were set once by Vaswani et al. (2017) and then reused in different models of different sizes on different datasets.

ALiBi has an inductive bias towards recency; it penalizes attention scores between distant query-key pairs, with the penalty increasing as the distance between a key and a query grows. The different heads increase their penalties at different rates, depending on the slope magnitude.

We initially experimented with making the slopes trainable, but this did not yield strong extrapolation results. A brief manual exploration of around ten slope sets led us to discover the set of slopes that we finally picked. Our main insight from this exploration is that the slope sets that work best are those with slopes in the \((0, 1)\) range, with the slopes’ density increasing as we get closer to 0. We also found our method to be robust to slope choice. Even randomly sampling from the exponential distribution worked well in some cases (although that method had high variance).

Since ALiBi is a relative position method, we add position information at every layer to the keys and queries but not to the values, as is done in the T5 bias and rotary methods. We hypothesize that these properties might be beneficial for extrapolation.

9 For simplicity we omit the key, query, value and final output projections, dropout, and the scaling factor.

10 The ALiBi bias is not multiplied by the 11In our experiments, trainable slopes also slowed down the training speed by 3%. dk scaling factor from Equation 1 of Vaswani et al. (2017).

Implementation. ALiBi is easy to implement, with all changes accomplished in a few lines of code. We implement it by modifying the mask matrix by adding the linear biases to it (in practice, when training a transformer LM, query \(q_i\) attends only to keys \(1\) to \(i\); this is implemented by adding a mask matrix to the query-key dot product before the softmax operation is applied). This means that there is no runtime penalty when using our method since we add no operations to the network.

Compared to the Sinusoidal model trained on the same input lengths, ALiBi incurs a memory increase (up to 100MB in some of our experiments): in the unmodified transformer, the mask is of size \(L \times L\); when using ALiBi, the mask is a slightly larger \(n \times L \times L\) (where \(n\) is the number of heads) since the linear biases added for each head uses a different slope. But, as we show, ALiBi enables training on much smaller sequences while still achieving (and occasionally surpassing) results obtained using Sinusoidal embeddings on longer sequences, which saves multiple gigabytes of memory.

4 RESULTS

We first show that on WikiText103 ALiBi is efficient and enables training models with short input subsequences that outperform strong baselines even when the ALiBi models extrapolate to more than six times the number of tokens that they were trained on. We then take the same hyperparameters for our method (the set of slopes) that worked on WikiText-103 and show that – with no modification – they provide strong results on a dataset in a very different domain: books. Finally, we show that a 1.3B parameter model trained with AliBi on a much larger (461 GB) dataset with much more compute provides a superior alternative to the Sinusoidal method since it achieves similar perplexity scores while running faster and using less memory (since it is trained on shorter inputs).

While multiple alternatives to the position methods presented in Vaswani et al. (2017) have been proposed, few have been adopted in large (1B or more parameter) LMs since that setting is much more challenging than the smaller scale experiments. GPT-3 and Jurassic-1 (Lieber et al., 2021) use the learned position embedding method from Vaswani et al., and GPT-J uses the rotary method. Our results on the 1.3B parameter model show our method’s ability to generalize to larger models, dataset sizes and training durations without retuning the hyperparameter.

4.1 RESULTS ON WIKITEXT-103 AND TORONTO BOOKCORPUS

We first develop our method on the WikiText-103 corpus (Merity et al., 2016), replacing the Sinusoidal position embeddings in the language model of Baevski & Auli (2018) with ALiBi.

Figure 4: ALiBi models trained and evaluated on varying sequence lengths on the WikiText-103 validation set and the Sinusoidal baseline (not evaluated on longer sequences). All of our models outperform the Sinusoidal ones even when trained on fewer tokens. Appendix Table 5 has exact perplexities, more ALiBi models (trained on fewer tokens), and results for rotary and T5 bias models.

Figure 4 (and the corresponding Appendix Table 5) shows our results for models trained with varying numbers of input subsequence tokens (\(L\)), extrapolating to longer subsequence lengths on the validation dataset. Our first observation is that, without extrapolation, for every \(L\), our models outperform those using the Sinusoidal method, sometimes by a significant amount. For example, the Baevski & Auli model achieves 18.67±0.24 (std. dev.) perplexity when trained with \(L = 3072\) input tokens, but our \(L = 3072\) model achieves 17.60 perplexity (when both models evaluate with \(L_{\text{valid}} = 3072\)).

Our second observation is that all of our models can extrapolate, and they obtain improved perplexity scores when handling more tokens than they observed during training. For example, our model trained on 512 tokens (which achieves 19.73 perplexity when evaluating subsequences of length 512 in the development set) achieves a perplexity score of 18.40 on the development set when extrapolating to subsequences of length 3072. Surprisingly, this surpasses the score that the \(L = 3072\) Sinusoidal model obtains on the development set by a statistically significant margin. Note that all our models trained on \(L = 512\) to \(L = 2048\) outperform the Sinusoidal baseline trained on \(L = 3072\) when extrapolating to \(L_{\text{valid}} = 3072\) even though those models all take much less time to train since they train on shorter subsequences (Appendix Figure 8 compares training speed to perplexity for these models)! The \(L = 512\) model is 1.84 times faster to train and yet still outperforms the \(L = 3072\) Sinusoidal model when extrapolating to \(L_{\text{valid}} = 3072\). In addition, training the \(L = 3072\) Sinusoidal model requires a GPU with more than 16 GB of memory to fit the large attention matrices, which our \(L = 512\) model outperforms even though it can be trained on a GPU with much less memory due to much smaller attention matrices.

Additionally, Table 5 (in the appendix) also shows that, for \(L\)s of 1024 and 3072, our method performs better than the rotary and T5 bias models even when \(L_{\text{valid}} = L\) (i.e., no extrapolation is occurring). Figure 1 (and the corresponding Appendix Tables 2 and 3) more broadly explore our method vs. the other position methods. They show that the T5 bias (the best of the baselines) improves perplexity until \(L_{\text{valid}}\) is around \(2L\), but on the WikiText-103 dataset our method continually improves perplexity until at least around \(3L\), with the \(L = 512\) model improving perplexity even when \(L_{\text{valid}}\) exceeds 12k tokens. Even when unable to improve perplexity given longer sequences, ALiBi always maintains strong performance as more tokens are added.

Appendix Table 6 shows that our results on the validation set also transfer to the test set of WikiText-103. Currently, almost all models that present results on WikiText-103 use sliding window evaluation (defined in §B) to compute perplexities. We apply that method to our (and to the Sinusoidal, rotary, and T5 bias) models in Appendix Table 7. We find that our \(L = 3072\) model surpasses the performance of Transformer-XL (Dai et al., 2019), the Sandwich (Press et al., 2020), and Shortformer (Press et al., 2021) models. Our results are similar to the ones obtained with staged training (Press et al., 2021) but fall short of results obtained by Routing Transformer (Roy et al., 2020) and kNN-LM (Khandelwal et al., 2020). The methods used in those models are orthogonal to ours, and we hypothesize that combining them with ours might lead to even larger performance increases.

After developing our method on WikiText-103, in Appendix Section A.3, we run one set of experiments on a different domain (books) using a similar model architecture and without modifying any of the ALiBi hyperparameters (the slopes) and show that our results fully transfer to this new domain. Our models are able to both surpass the Sinusoidal baseline when not extrapolating while also outperforming it when extrapolating to longer sequences.

4.2 Results on the CC100+RoBERTa Corpus

Our final set of experiments investigates whether ALiBi transfers to a larger model trained with a larger computational budget on a larger dataset than the ones we previously used. We show that our method achieves strong results in this more challenging setting, obtaining similar performance to the Sinusoidal baseline while using significantly less memory, since we train on shorter subsequences.

The dataset we choose is a combination of the datasets used to train the RoBERTa (Liu et al., 2019) implementation of BERT (Devlin et al., 2019) and the English part of the CC-100 corpus introduced in Conneau et al. (2020), for a total of 461 GB. The RoBERTa training corpus—i.e., the Toronto Book Corpus (Zhu et al., 2015), English Wikipedia, CC-News (Nagel, 2016), OpenWebText (Gokaslan & Cohen, 2019), and Stories (Trinh & Le, 2018) — is 161 gigabytes, and the English part of the CC-100 corpus is 300 gigabytes. The validation set contains 649K tokens.

Our models for this dataset have 25 transformer layers with 16 heads and a dimension of 2048, with an 8192 hidden dimension of the feedforward sublayers. These models have 1.3B parameters. We train our models for one epoch, which is 50k updates on 128 V100 GPUs.

In Figure 5 (left), we compare the validation perplexity for \(L_{\text{valid}} = 1024\) throughout the training process for an ALiBi model trained with \(L = 512\) compared to the Sinusoidal model trained with \(L = 1024\). Since our model is trained on shorter sequences, it is 7% faster and uses 1.6 GB less memory. We halt training of the Sinusoidal baseline when our model reaches the end of its training (one epoch). At that time, our model is just 0.06 perplexity away from the baseline even though it was trained on sequences that are half the length of those the baseline used and requires less memory.

In Figure 5 (right), results become even more impressive, showing that our model trained on \(L = 1024\) outperforms by 0.09 perplexity the Sinusoidal model trained on \(L = 2048\) (when evaluating with \(L_{\text{valid}} = 2048\)) even though our model uses 3.1 GB less memory. Our model maintains a lead in perplexity over the Sinusoidal model during the entire training process. By sampling five evenly distributed points across the training process, we compute that our \(L = 1024\) model reaches a given perplexity value, on average, 11% faster than the Sinusoidal model does.

Since our models in these comparisons use much less memory, they allow for stacking more layers, which would further improve performance (with negligible, if any, runtime cost). To keep our experiments as straightforward as possible, however, we do not add layers to our models.

Appendix Table 12 presents additional results comparing our models to the Sinusoidal baseline when both are trained on the same \(L\), showing that ALiBi performs similarly to the Sinusoidal baseline when not extrapolating. This contrasts with the results presented on the smaller datasets, where ALiBi consistently outperforms other position methods even when not extrapolating, suggesting that ALiBi’s inductive bias provides additional benefits for lower-resource language modeling.

Figure 6: The ALiBi and Sinusoidal models (with both \(L = 512\) and 1024) trained for 50k updates (1 epoch) on the CC100+RoBERTa corpus, extrapolating on the validation set. ALiBi achieves the best results at around \(2L\) but maintains strong performance even up to 10,000 tokens in these experiments.

Figure 6 shows that our models trained on \(L = 512\) and \(L = 1024\) achieve the best results when extrapolating to about double the tokens that they were trained on. Specifically, the \(L = 512\) model (that obtains 9.79 perplexity when \(L_{\text{valid}} = 512\)) achieves its best score (9.3) when extrapolating to about double the tokens that they were trained on.

One possible explanation is that the subsequences the model observes during training are up to \(L\) tokens long. When performing inference on subsequences of length \(2L\), half of the subsequences the model consumes are as long as the examples seen during training. When inference is performed on subsequences of length \(2L + 1\) or longer, less than half of the predictions the model makes are on subsequences of lengths seen during training, and that might degrade performance.

The Sinusoidal model cannot extrapolate at all in this setting, with its performance degrading for both the \(L = 512\) and \(L = 1024\) models as soon as one token more than \(L\) is added during evaluation.

In Appendix B, we find that ALiBi’s edge over Sinusoidal embeddings is largely explained by its improved avoidance of the early token curse. We posit that future work building on ALiBi might achieve further gains by more efficiently exploiting longer histories.

In parallel with our work, Wennberg & Henter (2021) introduce a relative position method that, like our method, adds a bias to attention scores that is a function of the distance between the key and query elements. Unlike our ALiBi method, which uses a non-learned linear function, their method uses a radial-basis function, with multiple trainable parameters (in our experiments, this led to a slight decrease in runtime). In addition, they present experiments on text classification, not on language modeling. They do not explore extrapolation. The Distance Aware Transformer (Wu et al., 2021) multiplies attention scores by a bias that is a function of the distance between the key and query. This function uses a different, learned parameter in every head. They show results only on text classification. In our experiments (not presented), multiplying attention scores by the bias (instead of adding, as in ALiBi) degraded performance.

Transformer-XL (Dai et al., 2019) presented a language model that uses a cache and can attend to more tokens during inference than it was trained on (by increasing the length of the cache). However, this work presents results only where output length is limited to the L (the training length), and their relative position method is very slow (Press et al., 2021). The Longformer (Beltagy et al., 2020) adapts models trained on shorter sequences to document-level tasks. However, to achieve this they had to partially train their models on longer sequences. Our ALiBi method enables extrapolation without any additional training on longer sequences.

To our knowledge, extrapolation has not been previously explored in transformer language modeling, but it has been investigated previously and concurrently with transformers on other tasks, such as machine translation (Rosendahl et al., 2019; Neishi & Yoshinaga, 2019; Newman et al., 2020; Kiyono et al., 2021), sequence-to-sequence models trained on an artificial dataset (Hupkes et al., 2020), pretrained sequence-to-sequence models tested on arithmetic tasks (Nogueira et al., 2021, Appendix C), models trained with reinforcement learning (Lampinen et al., 2021), image, speech recognition, and machine translation models (Likhomanenko et al., 2021), and protein structure prediction (Jumper et al., 2021, Appendix 1.5).

6 CONCLUSION

We showed that the Sinusoidal position embedding approach does not enable transformers to extrapolate to inputs longer than the ones they were trained on. We then established that extrapolation in transformers can be enabled by just changing the position method. We showed that our ALiBi method offers an extremely simple replacement for existing position approaches and allow models to extrapolate. In addition, when not extrapolating, our method achieves either better perplexity than the Sinusoidal method (in models smaller than 1B parameters, trained on less data) or similar perplexity (in larger, billion parameter models trained on much more data). ALiBi is simple to im- plement and does not slow down runtime or require extra parameters (but does occasionally require a negligible amount of extra memory). Using our method, we sped up the training of a 1.3 billion parameter model evaluated on the same input sequence length as GPT-3 (2048).

Previous: Model | Google - Gemma2** Next: Image QA

post contain ""

    No matching posts found containing ""