00:00:00

Share Your Feedback 🏝️

Architecture | Mamba

Architecture | Mamba

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: MM1, Methods, Analysis & Insights from Multimodal LLM Pre-training Next: Model | Jamba Technical Report

Architecture | Mamba

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-04-02

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

  • url: https://arxiv.org/abs/2312.00752
  • pdf: https://arxiv.org/pdf/2312.00752
  • abstract: Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers’ computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5×higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

Mamba 1

Release Date: 2023.02

  • Content-based reasoning capability
  • Efficient on long sequences
  • State-of-the-art performance across modalities
Learn More >
Mamba-2 (Phi-mamba)

Release Date: 2024.08

  • Trained with minimal data (3-5B tokens)
  • Leverages pre-existing Transformer knowledge
  • Superior performance among non-Transformer models
Learn More >


Overview

  1. Mamba-2는 기존 Transformer 모델의 지식을 활용하여 적은 양의 데이터로 효과적으로 학습(증류)
  2. Transformer와 SSM의 장점을 결합한 하이브리드 모델 제안 (Hybrid Phi-Mamba, 섹션 3 참조)
  3. 그러나 아직 일부 실험 셋팅에서의 논증 등
특징 Mamba Mamba-2 (Phi-Mamba)
아키텍처 Selective State Space Model (SSM) Selective State Space Model (SSM)
학습 방식 From scratch Transformer 모델에서 증류 (distilled)
training dataset 양 대규모 (정확한 양 명시되지 않음) 3B 토큰 (Phi-Mamba), 5B 토큰 (Hybrid Phi-Mamba)
기반 모델 독립적인 아키텍처 Phi-1.5 아키텍처 기반에 일부 어텐션 레이어 증류
주요 장점 - 선형 시간 복잡도
- 긴 시퀀스 처리 효율적
- 다양한 모달리티에서 우수한 성능
- Transformer의 강점 활용
- 적은 training dataset로 효과적인 성능
- 기존 Transformer 리소스 활용 가능
인퍼런스 속도 Transformer 대비 5배 빠름 구체적인 수치 언급 없음, 하지만 선형 시간 복잡도 유지
성능 비교 같은 크기의 Transformer 모델 능가 실험 모든 오픈소스 비-Transformer 모델 중 최고 성능 실험
주요 개선 선택적 상태 공간 (Selective State Space) 도입 MOHAWK: Transformer에서 SSM으로의 효과적인 지식 증류 방법 제시
확장성 백만 길이 시퀀스까지 성능 향상 구체적인 언급 없음, 하지만 SSM의 특성상 긴 시퀀스 처리 가능
하이브리드 버전 언급 없음 Hybrid Phi-Mamba 버전 존재


Contents

TL;DR


  • 선택적 상태 공간 모델(SSM)은 연속성과 선택성을 기반으로 데이터를 효율적으로 처리하는 새로운 시퀀스 모델링 클래스입니다.
  • 이 모델들은 트랜스포머의 모델링 능력을 유지하면서 시퀀스 길이에 선형적으로 확장됩니다.
  • Mamba 아키텍처는 다양한 도메인에서 고성능을 제공하며, 실제 긴 시퀀스에서 향상된 결과를 보여준다고 언급합니다.

리뷰에서 수학적 이론의 부족하며, Perplexity나 다른 벤치마크에서의 검증이 더 필요하다는 의견이 있었습니다.


1. 서론

최근의 머신러닝에서는 대규모 데이터에 사전 학습된 이후 다양한 downstream 작업에 적용되는 기반 모델(Foundation Models, FMs)이 효과적인 패러다임으로 부상하였습니다. 이런 기반 모델의 주축은 언어, 이미지, 음성, 오디오, 시계열, 유전체 등 다양한 도메인의 임의의 입력 시퀀스를 처리하는 시퀀스 모델입니다(Brown et al. 2020; Dosovitskiy et al. 2020; Ismail Fawaz et al. 2019; Oord et al. 2016; Poli et al. 2023; Sutskever, Vinyals, and Quoc V Le 2014). 이 개념은 특정 모델 아키텍처를 지정하지 않지만, 현대의 기반 모델은 주로 트랜스포머(Vaswani et al. 2017)와 그 핵심 어텐션 계층(Bahdanau, Cho, and Bengio 2015)에 기반을 두고 있습니다. (핵심 문제 정의) 셀프 어텐션은 context window 내에서 정보를 밀집시켜 복잡한 데이터를 모델링할 수 있게 해주지만, 이 속성은 유한 창 밖의 데이터를 모델링할 수 없고, 창 길이에 대해 제곱으로 확장되는 근본적인 단점을 갖고 있습니다. 이런 단점을 극복하기 위해 더 효율적인 어텐션 변형에 대한 방대한 연구가 이루어졌지만(Tay, Dehghani, Bahri, et al. 2022), 대체로 퍼포먼스를 저해하는 비용을 지불해야 했습니다. 아직까지 이런 변형들이 도메인에 걸쳐 실증적으로 효과적임을 보여준 예는 없습니다. (그러나 Mamba의 경우 이후 실증적인 검증들이 대부분이여서…많이 지켜보고 있는 상황인 것 같습니다.)

최근에는 구조화된 상태 공간 시퀀스 모델(SSM)(Gu, Goel, and Ré 2022; Gu, Johnson, Goel, et al. 2021)이 시퀀스 모델링을 위한 유망한 아키텍처 클래스로 부상하였습니다. 이 모델들은 재발 신경망(RNN)과 합성곱 신경망(CNN)의 조합으로 해석될 수 있으며, 고전적인 상태 공간 모델(Kalman 1960)에서 영감을 받았습니다. 이 클래스의 모델은 재발 또는 합성곱으로 효율적으로 계산될 수 있으며, 시퀀스 길이에 대해 선형 또는 거의 선형적으로 확장됩니다. 또한, 특정 데이터 모달리티에서 장거리 의존성을 모델링하는 원리적인 메커니즘을 가지고 있으며(Long Range Arena에서 확인된 바와 같이) 많은 벤치마크에서 우세를 보였습니다(Tay, Dehghani, Abnar, et al. 2021). (그러나 Perplexity에서 많이 손해를 봤다. 또한, 아직까지 데이터 리키지 혹은 오염이라고 부르는 것으로부터 완벽하게 자유롭기 어려운 것도 문제) 많은 종류의 SSM(Gu, Goel, and Ré 2022; Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Y. Li et al. 2023; Ma et al. 2023; Orvieto et al. 2023; Smith, Warrington, and Linderman 2023)이 오디오 및 비전과 같은 연속 신호 데이터를 포함한 도메인에서 성공을 거두었습니다(Goel et al. 2022; Nguyen, Goel, et al. 2022; Saon, Gupta, and Cui 2023). 그러나 이 모델들은 텍스트와 같은 이산적이고 정보 밀도가 높은 데이터를 모델링하는 데는 덜 효과적이었습니다.

트랜스포머의 모델링 능력을 달성하면서 시퀀스 길이에 선형으로 확장되는 새로운 선택적 상태 공간 모델 클래스를 제안합니다.

우선, 이전 모델들의 주요 한계를 확인합니다. 입력에 따라 데이터를 효율적으로 선택하는 능력입니다. 선택적 복사 및 인덕션 헤드(induction head)와 같은 중요한 합성 작업에 기초한 직관을 바탕으로, 입력에 기반하여 SSM 파라미터를 매개화함으로써 간단한 선택 메커니즘을 설계합니다. 이를 통해 모델은 관련 없는 정보를 걸러내고 관련 정보를 무기한 기억할 수 있습니다.

하드웨어 인식 알고리즘. 이 간단한 변경은 모델의 계산에 기술적 챌린지를 제기합니다; 실제로 모든 이전 SSM 모델들은 계산적 효율성을 위해 시간 및 입력 불변이어야 합니다. 합성곱 대신 스캔으로 모델을 반복적으로 계산하는 하드웨어 인식 알고리즘으로 이를 극복하며, 확장된 상태를 구체화하지 않아 GPU 메모리 계층 간의 IO 접근을 피합니다. 결과적인 구현은 이론적으로(모든 합성곱 기반 SSM보다 선형으로 확장되며) 및 현대 하드웨어에서(최대 3배 빠른 A100 GPUs에서) 이전 방법보다 빠릅니다.

아키텍처. 이전 SSM 아키텍처의 디자인을 트랜스포머의 MLP 블록과 결합하여 단일 블록으로 간소화함으로써 간단하고 동질적인 아키텍처 디자인(Mamba)을 이끌어냈습니다. 선택적 SSM은 일반적인 기반 모델의 백본으로 적합한 주요 특성을 가진 완전히 재발적인 모델입니다.

  • (i) 높은 품질: 선택성은 언어와 유전체학과 같은 밀도 높은 모달리티에서 강력한 성능을 가져옵니다.
  • (ii) 빠른 훈련 및 인퍼런스: 훈련 중 시퀀스 길이에 대한 계산 및 메모리가 선형으로 확장되며, 인퍼런스 중 모델을 자기 회귀적으로 풀 때 이전 요소의 캐시가 필요 없기 때문에 단계당 시간이 일정합니다.
  • (iii) long context: 품질과 효율성이 함께 1M까지의 실제 데이터에서 성능 향상을 가져옵니다.

여러 유형의 모달리티 및 설정에서 사전 훈련 품질 및 도메인별 작업 성능 측면에서 Mamba의 잠재력을 실증적으로 검증(논문의 골자)합니다.

  • 합성 큰 언어 모델에 중요하다고 제안된 복사 및 인덕션 헤드(induction head)와 같은 중요한 합성 작업에서 Mamba는 쉽게 해결할 뿐만 아니라 무기한으로 해결책을 외삽할 수 있습니다.
  • 오디오 및 유전체학 Mamba는 SaShiMi, Hyena 및 트랜스포머와 같은 이전 상위 모델을 능가하여 오디오 파형 및 DNA 시퀀스 모델링에서 사전 훈련 품질과 downstream 메트릭(e.g., 도전적인 음성 생성 데이터셋에서 FID를 절반 이상 줄임)에서 우수한 성능을 보입니다. 두 설정 모두에서 그 성능은 더 long context으로 향상됩니다.
  • 언어 모델링 Mamba는 사전 훈련 난이도 및 downstream 평가에서 트랜스포머 품질 성능을 실제로 달성하는 최초의 선형 시간 시퀀스 모델입니다. 1B 파라미터까지의 확장 법칙을 통해 Mamba가 LLaMa(Touvron et al. 2023)를 기반으로 한 강력한 현대 트랜스포머 훈련 레시피를 포함한 다양한 베이스라인의 성능을 능가함을 보여줍니다. Mamba 언어 모델은 유사 크기의 트랜스포머보다 5배의 생성 처리량을 가지며, Mamba-3B의 품질은 그 크기의 두 배인 트랜스포머와 비교했을 때 평균 4점 높은 상식 인퍼런스를 포함하여 능가합니다(Pythia-3B와 비교).

모델 코드와 사전 훈련된 체크포인트는 https://github.com/state-spaces/mamba 에서 공개적으로 제공됩니다.


2. 상태 공간 모델

상태 공간 모델(Structured State Space Sequence Models, S4)은 RNN, CNN 및 전통적인 상태 공간 모델에 영감을 받은 새로운 시퀀스 모델 클래스입니다. 이 모델들은 다음과 같은 수식을 사용하여 입력 시퀀스 $x(t)$를 출력 시퀀스 $y(t)$로 변환합니다.

\(h'(t) = Ah(t) + Bx(t)\) \(y(t) = Ch(t)\)

$A, B, C$는 모델 파라미터이며, 이들은 시퀀스 길이에 따라 선형 또는 로그 선형 시간 복잡도로 계산됩니다. S4 모델은 이산화 규칙을 사용하여 연속 파라미터 $(\Delta, A, B)$를 이산 파라미터 $(A, B)$로 변환합니다. 예를 들어, 제로 오더 홀드(ZOH) 규칙은 다음과 같이 정의됩니다.

\(A = \exp(\Delta A)\) \(B = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B\)

이 이산화 과정은 S4 모델이 입력에 따라 어떻게 변할 수 있는지를 설명하며, 이는 RNN의 게이팅 메커니즘과 유사한 동작을 가능하게 합니다.


3. 선택적 상태 공간 모델

선택적 SSM은 기존 SSM의 한계를 극복하고 선택적 기능을 통합하여 시퀀스 모델의 효율성과 효과성을 개선합니다. 선택 메커니즘은 입력에 따라 모델 파라미터가 변할 수 있도록 하여, 특정 입력에 집중하거나 무시할 수 있는 능력을 부여합니다.

\[\textbf{Algorithm 2: SSM + Selection (S6)} \\ \textbf{Input:} \, x : (B, L, D) \\ \textbf{Output:} \, y : (B, L, D) \\ 1. \, A : (D, N) \leftarrow \text{Parameter} \\ 2. \, B : (B, L, N) \leftarrow s_B(x) \\ 3. \, C : (B, L, N) \leftarrow s_C(x) \\ 4. \, \Delta : (B, L, D) \leftarrow \tau_\Delta (\text{Parameter} + s_\Delta (x)) \\ 5. \, A, B : (B, L, D, N) \leftarrow \text{discretize}(\Delta, A, B) \\ 6. \, y \leftarrow \text{SSM}(A, B, C)(x) \\ 7. \, \textbf{return} \, y \\\]

이 알고리즘은 입력에 따라 변하는 $\Delta, B, C$ 파라미터를 사용하여 기존 SSM의 시간 불변성을 극복하고, 하드웨어 친화적인 알고리즘을 통해 계산 효율성을 유지합니다. 선택적 스캔은 현대 하드웨어에서 메모리 계층을 효과적으로 활용하여 상태를 효율적으로 관리하고, 커널 퓨전 및 병렬 스캔 기법을 사용하여 성능을 향상시킵니다.


4. 실증 평가

3.1절에서 동기를 부여받은 두 가지 합성 작업을 통해 Mamba의 문제 해결 능력을 시험합니다. 이어서, 자기 회귀 사전 훈련 및 다양한 downstream 작업에 대한 세 가지 도메인에서 평가를 진행합니다.

  • 4.2 언어 모델 사전 훈련 및 0-shot downstream 평가
    • 언어 모델의 사전 훈련은 확장 법칙에 따라 평가되며, 0-shot 평가를 통해 downstream 작업의 성능을 측정합니다.
  • 4.3 DNA 시퀀스 사전 훈련 및 긴 시퀀스 분류 작업의 파인튜닝
    • DNA 시퀀싱 데이터에 대한 사전 훈련 후, 긴 시퀀스 분류 작업에 대한 파인튜닝을 통해 모델의 성능을 평가합니다.
  • 4.4 오디오 파형 사전 훈련 및 자기 회귀식으로 생성된 음성 클립의 품질 평가
    • 오디오 파형에 대한 사전 훈련을 수행하고, 자기 회귀적 방식으로 생성된 음성 클립의 품질을 평가하여 모델의 성능을 확인합니다.

마지막으로, 4.5절에서는 Mamba의 훈련 및 인퍼런스 시간에 대한 계산 효율성을 보여주고, 4.6절에서는 아키텍처 및 선택적 SSM의 여러 구성 요소에 대한 변형 실험을 수행합니다.


4.1 합성 작업

4.1.1 선택적 복사

선택적 복사 작업은 기존 복사 작업을 변형한 것으로, 토큰 간 간격을 무작위화하여 기존 모델의 시간 추적 한계를 드러냅니다. 이는 특정 토큰을 선택적으로 기억하는 능력이 필수적임을 시사합니다. 이전 연구에서는 게이팅 메커니즘을 도입하여 데이터 의존성을 높이려 했으나, 이는 시퀀스 축을 따라 상호작용하는 데 한계가 있음이 밝혀졌습니다(다오 외, 2023). 반면, 선택적 SSM(S6)은 이런 한계를 극복하고 이 작업을 쉽게 해결하였습니다.

4.1.2 인덕션 헤드(induction head)

인덕션 헤드(induction head) 작업은 연관 기억을 요구하는 간단한 작업으로, 대규모 언어모델의 인컨텍스트 학습 능력을 예측하는 데 유용합니다. 모델은 “Harry” 다음에 “Potter”가 나타나야 함을 기억해야 합니다. Mamba는 이 작업에서 완벽하게 일반화하여 훈련 중 보지 못한 1M 길이의 시퀀스까지도 완벽하게 처리할 수 있었습니다.


4.2 언어 모델링

Mamba 아키텍처는 GPT3 사양을 따르며, 언어 모델 사전 훈련 및 0-shot 평가에서 다른 아키텍처와 비교되었습니다. 결과적으로 Mamba는 Transformer++와 같은 강력한 기존 모델과 유사하거나 더 우수한 성능을 보였습니다. 이는 Mamba가 시퀀스 길이가 증가함에 따라 더욱 효과적임을 보여줍니다.


4.3 DNA 모델링

Mamba는 DNA 시퀀싱에 있어 기존의 장거리 의존성 문제를 해결하며, 사전 훈련 및 파인튜닝에서 우수한 성능을 보였습니다. 선택적 SSM은 긴 컨텍스트 길이에서도 정보를 효율적으로 처리할 수 있도록 하여, 기존 모델이 처리하기 어려운 긴 시퀀스에서도 성능이 개선되었습니다.

4.4 오디오 모델링 및 생성

Mamba는 SaShiMi 아키텍처를 사용한 오디오 웨이브폼 모델링과 비교하여 향상된 성능을 보였습니다. 특히 긴 컨텍스트 길이에서 Mamba는 기존 모델보다 우수한 비트당 비트율(BPB)을 달성하며, 오디오 생성에서도 더 높은 충실도를 보였습니다.

4.5 속도 및 메모리 벤치마크

Mamba는 효율적인 SSM 스캔 작업으로 인해 표준 구현보다 최대 40배 빠른 훈련 속도를 보였으며, 인퍼런스 시에는 유사 크기의 Transformer보다 5배 높은 처리량을 달성했습니다. 이는 Mamba가 높은 배치 크기를 사용할 수 있기 때문입니다.

4.6 모델 변형

다양한 아키텍처 및 선택적 SSM 계층에 대한 상세한 변형 실험을 통해 Mamba의 주요 구성 요소의 영향을 평가했습니다. 선택적 SSM은 성능 향상에 크게 기여하였으며, 복잡한 수치 대신 실제 수치를 사용한 SSM이 하드웨어 효율성 측면에서 더 나은 선택일 수 있음을 시사합니다.


5. 토론

이 섹션에서는 관련 연구, 한계점, 그리고 미래 방향에 대해 논의합니다.

관련 연구 Appendix A에서는 선택 메커니즘이 유사한 개념과 어떻게 관련되는지를 설명합니다. Appendix B에서는 SSM과 다른 관련 모델에 대한 확장된 관련 연구를 제공합니다.

선택 메커니즘의 표현력

선택 메커니즘은 입력의 투영을 통해 $\Delta$를 구성합니다. 표 9에서는 $\Delta$의 투영이 차원 1로도 성능이 크게 향상될 수 있음을 보여주며, 차원을 더 늘릴 경우 파라미터의 증가는 있지만 성능 향상이 더욱 두드러집니다. $\Delta$의 투영 크기는 64로 고정되어 있습니다.

SSM 상태 차원의 선택성

SSM의 상태 차원 $\text{N}$을 증가시키는 것은 재발 상태의 차원을 확장하는 것과 같으며, 표 10에 따르면 이는 성능을 크게 향상시킬 수 있습니다. 특히 $\text{B}$와 $\text{C}$가 선택적일 때 더욱 그러합니다. 이런 확장은 파라미터나 FLOPs의 비용이 거의 들지 않으면서도 성능을 개선할 수 있습니다.

연속-이산 스펙트럼에서의 구조화된 SSM

구조화된 SSM은 원래 연속 시스템의 이산화로 정의되었으며, 주로 지각 신호(e.g., 오디오, 비디오)와 같은 연속 시간 데이터 모달리티에 강한 선입견을 갖고 있습니다. 3.1절과 3.5절에서 논의된 바와 같이, 선택 메커니즘은 텍스트와 DNA와 같은 이산 모달리티에서의 약점을 극복하지만, LTI SSM이 향상된 성능을 발휘하는 데이터에 대해서는 성능을 저해할 수 있습니다. 오디오 웨이브폼에 대한 분석은 이런 절충을 더 자세히 조사합니다.

Downstream 적응성 Transformer 기반의 기초 모델은 파인튜닝, 적응, 프롬프팅, 인컨텍스트 학습, 지시 튜닝, RLHF, 양자화 등과 같은 다양한 상호 작용 모드를 갖고 있습니다. SSM과 같은 Transformer 대안이 유사한 속성과 적응성을 가질 수 있는지에 특히 관심이 있습니다.

확장성 실증 평가는 대부분의 강력한 오픈 소스 LLM보다 작은 모델 크기에 국한되었습니다. Mamba가 이런 더 큰 크기에서도 여전히 유리하게 비교될지 평가해야 합니다. 또한 SSM을 확장하는 것은 이 논문에서 논의되지 않은 모델의 추가 공학적 도전과 조정을 포함할 수 있습니다.


6. 결론

구조화된 상태 공간 모델에 선택 메커니즘을 도입하여, 시퀀스 길이에 선형적으로 확장하면서 맥락에 따른 인퍼런스를 수행할 수 있게 했습니다. 간단한 무주의 구조로 통합될 때, Mamba는 다양한 도메인에서 최고의 성능을 달성하여, 강력한 Transformer 모델의 성능을 매치하거나 능가했습니다. 유전체학, 오디오, 비디오와 같이 long context이 필요한 새로운 모달리티에 대한 기초 모델을 구축하기 위해 선택적 상태 공간 모델의 광범위한 응용에 대해 기대하고 있습니다. 결과는 Mamba가 일반적인 시퀀스 모델 백본으로서 강력한 후보임을 시사합니다.


1. Introduction

Foundation models (FMs), or large models pretrained on massive data then adapted for downstream tasks, have emerged as an effective paradigm in modern machine learning. The backbone of these FMs are often sequence models, operating on arbitrary sequences of inputs from a wide variety of domains such as language, images, speech, audio, time series, and genomics (Brown et al. 2020; Dosovitskiy et al. 2020; Ismail Fawaz et al. 2019; Oord et al. 2016; Poli et al. 2023; Sutskever, Vinyals, and Quoc V Le 2014). While this concept is agnostic to a particular choice of model architecture, modern FMs are predominantly based on a single type of sequence model: the Transformer (Vaswani et al. 2017) and its core attention layer (Bahdanau, Cho, and Bengio 2015) The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length. An enormous body of research has appeared on more efficient variants of attention to overcome these drawbacks (Tay, Dehghani, Bahri, et al. 2022), but often at the expense of the very properties that makes it effective. As of yet, none of these variants have been shown to be empirically effective at scale across domains.

Recently, structured state space sequence models (SSMs) (Gu, Goel, and Ré 2022; Gu, Johnson, Goel, et al. 2021) have emerged as a promising class of architectures for sequence modeling. These models can be interpreted as a combination of recurrent neural networks (RNNs) and convolutional neural networks (CNNs), with inspiration from classical state space models (Kalman 1960). This class of models can be computed very efficiently as either a recurrence or convolution, with linear or near-linear scaling in sequence length. Additionally, they have principled mechanisms for modeling long-range dependencies (Gu, Dao, et al. 2020) in certain data modalities, and have dominated benchmarks such as the Long Range Arena (Tay, Dehghani, Abnar, et al. 2021). Many flavors of SSMs (Gu, Goel, and Ré 2022; Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Y. Li et al. 2023; Ma et al. 2023; Orvieto et al. 2023; Smith, Warrington, and Linderman 2023) have been successful in domains involving continuous signal data such as audio and vision (Goel et al. 2022; Nguyen, Goel, et al. 2022; Saon, Gupta, and Cui 2023). However, they have been less effective at modeling discrete and information-dense data such as text.

We propose a new class of selective state space models, that improves on prior work on several axes to achieve the modeling power of Transformers while scaling linearly in sequence length.

First, we identify a key limitation of prior models: the ability to efficiently select data in an Selection Mechanism. input-dependent manner (i.e. focus on or ignore particular inputs). Building on intuition based on important synthetic tasks such as selective copy and induction heads, we design a simple selection mechanism by parameterizing the SSM parameters based on the input. This allows the model to filter out irrelevant information and remember relevant information indefinitely.

Hardware-aware Algorithm. This simple change poses a technical challenge for the computation of the model; in fact, all prior SSMs models must be time- and input-invariant in order to be computationally efficient. We overcome this with a hardware-aware algorithm that computes the model recurrently with a scan instead of convolution, but does not materialize the expanded state in order to avoid IO access between different levels of the GPU memory hierarchy. The resulting implementation is faster than previous methods both in theory (scaling linearly in sequence length, compared to pseudo-linear for all convolution-based SSMs) and on modern hardware (up to 3× faster on A100 GPUs).

Architecture. We simplify prior deep sequence model architectures by combining the design of prior SSM architectures (Dao, Fu, Saab, et al. 2023) with the MLP block of Transformers into a single block, leading to a simple and homogenous architecture design (Mamba) incorporating selective state spaces. Selective SSMs, and by extension the Mamba architecture, are fully recurrent models with key properties that make them suitable as the backbone of general foundation models operating on sequences. (i) High quality: selectivity brings strong performance on dense modalities such as language and genomics. (ii) Fast training and inference: computation and memory scales linearly in sequence length during training, and unrolling the model autoregressively during inference requires only constant time per step since it does not require a cache of previous elements. (iii) Long context: the quality and efficiency together yield performance improvements on real data up to sequence length 1M.

We empirically validate Mamba’s potential as a general sequence FM backbone, in both pretraining quality and domain- specific task performance, on several types of modalities and settings:

  • Synthetics. On important synthetic tasks such as copying and induction heads that have been proposed as being key to large language models, Mamba not only solves them easily but can extrapolate solutions indefinitely long (>1M tokens). - Audio and Genomics. Mamba out-performs prior state-of-the-art models such as SaShiMi, Hyena, and Transformers on modeling audio waveforms and DNA sequences, both in pretraining quality and downstream metrics (e.g. reducing FID on a challenging speech generation dataset by more than half). In both settings, its performance improves with longer context up to million-length sequences.
  • Language Modeling. Mamba is the first linear-time sequence model that truly achieves Transformer-quality performance, both in pretraining perplexity and downstream evaluations. With scaling laws up to 1B parameters, we show that Mamba exceeds the performance of a large range of baselines, including very strong modern Transformer training recipes based on LLaMa (Touvron et al. 2023). Our Mamba language model has 5× generation throughput compared to Transformers of similar size, and Mamba-3B’s quality matches that of Transformers twice its size (e.g. 4 points higher avg. on common sense reasoning compared to Pythia-3B and even exceeding Pythia-7B).

Model code and pre-trained checkpoints are open-sourced at https://github.com/state-spaces/mamba.

2 State Space Models

Structured state space sequence models (S4) are a recent class of sequence models for deep learning that are broadly related to RNNs, CNNs, and classical state space models. They are inspired by a particular continuous system (1) that maps a 1-dimensional function or sequence $x(t) \in \mathbb{R} \mapsto y(t) \in \mathbb{R}$ through an implicit latent state $h(t) \in \mathbb{R}^N$.

Figure 1: (Overview.) Structured SSMs independently map each channel (e.g. $D = 5$) of an input $x$ to output $y$ through a higher dimensional latent state $h$ (e.g. $N = 4$). Prior SSMs avoid materializing this large effective state ($DN$, times batch size $B$ and sequence length $L$) through clever alternate computation paths requiring time-invariance: the $(\Delta, A, B, C)$ parameters are constant across time. Our selection mechanism adds back input-dependent dynamics, which also requires a careful hardware-aware algorithm to only materialize the expanded states in more efficient levels of the GPU memory hierarchy.

Concretely, S4 models are defined with four parameters $(\Delta, A, B, C)$, which define a sequence-to-sequence transformation in two stages.

\[h'(t) = Ah(t) + Bx(t) \tag{1a}\] \[y(t) = Ch(t) \tag{1b}\] \[h_t = Ah_{t-1} + Bx_t \tag{2a}\] \[y_t = Ch_t \tag{2b}\] \[k \quad K = (CB, CAB, \ldots, CA^kB, \ldots) \tag{3a}\] \[y = x * K \tag{3b}\]

Discretization. The first stage transforms the “continuous parameters” $(\Delta, A, B)$ to “discrete parameters” $(A, B)$ through fixed formulas $A = f_A(\Delta, A)$ and $B = f_B(\Delta, A, B)$, where the pair $(f_A, f_B)$ is called a discretization rule. Various rules can be used such as the zero-order hold (ZOH) defined in equation (4).

\[A = \exp(\Delta A)\] \[B = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B \tag{4}\]

Discretization has deep connections to continuous-time systems which can endow them with additional properties such as resolution invariance (Nguyen, Goel, et al. 2022) and automatically ensuring that the model is properly normalized (Gu, Johnson, Timalsina, et al. 2023; Orvieto et al. 2023). It also has connections to gating mechanisms of RNNs (Gu, Gulcehre, et al. 2020; Tallec and Ollivier 2018) which we will revisit in Section 3.5. However, from a mechanical point of view, discretization can simply be viewed as the first step of the computation graph in the forward pass of an SSM. Alternate flavors of SSMs can bypass the discretization step and parameterize $(A, B)$ directly instead (Zhang et al. 2023), which may be easier to reason about.

Computation. After the parameters have been transformed from $(\Delta, A, B, C) \mapsto (A, B, C)$, the model can be computed in two ways, either as a linear recurrence (2) or a global convolution (3). Commonly, the model uses the convolutional mode (3) for efficient parallelizable training (where the whole input sequence is seen ahead of time), and switched into recurrent mode (2) for efficient autoregressive inference (where the inputs are seen one timestep at a time).

Linear Time Invariance (LTI). An important property of equations (1) to (3) is that the model’s dynamics are constant through time. In other words $(\Delta, A, B, C)$, and consequently $(A, B)$ as well, are fixed for all time-steps. This property is called linear time invariance (LTI), which is deeply connected to recurrence and convolutions. Informally, we think of LTI SSMs as being equivalent to any linear recurrence (2a) or convolution (3b), and use LTI as an umbrella term for these classes of models.

Thus far, all structured SSMs have been LTI (e.g. computed as convolutions) because of fundamental efficiency constraints, discussed in Section 3.3. However, a core insight of this work is that LTI models have fundamental limitations in modeling certain types of data, and our technical contributions involve removing the LTI constraint while overcoming the efficiency bottlenecks.

Finally, we note that structured SSMs are so named because computing them efficiently also requires imposing structure on the $A$ matrix. The most popular form of structure is diagonal (Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Smith, Warrington, and Linderman 2023), which we also use.

In this case, the $A \in \mathbb{R}^{N \times N}$, $B \in \mathbb{R}^{N \times 1}$, $C \in \mathbb{R}^{1 \times N}$ matrices can all be represented by $N$ numbers. To operate over an input sequence $x$ of batch size $B$ and length $L$ with $D$ channels, the SSM is applied independently to each channel. Note that in this case, the total hidden state has dimension $DN$ per input, and computing it over the sequence length requires $O(BLDN)$ time and memory; this is the root of the fundamental efficiency bottleneck addressed in Section 3.3.

General State Space Models. We note that the term state space model has a very broad meaning which simply represents the notion of any recurrent process with a latent state. It has been used to refer to many disparate concepts in different disciplines, including Markov decision processes (MDP) (reinforcement learning (Hafner et al. 2020)), dynamic causal modeling (DCM) (computational neuroscience (Friston, Harrison, and Penny 2003)), Kalman filters (controls (Kalman 1960)), hidden Markov models (HMM) and linear dynamical systems (LDS) (machine learning), and recurrent (and sometimes convolutional) models at large (deep learning).

Throughout this entire paper, we use the term “SSM” to refer exclusively to the class of structured SSMs or S4 models (Gu, Goel, and Ré 2022; Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Hasani et al. 2023; Ma et al. 2023; Smith, Warrington, and Linderman 2023) and use these terms interchangeably. For convenience, we may also include derivatives of such models, such as those focusing on either the linear-recurrence or global-convolution viewpoints (Y. Li et al. 2023; Orvieto et al. 2023; Poli et al. 2023), and clarify nuances when necessary.

SSM Architectures. SSMs are standalone sequence transformations that can be incorporated into end-to-end neural network architectures. (We also sometimes call SSM architectures SSNNs, which are to SSM layers as CNNs are to linear convolution layers.) We discuss some of the most well-known SSM architectures, many of which will also serve as our primary baselines.

  • Linear attention (Katharopoulos et al. 2020) is an approximation of self-attention involving a recurrence which can be viewed as a degenerate linear SSM.
  • H3 (Dao, Fu, Saab, et al. 2023) generalized this recurrence to use S4; it can be viewed as an architecture with an SSM sandwiched by two gated connections (Figure 3). H3 also inserts a standard local convolution, which they frame as a shift-SSM, before the main SSM layer.
  • Hyena (Poli et al. 2023) uses the same architecture as H3 but replaces the S4 layer with an MLP-parameterized global convolution (Romero et al. 2021).
  • RetNet (Y. Sun et al. 2023) adds an additional gate to the architecture and uses a simpler SSM, allowing an alternative parallelizable computation path, using a variant of multi-head attention (MHA) instead of convolutions.
  • RWKV (B. Peng et al. 2023) is a recent RNN designed for language modeling based on another linear attention approximation, the attention-free Transformer (S. Zhai et al. 2021). Its main “WKV” mechanism involves LTI recurrences and can be viewed as the ratio of two SSMs.

Other closely related SSMs and architectures are discussed further in an extended related work (Appendix B). We highlight in particular S5 (Smith, Warrington, and Linderman 2023), QRNN (Bradbury et al. 2016), and SRU (Lei et al. 2017), which we view as the most closely related methods to our core selective SSM.

3 Selective State Space Models

We motivate our selection mechanism using intuition from synthetic tasks (Section 3.1), then explain how to incorporate this mechanism into state space models (Section 3.2). The resulting time-varying SSMs cannot use convolutions, presenting a technical challenge of how to compute them efficiently. We overcome this with a hardware-aware algorithm that exploits the memory hierarchy on modern hardware (Section 3.3). We then describe a simple SSM architecture without attention or even MLP blocks (Section 3.4). Finally, we discuss some additional properties of selection mechanisms (Section 3.5).

3.1 Motivation: Selection as a Means of Compression We argue that a fundamental problem of sequence modeling is compressing context into a smaller state. In fact, we can view the tradeoffs of popular sequence models from this point of view. For example, attention is both effective and inefficient because it explicitly does not compress context at all. This can be seen from the fact that autoregressive inference requires explicitly storing the entire context (i.e. the KV cache), which directly causes the slow linear-time inference and quadratic-time training of Transformers. On the other hand, recurrent models are efficient because they have a finite state, implying constant-time inference and linear-time training. However, their effectiveness is limited by how well this state has compressed the context.

To understand this principle, we focus on two running examples of synthetic tasks (Figure 2).

  • The Selective Copying task modifies the popular Copying task (Arjovsky, Shah, and Bengio 2016) by varying the position of the tokens to memorize. It requires content-aware reasoning to be able to memorize the relevant tokens (colored) and filter out the irrelevant ones (white).
  • The Induction Heads task is a well-known mechanism hypothesized to explain the majority of in-context learning abilities of LLMs (Olsson et al. 2022). It requires context-aware reasoning to know when to produce the correct output in the appropriate context (black).

These tasks reveal the failure mode of LTI models. From the recurrent view, their constant dynamics (e.g. the (𝑨, 𝑩) transitions in (2)) cannot let them select the correct information from their context, or affect the hidden state passed along the sequence in an input-dependent way. From the convolutional view, it is known that global convolutions can solve the vanilla Copying task (Romero et al. 2021) because it only requires time-awareness, but that they have difficulty with the Selective Copying task because of lack of content-awareness (Figure 2). More concretely, the spacing between inputs-to-outputs is varying and cannot be modeled by static convolution kernels.

In summary, the efficiency vs. effectiveness tradeoff of sequence models is characterized by how well they compress their state: efficient models must have a small state, while effective models must have a state that contains all necessary information from the context. In turn, we propose that a fundamental principle for building sequence models is selectivity: or the context-aware ability to focus on or filter out inputs into a sequential state. In particular, a selection mechanism controls how information propagates or interacts along the sequence dimension (see Section 3.5 for more discussion).

3.2 Improving SSMs with Selection One method of incorporating a selection mechanism into models is by letting their parameters that affect interactions along the sequence (e.g. the recurrent dynamics of an RNN or the convolution kernel of a CNN) be input-dependent. ```markdown Algorithms 1 and 2 illustrate the main selection mechanism that we use. The main difference is simply making several parameters $\Delta$, $B$, $C$ functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension $L$, meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2.) This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.

We specifically choose $s_B(x) = \text{Linear}N(x)$, $s_C(x) = \text{Linear}_N(x)$, $s\Delta(x) = \text{Broadcast}D(\text{Linear}_1(x))$, and $\tau\Delta = \text{softplus}$, where $\text{Linear}d$ is a parameterized projection to dimension $d$. The choice of $s\Delta$ and $\tau_\Delta$ is due to a connection to RNN gating mechanisms explained in Section 3.5.

Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs.

\[\begin{array}{ll} \textbf{Algorithm 1: SSM (S4)} & \\ \textbf{Input:} & x : (B, L, D) \\ \textbf{Output:} & y : (B, L, D) \\ 1. & A : (D, N) \leftarrow \text{Parameter} \\ 2. & B : (D, N) \leftarrow \text{Parameter} \\ 3. & C : (D, N) \leftarrow \text{Parameter} \\ 4. & \Delta : (D) \leftarrow \tau_\Delta (\text{Parameter}) \\ 5. & A, B : (D, N) \leftarrow \text{discretize}(\Delta, A, B) \\ 6. & y \leftarrow \text{SSM}(A, B, C)(x) \\ 7. & \textbf{return} \, y \\ \end{array}\] \[\begin{array}{ll} \textbf{Algorithm 2: SSM + Selection (S6)} & \\ \textbf{Input:} & x : (B, L, D) \\ \textbf{Output:} & y : (B, L, D) \\ 1. & A : (D, N) \leftarrow \text{Parameter} \\ & \quad \text{⊲ Represents structured } N \times N \text{ matrix} \\ 2. & B : (B, L, N) \leftarrow s_B(x) \\ 3. & C : (B, L, N) \leftarrow s_C(x) \\ 4. & \Delta : (B, L, D) \leftarrow \tau_\Delta (\text{Parameter} + s_\Delta (x)) \\ 5. & A, B : (B, L, D, N) \leftarrow \text{discretize}(\Delta, A, B) \\ 6. & y \leftarrow \text{SSM}(A, B, C)(x) \\ 7. & \textbf{return} \, y \\ & \quad \text{⊲ Time-varying: recurrence (scan) only} \\ \end{array}\]

3.3 Efficient Implementation of Selective SSMs

Hardware-friendly primitives such as convolutions (Krizhevsky, Sutskever, and Hinton 2012) and attention (Bahdanau, Cho, and Bengio 2015; Vaswani et al. 2017) enjoy widespread application. Here we aim to make selective SSMs efficient on modern hardware (GPUs) as well. The selection mechanism is quite natural, and earlier works attempted to incorporate special cases of selection, such as letting $\Delta$ vary over time in recurrent SSMs (Gu, Dao, et al. 2020). However, as previously mentioned, a core limitation in the usage of SSMs is their computational efficiency, which was why S4 and all derivatives used LTI (non-selective) models, most commonly in the form of global convolutions.

3.3.1 Motivation of Prior Models

We first revisit this motivation and overview our approach to overcome limitations of prior methods.

  • At a high level, recurrent models such as SSMs always balance a tradeoff between expressivity and speed: as discussed in Section 3.1, models with larger hidden state dimension should be more effective but slower. Thus we want to maximize hidden state dimension without paying speed and memory costs.
  • Note that the recurrent mode is more flexible than the convolution mode, since the latter (3) is derived from expanding the former (2) (Gu, Goel, and Ré 2022; Gu, Johnson, Goel, et al. 2021). However, this would require computing and materializing the latent state $h$ with shape $(B, L, D, N)$, which is much larger (by a factor of $N$, the SSM state dimension) than the input $x$ and output $y$ of shape $(B, L, D)$. Thus the more efficient convolution mode was introduced which could bypass the state computation and materializes a convolution kernel (3a) of size only $(B, L, D)$.
  • Prior LTI state space models leverage the dual recurrent-convolutional forms to increase the effective state dimension by a factor of $N$ (≈ 10 - 100), much larger than traditional RNNs, without efficiency penalties.

3.3.2 Overview of Selective Scan: Hardware-Aware State Expansion

The selection mechanism is designed to overcome the limitations of LTI models; at the same time, we therefore need to revisit the computation problem of SSMs. We address this with three classical techniques: kernel fusion, parallel scan, and recomputation. We make two main observations:

  • The naive recurrent computation uses $O(BLDN)$ FLOPs while the convolutional computation uses $O(BLD \log(L))$ FLOPs, and the former has a lower constant factor. Thus for long sequences and not-too-large state dimension $N$, the recurrent mode can actually use fewer FLOPs.
  • The two challenges are the sequential nature of recurrence, and the large memory usage. To address the latter, just like the convolutional mode, we can attempt to not actually materialize the full state $h$.

The main idea is to leverage properties of modern accelerators (GPUs) to materialize the state $h$ only in more efficient levels of the memory hierarchy. In particular, most operations (except matrix multiplication) are bounded by memory bandwidth (Dao, Fu, Ermon, et al. 2022; Ivanov et al. 2021; Williams, Waterman, and Patterson 2009). This includes our scan operation, and we use kernel fusion to reduce the amount of memory IOs, leading to a significant speedup compared to a standard implementation.

Concretely, instead of preparing the scan input $(A, B)$ of size $(B, L, D, N)$ in GPU HBM (high-bandwidth memory), we load the SSM parameters $(\Delta, A, B, C)$ directly from slow HBM to fast SRAM, perform the discretization and recurrence in SRAM, and then write the final outputs of size $(B, L, D)$ back to HBM.

To avoid the sequential recurrence, we observe that despite not being linear it can still be parallelized with a work-efficient parallel scan algorithm (Blelloch 1990; Martin and Cundy 2018; Smith, Warrington, and Linderman 2023).

Finally, we must also avoid saving the intermediate states, which are necessary for backpropagation. We carefully apply the classic technique of recomputation to reduce the memory requirements: the intermediate states are not stored but recomputed in the backward pass when the inputs are loaded from HBM to SRAM. As a result, the fused selective scan layer has the same memory requirements as an optimized transformer implementation with FlashAttention.

Details of the fused kernel and recomputation are in Appendix D. The full Selective SSM layer and algorithm are illustrated in Figure 1.

3.4 A Simplified SSM Architecture

As with structured SSMs, selective SSMs are standalone sequence transformations that can be flexibly incorporated into neural networks. The H3 architecture is the basis for the most well-known SSM architectures (Section 2), which are generally comprised of a block inspired by linear attention interleaved with an MLP (multi-layer perceptron) block. We simplify this architecture by combining these two components into one, which is stacked homogenously (Figure 3). This is inspired by the gated attention unit (GAU) (Hua et al. 2022), which did something similar for attention.

This architecture involves expanding the model dimension $D$ by a controllable expansion factor $E$. For each block, most of the parameters ($3ED^2$) are in the linear projections ($2ED^2$ for input projections, $ED^2$ for output projection) while the inner SSM contributes less. The number of SSM parameters (projections for $\Delta$, $B$, $C$, and the matrix $A$) are much smaller in comparison. We repeat this block, interleaved with standard normalization and residual connections, to form the Mamba architecture. We always fix to $E = 2$ in our experiments and use two stacks of the block to match the $12D^2$ parameters of a Transformer’s interleaved MHA (multi-head attention) and MLP blocks. We use the SiLU / Swish activation function (Hendrycks and Gimpel 2016; Ramachandran, Zoph, and Quoc V Le 2017), motivated so that the Gated MLP becomes the popular “SwiGLU” variant (Chowdhery et al. 2023; Dauphin et al. 2017; Shazeer 2020; Touvron et al. 2023). Finally, we additionally use an optional normalization layer (we choose LayerNorm (J. L. Ba, Kiros, and Hinton 2016)), motivated by RetNet’s usage of a normalization layer in a similar location (Y. Sun et al. 2023).

3.5 Properties of Selection Mechanisms

The selection mechanism is a broader concept that can be applied in different ways, such as to more traditional RNNs or CNNs, to different parameters (e.g. $A$ in Algorithm 2), or using different transformations $s(x)$.

Figure 3: (Architecture.) Our simplified block design combines the H3 block, which is the basis of most SSM architectures, with the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we simply repeat the Mamba block homogenously. Compared to the H3 block, Mamba replaces the first multiplicative gate with an activation function. Compared to the MLP block, Mamba adds an SSM to the main branch. For 𝜎 we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; Ramachandran, Zoph, and Quoc V Le 2017).

3.5.1 Connection to Gating Mechanisms

We highlight the most important connection: the classical gating mechanism of RNNs is an instance of our selection mechanism for SSMs. We note that the connection between RNN gating and the discretization of continuous-time systems is well established (Funahashi and Nakamura 1993; Tallec and Ollivier 2018). In fact, Theorem 1 is an improvement of Gu, Johnson, Goel, et al. (2021, Lemma 3.1) generalizing to the ZOH discretization and input-dependent gates (proof in Appendix C). More broadly, $\Delta$ in SSMs can be seen to play a generalized role of the RNN gating mechanism. In line with prior work, we adopt the view that discretization of SSMs is the principled foundation of heuristic gating mechanisms.

Theorem 1. When $N = 1$, $A = -1$, $B = 1$, $s_\Delta = \text{Linear}(x)$, and $\tau_\Delta = \text{softplus}$, then the selective SSM recurrence (Algorithm 2) takes the form

\[g_t = \sigma (\text{Linear}(x_t )) \\ h_t = (1 - g_t )h_{t-1} + g_t x_t .\]

(5)

As mentioned in Section 3.2, our specific choices of $s_\Delta$, $\tau_\Delta$ are from this connection. In particular, note that if a given input $x_t$ should be completely ignored (as necessary in the synthetic tasks), all $D$ channels should ignore it, and so we project the input down to 1 dimension before repeating/broadcasting with $\Delta$.

3.5.2 Interpretation of Selection Mechanisms

We elaborate on three particular mechanistic effects of selection.

Selectivity allows filtering out irrelevant noise tokens that may occur between inputs of interest. This is exemplified by the Selective Copying task, but occurs ubiquitously in common data modalities, particularly for discrete data – for example, the presence of language fillers such as “um”. This property arises because the model can mechanistically filter out any particular input $x_t$, for example in the gated RNN case (Theorem 1) when $g_t \rightarrow 0$.

It has been empirically observed that many sequence models do not improve with longer context (F. Filtering Context. Shi et al. 2023), despite the principle that more context should lead to strictly better performance. An explanation is that many sequence models cannot effectively ignore irrelevant context when necessary; an intuitive example is global convolutions (and general LTI models). On the other hand, selective models can simply reset their state at any time to remove extraneous history, and thus their performance in principle improves monotonically with context length (e.g. Section 4.3.2).

In settings where multiple independent sequences are stitched together, Transformers can keep Boundary Resetting. them separate by instantiating a particular attention mask, while LTI models will bleed information between the sequences. Selective SSMs can also reset their state at boundaries (e.g. $\Delta_t \rightarrow \infty$, or Theorem 1 when $g_t \rightarrow 1$). These settings may occur artificially (e.g. packing documents together to improve hardware utilization) or naturally (e.g. episode boundaries in reinforcement learning (Lu et al. 2023)).

Additionally, we elaborate on the effects of each selective parameter.

Interpretation of $\Delta$. In general, $\Delta$ controls the balance between how much to focus or ignore the current input $x_t$. It generalizes RNN gates (e.g. $g_t$ in Theorem 1): mechanically, a large $\Delta$ resets the state $h$ and focuses on the current input $x$, while a small $\Delta$ persists the state and ignores the current input. SSMs (1)-(2) can be interpreted as a continuous system discretized by a timestep $\Delta$, and in this context, the intuition is that large $\Delta \rightarrow \infty$ represents the system focusing on the current input for longer (thus “selecting” it and forgetting its current state) while a small $\Delta \rightarrow 0$ represents a transient input that is ignored.

Interpretation of $A$. We remark that while the $A$ parameter could also be selective, it ultimately affects the model only through its interaction with $\Delta$ via $A = \exp(\Delta A)$ (the discretization (4)). Thus, selectivity in $\Delta$ is enough to ensure selectivity in $(A, B)$, and is the main source of improvement. We hypothesize that making $A$ selective in addition to (or instead of) $\Delta$ would have similar performance, and leave it out for simplicity.

Interpretation of $B$ and $C$. As discussed in Section 3.1, the most important property of selectivity is filtering out irrelevant information so that a sequence model’s context can be compressed into an efficient state. In an SSM, modifying $B$ and $C$ to be selective allows finer-grained control over whether to let an input $x_t$ into the state $h_t$, or the state into the output $y_t$. These can be interpreted as allowing the model to modulate the recurrent dynamics based on content (input) and context (hidden states) respectively.

3.6 Additional Model Details

Real vs. Complex. Most prior SSMs use complex numbers in their state $h$, which is necessary for strong performance on many tasks in perceptual modalities (Gu, Goel, and Ré 2022). However, it has been empirically observed that completely real-valued SSMs seem to work fine, and possibly even better, in some settings (Ma et al. 2023). We use real values as the default, which work well for all but one of our tasks; we hypothesize that the complex-real tradeoff is related to the continuous-discrete spectrum in data modalities, where complex numbers are helpful for continuous modalities (e.g. audio, video) but not discrete (e.g. text, DNA).

Initialization. Most prior SSMs also suggest special initializations, particularly in the complex-valued case, which can help in several settings such as low-data regimes. Our default initialization for the complex case is S4D-Lin and for the real case is S4D-Real (Gu, Gupta, et al. 2022), which is based on the HIPPO theory (Gu, Dao, et al. 2020). These define the $n$-th element of $A$ as $-1/2 + ni$ and $-(n + 1)$ respectively. However, we expect many initializations to work fine, particularly in the large-data and real-valued SSM regimes; some ablations are considered in Section 4.6.

Parameterization of $\Delta$. We defined the selective adjustment to $\Delta$ as $s_\Delta (x) = \text{Broadcast}D (\text{Linear}_1 (x))$, which was motivated by the mechanics of $\Delta$ (Section 3.5). We observe that it can be generalized from dimension 1 to a larger dimension $R$. We set this to be a small fraction of $D$, which uses a negligible number of parameters compared to the main Linear projections in the block. We additionally note that the broadcasting operation can instead be viewed as another Linear projection, initialized to a specific pattern of 1’s and 0’s; if this projection is trainable, this leads to the alternative $s\Delta (x) = \text{Linear}_D (\text{Linear}_R (x))$, which can be viewed as a low-rank projection. In our experiments, the $\Delta$ parameter (which can be viewed as a bias term) is initialized to $\tau^{-1}$ following prior work on SSMs (Gu, Johnson, Timalsina, et al. 2023).

Remark 3.1. For brevity in our experimental results, we sometimes abbreviate selective SSMs as S6 models, because they are S4 models with a selection mechanism and computed with a scan.

4 Empirical Evaluation

In Section 4.1 we test Mamba’s ability to solve the two synthetic tasks motivated in Section 3.1. We then evaluate on three domains, each evaluated on autoregressive pretraining as well as downstream tasks.

  • Section 4.2: language model pretraining (scaling laws), and zero-shot downstream evaluation.
  • Section 4.3: DNA sequence pretraining, and fine-tuning on a long-sequence classification task.
  • Section 4.4: audio waveform pretraining, and the quality of autoregressively generated speech clips.

Finally, Section 4.5 shows Mamba’s computational efficiency at both training and inference time, and Section 4.6 ablates various components of the architecture and selective SSMs.

4.1 Synthetic Tasks Full experiment details for these tasks including task details and training protocol are in Appendix E.1.

4.1.1 Selective Copying

The Copying task is one of the most well-studied synthetic tasks for sequence modeling, originally designed to test the memorization abilities of recurrent models. As discussed in Section 3.1, LTI SSMs (linear recurrences and global convolutions) can easily solve this task by only keeping track of time instead of reasoning about the data; for example, by constructing a convolution kernel of exactly the right length (Figure 2). This was explicitly validated in earlier work on global convolutions (Romero et al. 2021). The Selective Copying task prevents this shortcut by randomizing the spacing between tokens. Note that this task has been introduced before as the Denoising task (Jing et al. 2019).

Note that many previous works argue that adding architecture gating (multiplicative interactions) can endow models with “data-dependence” and solve related tasks (Dao, Fu, Saab, et al. 2023; Poli et al. 2023). However, we find this explanation insufficient intuitively because such gating does not interact along the sequence axis, and cannot affect the spacing between tokens. In particular architecture gating is not an instance of a selection mechanism (Appendix A).

Table 1 confirms that gated architectures such as H3 and Mamba only partially improve performance, while the selec- tion mechanism (modifying S4 to S6) easily solves this task, particularly when combined with these more powerful architectures.

4.1.2 Induction Heads

Induction heads (Olsson et al. 2022) is a simple task from the mechanistic interpretability lens (Elhage et al. 2021) that is surprisingly predictive of the in-context learning ability of LLMs. It requires models to perform associative recall and copy: for example, if the model has seen a bigram such as “Harry Potter” in the sequence, then the next time “Harry” appears in the same sequence, the model should be able to predict “Potter” by copying from history.

Dataset. We train a 2-layer model on the induction heads task at sequence length 256, with a vocab size of 16, which is comparable to prior work on this task (Dao, Fu, Saab, et al. 2023) but with longer sequences. We additionally investigate generalization and extrapolation abilities by evaluating on a range of sequence lengths from 26 = 64 up to 220 = 1048576 at test time.

Following established work on induction heads, we use 2 layer models, which allows attention to mechanistically Models. solve the induction heads task (Olsson et al. 2022). We test both multi-head attention (8 heads, with various positional encodings) and SSM variants. We use a model dimension 𝐷 of 64 for Mamba and 128 for the other models.

Results. Table 2 shows that Mamba—or more precisely, its selective SSM layer—has the ability to solve the task perfectly because of its ability to selectively remember the relevant token while ignoring everything else in between. It generalizes perfectly to million-length sequences, or 4000× longer than it saw during training, while no other method goes beyond 2×.

Table 1: (Selective Copying.) Accuracy for combinations of architectures and inner sequence layers.

Table 2: (Induction Heads.) Models are trained on sequence length 28 = 256, and tested on increasing sequence lengths of 26 = 64 up to 220 = 1048576. Full numbers in Table 11.

Figure 4: (Scaling Laws.) Models of size ≈ 125𝑀 to ≈ 1.3𝐵 parameters, trained on the Pile. Mamba scales better than all other attention-free models and is the first to match the performance of a very strong “Transformer++” recipe that has now become standard, particularly as the sequence length grows.

Out of positional encoding variants for attention models, xPos (which was designed for length extrapolation) is slightly better than the others; also note that all attention models were only tested up to sequence length 214 = 16384 due to memory limitations. Out of other SSMs, H3 and Hyena are similar, contrary to the findings in Poli et al. (2023).

4.2 Language Modeling

We evaluate the Mamba architecture on standard autoregressive language modeling against other architectures, on both pretraining metrics (perplexity) and zero-shot evaluations. We set the model sizes (depth and width) to mirror GPT3 specifications. We use the Pile dataset (L. Gao, Biderman, et al. 2020), and follow the training recipe described in Brown et al. (2020). All training details are in Appendix E.2.

4.2.1 Scaling Laws

For baselines, we compare against the standard Transformer architecture (GPT3 architecture), as well as the strongest Transformer recipe we know of (here referred to as Transformer++), based on the PaLM and LLaMa architectures (e.g. rotary embedding, SwiGLU MLP, RMSNorm instead of LayerNorm, no linear bias, and higher learning rates). We also compare against other recent subquadratic architectures (Figure 4). All model details are in Appendix E.2.

Figure 4 shows scaling laws under the standard Chinchilla (Hoffmann et al. 2022) protocol, on models from ≈ 125𝑀 to ≈ 1.3𝐵 parameters. Mamba is the first attention-free model to match the performance of a very strong Transformer recipe (Transformer++) that has now become standard, particularly as the sequence length grows. (We note that full results on context length 8k are missing for the RWKV and RetNet baselines, prior strong recurrent models that can also be interpreted as SSMs, because of a lack of efficient implementations leading to out-of-memory or unrealistic computation requirements.)

Table 3 shows the performance of Mamba on a range of popular downstream zero-shot evaluation tasks. We compare against the most well-known open source models at these sizes, most importantly Pythia (Biderman et al. 2023) and RWKV (B. Peng et al. 2023) which were trained with the same tokenizer, dataset, and training length (300B tokens) as our models. (Note that Mamba and Pythia are trained with context length 2048, while RWKV was trained with context length 1024.)

Table 3: (Zero-shot Evaluations.) Best results for each size in bold. We compare against open source LMs with various tokenizers, trained for up to 300B tokens. Pile refers to the validation split, comparing only against models trained on the same dataset and tokenizer (GPT-NeoX-20B). For each model size, Mamba is best-in-class on every single evaluation result, and generally matches baselines at twice the model size.

4.3 DNA Modeling

Motivated by the success of large language models, there has been recent exploration into using the foundation model paradigm for genomics. DNA has been likened to language in that it consists of sequences of discrete tokens with a finite vocabulary. It is also known for requiring long-range dependencies to model (Avsec et al. 2021). We investigate Mamba as a FM backbone for pretraining and fine-tuning in the same setting as recent works on long-sequence models for DNA (Nguyen, Poli, et al. 2023). In particular, we focus on two explorations of scaling laws across model size and sequence length (Figure 5), and a difficult downstream synthetic classification task requiring long context (Figure 6).

For pretraining, we largely follow a standard causal language modeling (next token prediction) setup for the training and model details (see also Appendix E.2). For the dataset, we largely follow the setup of HyenaDNA (Nguyen, Poli, et al. 2023), which uses the HG38 dataset for pretraining consisting of a single human genome with about 4.5 billion tokens (DNA base pairs) in the training split.

Figure 5: (DNA Scaling Laws.) Pretraining on the HG38 (human genome) dataset. (Left) Fixing short context length 210 = 1024 and increasing size from ≈ 200𝐾 to ≈ 40𝑀 parameters, Mamba scales better than baselines. (Right) Fixing model size and increasing sequence lengths while keeping tokens/batch and total training tokens fixed. Unlike baselines, the selection mechanism of Mamba facilitates better performance with increasing context length.

4.3.1 Scaling: Model Size

In this experiment, we investigate the scaling properties of genomics foundation models with various model backbones (Figure 5 Left).

Training. To advantage the baselines, we train on a short sequence length of 1024; as shown in Section 4.3.2, we expect results to favor Mamba even more at longer sequence lengths. We fix a global batch size of 1024, for a total of 220 ≈ 1𝑀 tokens per batch. Models were trained for 10𝐾 gradient steps for a total of 10𝐵 tokens.

Figure 5 (Left) shows that Mamba’s pretraining perplexity improves smoothly with model size, and that Mamba Results. scales better than both HyenaDNA and Transformer++. For example, at the largest model size of ≈ 40𝑀 parameters, the curve shows that Mamba can match the Transformer++ and HyenaDNA models with roughly 3× to 4× fewer parameters.

4.3.2 Scaling: Context Length

In the next DNA experiment, we investigate the scaling properties of models with respect to sequence length. We only compare the HyenaDNA and Mamba models, as quadratic attention becomes prohibitively expensive at longer sequence lengths. We pretrain models on sequence lengths 210 = 1024, 212 = 4096, 214 = 16384, 216 = 65536, 218 = 262144, 220 = 1048576. We fix a model size of 6 layers by width 128 (about 1.3M-1.4M parameters). Models were trained for 20𝐾 gradient steps for a total of ≈ 330𝐵 tokens. The longer sequence lengths used sequence length warmup similar to (Nguyen, Poli, et al. 2023).

Figure 5 (Right) shows that Mamba is able to make use of longer context even up to extremely long Results. sequences of length 1M, and its pretraining perplexity improves as the context increases. On the other hand, the HyenaDNA model gets worse with sequence length. This is intuitive from the discussion in Section 3.5 on properties of the selection mechanism. In particular, LTI models cannot selectively ignore information; from a convolutional perspective, a very long convolution kernel is aggregating all information across a long sequence which may be very noisy. Note that while HyenaDNA claims to improve with longer context, their results do not control for computation time.

4.3.3 Synthetic Species Classification

We evaluate models on a downstream task of classifying between 5 different species by randomly sampling a contiguous segment of their DNA. This task is adapted from HyenaDNA, which used the species {human, lemur, mouse, pig, hippo}. We modify the task to be significantly more challenging by classifying between the five great apes species {human, chimpanzee, gorilla, orangutan, bonobo}, which are known to share 99% of their DNA.

Figure 6: (Great Apes DNA Classification.) Accuracy after fine- tuning on sequences of length 210 = 1024 up to 220 = 1048576 using pretrained models of the same context length. Numerical results in Table 13.

Figure 7: (Audio Pretraining.) Mamba improves performance over prior state-of-the-art (Sashimi) in autoregressive audio model- ing, while improving up to minute-long context or million-length sequences (controlling for computation).

4.4 Audio Modeling and Generation

For the audio waveform modality, we compare primarily to the SaShiMi architecture and training protocols (Goel et al. 2022). This model comprises:

  1. a U-Net backbone with two stages of pooling by a factor 𝑝 that doubles the model dimension 𝐷 per stage,
  2. alternating S4 and MLP blocks in each stage.

We consider replacing the S4+MLP blocks with Mamba blocks. Experiment details are in Appendix E.4.

4.4.1 Long-Context Autoregressive Pretraining

We evaluate pretraining quality (autoregressive next-sample prediction) on YouTubeMix (DeepSound 2017), a standard piano music dataset used by prior work consisting of 4 hours of solo piano music, sampled at a rate of 16000 Hz. Pretraining details largely follow the standard language modeling setup (Section 4.2). Figure 7 evaluates the effect of increasing training sequence lengths from 213 = 8192 to 220 ≈ 106, while keeping computation fixed. (There are some slight edge cases to the way the data is curated, which may lead to kinks in the scaling curves. For example, only minute-long clips were available so the maximum sequence length is actually bounded by 60𝑠 · 16000𝐻𝑧 = 960000.)

Both Mamba and the SaShiMi (S4+MLP) baseline improve consistently with longer context lengths; Mamba is better throughout, and the gap widens at longer lengths. The main metric is bits per byte (BPB), which is a constant factor log(2) of the standard negative log-likelihood (NLL) loss for pretraining other modalities.

We note one important detail: this is the only experiment in this paper in which we switched from the real parameterization to complex (Section 3.6). We show additional ablations in Appendix E.4.

4.4.2 Autoregressive Speech Generation

SC09 is a benchmark speech generation dataset (Donahue, McAuley, and Puckette 2019; Warden 2018), consisting of 1-second clips sampled at 16000 Hz of the digits “zero” through “nine” with highly variable characteristics. We largely follow the autoregressive training setup and generation protocol of Goel et al. (2022).

Table 4 shows automated metrics of the Mamba-UNet model compared to a variety of baselines from Goel et al. (2022): WaveNet (Oord et al. 2016), SampleRNN (Mehri et al. 2017), WaveGAN (Donahue, McAuley, and Puckette 2019), DiffWave (Z. Kong et al. 2021), and SaShiMi. A small Mamba model outperforms the state-of-the-art (and much larger) GAN- and diffusion- based models. A larger model parameter-matched to the baselines further improves on fidelity metrics dramatically.

Table 5 takes the small Mamba model and investigates combinations of different architectures for the outer stages and center stage. It shows that Mamba is consistently better than S4+MLP in the outer blocks, and Mamba > S4+MLP > MHA+MLP in the center blocks.

Table 4: (SC09) Automated metrics for unconditional generation on a challenging dataset of fixed-length speech clips. (Top to Bottom) Autoregressive baselines, non-autoregressive baselines, Mamba, and dataset metrics.

Table 5: (SC09 Model Ablations) Models with 6M parameters. In SaShiMi’s U-Net backbone, there are 8 center blocks operating on sequence length 1000, sandwiched on each side by 8 outer blocks on sequence length 4000, sandwiched by 8 outer blocks on sequence length 16000 (40 blocks total). The architecture of the 8 center blocks are ablated independently of the rest. Note that Transformers (MHA+MLP) were not tested in the more important outer blocks because of efficiency constraints.

4.5 Speed and Memory Benchmarks

We benchmark the speed of the SSM scan operation (state expansion 𝑁 = 16), as well as the end-to-end inference throughput of Mamba, in Figure 8. Our efficient SSM scan is faster than the best attention implementation that we know of (FlashAttention-2 (Dao 2024)) beyond sequence length 2K, and up to 20-40× faster than a standard scan implementation in PyTorch. Mamba achieves 4-5× higher inference throughput than a Transformer of similar size, since without the KV cache it can use much higher batch sizes. For example, a Mamba-6.9B (untrained) would have higher inference throughput than a 5× smaller Transformer-1.3B. Details in Appendix E.5, which additionally includes a benchmark of memory consumption.

Figure 8: (Efficiency Benchmarks.) (Left) Training: our efficient scan is 40× faster than a standard implementation. (Right) Inference: as a recurrent model, Mamba can achieve 5× higher throughput than Transformers.

4.6 Model Ablations

We perform a series of detailed ablations on components of our model, focusing on the setting of language modeling with size ≈ 350M models at Chinchilla token counts (same setting as Figure 4).

4.6.1 Architecture

Table 6 investigates the effects of the architecture (block) and its inner SSM layer (Figure 3). We find that

  • Among previous non-selective (LTI) SSMs, which are equivalent to global convolutions, performance is very similar.
  • Replacing the complex-valued S4 variant from previous work with a real-valued one does not affect performance much, suggesting that (at least for LM) real-valued SSMs may be a better choice when accounting for hardware efficiency.
  • Replacing any of these with a selective SSM (S6) significantly improves performance, validating the motivation of Section 3.
  • The Mamba architecture performs similarly to the H3 architecture (and seems slightly better when using a selective layer).

Table 6: (Ablations: Architecture and SSM layer.) The Mamba block performs similarly to H3 while being simpler. In the inner layer, there is little difference among different parameterizations of LTI models, while selective SSMs (S6) provide a large improvement. More specifically, the S4 (real) variant is S4D-Real and the S4 (complex) variant is S4D-Lin.

We also investigate interleaving the Mamba block with other blocks such as MLP (a traditional architecture) MHA (a hybrid attention architecture) in Appendix E.2.2.

4.6.2 Selective SSM

Table 7 ablates the selective SSM layer by considering different combinations of selective Δ, 𝑩, and 𝑪 parameters (Algo- rithm 2), showing that Δ is the most important parameter due to its connection to RNN gating (Theorem 1).

Table 8 considers different initializations of the SSM, which have been shown to make a large difference in some data modalities and settings (Gu, Goel, and Ré 2022; Gu, Gupta, et al. 2022). On language modeling, we find that simpler real-valued diagonal initializations (S4D-Real, row 3) instead of more standard complex-valued parameterizations (S4D-Lin, row 1) perform better. Random initializations also work well, consistent with findings from prior work (Mehta et al. 2023).

Table 9 and Table 10 consider varying the dimension of the Δ and (𝑩, 𝑪) projections respectively. Changing them from static to selective provides the most benefit, while increasing the dimensions further generally improves performance modestly with a small increase in parameter count.

Of particular note is the dramatic improvement of the selective SSM when the state size 𝑁 is increased, with over a 1.0 perplexity improvement for a cost of only 1% additional parameters. This validates our core motivation in Sections 3.1 and 3.3.

5 Discussion

We discuss related work, limitations, and some future directions.

Related Work. Appendix A discusses how the selection mechanism relates to similar concepts. Appendix B has an extended related work of SSMs and other related models.

Table 9: (Ablations: Expressivity of Δ.) The selection mechanism of Δ constructs it with a projection of the input. Projecting it even to dim. 1 provides a large increase in performance; increasing it further provides further improvements at the cost of a mod- est increase in parameters. State size fixed to 𝑁 = 16.

Table 10: (Ablations: SSM state dimension.) (Top) Constant 𝑩 and 𝑪 (Bottom) Selective 𝑩 and 𝑪. Increasing the SSM state dimension 𝑁 , which can be viewed as an expansion factor on the dimension of the recurrent state, can significantly improve performance for a negligible cost in parameters/FLOPs, but only when 𝑩 and 𝑪 are also selective. Size of Δ projection fixed to 64.

Structured SSMs were originally defined as discretizations of No Free Lunch: Continuous-Discrete Spectrum. continuous systems (1), and have had a strong inductive bias toward continuous-time data modalities such as perceptual signals (e.g. audio, video). As discussed in Sections 3.1 and 3.5, the selection mechanism overcomes their weaknesses on discrete modalities such as text and DNA; but this conversely can impede their performance on data that LTI SSMs excel on. Our ablations on audio waveforms examine this tradeoff in more detail.

Downstream Affordances. Transformer-based foundation models (particularly LLMs) have a rich ecosystem of proper- ties and modes of interaction with pretrained models, such as fine-tuning, adaptation, prompting, in-context learning, instruction tuning, RLHF, quantization, and so on. We are particularly interested in whether Transformer alternatives such as SSMs have similar properties and affordances.

Scaling. Our empirical evaluation is limited to small model sizes, below the threshold of most strong open source LLMs (e.g. Llama (Touvron et al. 2023)) as well as other recurrent models such as RWKV (B. Peng et al. 2023) and RetNet (Y. Sun et al. 2023), which have been evaluated at the 7B parameter scale and beyond. It remains to assess whether Mamba still compares favorably at these larger sizes. We also note that scaling SSMs may involve further engineering challenges and adjustments to the model that are not discussed in this paper.

6 Conclusion

We introduce a selection mechanism to structured state space models, allowing them to perform context-dependent reasoning while scaling linearly in sequence length. When incorporated into a simple attention-free architecture, Mamba achieves state-of-the-art results on a diverse set of domains, where it matches or exceeds the performance of strong Transformer models. We are excited about the broad applications of selective state space models to build foundation models for different domains, especially in emerging modalities requiring long context such as genomics, audio, and video. Our results suggest that Mamba is a strong candidate to be a general sequence model backbone.

Previous: MM1, Methods, Analysis & Insights from Multimodal LLM Pre-training Next: Model | Jamba Technical Report

post contain ""

    No matching posts found containing ""