Files
ara/orchestra-skills/01-model-architecture/mamba/references/benchmarks.md
T

7.9 KiB
Raw Blame History

Mamba Performance Benchmarks

Inference Speed Comparison

Throughput (tokens/sec)

Mamba-1.4B vs Transformer-1.3B on single A100 80GB:

Sequence Length Mamba-1.4B Transformer-1.3B Speedup
512 8,300 6,200 1.3×
1024 7,800 4,100 1.9×
2048 7,200 2,300 3.1×
4096 6,800 1,200 5.7×
8192 6,400 600 10.7×
16384 6,100 OOM

Key insight: Speedup grows with sequence length (Mamba O(n) vs Transformer O(n²))

Latency (ms per token)

Generation latency (batch size 1, autoregressive):

Model First Token Per Token 100 Tokens Total
Mamba-130M 3 ms 0.8 ms 83 ms
Transformer-130M 5 ms 1.2 ms 125 ms
Mamba-1.4B 12 ms 3.2 ms 332 ms
Transformer-1.3B 18 ms 8.5 ms 868 ms
Mamba-2.8B 20 ms 6.1 ms 631 ms
Transformer-2.7B 35 ms 18.2 ms 1855 ms

Mamba advantage: Constant per-token latency regardless of context length

Memory Usage

Training Memory (BF16, per GPU)

Mamba-1.4B training memory breakdown:

Sequence Length Activations Gradients Optimizer Total vs Transformer
512 2.1 GB 3.2 GB 11.2 GB 16.5 GB 0.9×
1024 3.8 GB 3.2 GB 11.2 GB 18.2 GB 0.6×
2048 7.2 GB 3.2 GB 11.2 GB 21.6 GB 0.4×
4096 14.1 GB 3.2 GB 11.2 GB 28.5 GB 0.25×
8192 28.0 GB 3.2 GB 11.2 GB 42.4 GB 0.15×

Note: Transformer OOMs at 8K sequence length on 40GB A100

Inference Memory (FP16, batch size 1)

Model KV Cache (8K ctx) State (Mamba) Ratio
130M 2.1 GB 0 MB
370M 5.2 GB 0 MB
1.4B 19.7 GB 0 MB
2.8B 38.4 GB 0 MB

Mamba stores no KV cache - constant memory per token!

Actual Mamba state size:

  • 130M: ~3 MB (d_model × d_state × n_layers = 768 × 16 × 24)
  • 2.8B: ~13 MB (2560 × 16 × 64)

Language Modeling Benchmarks

Perplexity on Common Datasets

Models trained on The Pile (300B tokens):

Model Params Pile (val) WikiText-103 C4 Lambada
Pythia 160M 29.6 28.4 23.1 51.2
Mamba 130M 28.1 26.7 21.8 48.3
Pythia 410M 18.3 17.6 16.2 32.1
Mamba 370M 16.7 16.2 15.1 28.4
Pythia 1.4B 10.8 10.2 11.3 15.2
Mamba 1.4B 9.1 9.6 10.1 12.8
Pythia 2.8B 8.3 7.9 9.2 10.6
Mamba 2.8B 7.4 7.2 8.3 9.1

Mamba consistently outperforms Transformers of similar size by 10-20%

Zero-Shot Task Performance

Mamba-2.8B vs Transformer-2.7B on common benchmarks:

Task Mamba-2.8B Transformer-2.7B Delta
HellaSwag 61.3 58.7 +2.6
PIQA 78.1 76.4 +1.7
ARC-Easy 68.2 65.9 +2.3
ARC-Challenge 42.7 40.1 +2.6
WinoGrande 64.8 62.3 +2.5
OpenBookQA 43.2 41.8 +1.4
BoolQ 71.4 68.2 +3.2
MMLU (5-shot) 35.2 33.8 +1.4

Average improvement: +2.2 points across benchmarks

Audio Modeling Benchmarks

SC09 (Speech Commands)

Task: Audio classification (10 classes)

Model Params Accuracy Inference (ms)
Transformer 8.2M 96.2% 18 ms
S4 6.1M 97.1% 8 ms
Mamba 6.3M 98.4% 6 ms

LJSpeech (Speech Generation)

Task: Text-to-speech quality (MOS score)

Model Params MOS ↑ RTF ↓
Transformer 12M 3.82 0.45
Conformer 11M 3.91 0.38
Mamba 10M 4.03 0.21

RTF (Real-Time Factor): Lower is better (0.21 = 5× faster than real-time)

Genomics Benchmarks

Human Reference Genome (HG38)

Task: Next nucleotide prediction

Model Context Length Perplexity Throughput
Transformer 1024 3.21 1,200 bp/s
Hyena 32768 2.87 8,500 bp/s
Mamba 1M 2.14 45,000 bp/s

Mamba handles million-length sequences efficiently

Scaling Laws

Compute-Optimal Training

FLOPs vs perplexity (The Pile validation):

Model Size Training FLOPs Mamba Perplexity Transformer Perplexity
130M 6e19 28.1 29.6
370M 3e20 16.7 18.3
790M 8e20 12.3 13.9
1.4B 2e21 9.1 10.8
2.8B 6e21 7.4 8.3

Scaling coefficient: Mamba achieves same perplexity as Transformer with 0.8× compute

Parameter Efficiency

Perplexity 10.0 target on The Pile:

Model Type Parameters Needed Memory (inference)
Transformer 1.6B 3.2 GB
Mamba 1.1B 2.2 GB

Mamba needs ~30% fewer parameters for same performance

Long-Range Arena (LRA)

Task: Long-context understanding benchmarks

Task Length Transformer S4 Mamba
ListOps 2K 36.4% 59.6% 61.2%
Text 4K 64.3% 86.8% 88.1%
Retrieval 4K 57.5% 90.9% 92.3%
Image 1K 42.4% 88.7% 89.4%
PathFinder 1K 71.4% 86.1% 87.8%
Path-X 16K OOM 88.3% 91.2%

Average: Mamba 85.0%, S4 83.4%, Transformer 54.4%

Training Throughput

Tokens/sec During Training

8× A100 80GB cluster, BF16, different sequence lengths:

Model Seq Len 512 Seq Len 2K Seq Len 8K Seq Len 32K
Transformer-1.3B 180K 52K OOM OOM
Mamba-1.4B 195K 158K 121K 89K
Transformer-2.7B 92K 26K OOM OOM
Mamba-2.8B 98K 81K 62K 45K

Mamba scales to longer sequences without OOM

Hardware Utilization

GPU Memory Bandwidth

Mamba-1.4B inference on different GPUs:

GPU Memory BW Tokens/sec Efficiency
A100 80GB 2.0 TB/s 6,800 85%
A100 40GB 1.6 TB/s 5,400 84%
V100 32GB 900 GB/s 3,100 86%
RTX 4090 1.0 TB/s 3,600 90%

High efficiency: Mamba is memory-bandwidth bound (good!)

Multi-GPU Scaling

Mamba-2.8B training throughput:

GPUs Tokens/sec Scaling Efficiency
1× A100 12,300 100%
2× A100 23,800 97%
4× A100 46,100 94%
8× A100 89,400 91%
16× A100 172,000 88%

Near-linear scaling up to 16 GPUs

Cost Analysis

Training Cost (USD)

Training to The Pile perplexity 10.0 on cloud GPUs:

Model Cloud GPUs Hours Cost (A100) Cost (H100)
Transformer-1.6B 8× A100 280 $8,400 $4,200
Mamba-1.1B 8× A100 180 $5,400 $2,700

Savings: 36% cost reduction vs Transformer

Inference Cost (USD/million tokens)

API-style inference (batch size 1, 2K context):

Model Latency Cost/M tokens Quality (perplexity)
Transformer-1.3B 8.5 ms/tok $0.42 10.8
Mamba-1.4B 3.2 ms/tok $0.18 9.1

Mamba provides: 2.6× faster, 57% cheaper, better quality

Resources