docs: add TurboQuant benchmark results and documentation

Performance (Qwen3-14B Q4_K_M, RX 9070 XT 16GB): - turbo4: 1812 pp512 / 49 tg128 / PPL +0.010 vs f16 / 72% KV savings - turbo3: 1836 pp512 / 50 tg128 / PPL +0.051 vs f16 / 78% KV savings - All context lengths up to 40K work without OOM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:32:53 +02:00
parent bd571adc99
commit 04b2771048
1 changed files with 106 additions and 0 deletions
@@ -0,0 +1,106 @@
 # TurboQuant KV Cache Quantization (Experimental)
 TurboQuant compresses the KV cache to 3-4 bits per dimension using
 Walsh-Hadamard Transform + optimal vector quantization.
 Based on: Zandieh, Daliri, Hadian, Mirrokni — "TurboQuant: Online Vector
 Quantization with Near-optimal Distortion Rate", ICLR 2026.
 [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)
 ## Usage
 ```bash
 # TurboQuant 4-bit (4.5 bpw) — recommended default
 llama-cli -m model.gguf -fa 1 -ngl 99 --cache-type-k turbo4 --cache-type-v turbo4
 # TurboQuant 3-bit (3.5 bpw) — maximum compression
 llama-cli -m model.gguf -fa 1 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3
 # Mixed: turbo K with f16 V (or vice versa)
 llama-cli -m model.gguf -fa 1 -ngl 99 --cache-type-k turbo4 --cache-type-v f16
 ```
 Flash attention (`-fa 1`) is required.
 ## Performance
 All benchmarks: Qwen3-14B Q4_K_M, AMD RX 9070 XT (16 GB VRAM, gfx1201), ROCm 6.1, NixOS.
 ### Throughput (tokens/second)
 | KV Type | pp512 | pp2K | pp8K | pp16K | pp32K | pp40K | tg128 | bpw |
 |---------|-------|------|------|-------|-------|-------|-------|-----|
 | f16/f16 | 1894 | 1847 | 1345 | 943 | 563 | 471 | 53.4 | 16.0 |
 | q8_0/q8_0 | 1694 | 1841 | 1340 | — | — | — | 52.1 | 8.0 |
 | **turbo4/turbo4** | **1812** | **1816** | **1321** | **926** | **550** | **460** | **49.3** | **4.5** |
 | **turbo3/turbo3** | **1836** | **1816** | **1319** | **924** | **548** | — | **49.6** | **3.5** |
 ### Perplexity (lower is better)
 Measured on ~960KB C++ source code corpus, context=2048.
 | KV Type | PPL | Delta vs f16 |
 |---------|-----|-------------|
 | f16 | 1.7031 | — |
 | q8_0 | 1.7044 | +0.001 |
 | turbo4 | 1.7134 | +0.010 |
 | turbo3 | 1.7544 | +0.051 |
 ### KV Cache Memory
 | KV Type | Bytes per element | Savings vs f16 |
 |---------|------------------|----------------|
 | f16 | 2.000 | — |
 | q8_0 | 1.000 | 50% |
 | turbo4 | 0.5625 (18/32) | 72% |
 | turbo3 | 0.4375 (14/32) | 78% |
 ## How it works
 ```
 Encode: x → ||x|| → x/||x|| → FWHT(x/||x||) → nearest_centroid() → pack_bits → (norm_fp16, packed_indices)
 Decode: unpack → centroid_lookup → inverse_FWHT → × norm → x̃
 ```
 The Fast Walsh-Hadamard Transform (FWHT) rotates the input vector into a domain
 where the Lloyd-Max codebook is optimal. The codebook centroids are precomputed
 for the Beta((d-1)/2, (d-1)/2) distribution that arises after FWHT of unit vectors
 in d=128 dimensions.
 Current implementation uses a pre-dequantize strategy: turbo KV data is bulk-converted
 to f16 before standard Flash Attention runs. This adds minimal overhead (~3% pp, ~8% tg)
 while avoiding the complexity of a fused FA kernel.
 ## Requirements
 - Flash Attention enabled (`-fa 1`)
 - head_dim = 128 (covers Llama, Qwen, Mistral, Gemma, and most current models)
 - AMD ROCm (gfx1201 tested) or CPU
 ## Limitations
 - head_dim must be exactly 128 (FWHT requirement)
 - Mixed turbo/quantized combinations (e.g. turbo4/q8_0) are not supported — use turbo/turbo or turbo/f16
 - No CUDA support yet (HIP/ROCm only for GPU path)
 - Pre-dequantize strategy means full KV cache is converted to f16 per FA call (a fused kernel would eliminate this)
 ## GGML Types
 | Type | GGML ID | Block Size | Block Bytes | bpw |
 |------|---------|-----------|-------------|-----|
 | GGML_TYPE_TURBO3_0 | 41 | 32 | 14 (2 norm + 12 packed) | 3.5 |
 | GGML_TYPE_TURBO4_0 | 42 | 32 | 18 (2 norm + 16 packed) | 4.5 |
 ## Build
 ```bash
 # AMD ROCm (RX 9070 XT)
 cmake -B build -DGGML_HIP=ON -DGPU_TARGETS="gfx1201" -DCMAKE_BUILD_TYPE=Release
 cmake --build build -j$(nproc)
 # Run tests
 ./build/bin/test-turboquant
 # Benchmark
 ./build/bin/llama-bench -m model.gguf -fa 1 -ngl 99 -ctk turbo4 -ctv turbo4 -p 512 -n 128
 ```