Performance (Qwen3-14B Q4_K_M, RX 9070 XT 16GB): - turbo4: 1812 pp512 / 49 tg128 / PPL +0.010 vs f16 / 72% KV savings - turbo3: 1836 pp512 / 50 tg128 / PPL +0.051 vs f16 / 78% KV savings - All context lengths up to 40K work without OOM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.6 KiB
TurboQuant KV Cache Quantization (Experimental)
TurboQuant compresses the KV cache to 3-4 bits per dimension using Walsh-Hadamard Transform + optimal vector quantization.
Based on: Zandieh, Daliri, Hadian, Mirrokni — "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate", ICLR 2026. arXiv:2504.19874
Usage
# TurboQuant 4-bit (4.5 bpw) — recommended default
llama-cli -m model.gguf -fa 1 -ngl 99 --cache-type-k turbo4 --cache-type-v turbo4
# TurboQuant 3-bit (3.5 bpw) — maximum compression
llama-cli -m model.gguf -fa 1 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3
# Mixed: turbo K with f16 V (or vice versa)
llama-cli -m model.gguf -fa 1 -ngl 99 --cache-type-k turbo4 --cache-type-v f16
Flash attention (-fa 1) is required.
Performance
All benchmarks: Qwen3-14B Q4_K_M, AMD RX 9070 XT (16 GB VRAM, gfx1201), ROCm 6.1, NixOS.
Throughput (tokens/second)
| KV Type | pp512 | pp2K | pp8K | pp16K | pp32K | pp40K | tg128 | bpw |
|---|---|---|---|---|---|---|---|---|
| f16/f16 | 1894 | 1847 | 1345 | 943 | 563 | 471 | 53.4 | 16.0 |
| q8_0/q8_0 | 1694 | 1841 | 1340 | — | — | — | 52.1 | 8.0 |
| turbo4/turbo4 | 1812 | 1816 | 1321 | 926 | 550 | 460 | 49.3 | 4.5 |
| turbo3/turbo3 | 1836 | 1816 | 1319 | 924 | 548 | — | 49.6 | 3.5 |
Perplexity (lower is better)
Measured on ~960KB C++ source code corpus, context=2048.
| KV Type | PPL | Delta vs f16 |
|---|---|---|
| f16 | 1.7031 | — |
| q8_0 | 1.7044 | +0.001 |
| turbo4 | 1.7134 | +0.010 |
| turbo3 | 1.7544 | +0.051 |
KV Cache Memory
| KV Type | Bytes per element | Savings vs f16 |
|---|---|---|
| f16 | 2.000 | — |
| q8_0 | 1.000 | 50% |
| turbo4 | 0.5625 (18/32) | 72% |
| turbo3 | 0.4375 (14/32) | 78% |
How it works
Encode: x → ||x|| → x/||x|| → FWHT(x/||x||) → nearest_centroid() → pack_bits → (norm_fp16, packed_indices)
Decode: unpack → centroid_lookup → inverse_FWHT → × norm → x̃
The Fast Walsh-Hadamard Transform (FWHT) rotates the input vector into a domain where the Lloyd-Max codebook is optimal. The codebook centroids are precomputed for the Beta((d-1)/2, (d-1)/2) distribution that arises after FWHT of unit vectors in d=128 dimensions.
Current implementation uses a pre-dequantize strategy: turbo KV data is bulk-converted to f16 before standard Flash Attention runs. This adds minimal overhead (~3% pp, ~8% tg) while avoiding the complexity of a fused FA kernel.
Requirements
- Flash Attention enabled (
-fa 1) - head_dim = 128 (covers Llama, Qwen, Mistral, Gemma, and most current models)
- AMD ROCm (gfx1201 tested) or CPU
Limitations
- head_dim must be exactly 128 (FWHT requirement)
- Mixed turbo/quantized combinations (e.g. turbo4/q8_0) are not supported — use turbo/turbo or turbo/f16
- No CUDA support yet (HIP/ROCm only for GPU path)
- Pre-dequantize strategy means full KV cache is converted to f16 per FA call (a fused kernel would eliminate this)
GGML Types
| Type | GGML ID | Block Size | Block Bytes | bpw |
|---|---|---|---|---|
| GGML_TYPE_TURBO3_0 | 41 | 32 | 14 (2 norm + 12 packed) | 3.5 |
| GGML_TYPE_TURBO4_0 | 42 | 32 | 18 (2 norm + 16 packed) | 4.5 |
Build
# AMD ROCm (RX 9070 XT)
cmake -B build -DGGML_HIP=ON -DGPU_TARGETS="gfx1201" -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Run tests
./build/bin/test-turboquant
# Benchmark
./build/bin/llama-bench -m model.gguf -fa 1 -ngl 99 -ctk turbo4 -ctv turbo4 -p 512 -n 128