Files
Pascal Wachowski 04b2771048 docs: add TurboQuant benchmark results and documentation
Performance (Qwen3-14B Q4_K_M, RX 9070 XT 16GB):
- turbo4: 1812 pp512 / 49 tg128 / PPL +0.010 vs f16 / 72% KV savings
- turbo3: 1836 pp512 / 50 tg128 / PPL +0.051 vs f16 / 78% KV savings
- All context lengths up to 40K work without OOM

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:48:00 +02:00

3.6 KiB
Raw Permalink Blame History

TurboQuant KV Cache Quantization (Experimental)

TurboQuant compresses the KV cache to 3-4 bits per dimension using Walsh-Hadamard Transform + optimal vector quantization.

Based on: Zandieh, Daliri, Hadian, Mirrokni — "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate", ICLR 2026. arXiv:2504.19874

Usage

# TurboQuant 4-bit (4.5 bpw) — recommended default
llama-cli -m model.gguf -fa 1 -ngl 99 --cache-type-k turbo4 --cache-type-v turbo4

# TurboQuant 3-bit (3.5 bpw) — maximum compression
llama-cli -m model.gguf -fa 1 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3

# Mixed: turbo K with f16 V (or vice versa)
llama-cli -m model.gguf -fa 1 -ngl 99 --cache-type-k turbo4 --cache-type-v f16

Flash attention (-fa 1) is required.

Performance

All benchmarks: Qwen3-14B Q4_K_M, AMD RX 9070 XT (16 GB VRAM, gfx1201), ROCm 6.1, NixOS.

Throughput (tokens/second)

KV Type pp512 pp2K pp8K pp16K pp32K pp40K tg128 bpw
f16/f16 1894 1847 1345 943 563 471 53.4 16.0
q8_0/q8_0 1694 1841 1340 52.1 8.0
turbo4/turbo4 1812 1816 1321 926 550 460 49.3 4.5
turbo3/turbo3 1836 1816 1319 924 548 49.6 3.5

Perplexity (lower is better)

Measured on ~960KB C++ source code corpus, context=2048.

KV Type PPL Delta vs f16
f16 1.7031
q8_0 1.7044 +0.001
turbo4 1.7134 +0.010
turbo3 1.7544 +0.051

KV Cache Memory

KV Type Bytes per element Savings vs f16
f16 2.000
q8_0 1.000 50%
turbo4 0.5625 (18/32) 72%
turbo3 0.4375 (14/32) 78%

How it works

Encode: x → ||x|| → x/||x|| → FWHT(x/||x||) → nearest_centroid() → pack_bits → (norm_fp16, packed_indices)
Decode: unpack → centroid_lookup → inverse_FWHT → × norm → x̃

The Fast Walsh-Hadamard Transform (FWHT) rotates the input vector into a domain where the Lloyd-Max codebook is optimal. The codebook centroids are precomputed for the Beta((d-1)/2, (d-1)/2) distribution that arises after FWHT of unit vectors in d=128 dimensions.

Current implementation uses a pre-dequantize strategy: turbo KV data is bulk-converted to f16 before standard Flash Attention runs. This adds minimal overhead (~3% pp, ~8% tg) while avoiding the complexity of a fused FA kernel.

Requirements

  • Flash Attention enabled (-fa 1)
  • head_dim = 128 (covers Llama, Qwen, Mistral, Gemma, and most current models)
  • AMD ROCm (gfx1201 tested) or CPU

Limitations

  • head_dim must be exactly 128 (FWHT requirement)
  • Mixed turbo/quantized combinations (e.g. turbo4/q8_0) are not supported — use turbo/turbo or turbo/f16
  • No CUDA support yet (HIP/ROCm only for GPU path)
  • Pre-dequantize strategy means full KV cache is converted to f16 per FA call (a fused kernel would eliminate this)

GGML Types

Type GGML ID Block Size Block Bytes bpw
GGML_TYPE_TURBO3_0 41 32 14 (2 norm + 12 packed) 3.5
GGML_TYPE_TURBO4_0 42 32 18 (2 norm + 16 packed) 4.5

Build

# AMD ROCm (RX 9070 XT)
cmake -B build -DGGML_HIP=ON -DGPU_TARGETS="gfx1201" -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run tests
./build/bin/test-turboquant

# Benchmark
./build/bin/llama-bench -m model.gguf -fa 1 -ngl 99 -ctk turbo4 -ctv turbo4 -p 512 -n 128