Files

T

Pascal Wachowski 04b2771048 docs: add TurboQuant benchmark results and documentation

Performance (Qwen3-14B Q4_K_M, RX 9070 XT 16GB):
- turbo4: 1812 pp512 / 49 tg128 / PPL +0.010 vs f16 / 72% KV savings
- turbo3: 1836 pp512 / 50 tg128 / PPL +0.051 vs f16 / 78% KV savings
- All context lengths up to 40K work without OOM

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-29 20:48:00 +02:00

3.6 KiB

Raw Blame History

TurboQuant KV Cache Quantization (Experimental)

TurboQuant compresses the KV cache to 3-4 bits per dimension using Walsh-Hadamard Transform + optimal vector quantization.

Based on: Zandieh, Daliri, Hadian, Mirrokni — "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate", ICLR 2026. arXiv:2504.19874

Usage

# TurboQuant 4-bit (4.5 bpw) — recommended default
llama-cli -m model.gguf -fa 1 -ngl 99 --cache-type-k turbo4 --cache-type-v turbo4

# TurboQuant 3-bit (3.5 bpw) — maximum compression
llama-cli -m model.gguf -fa 1 -ngl 99 --cache-type-k turbo3 --cache-type-v turbo3

# Mixed: turbo K with f16 V (or vice versa)
llama-cli -m model.gguf -fa 1 -ngl 99 --cache-type-k turbo4 --cache-type-v f16

Flash attention (-fa 1) is required.

Performance

All benchmarks: Qwen3-14B Q4_K_M, AMD RX 9070 XT (16 GB VRAM, gfx1201), ROCm 6.1, NixOS.

Throughput (tokens/second)

KV Type	pp512	pp2K	pp8K	pp16K	pp32K	pp40K	tg128	bpw
f16/f16	1894	1847	1345	943	563	471	53.4	16.0
q8_0/q8_0	1694	1841	1340	—	—	—	52.1	8.0
turbo4/turbo4	1812	1816	1321	926	550	460	49.3	4.5
turbo3/turbo3	1836	1816	1319	924	548	—	49.6	3.5

Perplexity (lower is better)

Measured on ~960KB C++ source code corpus, context=2048.

KV Type	PPL	Delta vs f16
f16	1.7031	—
q8_0	1.7044	+0.001
turbo4	1.7134	+0.010
turbo3	1.7544	+0.051

KV Cache Memory

KV Type	Bytes per element	Savings vs f16
f16	2.000	—
q8_0	1.000	50%
turbo4	0.5625 (18/32)	72%
turbo3	0.4375 (14/32)	78%

How it works

Encode: x → ||x|| → x/||x|| → FWHT(x/||x||) → nearest_centroid() → pack_bits → (norm_fp16, packed_indices)
Decode: unpack → centroid_lookup → inverse_FWHT → × norm → x̃

The Fast Walsh-Hadamard Transform (FWHT) rotates the input vector into a domain where the Lloyd-Max codebook is optimal. The codebook centroids are precomputed for the Beta((d-1)/2, (d-1)/2) distribution that arises after FWHT of unit vectors in d=128 dimensions.

Current implementation uses a pre-dequantize strategy: turbo KV data is bulk-converted to f16 before standard Flash Attention runs. This adds minimal overhead (~3% pp, ~8% tg) while avoiding the complexity of a fused FA kernel.

Requirements

Flash Attention enabled (-fa 1)
head_dim = 128 (covers Llama, Qwen, Mistral, Gemma, and most current models)
AMD ROCm (gfx1201 tested) or CPU

Limitations

head_dim must be exactly 128 (FWHT requirement)
Mixed turbo/quantized combinations (e.g. turbo4/q8_0) are not supported — use turbo/turbo or turbo/f16
No CUDA support yet (HIP/ROCm only for GPU path)
Pre-dequantize strategy means full KV cache is converted to f16 per FA call (a fused kernel would eliminate this)

GGML Types

Type	GGML ID	Block Size	Block Bytes	bpw
GGML_TYPE_TURBO3_0	41	32	14 (2 norm + 12 packed)	3.5
GGML_TYPE_TURBO4_0	42	32	18 (2 norm + 16 packed)	4.5

Build

# AMD ROCm (RX 9070 XT)
cmake -B build -DGGML_HIP=ON -DGPU_TARGETS="gfx1201" -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run tests
./build/bin/test-turboquant

# Benchmark
./build/bin/llama-bench -m model.gguf -fa 1 -ngl 99 -ctk turbo4 -ctv turbo4 -p 512 -n 128

3.6 KiB Raw Blame History Unescape Escape