Files
Pascal Wachowski bd571adc99 feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)
Implements TurboQuant (Zandieh et al., ICLR 2026) KV-cache vector
quantization targeting AMD RDNA 4 (gfx1201, RX 9070 XT).

Algorithm: L2-normalize → FWHT(128) → Lloyd-Max scalar quantize → bitpack
Decode: unpack → codebook lookup → inverse FWHT → denormalize

Two new GGML types:
- GGML_TYPE_TURBO3_0: 3-bit, 3.5 bpw, MSE*d=0.034 (block_size=32, 14 bytes)
- GGML_TYPE_TURBO4_0: 4-bit, 4.5 bpw, MSE*d=0.009 (block_size=32, 18 bytes)

Architecture (pre-dequantize strategy):
- Write path: FWHT-aware set-rows kernels (128 threads, shared-mem FWHT)
- Read path: bulk dequantize turbo→f16 before standard Flash Attention
- Stride scaling preserves ggml_permute dim swaps (critical fix)

Performance (Qwen3-14B Q4_K_M, RX 9070 XT, 16 GB VRAM):
  f16/f16:       1865 pp512,  54 tg128 (baseline)
  q8_0/q8_0:     1694 pp512,  52 tg128
  turbo4/turbo4:  1813 pp512,  49 tg128 (-3% pp, -9% tg, 72% less KV VRAM)
  turbo3/turbo3:  1983 pp512,  49 tg128 (+6% pp, -9% tg, 78% less KV VRAM)

Usage: llama-cli -fa 1 --cache-type-k turbo4 --cache-type-v turbo4

Includes 7 CPU reference tests validating FWHT self-inverse, MSE against
paper values, bitpack determinism, and dequantize sanity.

Requires head_dim=128 (covers most current models including Llama, Qwen,
Mistral, Gemma). Guard added to KV cache init with clear error message.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:48:00 +02:00
..
2026-03-28 02:33:04 +01:00
2026-03-08 12:30:21 +01:00