Remove GGML_CUDA_FORCE_MMQ — let rocBLAS handle large batch GEMMs using gfx1102 TensileLibrary (available in ROCm 7.2). The GPU is spoofed as gfx1102 via HSA_OVERRIDE_GFX_VERSION=11.0.2 at runtime, matching Ollama's working configuration. FORCE_MMQ caused crashes because MMQ kernel launch_bounds are tuned for GPUs with many CUs and cannot fit on the 6-CU iGPU for large matrix dimensions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
llama.cpp + TurboQuant KV Cache
Fork of llama.cpp with TurboQuant KV-cache vector quantization for AMD ROCm.
Compresses the KV cache to 3-4 bits per dimension using Walsh-Hadamard Transform + Lloyd-Max optimal quantization (Zandieh et al., ICLR 2026). Reduces KV cache VRAM by 72-78% with less than 10% performance overhead.
Results
All benchmarks: Qwen3-14B Q4_K_M, AMD RX 9070 XT (16 GB VRAM, gfx1201), ROCm 6.1, NixOS.
Throughput
| KV Type | pp512 | pp8K | pp32K | tg128 | bpw | KV Savings |
|---|---|---|---|---|---|---|
| f16/f16 | 1894 | 1345 | 563 | 53.4 | 16.0 | — |
| q8_0/q8_0 | 1694 | 1340 | — | 52.1 | 8.0 | 50% |
| turbo4/turbo4 | 1812 | 1321 | 550 | 49.3 | 4.5 | 72% |
| turbo3/turbo3 | 1836 | 1319 | 548 | 49.6 | 3.5 | 78% |
Quality (Perplexity)
| KV Type | PPL | Delta vs f16 |
|---|---|---|
| f16 | 1.7031 | — |
| q8_0 | 1.7044 | +0.001 |
| turbo4 | 1.7134 | +0.010 |
| turbo3 | 1.7544 | +0.051 |
Quick Start
# Build (AMD ROCm)
cmake -B build -DGGML_HIP=ON -DGPU_TARGETS="gfx1201" -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Run with TurboQuant KV cache
./build/bin/llama-cli -m model.gguf -fa 1 -ngl 99 \
--cache-type-k turbo4 --cache-type-v turbo4
# Benchmark
./build/bin/llama-bench -m model.gguf -fa 1 -ngl 99 \
-ctk turbo4 -ctv turbo4 -p 512 -n 128
# Run tests
./build/bin/test-turboquant
Flash attention (-fa 1) is required.
Available Types
| Type | Bits | Block | Bytes | Use Case |
|---|---|---|---|---|
turbo4 |
4-bit | 32 | 18 | Recommended default — near-lossless |
turbo3 |
3-bit | 32 | 14 | Maximum compression — slightly higher PPL |
Mixed configurations work: --cache-type-k turbo4 --cache-type-v f16 (or vice versa).
How It Works
Encode: x → L2norm → x/‖x‖ → FWHT(128) → scalar quantize → bitpack → (norm_fp16, indices)
Decode: unpack → codebook lookup → inverse FWHT → ×norm → x̃
The Fast Walsh-Hadamard Transform rotates input vectors into a domain where Lloyd-Max scalar quantization is optimal. The codebook centroids are precomputed for the Beta((d-1)/2, (d-1)/2) distribution arising after FWHT of unit vectors in d=128 dimensions.
Current implementation uses pre-dequantize strategy: turbo KV data is bulk-converted to f16 before standard Flash Attention. This adds minimal overhead while avoiding the complexity of a fused FA kernel.
Requirements
- Flash Attention enabled (
-fa 1) - head_dim = 128 (Llama, Qwen, Mistral, Gemma, DeepSeek, Phi, and most current models)
- AMD ROCm (gfx1201 tested) or CPU
Limitations
- head_dim must be exactly 128
- No CUDA support yet (HIP/ROCm only for GPU path)
- Mixed turbo + quantized V (e.g. turbo4/q8_0) not supported — use turbo/turbo or turbo/f16
What Changed vs Upstream llama.cpp
22 files changed, ~1550 lines added. See docs/turboquant.md for full details.
Key files:
ggml/src/ggml-cuda/set-rows.cu— FWHT encode kernels (128 threads, shared-memory FWHT)ggml/src/ggml-cuda/convert.cu— FWHT decode kernels (bulk dequantize)ggml/src/ggml-cuda/fattn.cu— Pre-dequantize integration in Flash Attentionggml/src/ggml-quants.c— CPU reference implementationsrc/llama-kv-cache.cpp— head_dim=128 validation guardtests/test-turboquant.cpp— 7 CPU reference tests
License
MIT License — same as upstream llama.cpp. See LICENSE.
TurboQuant algorithm: arXiv:2504.19874 (Zandieh, Daliri, Hadian, Mirrokni, ICLR 2026).
Acknowledgments
- llama.cpp by ggml-org — the foundation this builds on
- TurboQuant paper by Zandieh et al.
- Reference implementation by 0xSero
- llama.cpp discussion #20969 — community research thread