claudeopus46 457e76fc0e
Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled
Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled
Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[cuda_version:12.4.0 dockerfile:.devops/cuda.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:cuda cuda12 ubuntu_version:22.04]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[cuda_version:13.1.0 dockerfile:.devops/cuda-new.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:cuda13 ubuntu_version:24.04]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cpu.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:cpu]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cpu.Dockerfile free_disk_space:false full:true light:true platforms:linux/arm64 runs_on:ubuntu-24.04 server:true tag:cpu]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/intel.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:intel]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/musa.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:musa]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/openvino.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:openvino]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/rocm.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:rocm]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/s390x.Dockerfile free_disk_space:false full:true light:true platforms:linux/s390x runs_on:ubuntu-24.04-s390x server:true tag:s390x]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/vulkan.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:vulkan]) (push) Has been cancelled
Publish Docker image / Create and push git tag (push) Has been cancelled
Update Winget Package / Update Winget Package (push) Has been cancelled
EditorConfig Checker / editorconfig (push) Has been cancelled
Close inactive issues / close-issues (push) Has been cancelled
CI (msys) / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled
CI (msys) / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled
CI (cross) / debian-13-loongarch64-cpu-cross (push) Has been cancelled
CI (cross) / debian-13-loongarch64-vulkan-cross (push) Has been cancelled
CI (cross) / ubuntu-24-riscv64-cpu-spacemit-ime-cross (push) Has been cancelled
fix: match Ollama's proven gfx1103 approach — gfx1102 target + rocBLAS
Remove GGML_CUDA_FORCE_MMQ — let rocBLAS handle large batch GEMMs
using gfx1102 TensileLibrary (available in ROCm 7.2). The GPU is
spoofed as gfx1102 via HSA_OVERRIDE_GFX_VERSION=11.0.2 at runtime,
matching Ollama's working configuration.

FORCE_MMQ caused crashes because MMQ kernel launch_bounds are tuned
for GPUs with many CUs and cannot fit on the 6-CU iGPU for large
matrix dimensions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 03:05:06 +02:00
2026-02-02 08:51:25 +02:00
2024-11-24 08:03:25 -08:00
2026-02-02 08:38:55 +02:00

llama.cpp + TurboQuant KV Cache

License: MIT Based on llama.cpp

Fork of llama.cpp with TurboQuant KV-cache vector quantization for AMD ROCm.

Compresses the KV cache to 3-4 bits per dimension using Walsh-Hadamard Transform + Lloyd-Max optimal quantization (Zandieh et al., ICLR 2026). Reduces KV cache VRAM by 72-78% with less than 10% performance overhead.

Results

All benchmarks: Qwen3-14B Q4_K_M, AMD RX 9070 XT (16 GB VRAM, gfx1201), ROCm 6.1, NixOS.

Throughput

KV Type pp512 pp8K pp32K tg128 bpw KV Savings
f16/f16 1894 1345 563 53.4 16.0
q8_0/q8_0 1694 1340 52.1 8.0 50%
turbo4/turbo4 1812 1321 550 49.3 4.5 72%
turbo3/turbo3 1836 1319 548 49.6 3.5 78%

Quality (Perplexity)

KV Type PPL Delta vs f16
f16 1.7031
q8_0 1.7044 +0.001
turbo4 1.7134 +0.010
turbo3 1.7544 +0.051

Quick Start

# Build (AMD ROCm)
cmake -B build -DGGML_HIP=ON -DGPU_TARGETS="gfx1201" -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run with TurboQuant KV cache
./build/bin/llama-cli -m model.gguf -fa 1 -ngl 99 \
  --cache-type-k turbo4 --cache-type-v turbo4

# Benchmark
./build/bin/llama-bench -m model.gguf -fa 1 -ngl 99 \
  -ctk turbo4 -ctv turbo4 -p 512 -n 128

# Run tests
./build/bin/test-turboquant

Flash attention (-fa 1) is required.

Available Types

Type Bits Block Bytes Use Case
turbo4 4-bit 32 18 Recommended default — near-lossless
turbo3 3-bit 32 14 Maximum compression — slightly higher PPL

Mixed configurations work: --cache-type-k turbo4 --cache-type-v f16 (or vice versa).

How It Works

Encode: x → L2norm → x/‖x‖ → FWHT(128) → scalar quantize → bitpack → (norm_fp16, indices)
Decode: unpack → codebook lookup → inverse FWHT → ×norm → x̃

The Fast Walsh-Hadamard Transform rotates input vectors into a domain where Lloyd-Max scalar quantization is optimal. The codebook centroids are precomputed for the Beta((d-1)/2, (d-1)/2) distribution arising after FWHT of unit vectors in d=128 dimensions.

Current implementation uses pre-dequantize strategy: turbo KV data is bulk-converted to f16 before standard Flash Attention. This adds minimal overhead while avoiding the complexity of a fused FA kernel.

Requirements

  • Flash Attention enabled (-fa 1)
  • head_dim = 128 (Llama, Qwen, Mistral, Gemma, DeepSeek, Phi, and most current models)
  • AMD ROCm (gfx1201 tested) or CPU

Limitations

  • head_dim must be exactly 128
  • No CUDA support yet (HIP/ROCm only for GPU path)
  • Mixed turbo + quantized V (e.g. turbo4/q8_0) not supported — use turbo/turbo or turbo/f16

What Changed vs Upstream llama.cpp

22 files changed, ~1550 lines added. See docs/turboquant.md for full details.

Key files:

  • ggml/src/ggml-cuda/set-rows.cu — FWHT encode kernels (128 threads, shared-memory FWHT)
  • ggml/src/ggml-cuda/convert.cu — FWHT decode kernels (bulk dequantize)
  • ggml/src/ggml-cuda/fattn.cu — Pre-dequantize integration in Flash Attention
  • ggml/src/ggml-quants.c — CPU reference implementation
  • src/llama-kv-cache.cpp — head_dim=128 validation guard
  • tests/test-turboquant.cpp — 7 CPU reference tests

License

MIT License — same as upstream llama.cpp. See LICENSE.

TurboQuant algorithm: arXiv:2504.19874 (Zandieh, Daliri, Hadian, Mirrokni, ICLR 2026).

Acknowledgments

S
Description
TurboQuant KV-cache quantization for AMD ROCm - fork of llama.cpp
Readme MIT 293 MiB
Languages
C++ 56.8%
C 12.7%
Python 7.3%
Cuda 6.1%
HTML 3.8%
Other 13%