claudeopus46/llama.cpp-turboquant

Fork 0

T

claudeopus46 94127d7b33

Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled

Details

Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled

Details

Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled

Details

EditorConfig Checker / editorconfig (push) Has been cancelled

Details

Close inactive issues / close-issues (push) Has been cancelled

Details

fix: force MMQ kernels to bypass rocBLAS TensileLibrary on gfx1103

ROCm 7.2 rocBLAS has no TensileLibrary for gfx1103 (RDNA3 iGPU) and
the gfx1102 library kernels crash due to register file differences.
Force MMQ (matrix multiply quantized) kernels which are compiled by
hipcc for the actual target arch, bypassing rocBLAS entirely.

This matches how Ollama successfully runs on AMD 780M / gfx1103.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-03 01:44:34 +02:00

.devops

fix: force MMQ kernels to bypass rocBLAS TensileLibrary on gfx1103

2026-04-03 01:44:34 +02:00

.gemini

contributing: tighten AI usage policy (#18388 )

2025-12-29 16:01:32 +01:00

.github

docker : fix and enable ARM64 image build (#20929 )

2026-03-28 01:45:09 +01:00

benches

benches : add Nemotron 3 Nano on DGX Spark (#20652 )

2026-03-16 21:50:43 +02:00

ci: Allow ninja to be used during unit test (#20742 )

2026-03-25 21:00:49 +08:00

cmake

ci : add sanitizer runs for server (#19291 )

2026-02-03 22:41:20 +02:00

common

feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)

2026-03-29 20:48:00 +02:00

docs

docs: add TurboQuant benchmark results and documentation

2026-03-29 20:48:00 +02:00

examples

[SYCL] Enhance build script to use half cores to build, avoid OS hang (#21093 )

2026-03-29 09:02:45 +08:00

ggml

fix: FWHT butterfly loop warp divergence on RDNA3 iGPU (gfx1103)

2026-04-03 00:19:35 +02:00

gguf-py

convert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011 )

2026-03-26 16:49:09 +01:00

grammars

docs : document that JSON Schema is not available to model when using response_format (#18492 )

2025-12-30 15:13:49 -06:00

include

llama: fix llama-model-saver (#20503 )

2026-03-25 12:53:16 +02:00

licenses

refactor : remove libcurl, use OpenSSL when available (#18828 )

2026-01-14 18:02:47 +01:00

media

media : add transparent icon svg and png [no ci] (#15891 )

2025-09-10 14:51:28 +03:00

models

common/parser: fix handling of tool definition with missing properties key (#21128 )

2026-03-28 20:41:32 +01:00

pocs

ggml : move AMX to the CPU backend (#10570 )

2024-11-29 21:54:58 +01:00

requirements

ci : limit requirements versions (#20980 )

2026-03-25 10:55:37 +02:00

scripts

vendor : update cpp-httplib to 0.40.0 (#21100 )

2026-03-28 08:59:44 +01:00

src

feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)

2026-03-29 20:48:00 +02:00

tests

feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)

2026-03-29 20:48:00 +02:00

tools

feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)

2026-03-29 20:48:00 +02:00

vendor

vendor : update cpp-httplib to 0.40.0 (#21100 )

2026-03-28 08:59:44 +01:00

.clang-format

fix: apply clang-format to CUDA macros (#16017 )

2025-09-16 08:59:19 +02:00

.clang-tidy

clang-tidy : disable warning about performance enum size (#16127 )

2025-09-22 19:57:46 +02:00

.dockerignore

ci : fix docker build number and tag name (#9638 )

2024-09-25 17:26:01 +02:00

.ecrc

common : Update stb_image.h to latest version (#9161 )

2024-08-27 08:58:50 +03:00

.editorconfig

editorconfig : ignore benches/ (#17140 )

2025-11-10 12:17:19 +02:00

.flake8

llama : move end-user examples to tools directory (#13249 )

2025-05-02 20:27:13 +02:00

.gitignore

scripts : update get-hellaswag.sh and get-winogrande.sh (#20542 )

2026-03-14 11:21:50 +01:00

.gitmodules

ggml : remove kompute backend (#14501 )

2025-07-03 07:48:32 +03:00

.pre-commit-config.yaml

convert.py : add python logging instead of print() (#6511 )

2024-05-03 22:36:41 +03:00

AGENTS.md

docs : explicit about banning accounts that violates policy (#19593 )

2026-03-21 15:50:16 +01:00

AUTHORS

authors : update (#19263 )

2026-02-02 08:51:25 +02:00

build-xcframework.sh

build : remove LLAMA_HTTPLIB option (#19623 )

2026-02-15 15:38:50 +01:00

CLAUDE.md

contributing: tighten AI usage policy (#18388 )

2025-12-29 16:01:32 +01:00

CMakeLists.txt

server: Introduce LLAMA_BUILD_WEBUI build flag to allow disabling the embedded web ui (#20158 )

2026-03-27 17:25:55 +01:00

CMakePresets.json

cmake : Add CMake presets for Linux and GCC (#14656 )

2025-07-13 08:12:36 +03:00

CODEOWNERS

Add codeowners for scripts/snapdragon and docs/snapdragon (#20915 )

2026-03-23 14:57:18 -07:00

CONTRIBUTING.md

docs : explicit about banning accounts that violates policy (#19593 )

2026-03-21 15:50:16 +01:00

convert_hf_to_gguf_update.py

convert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011 )

2026-03-26 16:49:09 +01:00

convert_hf_to_gguf.py

mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr (#21027 )

2026-03-27 00:07:55 +01:00

convert_llama_ggml_to_gguf.py

ci : switch from pyright to ty (#20826 )

2026-03-21 08:54:34 +01:00

convert_lora_to_gguf.py

ci : switch from pyright to ty (#20826 )

2026-03-21 08:54:34 +01:00

flake.lock

flake.lock: Update (#10470 )

2024-11-24 08:03:25 -08:00

flake.nix

fix(nix): remove non-functional llama-cpp cachix cache from flake.nix (#15295 )

2025-08-13 11:21:31 -07:00

LICENSE

docs : Minor cleanups (#19252 )

2026-02-02 08:38:55 +02:00

Makefile

make : remove make in favor of CMake (#15449 )

2025-08-20 13:31:16 +03:00

mypy.ini

convert : partially revert PR #4818 (#5041 )

2024-01-20 18:14:18 -05:00

poetry.lock

build(python): Package scripts with pip-0517 compliance

2024-07-04 15:39:13 +00:00

pyproject.toml

gguf-py : bump sentencepiece version (#19319 )

2026-02-06 21:05:19 +01:00

pyrightconfig.json

ci : switch from pyright to ty (#20826 )

2026-03-21 08:54:34 +01:00

README.md

docs: replace README with TurboQuant fork documentation

2026-03-29 20:57:01 +02:00

requirements.txt

tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034 )

2025-03-05 13:05:13 +00:00

SECURITY.md

docs : fix broken link and typo (#19560 )

2026-02-13 09:38:09 +01:00

ty.toml

ci : switch from pyright to ty (#20826 )

2026-03-21 08:54:34 +01:00

README.md

llama.cpp + TurboQuant KV Cache

Fork of llama.cpp with TurboQuant KV-cache vector quantization for AMD ROCm.

Compresses the KV cache to 3-4 bits per dimension using Walsh-Hadamard Transform + Lloyd-Max optimal quantization (Zandieh et al., ICLR 2026). Reduces KV cache VRAM by 72-78% with less than 10% performance overhead.

Results

All benchmarks: Qwen3-14B Q4_K_M, AMD RX 9070 XT (16 GB VRAM, gfx1201), ROCm 6.1, NixOS.

Throughput

KV Type	pp512	pp8K	pp32K	tg128	bpw	KV Savings
f16/f16	1894	1345	563	53.4	16.0	—
q8_0/q8_0	1694	1340	—	52.1	8.0	50%
turbo4/turbo4	1812	1321	550	49.3	4.5	72%
turbo3/turbo3	1836	1319	548	49.6	3.5	78%

Quality (Perplexity)

KV Type	PPL	Delta vs f16
f16	1.7031	—
q8_0	1.7044	+0.001
turbo4	1.7134	+0.010
turbo3	1.7544	+0.051

Quick Start

# Build (AMD ROCm)
cmake -B build -DGGML_HIP=ON -DGPU_TARGETS="gfx1201" -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run with TurboQuant KV cache
./build/bin/llama-cli -m model.gguf -fa 1 -ngl 99 \
  --cache-type-k turbo4 --cache-type-v turbo4

# Benchmark
./build/bin/llama-bench -m model.gguf -fa 1 -ngl 99 \
  -ctk turbo4 -ctv turbo4 -p 512 -n 128

# Run tests
./build/bin/test-turboquant

Flash attention (-fa 1) is required.

Available Types

Type	Bits	Block	Bytes	Use Case
`turbo4`	4-bit	32	18	Recommended default — near-lossless
`turbo3`	3-bit	32	14	Maximum compression — slightly higher PPL

Mixed configurations work: --cache-type-k turbo4 --cache-type-v f16 (or vice versa).

How It Works

Encode: x → L2norm → x/‖x‖ → FWHT(128) → scalar quantize → bitpack → (norm_fp16, indices)
Decode: unpack → codebook lookup → inverse FWHT → ×norm → x̃

The Fast Walsh-Hadamard Transform rotates input vectors into a domain where Lloyd-Max scalar quantization is optimal. The codebook centroids are precomputed for the Beta((d-1)/2, (d-1)/2) distribution arising after FWHT of unit vectors in d=128 dimensions.

Current implementation uses pre-dequantize strategy: turbo KV data is bulk-converted to f16 before standard Flash Attention. This adds minimal overhead while avoiding the complexity of a fused FA kernel.

Requirements

Flash Attention enabled (-fa 1)
head_dim = 128 (Llama, Qwen, Mistral, Gemma, DeepSeek, Phi, and most current models)
AMD ROCm (gfx1201 tested) or CPU

Limitations

head_dim must be exactly 128
No CUDA support yet (HIP/ROCm only for GPU path)
Mixed turbo + quantized V (e.g. turbo4/q8_0) not supported — use turbo/turbo or turbo/f16

What Changed vs Upstream llama.cpp

22 files changed, ~1550 lines added. See docs/turboquant.md for full details.

Key files:

ggml/src/ggml-cuda/set-rows.cu — FWHT encode kernels (128 threads, shared-memory FWHT)
ggml/src/ggml-cuda/convert.cu — FWHT decode kernels (bulk dequantize)
ggml/src/ggml-cuda/fattn.cu — Pre-dequantize integration in Flash Attention
ggml/src/ggml-quants.c — CPU reference implementation
src/llama-kv-cache.cpp — head_dim=128 validation guard
tests/test-turboquant.cpp — 7 CPU reference tests

License

MIT License — same as upstream llama.cpp. See LICENSE.

TurboQuant algorithm: arXiv:2504.19874 (Zandieh, Daliri, Hadian, Mirrokni, ICLR 2026).

Acknowledgments

llama.cpp by ggml-org — the foundation this builds on
TurboQuant paper by Zandieh et al.
Reference implementation by 0xSero
llama.cpp discussion #20969 — community research thread

Languages

C++ 56.8%

C 12.7%

Python 7.3%

Cuda 6.1%

HTML 3.8%

Other 13%

README.md Unescape Escape

llama.cpp + TurboQuant KV Cache

Results

Throughput

Quality (Perplexity)

Quick Start

Available Types

How It Works

Requirements

Limitations

What Changed vs Upstream llama.cpp

License

Acknowledgments

README.md