claudeopus46/llama.cpp-turboquant

Fork 0

T

claudeopus46 457e76fc0e

Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled

Details

Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled

Details

Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled

Details

Publish Docker image / Push Docker image to Docker Hub (map[cuda_version:12.4.0 dockerfile:.devops/cuda.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:cuda cuda12 ubuntu_version:22.04]) (push) Has been cancelled

Details

Publish Docker image / Push Docker image to Docker Hub (map[cuda_version:13.1.0 dockerfile:.devops/cuda-new.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:cuda13 ubuntu_version:24.04]) (push) Has been cancelled

Details

Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cpu.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:cpu]) (push) Has been cancelled

Details

Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cpu.Dockerfile free_disk_space:false full:true light:true platforms:linux/arm64 runs_on:ubuntu-24.04 server:true tag:cpu]) (push) Has been cancelled

Details

Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/intel.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:intel]) (push) Has been cancelled

Details

Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/musa.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:musa]) (push) Has been cancelled

Details

Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/openvino.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:openvino]) (push) Has been cancelled

Details

Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/rocm.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:rocm]) (push) Has been cancelled

Details

Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/s390x.Dockerfile free_disk_space:false full:true light:true platforms:linux/s390x runs_on:ubuntu-24.04-s390x server:true tag:s390x]) (push) Has been cancelled

Details

Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/vulkan.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:vulkan]) (push) Has been cancelled

Details

Publish Docker image / Create and push git tag (push) Has been cancelled

Details

Update Winget Package / Update Winget Package (push) Has been cancelled

Details

EditorConfig Checker / editorconfig (push) Has been cancelled

Details

Close inactive issues / close-issues (push) Has been cancelled

Details

CI (msys) / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled

Details

CI (msys) / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled

Details

CI (cross) / debian-13-loongarch64-cpu-cross (push) Has been cancelled

Details

CI (cross) / debian-13-loongarch64-vulkan-cross (push) Has been cancelled

Details

CI (cross) / ubuntu-24-riscv64-cpu-spacemit-ime-cross (push) Has been cancelled

Details

fix: match Ollama's proven gfx1103 approach — gfx1102 target + rocBLAS

Remove GGML_CUDA_FORCE_MMQ — let rocBLAS handle large batch GEMMs
using gfx1102 TensileLibrary (available in ROCm 7.2). The GPU is
spoofed as gfx1102 via HSA_OVERRIDE_GFX_VERSION=11.0.2 at runtime,
matching Ollama's working configuration.

FORCE_MMQ caused crashes because MMQ kernel launch_bounds are tuned
for GPUs with many CUs and cannot fit on the 6-CU iGPU for large
matrix dimensions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-03 03:05:06 +02:00

.devops

fix: match Ollama's proven gfx1103 approach — gfx1102 target + rocBLAS

2026-04-03 03:05:06 +02:00

.gemini

contributing: tighten AI usage policy (#18388 )

2025-12-29 16:01:32 +01:00

.github

docker : fix and enable ARM64 image build (#20929 )

2026-03-28 01:45:09 +01:00

benches

benches : add Nemotron 3 Nano on DGX Spark (#20652 )

2026-03-16 21:50:43 +02:00

ci: Allow ninja to be used during unit test (#20742 )

2026-03-25 21:00:49 +08:00

cmake

ci : add sanitizer runs for server (#19291 )

2026-02-03 22:41:20 +02:00

common

feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)

2026-03-29 20:48:00 +02:00

docs

docs: add TurboQuant benchmark results and documentation

2026-03-29 20:48:00 +02:00

examples

[SYCL] Enhance build script to use half cores to build, avoid OS hang (#21093 )

2026-03-29 09:02:45 +08:00

ggml

fix: FWHT butterfly loop warp divergence on RDNA3 iGPU (gfx1103)

2026-04-03 00:19:35 +02:00

gguf-py

convert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011 )

2026-03-26 16:49:09 +01:00

grammars

docs : document that JSON Schema is not available to model when using response_format (#18492 )

2025-12-30 15:13:49 -06:00

include

llama: fix llama-model-saver (#20503 )

2026-03-25 12:53:16 +02:00

licenses

refactor : remove libcurl, use OpenSSL when available (#18828 )

2026-01-14 18:02:47 +01:00

media

media : add transparent icon svg and png [no ci] (#15891 )

2025-09-10 14:51:28 +03:00

models

common/parser: fix handling of tool definition with missing properties key (#21128 )

2026-03-28 20:41:32 +01:00

pocs

ggml : move AMX to the CPU backend (#10570 )

2024-11-29 21:54:58 +01:00

requirements

ci : limit requirements versions (#20980 )

2026-03-25 10:55:37 +02:00

scripts

vendor : update cpp-httplib to 0.40.0 (#21100 )

2026-03-28 08:59:44 +01:00

src

feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)

2026-03-29 20:48:00 +02:00

tests

feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)

2026-03-29 20:48:00 +02:00

tools

feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)

2026-03-29 20:48:00 +02:00

vendor

vendor : update cpp-httplib to 0.40.0 (#21100 )

2026-03-28 08:59:44 +01:00

.clang-format

fix: apply clang-format to CUDA macros (#16017 )

2025-09-16 08:59:19 +02:00

.clang-tidy

clang-tidy : disable warning about performance enum size (#16127 )

2025-09-22 19:57:46 +02:00

.dockerignore

ci : fix docker build number and tag name (#9638 )

2024-09-25 17:26:01 +02:00

.ecrc

common : Update stb_image.h to latest version (#9161 )

2024-08-27 08:58:50 +03:00

.editorconfig

editorconfig : ignore benches/ (#17140 )

2025-11-10 12:17:19 +02:00

.flake8

llama : move end-user examples to tools directory (#13249 )

2025-05-02 20:27:13 +02:00

.gitignore

scripts : update get-hellaswag.sh and get-winogrande.sh (#20542 )

2026-03-14 11:21:50 +01:00

.gitmodules

ggml : remove kompute backend (#14501 )

2025-07-03 07:48:32 +03:00

.pre-commit-config.yaml

convert.py : add python logging instead of print() (#6511 )

2024-05-03 22:36:41 +03:00

AGENTS.md

docs : explicit about banning accounts that violates policy (#19593 )

2026-03-21 15:50:16 +01:00

AUTHORS

authors : update (#19263 )

2026-02-02 08:51:25 +02:00

build-xcframework.sh

build : remove LLAMA_HTTPLIB option (#19623 )

2026-02-15 15:38:50 +01:00

CLAUDE.md

contributing: tighten AI usage policy (#18388 )

2025-12-29 16:01:32 +01:00

CMakeLists.txt

server: Introduce LLAMA_BUILD_WEBUI build flag to allow disabling the embedded web ui (#20158 )

2026-03-27 17:25:55 +01:00

CMakePresets.json

cmake : Add CMake presets for Linux and GCC (#14656 )

2025-07-13 08:12:36 +03:00

CODEOWNERS

Add codeowners for scripts/snapdragon and docs/snapdragon (#20915 )

2026-03-23 14:57:18 -07:00

CONTRIBUTING.md

docs : explicit about banning accounts that violates policy (#19593 )

2026-03-21 15:50:16 +01:00

convert_hf_to_gguf_update.py

convert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011 )

2026-03-26 16:49:09 +01:00

convert_hf_to_gguf.py

mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr (#21027 )

2026-03-27 00:07:55 +01:00

convert_llama_ggml_to_gguf.py

ci : switch from pyright to ty (#20826 )

2026-03-21 08:54:34 +01:00

convert_lora_to_gguf.py

ci : switch from pyright to ty (#20826 )

2026-03-21 08:54:34 +01:00

flake.lock

flake.lock: Update (#10470 )

2024-11-24 08:03:25 -08:00

flake.nix

fix(nix): remove non-functional llama-cpp cachix cache from flake.nix (#15295 )

2025-08-13 11:21:31 -07:00

LICENSE

docs : Minor cleanups (#19252 )

2026-02-02 08:38:55 +02:00

Makefile

make : remove make in favor of CMake (#15449 )

2025-08-20 13:31:16 +03:00

mypy.ini

convert : partially revert PR #4818 (#5041 )

2024-01-20 18:14:18 -05:00

poetry.lock

build(python): Package scripts with pip-0517 compliance

2024-07-04 15:39:13 +00:00

pyproject.toml

gguf-py : bump sentencepiece version (#19319 )

2026-02-06 21:05:19 +01:00

pyrightconfig.json

ci : switch from pyright to ty (#20826 )

2026-03-21 08:54:34 +01:00

README.md

docs: replace README with TurboQuant fork documentation

2026-03-29 20:57:01 +02:00

requirements.txt

tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034 )

2025-03-05 13:05:13 +00:00

SECURITY.md

docs : fix broken link and typo (#19560 )

2026-02-13 09:38:09 +01:00

ty.toml

ci : switch from pyright to ty (#20826 )

2026-03-21 08:54:34 +01:00

README.md

llama.cpp + TurboQuant KV Cache

Fork of llama.cpp with TurboQuant KV-cache vector quantization for AMD ROCm.

Compresses the KV cache to 3-4 bits per dimension using Walsh-Hadamard Transform + Lloyd-Max optimal quantization (Zandieh et al., ICLR 2026). Reduces KV cache VRAM by 72-78% with less than 10% performance overhead.

Results

All benchmarks: Qwen3-14B Q4_K_M, AMD RX 9070 XT (16 GB VRAM, gfx1201), ROCm 6.1, NixOS.

Throughput

KV Type	pp512	pp8K	pp32K	tg128	bpw	KV Savings
f16/f16	1894	1345	563	53.4	16.0	—
q8_0/q8_0	1694	1340	—	52.1	8.0	50%
turbo4/turbo4	1812	1321	550	49.3	4.5	72%
turbo3/turbo3	1836	1319	548	49.6	3.5	78%

Quality (Perplexity)

KV Type	PPL	Delta vs f16
f16	1.7031	—
q8_0	1.7044	+0.001
turbo4	1.7134	+0.010
turbo3	1.7544	+0.051

Quick Start

# Build (AMD ROCm)
cmake -B build -DGGML_HIP=ON -DGPU_TARGETS="gfx1201" -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run with TurboQuant KV cache
./build/bin/llama-cli -m model.gguf -fa 1 -ngl 99 \
  --cache-type-k turbo4 --cache-type-v turbo4

# Benchmark
./build/bin/llama-bench -m model.gguf -fa 1 -ngl 99 \
  -ctk turbo4 -ctv turbo4 -p 512 -n 128

# Run tests
./build/bin/test-turboquant

Flash attention (-fa 1) is required.

Available Types

Type	Bits	Block	Bytes	Use Case
`turbo4`	4-bit	32	18	Recommended default — near-lossless
`turbo3`	3-bit	32	14	Maximum compression — slightly higher PPL

Mixed configurations work: --cache-type-k turbo4 --cache-type-v f16 (or vice versa).

How It Works

Encode: x → L2norm → x/‖x‖ → FWHT(128) → scalar quantize → bitpack → (norm_fp16, indices)
Decode: unpack → codebook lookup → inverse FWHT → ×norm → x̃

The Fast Walsh-Hadamard Transform rotates input vectors into a domain where Lloyd-Max scalar quantization is optimal. The codebook centroids are precomputed for the Beta((d-1)/2, (d-1)/2) distribution arising after FWHT of unit vectors in d=128 dimensions.

Current implementation uses pre-dequantize strategy: turbo KV data is bulk-converted to f16 before standard Flash Attention. This adds minimal overhead while avoiding the complexity of a fused FA kernel.

Requirements

Flash Attention enabled (-fa 1)
head_dim = 128 (Llama, Qwen, Mistral, Gemma, DeepSeek, Phi, and most current models)
AMD ROCm (gfx1201 tested) or CPU

Limitations

head_dim must be exactly 128
No CUDA support yet (HIP/ROCm only for GPU path)
Mixed turbo + quantized V (e.g. turbo4/q8_0) not supported — use turbo/turbo or turbo/f16

What Changed vs Upstream llama.cpp

22 files changed, ~1550 lines added. See docs/turboquant.md for full details.

Key files:

ggml/src/ggml-cuda/set-rows.cu — FWHT encode kernels (128 threads, shared-memory FWHT)
ggml/src/ggml-cuda/convert.cu — FWHT decode kernels (bulk dequantize)
ggml/src/ggml-cuda/fattn.cu — Pre-dequantize integration in Flash Attention
ggml/src/ggml-quants.c — CPU reference implementation
src/llama-kv-cache.cpp — head_dim=128 validation guard
tests/test-turboquant.cpp — 7 CPU reference tests

License

MIT License — same as upstream llama.cpp. See LICENSE.

TurboQuant algorithm: arXiv:2504.19874 (Zandieh, Daliri, Hadian, Mirrokni, ICLR 2026).

Acknowledgments

llama.cpp by ggml-org — the foundation this builds on
TurboQuant paper by Zandieh et al.
Reference implementation by 0xSero
llama.cpp discussion #20969 — community research thread

Languages

C++ 56.8%

C 12.7%

Python 7.3%

Cuda 6.1%

HTML 3.8%

Other 13%

README.md Unescape Escape

llama.cpp + TurboQuant KV Cache

Results

Throughput

Quality (Perplexity)

Quick Start

Available Types

How It Works

Requirements

Limitations

What Changed vs Upstream llama.cpp

License

Acknowledgments

README.md