claudeopus46/llama.cpp-turboquant

Fork 0

T

claudeopus46 1ea0b4798d

EditorConfig Checker / editorconfig (push) Has been cancelled

Details

CI (self-hosted) / ggml-ci-nvidia-cuda (push) Has been cancelled

Details

CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Has been cancelled

Details

CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Has been cancelled

Details

CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Has been cancelled

Details

CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled

Details

CI / build-cmake-pkg (push) Has been cancelled

Details

CI / macOS-latest-arm64 (push) Has been cancelled

Details

CI / macOS-latest-x64 (push) Has been cancelled

Details

CI / macOS-latest-arm64-webgpu (push) Has been cancelled

Details

CI / ubuntu-cpu (arm64, ubuntu-22.04-arm) (push) Has been cancelled

Details

CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Has been cancelled

Details

CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled

Details

CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled

Details

CI / ubuntu-latest-rpc (push) Has been cancelled

Details

CI / ubuntu-24-vulkan (push) Has been cancelled

Details

CI / ubuntu-24-webgpu (push) Has been cancelled

Details

CI / ubuntu-24-webgpu-wasm (push) Has been cancelled

Details

CI / ubuntu-22-hip (push) Has been cancelled

Details

CI / ubuntu-22-musa (push) Has been cancelled

Details

CI / ubuntu-22-sycl (push) Has been cancelled

Details

CI / ubuntu-22-sycl-fp16 (push) Has been cancelled

Details

CI / ubuntu-24-openvino-CPU (push) Has been cancelled

Details

CI / ubuntu-24-openvino-GPU (push) Has been cancelled

Details

CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Has been cancelled

Details

CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Has been cancelled

Details

CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Has been cancelled

Details

CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Has been cancelled

Details

CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Has been cancelled

Details

CI / ubuntu-latest-cuda (push) Has been cancelled

Details

CI / windows-2022-cuda (12.4) (push) Has been cancelled

Details

CI / windows-latest-sycl (push) Has been cancelled

Details

CI / windows-latest-hip (push) Has been cancelled

Details

CI / ubuntu-cpu-riscv64-native (push) Has been cancelled

Details

CI / ggml-ci-x64-cpu-low-perf (push) Has been cancelled

Details

CI / ggml-ci-arm64-cpu-low-perf (push) Has been cancelled

Details

CI / ggml-ci-x64-cpu-high-perf (push) Has been cancelled

Details

CI / ggml-ci-arm64-cpu-high-perf (push) Has been cancelled

Details

CI / ggml-ci-arm64-cpu-high-perf-sve (push) Has been cancelled

Details

CI / ggml-ci-arm64-cpu-kleidiai (push) Has been cancelled

Details

CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Has been cancelled

Details

HIP quality check / ubuntu-22-hip-quality-check (push) Has been cancelled

Details

Release / macOS-arm64 (push) Has been cancelled

Details

Release / macOS-x64 (push) Has been cancelled

Details

Release / ubuntu-22-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled

Details

Release / ubuntu-22-cpu (x64, ubuntu-22.04) (push) Has been cancelled

Details

Release / ubuntu-22-vulkan (push) Has been cancelled

Details

Release / ubuntu-24-openvino (push) Has been cancelled

Details

Release / windows-cpu (arm64) (push) Has been cancelled

Details

Release / windows-cpu (x64) (push) Has been cancelled

Details

Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Has been cancelled

Details

Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Has been cancelled

Details

Release / windows-cuda (12.4) (push) Has been cancelled

Details

Release / windows-cuda (13.1) (push) Has been cancelled

Details

Release / windows-sycl (push) Has been cancelled

Details

Release / ubuntu-22-rocm (7.2, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1151;gfx1150;gfx1200;gfx1201) (push) Has been cancelled

Details

Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Has been cancelled

Details

Release / ios-xcode-build (push) Has been cancelled

Details

Release / openEuler-cann (aarch64, Release, 310p, off) (push) Has been cancelled

Details

Release / openEuler-cann (aarch64, Release, 910b, on) (push) Has been cancelled

Details

Release / openEuler-cann (x86, Release, 310p, off) (push) Has been cancelled

Details

Release / openEuler-cann (x86, Release, 910b, on) (push) Has been cancelled

Details

Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Has been cancelled

Details

Server (self-hosted) / server-metal (GPUx2) (push) Has been cancelled

Details

Server (self-hosted) / server-metal (GPUx1) (push) Has been cancelled

Details

Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Has been cancelled

Details

Server (self-hosted) / server-cuda (GPUx1) (push) Has been cancelled

Details

Server (self-hosted) / server-cuda (GPUx1, backend-sampling) (push) Has been cancelled

Details

Server / server (default) (push) Has been cancelled

Details

Server / server (backend-sampling) (push) Has been cancelled

Details

Server / server-windows (push) Has been cancelled

Details

Release / release (push) Has been cancelled

Details

fix: FWHT butterfly loop warp divergence on RDNA3 iGPU (gfx1103)

The `if (tid < 64)` guard only let half the threads participate in the
128-element Walsh-Hadamard Transform, leaving elements 64-127
untransformed. This caused warp divergence and register pressure that
crashes low-CU iGPUs (AMD 780M / gfx1103 with 6 CUs).

Replace with proper bounds check `if (i + h < TURBO_HEAD_DIM)` so all
128 threads participate naturally. Fixes both correctness (full FWHT
over all elements) and GPU occupancy on resource-constrained hardware.

Affects: dequantize_block_turbo3_0, dequantize_block_turbo4_0,
k_set_rows_turbo3, k_set_rows_turbo4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-03 00:19:35 +02:00

.devops

devops: including compute-runtime for intel.Dockerfile (#21076 )

2026-03-29 13:34:03 +08:00

.gemini

contributing: tighten AI usage policy (#18388 )

2025-12-29 16:01:32 +01:00

.github

docker : fix and enable ARM64 image build (#20929 )

2026-03-28 01:45:09 +01:00

benches

benches : add Nemotron 3 Nano on DGX Spark (#20652 )

2026-03-16 21:50:43 +02:00

ci: Allow ninja to be used during unit test (#20742 )

2026-03-25 21:00:49 +08:00

cmake

ci : add sanitizer runs for server (#19291 )

2026-02-03 22:41:20 +02:00

common

feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)

2026-03-29 20:48:00 +02:00

docs

docs: add TurboQuant benchmark results and documentation

2026-03-29 20:48:00 +02:00

examples

[SYCL] Enhance build script to use half cores to build, avoid OS hang (#21093 )

2026-03-29 09:02:45 +08:00

ggml

fix: FWHT butterfly loop warp divergence on RDNA3 iGPU (gfx1103)

2026-04-03 00:19:35 +02:00

gguf-py

convert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011 )

2026-03-26 16:49:09 +01:00

grammars

docs : document that JSON Schema is not available to model when using response_format (#18492 )

2025-12-30 15:13:49 -06:00

include

llama: fix llama-model-saver (#20503 )

2026-03-25 12:53:16 +02:00

licenses

refactor : remove libcurl, use OpenSSL when available (#18828 )

2026-01-14 18:02:47 +01:00

media

media : add transparent icon svg and png [no ci] (#15891 )

2025-09-10 14:51:28 +03:00

models

common/parser: fix handling of tool definition with missing properties key (#21128 )

2026-03-28 20:41:32 +01:00

pocs

ggml : move AMX to the CPU backend (#10570 )

2024-11-29 21:54:58 +01:00

requirements

ci : limit requirements versions (#20980 )

2026-03-25 10:55:37 +02:00

scripts

vendor : update cpp-httplib to 0.40.0 (#21100 )

2026-03-28 08:59:44 +01:00

src

feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)

2026-03-29 20:48:00 +02:00

tests

feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)

2026-03-29 20:48:00 +02:00

tools

feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)

2026-03-29 20:48:00 +02:00

vendor

vendor : update cpp-httplib to 0.40.0 (#21100 )

2026-03-28 08:59:44 +01:00

.clang-format

fix: apply clang-format to CUDA macros (#16017 )

2025-09-16 08:59:19 +02:00

.clang-tidy

clang-tidy : disable warning about performance enum size (#16127 )

2025-09-22 19:57:46 +02:00

.dockerignore

ci : fix docker build number and tag name (#9638 )

2024-09-25 17:26:01 +02:00

.ecrc

common : Update stb_image.h to latest version (#9161 )

2024-08-27 08:58:50 +03:00

.editorconfig

editorconfig : ignore benches/ (#17140 )

2025-11-10 12:17:19 +02:00

.flake8

llama : move end-user examples to tools directory (#13249 )

2025-05-02 20:27:13 +02:00

.gitignore

scripts : update get-hellaswag.sh and get-winogrande.sh (#20542 )

2026-03-14 11:21:50 +01:00

.gitmodules

ggml : remove kompute backend (#14501 )

2025-07-03 07:48:32 +03:00

.pre-commit-config.yaml

convert.py : add python logging instead of print() (#6511 )

2024-05-03 22:36:41 +03:00

AGENTS.md

docs : explicit about banning accounts that violates policy (#19593 )

2026-03-21 15:50:16 +01:00

AUTHORS

authors : update (#19263 )

2026-02-02 08:51:25 +02:00

build-xcframework.sh

build : remove LLAMA_HTTPLIB option (#19623 )

2026-02-15 15:38:50 +01:00

CLAUDE.md

contributing: tighten AI usage policy (#18388 )

2025-12-29 16:01:32 +01:00

CMakeLists.txt

server: Introduce LLAMA_BUILD_WEBUI build flag to allow disabling the embedded web ui (#20158 )

2026-03-27 17:25:55 +01:00

CMakePresets.json

cmake : Add CMake presets for Linux and GCC (#14656 )

2025-07-13 08:12:36 +03:00

CODEOWNERS

Add codeowners for scripts/snapdragon and docs/snapdragon (#20915 )

2026-03-23 14:57:18 -07:00

CONTRIBUTING.md

docs : explicit about banning accounts that violates policy (#19593 )

2026-03-21 15:50:16 +01:00

convert_hf_to_gguf_update.py

convert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011 )

2026-03-26 16:49:09 +01:00

convert_hf_to_gguf.py

mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr (#21027 )

2026-03-27 00:07:55 +01:00

convert_llama_ggml_to_gguf.py

ci : switch from pyright to ty (#20826 )

2026-03-21 08:54:34 +01:00

convert_lora_to_gguf.py

ci : switch from pyright to ty (#20826 )

2026-03-21 08:54:34 +01:00

flake.lock

flake.lock: Update (#10470 )

2024-11-24 08:03:25 -08:00

flake.nix

fix(nix): remove non-functional llama-cpp cachix cache from flake.nix (#15295 )

2025-08-13 11:21:31 -07:00

LICENSE

docs : Minor cleanups (#19252 )

2026-02-02 08:38:55 +02:00

Makefile

make : remove make in favor of CMake (#15449 )

2025-08-20 13:31:16 +03:00

mypy.ini

convert : partially revert PR #4818 (#5041 )

2024-01-20 18:14:18 -05:00

poetry.lock

build(python): Package scripts with pip-0517 compliance

2024-07-04 15:39:13 +00:00

pyproject.toml

gguf-py : bump sentencepiece version (#19319 )

2026-02-06 21:05:19 +01:00

pyrightconfig.json

ci : switch from pyright to ty (#20826 )

2026-03-21 08:54:34 +01:00

README.md

docs: replace README with TurboQuant fork documentation

2026-03-29 20:57:01 +02:00

requirements.txt

tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034 )

2025-03-05 13:05:13 +00:00

SECURITY.md

docs : fix broken link and typo (#19560 )

2026-02-13 09:38:09 +01:00

ty.toml

ci : switch from pyright to ty (#20826 )

2026-03-21 08:54:34 +01:00

README.md

llama.cpp + TurboQuant KV Cache

Fork of llama.cpp with TurboQuant KV-cache vector quantization for AMD ROCm.

Compresses the KV cache to 3-4 bits per dimension using Walsh-Hadamard Transform + Lloyd-Max optimal quantization (Zandieh et al., ICLR 2026). Reduces KV cache VRAM by 72-78% with less than 10% performance overhead.

Results

All benchmarks: Qwen3-14B Q4_K_M, AMD RX 9070 XT (16 GB VRAM, gfx1201), ROCm 6.1, NixOS.

Throughput

KV Type	pp512	pp8K	pp32K	tg128	bpw	KV Savings
f16/f16	1894	1345	563	53.4	16.0	—
q8_0/q8_0	1694	1340	—	52.1	8.0	50%
turbo4/turbo4	1812	1321	550	49.3	4.5	72%
turbo3/turbo3	1836	1319	548	49.6	3.5	78%

Quality (Perplexity)

KV Type	PPL	Delta vs f16
f16	1.7031	—
q8_0	1.7044	+0.001
turbo4	1.7134	+0.010
turbo3	1.7544	+0.051

Quick Start

# Build (AMD ROCm)
cmake -B build -DGGML_HIP=ON -DGPU_TARGETS="gfx1201" -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run with TurboQuant KV cache
./build/bin/llama-cli -m model.gguf -fa 1 -ngl 99 \
  --cache-type-k turbo4 --cache-type-v turbo4

# Benchmark
./build/bin/llama-bench -m model.gguf -fa 1 -ngl 99 \
  -ctk turbo4 -ctv turbo4 -p 512 -n 128

# Run tests
./build/bin/test-turboquant

Flash attention (-fa 1) is required.

Available Types

Type	Bits	Block	Bytes	Use Case
`turbo4`	4-bit	32	18	Recommended default — near-lossless
`turbo3`	3-bit	32	14	Maximum compression — slightly higher PPL

Mixed configurations work: --cache-type-k turbo4 --cache-type-v f16 (or vice versa).

How It Works

Encode: x → L2norm → x/‖x‖ → FWHT(128) → scalar quantize → bitpack → (norm_fp16, indices)
Decode: unpack → codebook lookup → inverse FWHT → ×norm → x̃

The Fast Walsh-Hadamard Transform rotates input vectors into a domain where Lloyd-Max scalar quantization is optimal. The codebook centroids are precomputed for the Beta((d-1)/2, (d-1)/2) distribution arising after FWHT of unit vectors in d=128 dimensions.

Current implementation uses pre-dequantize strategy: turbo KV data is bulk-converted to f16 before standard Flash Attention. This adds minimal overhead while avoiding the complexity of a fused FA kernel.

Requirements

Flash Attention enabled (-fa 1)
head_dim = 128 (Llama, Qwen, Mistral, Gemma, DeepSeek, Phi, and most current models)
AMD ROCm (gfx1201 tested) or CPU

Limitations

head_dim must be exactly 128
No CUDA support yet (HIP/ROCm only for GPU path)
Mixed turbo + quantized V (e.g. turbo4/q8_0) not supported — use turbo/turbo or turbo/f16

What Changed vs Upstream llama.cpp

22 files changed, ~1550 lines added. See docs/turboquant.md for full details.

Key files:

ggml/src/ggml-cuda/set-rows.cu — FWHT encode kernels (128 threads, shared-memory FWHT)
ggml/src/ggml-cuda/convert.cu — FWHT decode kernels (bulk dequantize)
ggml/src/ggml-cuda/fattn.cu — Pre-dequantize integration in Flash Attention
ggml/src/ggml-quants.c — CPU reference implementation
src/llama-kv-cache.cpp — head_dim=128 validation guard
tests/test-turboquant.cpp — 7 CPU reference tests

License

MIT License — same as upstream llama.cpp. See LICENSE.

TurboQuant algorithm: arXiv:2504.19874 (Zandieh, Daliri, Hadian, Mirrokni, ICLR 2026).

Acknowledgments

llama.cpp by ggml-org — the foundation this builds on
TurboQuant paper by Zandieh et al.
Reference implementation by 0xSero
llama.cpp discussion #20969 — community research thread

Languages

C++ 56.8%

C 12.7%

Python 7.3%

Cuda 6.1%

HTML 3.8%

Other 13%

README.md Unescape Escape

llama.cpp + TurboQuant KV Cache

Results

Throughput

Quality (Perplexity)

Quick Start

Available Types

How It Works

Requirements

Limitations

What Changed vs Upstream llama.cpp

License

Acknowledgments

README.md