llama.cpp-turboquant

Author	SHA1	Message	Date
claudeopus46	457e76fc0e	fix: match Ollama's proven gfx1103 approach — gfx1102 target + rocBLAS Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled Details Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled Details Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[cuda_version:12.4.0 dockerfile:.devops/cuda.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:cuda cuda12 ubuntu_version:22.04]) (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[cuda_version:13.1.0 dockerfile:.devops/cuda-new.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:cuda13 ubuntu_version:24.04]) (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cpu.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:cpu]) (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cpu.Dockerfile free_disk_space:false full:true light:true platforms:linux/arm64 runs_on:ubuntu-24.04 server:true tag:cpu]) (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/intel.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:intel]) (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/musa.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:musa]) (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/openvino.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:openvino]) (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/rocm.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:rocm]) (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/s390x.Dockerfile free_disk_space:false full:true light:true platforms:linux/s390x runs_on:ubuntu-24.04-s390x server:true tag:s390x]) (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/vulkan.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:vulkan]) (push) Has been cancelled Details Publish Docker image / Create and push git tag (push) Has been cancelled Details Update Winget Package / Update Winget Package (push) Has been cancelled Details EditorConfig Checker / editorconfig (push) Has been cancelled Details Close inactive issues / close-issues (push) Has been cancelled Details CI (msys) / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled Details CI (msys) / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled Details CI (cross) / debian-13-loongarch64-cpu-cross (push) Has been cancelled Details CI (cross) / debian-13-loongarch64-vulkan-cross (push) Has been cancelled Details CI (cross) / ubuntu-24-riscv64-cpu-spacemit-ime-cross (push) Has been cancelled Details Remove GGML_CUDA_FORCE_MMQ — let rocBLAS handle large batch GEMMs using gfx1102 TensileLibrary (available in ROCm 7.2). The GPU is spoofed as gfx1102 via HSA_OVERRIDE_GFX_VERSION=11.0.2 at runtime, matching Ollama's working configuration. FORCE_MMQ caused crashes because MMQ kernel launch_bounds are tuned for GPUs with many CUs and cannot fit on the 6-CU iGPU for large matrix dimensions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 03:05:06 +02:00
claudeopus46	94127d7b33	fix: force MMQ kernels to bypass rocBLAS TensileLibrary on gfx1103 Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled Details Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled Details Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled Details EditorConfig Checker / editorconfig (push) Has been cancelled Details Close inactive issues / close-issues (push) Has been cancelled Details ROCm 7.2 rocBLAS has no TensileLibrary for gfx1103 (RDNA3 iGPU) and the gfx1102 library kernels crash due to register file differences. Force MMQ (matrix multiply quantized) kernels which are compiled by hipcc for the actual target arch, bypassing rocBLAS entirely. This matches how Ollama successfully runs on AMD 780M / gfx1103. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 01:44:34 +02:00
claudeopus46	cec946de76	fix: add gfx1103 TensileLibrary alias from gfx1102 EditorConfig Checker / editorconfig (push) Has been cancelled Details Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled Details Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled Details Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled Details ROCm 7.2 ships rocBLAS TensileLibrary for gfx1102 but not gfx1103 (RDNA3 iGPU). Copy gfx1102 library as gfx1103 so rocBLAS matmuls work on AMD 780M without HSA_OVERRIDE_GFX_VERSION spoofing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 00:49:45 +02:00
claudeopus46	7eb4c866ff	fix: disable WMMA flash attention for gfx1103 iGPU compatibility EditorConfig Checker / editorconfig (push) Has been cancelled Details rocWMMA does not support RDNA3 iGPU (gfx1103). Flash attention still works via the vec codepath. This allows native gfx1103 compilation without HSA_OVERRIDE_GFX_VERSION spoofing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 00:32:57 +02:00
claudeopus46	1ea0b4798d	fix: FWHT butterfly loop warp divergence on RDNA3 iGPU (gfx1103) EditorConfig Checker / editorconfig (push) Has been cancelled Details CI (self-hosted) / ggml-ci-nvidia-cuda (push) Has been cancelled Details CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Has been cancelled Details CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Has been cancelled Details CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Has been cancelled Details CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled Details CI / build-cmake-pkg (push) Has been cancelled Details CI / macOS-latest-arm64 (push) Has been cancelled Details CI / macOS-latest-x64 (push) Has been cancelled Details CI / macOS-latest-arm64-webgpu (push) Has been cancelled Details CI / ubuntu-cpu (arm64, ubuntu-22.04-arm) (push) Has been cancelled Details CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Has been cancelled Details CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled Details CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled Details CI / ubuntu-latest-rpc (push) Has been cancelled Details CI / ubuntu-24-vulkan (push) Has been cancelled Details CI / ubuntu-24-webgpu (push) Has been cancelled Details CI / ubuntu-24-webgpu-wasm (push) Has been cancelled Details CI / ubuntu-22-hip (push) Has been cancelled Details CI / ubuntu-22-musa (push) Has been cancelled Details CI / ubuntu-22-sycl (push) Has been cancelled Details CI / ubuntu-22-sycl-fp16 (push) Has been cancelled Details CI / ubuntu-24-openvino-CPU (push) Has been cancelled Details CI / ubuntu-24-openvino-GPU (push) Has been cancelled Details CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Has been cancelled Details CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Has been cancelled Details CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Has been cancelled Details CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Has been cancelled Details CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Has been cancelled Details CI / ubuntu-latest-cuda (push) Has been cancelled Details CI / windows-2022-cuda (12.4) (push) Has been cancelled Details CI / windows-latest-sycl (push) Has been cancelled Details CI / windows-latest-hip (push) Has been cancelled Details CI / ubuntu-cpu-riscv64-native (push) Has been cancelled Details CI / ggml-ci-x64-cpu-low-perf (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-low-perf (push) Has been cancelled Details CI / ggml-ci-x64-cpu-high-perf (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-high-perf (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-high-perf-sve (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-kleidiai (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Has been cancelled Details HIP quality check / ubuntu-22-hip-quality-check (push) Has been cancelled Details Release / macOS-arm64 (push) Has been cancelled Details Release / macOS-x64 (push) Has been cancelled Details Release / ubuntu-22-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled Details Release / ubuntu-22-cpu (x64, ubuntu-22.04) (push) Has been cancelled Details Release / ubuntu-22-vulkan (push) Has been cancelled Details Release / ubuntu-24-openvino (push) Has been cancelled Details Release / windows-cpu (arm64) (push) Has been cancelled Details Release / windows-cpu (x64) (push) Has been cancelled Details Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Has been cancelled Details Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Has been cancelled Details Release / windows-cuda (12.4) (push) Has been cancelled Details Release / windows-cuda (13.1) (push) Has been cancelled Details Release / windows-sycl (push) Has been cancelled Details Release / ubuntu-22-rocm (7.2, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1151;gfx1150;gfx1200;gfx1201) (push) Has been cancelled Details Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Has been cancelled Details Release / ios-xcode-build (push) Has been cancelled Details Release / openEuler-cann (aarch64, Release, 310p, off) (push) Has been cancelled Details Release / openEuler-cann (aarch64, Release, 910b, on) (push) Has been cancelled Details Release / openEuler-cann (x86, Release, 310p, off) (push) Has been cancelled Details Release / openEuler-cann (x86, Release, 910b, on) (push) Has been cancelled Details Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Has been cancelled Details Server (self-hosted) / server-metal (GPUx2) (push) Has been cancelled Details Server (self-hosted) / server-metal (GPUx1) (push) Has been cancelled Details Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Has been cancelled Details Server (self-hosted) / server-cuda (GPUx1) (push) Has been cancelled Details Server (self-hosted) / server-cuda (GPUx1, backend-sampling) (push) Has been cancelled Details Server / server (default) (push) Has been cancelled Details Server / server (backend-sampling) (push) Has been cancelled Details Server / server-windows (push) Has been cancelled Details Release / release (push) Has been cancelled Details The `if (tid < 64)` guard only let half the threads participate in the 128-element Walsh-Hadamard Transform, leaving elements 64-127 untransformed. This caused warp divergence and register pressure that crashes low-CU iGPUs (AMD 780M / gfx1103 with 6 CUs). Replace with proper bounds check `if (i + h < TURBO_HEAD_DIM)` so all 128 threads participate naturally. Fixes both correctness (full FWHT over all elements) and GPU occupancy on resource-constrained hardware. Affects: dequantize_block_turbo3_0, dequantize_block_turbo4_0, k_set_rows_turbo3, k_set_rows_turbo4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 00:19:35 +02:00
Pascal Wachowski	d1b8f40933	docs: replace README with TurboQuant fork documentation EditorConfig Checker / editorconfig (push) Has been cancelled Details Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled Details Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled Details Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled Details Copilot Setup Steps / copilot-setup-steps (push) Has been cancelled Details Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details Update Operations Documentation / update-ops-docs (push) Has been cancelled Details Clear fork identification, MIT license attribution, benchmark results, build instructions, and links to upstream llama.cpp and the paper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> b8583	2026-03-29 20:57:01 +02:00
Pascal Wachowski	04b2771048	docs: add TurboQuant benchmark results and documentation Performance (Qwen3-14B Q4_K_M, RX 9070 XT 16GB): - turbo4: 1812 pp512 / 49 tg128 / PPL +0.010 vs f16 / 72% KV savings - turbo3: 1836 pp512 / 50 tg128 / PPL +0.051 vs f16 / 78% KV savings - All context lengths up to 40K work without OOM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 20:48:00 +02:00
Pascal Wachowski	bd571adc99	feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4) Implements TurboQuant (Zandieh et al., ICLR 2026) KV-cache vector quantization targeting AMD RDNA 4 (gfx1201, RX 9070 XT). Algorithm: L2-normalize → FWHT(128) → Lloyd-Max scalar quantize → bitpack Decode: unpack → codebook lookup → inverse FWHT → denormalize Two new GGML types: - GGML_TYPE_TURBO3_0: 3-bit, 3.5 bpw, MSEd=0.034 (block_size=32, 14 bytes) - GGML_TYPE_TURBO4_0: 4-bit, 4.5 bpw, MSEd=0.009 (block_size=32, 18 bytes) Architecture (pre-dequantize strategy): - Write path: FWHT-aware set-rows kernels (128 threads, shared-mem FWHT) - Read path: bulk dequantize turbo→f16 before standard Flash Attention - Stride scaling preserves ggml_permute dim swaps (critical fix) Performance (Qwen3-14B Q4_K_M, RX 9070 XT, 16 GB VRAM): f16/f16: 1865 pp512, 54 tg128 (baseline) q8_0/q8_0: 1694 pp512, 52 tg128 turbo4/turbo4: 1813 pp512, 49 tg128 (-3% pp, -9% tg, 72% less KV VRAM) turbo3/turbo3: 1983 pp512, 49 tg128 (+6% pp, -9% tg, 78% less KV VRAM) Usage: llama-cli -fa 1 --cache-type-k turbo4 --cache-type-v turbo4 Includes 7 CPU reference tests validating FWHT self-inverse, MSE against paper values, bitpack determinism, and dequantize sanity. Requires head_dim=128 (covers most current models including Llama, Qwen, Mistral, Gemma). Guard added to KV cache init with clear error message. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 20:48:00 +02:00
Sigbjørn Skjæret	7c203670f8	add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150 )	2026-03-29 19:45:40 +02:00
Gaurav Garg	ec16a072f0	Optimize MOE GEMV kernel for BS > 1. (#20905 ) * Optimize MOE GEMV kernel for BS > 1. The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row. New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync). This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization. * Remove em-dashes * Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8 * Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-03-29 18:35:18 +02:00
Max Krasnyansky	f5d1c4179f	hexagon: dma optimizations (mostly fixing regressions) (#21137 ) * hex-fa: add simple dma cache for Mask I noticed that we were refetch the mask rows over and over. This simple cache avoids that. * hex-dma: unset in-order desc bit which caused signficant perf regression We don't rely on true in order processing of the DMA descriptors anywhere. Turns out this mode caused significant regression of around 3-4 TPS during token gen. * hex-rope: update comment to clarify that we don't need in-order DMA completions	2026-03-29 06:40:13 -07:00
Davi Henrique Linhares	2405d59cb6	devops: including compute-runtime for intel.Dockerfile (#21076 )	2026-03-29 13:34:03 +08:00
Neo Zhang	afe65aa282	[SYCL] Enhance build script to use half cores to build, avoid OS hang (#21093 ) * use half cores to build, avoid OS hang * reduce the output text num to short test time * avoid to return 0	2026-03-29 09:02:45 +08:00
Sigbjørn Skjæret	65097181e4	fix **/x glob matching (#21129 )	2026-03-28 22:27:38 +01:00
Piotr Wilkin (ilintar)	98ae0a0d36	common/parser: fix handling of tool definition with missing properties key (#21128 )	2026-03-28 20:41:32 +01:00
Sigbjørn Skjæret	3a14a542f5	common : add character class support to glob_match (#21111 ) * add character class support to glob_match * remove pointless reference	2026-03-28 19:57:37 +01:00
BlueMöhre	968189729f	WebUI: Replace illegal nested button elements (#21026 ) * remove/replace nested button elements * map rest props to outer element * solve TODO * chore: update webui build output	2026-03-28 17:57:59 +01:00
Adrien	e397d3885c	common/json-schema: fix: handle non-capturing groups (?:...) in JSON schema pattern converter (#21124 ) The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV when a JSON schema "pattern" field contains a non-capturing group (?:...). Root cause: when the parser sees '(' followed by '?', it pushes a warning but does not advance past '?:'. The recursive transform() call then interprets '?' as a quantifier and calls seq.back() on an empty vector, causing undefined behavior. This commonly occurs when serving OpenAI-compatible tool calls from clients that include complex regex patterns in their JSON schemas (e.g., date validation patterns like ^(?:(?:\d\d[2468][048]\|...)-02-29\|...)$). The fix: - Skip '?:' after '(' to treat non-capturing groups as regular groups - For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely, handling escaped characters to avoid miscounting parenthesis depth - Adjust the ')' unbalanced-parentheses check using direct char comparisons instead of substr - Add test cases for non-capturing groups (C++ only, as the JS/Python implementations do not yet support this syntax)	2026-03-28 17:55:38 +01:00
Aldehir Rojas	e6f2ec01ff	common : add reasoning_format = none support to gpt-oss (#21094 )	2026-03-28 09:33:39 -05:00
Georgi Gerganov	edfb440a2f	server : fix processing of multiple back-to-back mtmd chunks (#21107 )	2026-03-28 16:27:36 +02:00
Adrien Gallouët	3d66da1809	ci : gracefully shut down the server (#21110 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-28 14:49:57 +01:00
Woof Dog	82b703f8bc	Document custom default webui preferences in server README (#19771 )	2026-03-28 14:19:16 +01:00
Aleksander Grygier	51a84efc53	webui: Conversation forking + branching improvements (#21021 ) * refactor: Make `DialogConfirmation` extensible with children slot * feat: Add conversation forking logic * feat: Conversation forking UI * feat: Update delete/edit dialogs and logic for forks * refactor: Improve Chat Sidebar UX and add MCP Servers entry * refactor: Cleanup * feat: Update message in place when editing leaf nodes * chore: Cleanup * chore: Cleanup * chore: Cleanup * chore: Cleanup * chore: Cleanup * chore: Cleanup * refactor: Post-review improvements * chore: update webui build output * test: Update Storybook test * chore: update webui build output * chore: update webui build output	2026-03-28 13:38:15 +01:00
Adrien Gallouët	b0f0dd3e51	vendor : update cpp-httplib to 0.40.0 (#21100 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-28 08:59:44 +01:00
Ruben Ortlam	0eb4764182	vulkan: add noncontiguous GLU support (#21081 ) * vulkan: add noncontiguous GLU support * fix compile issue	2026-03-28 08:44:56 +01:00
Piotr Wilkin (ilintar)	1f5d15e665	common/parser: fix reasoning whitespace bugs + extra parser tests (#21085 ) * fix whitespace reasoning issues + add reconstruction tests * Proper fix * fix Nemotron autoparser test expectations to include newline in marker	2026-03-28 07:29:26 +01:00
Sigbjørn Skjæret	c46758d28f	cli : add /glob command (#21084 ) * add /glob command * output error when max files reached * support globbing outside curdir	2026-03-28 02:33:04 +01:00
Ts-sound	bf934f28db	docker : fix and enable ARM64 image build (#20929 ) * CI: fix ARM64 image build error & enable compilation * Update .github/workflows/docker.yml Co-authored-by: Aaron Teo <taronaeo@gmail.com> * CI: revert ggml/src/ggml-cpu/CMakeLists.txt * Update .github/workflows/docker.yml Co-authored-by: Aaron Teo <taronaeo@gmail.com> * CI: update runs-on to ubuntu24.04, and update ARM64 build image ( ubuntu_version: "24.04") * CI: change cpu.Dockerfile gcc to 14; * CI : cpu.Dockerfile , update pip install . * Update .github/workflows/docker.yml Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com>	2026-03-28 01:45:09 +01:00
Adrien Gallouët	5c1a7b8355	server : add custom socket options to disable SO_REUSEPORT (#21056 ) * server : add custom socket options to disable SO_REUSEPORT Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --reuse-port $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update tools/server/README.md (llama-gen-docs) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix windows Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-28 01:12:43 +01:00
Aldehir Rojas	59d840209a	common : inhibit lazy grammar sampler while reasoning is active (#20970 ) * common : inhibit grammar while reasoning budget is active * cont : update force_pos in accept * cont : fix tests * cont : tweak should apply logic * cont : return early not using grammar sampler * Add tests * cont : prevent backend sampling when reasoning budget enabled * cont : fix typo --------- Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>	2026-03-27 18:30:40 +01:00
Kusha Gharahi	ff934e29bc	server: Introduce LLAMA_BUILD_WEBUI build flag to allow disabling the embedded web ui (#20158 ) * introduce LLAMA_SERVER_NO_WEBUI * LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI * LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE * MIssed this * Add useWebUi to package.nix	2026-03-27 17:25:55 +01:00
Yiwei Shao	ee051c1e4e	hexagon: support for IQ4_NL and MXFP4 (#21018 ) * ggml-hexagon: add IQ4_NL and MXFP4 HMX matmul support - Add IQ4_NL quantization type support to Hexagon backend (buffer set/get tensor repack, mul_mat, mul_mat_id dispatch) - Implement HVX IQ4_NL vec_dot kernels (1x1, 2x1, 2x2) with LUT-based 4-bit index to int8 kvalue dequantization - Add MXFP4 HMX dequantization path with E8M0 scale conversion, including batch-4 fast path and single-tile fallback - Unify quantized row size / scale offset logic to handle Q4_0, Q8_0, IQ4_NL, and MXFP4 in the DMA fetch path * ggml-hexagon: fix SKIP_QUANTIZE src1 address mismatch in mixed-quant models * Fix the pragma indent	2026-03-27 09:22:41 -07:00
Aleksander Grygier	e6f6770515	webui: Improve Chat Messages initial scroll + auto-scroll logic + add lazy loading with transitions to content blocks (#20999 ) * refactor: Always use agentic content renderer for Assistant Message * feat: Improve initial scroll + auto-scroll logic + implement fade in action for content blocks * chore: update webui build output	2026-03-27 17:01:36 +01:00
AN Long	48cda24c11	server: remove the verbose_prompt parameter (#21059 ) * server: respect the verbose_prompt parameter * Revert "server: respect the verbose_prompt parameter" This reverts commit 8ed885cf375b2c8ba641c661f3667df70b9797f4. * Remove --verbose-prompt parameter from llama-server * Using set_examples instead of set_excludes	2026-03-27 13:36:13 +02:00
Xuan-Son Nguyen	871f1a2d2f	mtmd: add more sanity checks (#21047 )	2026-03-27 11:00:52 +01:00
Xuan-Son Nguyen	20197b6fe3	server: add built-in tools backend support (#20898 ) * wip: server_tools * refactor * displayName -> display_name * snake_case everywhere * rm redundant field * change arg to --tools all * add readme mention * llama-gen-docs	2026-03-27 10:07:11 +01:00
Radoslav Gerganov	ba38f3becc	rpc : proper handling of data pointers to CPU buffers (#21030 ) The compute graph may contain tensors pointing to CPU buffers. In these cases the buffer address is serialized as 0 and sent over the wire. However, the data pointer is serialized as-is and this prevents proper validation on the server side. This patches fixes this by serializing the data pointer as 0 for non-RPC buffers and doing proper validation on the server side. closes: #21006	2026-03-27 10:59:35 +02:00
mtmcp	37f230dd7c	completion : session_tokens insert range in completion tool (no-op → correct) (#20917 ) The embd.begin(), embd.begin() range is empty and inserts nothing, so session_tokens never gets updated after decoding. Should be embd.begin(), embd.end(). Introduced in commit `2b6dfe8`.	2026-03-27 09:25:58 +01:00
mtmcp	a308e584ca	completion : Fix segfault on model load failure (#21049 )	2026-03-27 10:01:13 +02:00
Pascal	d0fa2c9fbb	Send reasoning content back to the model across turns via the reasoning_content API field (#21036 ) * webui: send reasoning_content back to model in context Preserve assistant reasoning across turns by extracting it from internal tags and sending it as a separate reasoning_content field in the API payload. The server and Jinja templates handle native formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...). Adds "Exclude reasoning from context" toggle in Settings > Developer (off by default, so reasoning is preserved). Includes unit tests. * webui: add syncable parameter for excludeReasoningFromContext * chore: update webui build output	2026-03-27 08:17:35 +01:00
ren	9bcb4eff4d	metal : Fix dimension constraint violation in matmul2d descriptor (#21048 ) Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).	2026-03-27 09:05:21 +02:00
KokerZhou	6861f6509a	CANN: update docker images to 8.5.0 and improve CANN.md (#20801 ) * cann: update docker images to 8.5.0 - bump CANN base image from 8.3.rc2 to 8.5.0 - bump ASCEND_VERSION from 8.1.RC1.alpha001 to 8.5.0 Move to newer stable releases. * cann: update CANN.md * Update CANN.md to include BF16 support Added BF16 support information to the CANN documentation and corrected formatting for the installation instructions. * Fix formatting issues in CANN.md Fix 234: Trailing whitespace	2026-03-27 08:53:00 +08:00
Saba Fallah	1743d98057	mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr (#21027 ) * mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-27 00:07:55 +01:00
uvos	7ca0c9cca7	hip: use fnuz fp8 for conversion on CDNA3 (#21040 )	2026-03-26 23:06:33 +01:00
Xuan-Son Nguyen	8c60b8a2be	ci: pin external actions to exact commit SHA (#21033 )	2026-03-26 20:44:00 +01:00
Adrien Gallouët	287b5b1eab	common : add getpwuid fallback for HF cache when HOME is not set (#21035 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-26 20:34:23 +01:00
Xuan-Son Nguyen	a73bbd5d92	mtmd: refactor image preprocessing (#21031 ) * mtmd: refactor image pre-processing * correct some places * correct lfm2 * fix deepseek-ocr on server * add comment to clarify about mtmd_image_preprocessor_dyn_size	2026-03-26 19:49:20 +01:00
lhez	ded446b34c	opencl: allow large buffer for adreno (#20997 )	2026-03-26 08:52:21 -07:00
Michael Wand	f8d4abae86	convert : support Qwen3.5/Qwen3.5 Moe NVFP4 and add input scales (#20505 ) * convert : fix Qwen3.5 NVFP4 conversion * Updated copilot concerns and rebased * move into _LinearAttentionVReorderBase and simplify * --flake * new_name not needed * Added input_scale to gguf * Fixed input_scale addition as tensor * Added input scale to loader and named _in_s * Update convert_hf_to_gguf.py Re-removed input_scale from aux cleanup Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-26 16:52:06 +01:00
Pavel Zloi	3d5acab3e7	convert : add RuGPT3XL (RuGPT3XLForCausalLM) support (#21011 ) * Support of ruGPT3XL model added * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * chkhsh for ruGPT3XL model added * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fixing chkhsh for ruGPT3XL, rerun updated and _qkv_parts in RuGPT3XLModel --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-26 16:49:09 +01:00

1 2 3 4 5 ...

8588 Commits