claudeopus46
457e76fc0e
fix: match Ollama's proven gfx1103 approach — gfx1102 target + rocBLAS
...
Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled
Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled
Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[cuda_version:12.4.0 dockerfile:.devops/cuda.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:cuda cuda12 ubuntu_version:22.04]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[cuda_version:13.1.0 dockerfile:.devops/cuda-new.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:cuda13 ubuntu_version:24.04]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cpu.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:cpu]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cpu.Dockerfile free_disk_space:false full:true light:true platforms:linux/arm64 runs_on:ubuntu-24.04 server:true tag:cpu]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/intel.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:intel]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/musa.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:musa]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/openvino.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:openvino]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/rocm.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:rocm]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/s390x.Dockerfile free_disk_space:false full:true light:true platforms:linux/s390x runs_on:ubuntu-24.04-s390x server:true tag:s390x]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/vulkan.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-24.04 server:true tag:vulkan]) (push) Has been cancelled
Publish Docker image / Create and push git tag (push) Has been cancelled
Update Winget Package / Update Winget Package (push) Has been cancelled
EditorConfig Checker / editorconfig (push) Has been cancelled
Close inactive issues / close-issues (push) Has been cancelled
CI (msys) / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled
CI (msys) / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled
CI (cross) / debian-13-loongarch64-cpu-cross (push) Has been cancelled
CI (cross) / debian-13-loongarch64-vulkan-cross (push) Has been cancelled
CI (cross) / ubuntu-24-riscv64-cpu-spacemit-ime-cross (push) Has been cancelled
Remove GGML_CUDA_FORCE_MMQ — let rocBLAS handle large batch GEMMs
using gfx1102 TensileLibrary (available in ROCm 7.2). The GPU is
spoofed as gfx1102 via HSA_OVERRIDE_GFX_VERSION=11.0.2 at runtime,
matching Ollama's working configuration.
FORCE_MMQ caused crashes because MMQ kernel launch_bounds are tuned
for GPUs with many CUs and cannot fit on the 6-CU iGPU for large
matrix dimensions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-03 03:05:06 +02:00
claudeopus46
94127d7b33
fix: force MMQ kernels to bypass rocBLAS TensileLibrary on gfx1103
...
Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled
Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled
Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled
EditorConfig Checker / editorconfig (push) Has been cancelled
Close inactive issues / close-issues (push) Has been cancelled
ROCm 7.2 rocBLAS has no TensileLibrary for gfx1103 (RDNA3 iGPU) and
the gfx1102 library kernels crash due to register file differences.
Force MMQ (matrix multiply quantized) kernels which are compiled by
hipcc for the actual target arch, bypassing rocBLAS entirely.
This matches how Ollama successfully runs on AMD 780M / gfx1103.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-03 01:44:34 +02:00
claudeopus46
cec946de76
fix: add gfx1103 TensileLibrary alias from gfx1102
...
EditorConfig Checker / editorconfig (push) Has been cancelled
Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled
Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled
Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled
ROCm 7.2 ships rocBLAS TensileLibrary for gfx1102 but not gfx1103
(RDNA3 iGPU). Copy gfx1102 library as gfx1103 so rocBLAS matmuls
work on AMD 780M without HSA_OVERRIDE_GFX_VERSION spoofing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-03 00:49:45 +02:00
claudeopus46
7eb4c866ff
fix: disable WMMA flash attention for gfx1103 iGPU compatibility
...
EditorConfig Checker / editorconfig (push) Has been cancelled
rocWMMA does not support RDNA3 iGPU (gfx1103). Flash attention still
works via the vec codepath. This allows native gfx1103 compilation
without HSA_OVERRIDE_GFX_VERSION spoofing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-03 00:32:57 +02:00
claudeopus46
1ea0b4798d
fix: FWHT butterfly loop warp divergence on RDNA3 iGPU (gfx1103)
...
EditorConfig Checker / editorconfig (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-cuda (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Has been cancelled
CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled
CI / build-cmake-pkg (push) Has been cancelled
CI / macOS-latest-arm64 (push) Has been cancelled
CI / macOS-latest-x64 (push) Has been cancelled
CI / macOS-latest-arm64-webgpu (push) Has been cancelled
CI / ubuntu-cpu (arm64, ubuntu-22.04-arm) (push) Has been cancelled
CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Has been cancelled
CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled
CI / ubuntu-latest-rpc (push) Has been cancelled
CI / ubuntu-24-vulkan (push) Has been cancelled
CI / ubuntu-24-webgpu (push) Has been cancelled
CI / ubuntu-24-webgpu-wasm (push) Has been cancelled
CI / ubuntu-22-hip (push) Has been cancelled
CI / ubuntu-22-musa (push) Has been cancelled
CI / ubuntu-22-sycl (push) Has been cancelled
CI / ubuntu-22-sycl-fp16 (push) Has been cancelled
CI / ubuntu-24-openvino-CPU (push) Has been cancelled
CI / ubuntu-24-openvino-GPU (push) Has been cancelled
CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Has been cancelled
CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Has been cancelled
CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Has been cancelled
CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Has been cancelled
CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Has been cancelled
CI / ubuntu-latest-cuda (push) Has been cancelled
CI / windows-2022-cuda (12.4) (push) Has been cancelled
CI / windows-latest-sycl (push) Has been cancelled
CI / windows-latest-hip (push) Has been cancelled
CI / ubuntu-cpu-riscv64-native (push) Has been cancelled
CI / ggml-ci-x64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-x64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Has been cancelled
HIP quality check / ubuntu-22-hip-quality-check (push) Has been cancelled
Release / macOS-arm64 (push) Has been cancelled
Release / macOS-x64 (push) Has been cancelled
Release / ubuntu-22-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
Release / ubuntu-22-cpu (x64, ubuntu-22.04) (push) Has been cancelled
Release / ubuntu-22-vulkan (push) Has been cancelled
Release / ubuntu-24-openvino (push) Has been cancelled
Release / windows-cpu (arm64) (push) Has been cancelled
Release / windows-cpu (x64) (push) Has been cancelled
Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Has been cancelled
Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Has been cancelled
Release / windows-cuda (12.4) (push) Has been cancelled
Release / windows-cuda (13.1) (push) Has been cancelled
Release / windows-sycl (push) Has been cancelled
Release / ubuntu-22-rocm (7.2, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1151;gfx1150;gfx1200;gfx1201) (push) Has been cancelled
Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Has been cancelled
Release / ios-xcode-build (push) Has been cancelled
Release / openEuler-cann (aarch64, Release, 310p, off) (push) Has been cancelled
Release / openEuler-cann (aarch64, Release, 910b, on) (push) Has been cancelled
Release / openEuler-cann (x86, Release, 310p, off) (push) Has been cancelled
Release / openEuler-cann (x86, Release, 910b, on) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx2) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx1) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Has been cancelled
Server (self-hosted) / server-cuda (GPUx1) (push) Has been cancelled
Server (self-hosted) / server-cuda (GPUx1, backend-sampling) (push) Has been cancelled
Server / server (default) (push) Has been cancelled
Server / server (backend-sampling) (push) Has been cancelled
Server / server-windows (push) Has been cancelled
Release / release (push) Has been cancelled
The `if (tid < 64)` guard only let half the threads participate in the
128-element Walsh-Hadamard Transform, leaving elements 64-127
untransformed. This caused warp divergence and register pressure that
crashes low-CU iGPUs (AMD 780M / gfx1103 with 6 CUs).
Replace with proper bounds check `if (i + h < TURBO_HEAD_DIM)` so all
128 threads participate naturally. Fixes both correctness (full FWHT
over all elements) and GPU occupancy on resource-constrained hardware.
Affects: dequantize_block_turbo3_0, dequantize_block_turbo4_0,
k_set_rows_turbo3, k_set_rows_turbo4
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-03 00:19:35 +02:00
Pascal Wachowski
d1b8f40933
docs: replace README with TurboQuant fork documentation
...
EditorConfig Checker / editorconfig (push) Has been cancelled
Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled
Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled
Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled
Copilot Setup Steps / copilot-setup-steps (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
Update Operations Documentation / update-ops-docs (push) Has been cancelled
Clear fork identification, MIT license attribution, benchmark results,
build instructions, and links to upstream llama.cpp and the paper.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
b8583
2026-03-29 20:57:01 +02:00
Pascal Wachowski
04b2771048
docs: add TurboQuant benchmark results and documentation
...
Performance (Qwen3-14B Q4_K_M, RX 9070 XT 16GB):
- turbo4: 1812 pp512 / 49 tg128 / PPL +0.010 vs f16 / 72% KV savings
- turbo3: 1836 pp512 / 50 tg128 / PPL +0.051 vs f16 / 78% KV savings
- All context lengths up to 40K work without OOM
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-29 20:48:00 +02:00
Pascal Wachowski
bd571adc99
feat: TurboQuant KV-cache quantization for AMD ROCm (turbo3/turbo4)
...
Implements TurboQuant (Zandieh et al., ICLR 2026) KV-cache vector
quantization targeting AMD RDNA 4 (gfx1201, RX 9070 XT).
Algorithm: L2-normalize → FWHT(128) → Lloyd-Max scalar quantize → bitpack
Decode: unpack → codebook lookup → inverse FWHT → denormalize
Two new GGML types:
- GGML_TYPE_TURBO3_0: 3-bit, 3.5 bpw, MSE*d=0.034 (block_size=32, 14 bytes)
- GGML_TYPE_TURBO4_0: 4-bit, 4.5 bpw, MSE*d=0.009 (block_size=32, 18 bytes)
Architecture (pre-dequantize strategy):
- Write path: FWHT-aware set-rows kernels (128 threads, shared-mem FWHT)
- Read path: bulk dequantize turbo→f16 before standard Flash Attention
- Stride scaling preserves ggml_permute dim swaps (critical fix)
Performance (Qwen3-14B Q4_K_M, RX 9070 XT, 16 GB VRAM):
f16/f16: 1865 pp512, 54 tg128 (baseline)
q8_0/q8_0: 1694 pp512, 52 tg128
turbo4/turbo4: 1813 pp512, 49 tg128 (-3% pp, -9% tg, 72% less KV VRAM)
turbo3/turbo3: 1983 pp512, 49 tg128 (+6% pp, -9% tg, 78% less KV VRAM)
Usage: llama-cli -fa 1 --cache-type-k turbo4 --cache-type-v turbo4
Includes 7 CPU reference tests validating FWHT self-inverse, MSE against
paper values, bitpack determinism, and dequantize sanity.
Requires head_dim=128 (covers most current models including Llama, Qwen,
Mistral, Gemma). Guard added to KV cache init with clear error message.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-29 20:48:00 +02:00
Sigbjørn Skjæret
7c203670f8
add missing ROPE_FACTORS_LONG/SHORT for MiniCPM ( #21150 )
2026-03-29 19:45:40 +02:00
Gaurav Garg
ec16a072f0
Optimize MOE GEMV kernel for BS > 1. ( #20905 )
...
* Optimize MOE GEMV kernel for BS > 1.
The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.
New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).
This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.
* Remove em-dashes
* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits
Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8
* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com >
2026-03-29 18:35:18 +02:00
Max Krasnyansky
f5d1c4179f
hexagon: dma optimizations (mostly fixing regressions) ( #21137 )
...
* hex-fa: add simple dma cache for Mask
I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.
* hex-dma: unset in-order desc bit which caused signficant perf regression
We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.
* hex-rope: update comment to clarify that we don't need in-order DMA completions
2026-03-29 06:40:13 -07:00
Davi Henrique Linhares
2405d59cb6
devops: including compute-runtime for intel.Dockerfile ( #21076 )
2026-03-29 13:34:03 +08:00
Neo Zhang
afe65aa282
[SYCL] Enhance build script to use half cores to build, avoid OS hang ( #21093 )
...
* use half cores to build, avoid OS hang
* reduce the output text num to short test time
* avoid to return 0
2026-03-29 09:02:45 +08:00
Sigbjørn Skjæret
65097181e4
fix **/x glob matching ( #21129 )
2026-03-28 22:27:38 +01:00
Piotr Wilkin (ilintar)
98ae0a0d36
common/parser: fix handling of tool definition with missing properties key ( #21128 )
2026-03-28 20:41:32 +01:00
Sigbjørn Skjæret
3a14a542f5
common : add character class support to glob_match ( #21111 )
...
* add character class support to glob_match
* remove pointless reference
2026-03-28 19:57:37 +01:00
BlueMöhre
968189729f
WebUI: Replace illegal nested button elements ( #21026 )
...
* remove/replace nested button elements
* map rest props to outer element
* solve TODO
* chore: update webui build output
2026-03-28 17:57:59 +01:00
Adrien
e397d3885c
common/json-schema: fix: handle non-capturing groups (?:...) in JSON schema pattern converter ( #21124 )
...
The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).
Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.
This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).
The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
implementations do not yet support this syntax)
2026-03-28 17:55:38 +01:00
Aldehir Rojas
e6f2ec01ff
common : add reasoning_format = none support to gpt-oss ( #21094 )
2026-03-28 09:33:39 -05:00
Georgi Gerganov
edfb440a2f
server : fix processing of multiple back-to-back mtmd chunks ( #21107 )
2026-03-28 16:27:36 +02:00
Adrien Gallouët
3d66da1809
ci : gracefully shut down the server ( #21110 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
2026-03-28 14:49:57 +01:00
Woof Dog
82b703f8bc
Document custom default webui preferences in server README ( #19771 )
2026-03-28 14:19:16 +01:00
Aleksander Grygier
51a84efc53
webui: Conversation forking + branching improvements ( #21021 )
...
* refactor: Make `DialogConfirmation` extensible with children slot
* feat: Add conversation forking logic
* feat: Conversation forking UI
* feat: Update delete/edit dialogs and logic for forks
* refactor: Improve Chat Sidebar UX and add MCP Servers entry
* refactor: Cleanup
* feat: Update message in place when editing leaf nodes
* chore: Cleanup
* chore: Cleanup
* chore: Cleanup
* chore: Cleanup
* chore: Cleanup
* chore: Cleanup
* refactor: Post-review improvements
* chore: update webui build output
* test: Update Storybook test
* chore: update webui build output
* chore: update webui build output
2026-03-28 13:38:15 +01:00
Adrien Gallouët
b0f0dd3e51
vendor : update cpp-httplib to 0.40.0 ( #21100 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
2026-03-28 08:59:44 +01:00
Ruben Ortlam
0eb4764182
vulkan: add noncontiguous GLU support ( #21081 )
...
* vulkan: add noncontiguous GLU support
* fix compile issue
2026-03-28 08:44:56 +01:00
Piotr Wilkin (ilintar)
1f5d15e665
common/parser: fix reasoning whitespace bugs + extra parser tests ( #21085 )
...
* fix whitespace reasoning issues + add reconstruction tests
* Proper fix
* fix Nemotron autoparser test expectations to include newline in marker
2026-03-28 07:29:26 +01:00
Sigbjørn Skjæret
c46758d28f
cli : add /glob command ( #21084 )
...
* add /glob command
* output error when max files reached
* support globbing outside curdir
2026-03-28 02:33:04 +01:00
Ts-sound
bf934f28db
docker : fix and enable ARM64 image build ( #20929 )
...
* CI: fix ARM64 image build error & enable compilation
* Update .github/workflows/docker.yml
Co-authored-by: Aaron Teo <taronaeo@gmail.com >
* CI: revert ggml/src/ggml-cpu/CMakeLists.txt
* Update .github/workflows/docker.yml
Co-authored-by: Aaron Teo <taronaeo@gmail.com >
* CI: update runs-on to ubuntu24.04, and update ARM64 build image ( ubuntu_version: "24.04")
* CI: change cpu.Dockerfile gcc to 14;
* CI : cpu.Dockerfile , update pip install .
* Update .github/workflows/docker.yml
Co-authored-by: Aaron Teo <taronaeo@gmail.com >
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com >
2026-03-28 01:45:09 +01:00
Adrien Gallouët
5c1a7b8355
server : add custom socket options to disable SO_REUSEPORT ( #21056 )
...
* server : add custom socket options to disable SO_REUSEPORT
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
* Add --reuse-port
$ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
$ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
* Update tools/server/README.md (llama-gen-docs)
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
* Fix windows
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
2026-03-28 01:12:43 +01:00
Aldehir Rojas
59d840209a
common : inhibit lazy grammar sampler while reasoning is active ( #20970 )
...
* common : inhibit grammar while reasoning budget is active
* cont : update force_pos in accept
* cont : fix tests
* cont : tweak should apply logic
* cont : return early not using grammar sampler
* Add tests
* cont : prevent backend sampling when reasoning budget enabled
* cont : fix typo
---------
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com >
2026-03-27 18:30:40 +01:00
Kusha Gharahi
ff934e29bc
server: Introduce LLAMA_BUILD_WEBUI build flag to allow disabling the embedded web ui ( #20158 )
...
* introduce LLAMA_SERVER_NO_WEBUI
* LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI
* LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE
* MIssed this
* Add useWebUi to package.nix
2026-03-27 17:25:55 +01:00
Yiwei Shao
ee051c1e4e
hexagon: support for IQ4_NL and MXFP4 ( #21018 )
...
* ggml-hexagon: add IQ4_NL and MXFP4 HMX matmul support
- Add IQ4_NL quantization type support to Hexagon backend (buffer
set/get tensor repack, mul_mat, mul_mat_id dispatch)
- Implement HVX IQ4_NL vec_dot kernels (1x1, 2x1, 2x2) with
LUT-based 4-bit index to int8 kvalue dequantization
- Add MXFP4 HMX dequantization path with E8M0 scale conversion,
including batch-4 fast path and single-tile fallback
- Unify quantized row size / scale offset logic to handle Q4_0,
Q8_0, IQ4_NL, and MXFP4 in the DMA fetch path
* ggml-hexagon: fix SKIP_QUANTIZE src1 address mismatch in mixed-quant models
* Fix the pragma indent
2026-03-27 09:22:41 -07:00
Aleksander Grygier
e6f6770515
webui: Improve Chat Messages initial scroll + auto-scroll logic + add lazy loading with transitions to content blocks ( #20999 )
...
* refactor: Always use agentic content renderer for Assistant Message
* feat: Improve initial scroll + auto-scroll logic + implement fade in action for content blocks
* chore: update webui build output
2026-03-27 17:01:36 +01:00
AN Long
48cda24c11
server: remove the verbose_prompt parameter ( #21059 )
...
* server: respect the verbose_prompt parameter
* Revert "server: respect the verbose_prompt parameter"
This reverts commit 8ed885cf375b2c8ba641c661f3667df70b9797f4.
* Remove --verbose-prompt parameter from llama-server
* Using set_examples instead of set_excludes
2026-03-27 13:36:13 +02:00
Xuan-Son Nguyen
871f1a2d2f
mtmd: add more sanity checks ( #21047 )
2026-03-27 11:00:52 +01:00
Xuan-Son Nguyen
20197b6fe3
server: add built-in tools backend support ( #20898 )
...
* wip: server_tools
* refactor
* displayName -> display_name
* snake_case everywhere
* rm redundant field
* change arg to --tools all
* add readme mention
* llama-gen-docs
2026-03-27 10:07:11 +01:00
Radoslav Gerganov
ba38f3becc
rpc : proper handling of data pointers to CPU buffers ( #21030 )
...
The compute graph may contain tensors pointing to CPU buffers. In these
cases the buffer address is serialized as 0 and sent over the wire.
However, the data pointer is serialized as-is and this prevents proper
validation on the server side. This patches fixes this by serializing
the data pointer as 0 for non-RPC buffers and doing proper validation on
the server side.
closes : #21006
2026-03-27 10:59:35 +02:00
mtmcp
37f230dd7c
completion : session_tokens insert range in completion tool (no-op → correct) ( #20917 )
...
The embd.begin(), embd.begin() range is empty and inserts nothing, so session_tokens never gets updated after
decoding. Should be embd.begin(), embd.end(). Introduced in commit 2b6dfe8 .
2026-03-27 09:25:58 +01:00
mtmcp
a308e584ca
completion : Fix segfault on model load failure ( #21049 )
2026-03-27 10:01:13 +02:00
Pascal
d0fa2c9fbb
Send reasoning content back to the model across turns via the reasoning_content API field ( #21036 )
...
* webui: send reasoning_content back to model in context
Preserve assistant reasoning across turns by extracting it from
internal tags and sending it as a separate reasoning_content field
in the API payload. The server and Jinja templates handle native
formatting (e.g. <think> tags for Qwen, GLM, DeepSeek...).
Adds "Exclude reasoning from context" toggle in Settings > Developer
(off by default, so reasoning is preserved). Includes unit tests.
* webui: add syncable parameter for excludeReasoningFromContext
* chore: update webui build output
2026-03-27 08:17:35 +01:00
ren
9bcb4eff4d
metal : Fix dimension constraint violation in matmul2d descriptor ( #21048 )
...
Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).
2026-03-27 09:05:21 +02:00
KokerZhou
6861f6509a
CANN: update docker images to 8.5.0 and improve CANN.md ( #20801 )
...
* cann: update docker images to 8.5.0
- bump CANN base image from 8.3.rc2 to 8.5.0
- bump ASCEND_VERSION from 8.1.RC1.alpha001 to 8.5.0
Move to newer stable releases.
* cann: update CANN.md
* Update CANN.md to include BF16 support
Added BF16 support information to the CANN documentation and corrected formatting for the installation instructions.
* Fix formatting issues in CANN.md
Fix 234: Trailing whitespace
2026-03-27 08:53:00 +08:00
Saba Fallah
1743d98057
mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr ( #21027 )
...
* mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr
* Update src/llama-quant.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2026-03-27 00:07:55 +01:00
uvos
7ca0c9cca7
hip: use fnuz fp8 for conversion on CDNA3 ( #21040 )
2026-03-26 23:06:33 +01:00
Xuan-Son Nguyen
8c60b8a2be
ci: pin external actions to exact commit SHA ( #21033 )
2026-03-26 20:44:00 +01:00
Adrien Gallouët
287b5b1eab
common : add getpwuid fallback for HF cache when HOME is not set ( #21035 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
2026-03-26 20:34:23 +01:00
Xuan-Son Nguyen
a73bbd5d92
mtmd: refactor image preprocessing ( #21031 )
...
* mtmd: refactor image pre-processing
* correct some places
* correct lfm2
* fix deepseek-ocr on server
* add comment to clarify about mtmd_image_preprocessor_dyn_size
2026-03-26 19:49:20 +01:00
lhez
ded446b34c
opencl: allow large buffer for adreno ( #20997 )
2026-03-26 08:52:21 -07:00
Michael Wand
f8d4abae86
convert : support Qwen3.5/Qwen3.5 Moe NVFP4 and add input scales ( #20505 )
...
* convert : fix Qwen3.5 NVFP4 conversion
* Updated copilot concerns and rebased
* move into _LinearAttentionVReorderBase and simplify
* --flake
* new_name not needed
* Added input_scale to gguf
* Fixed input_scale addition as tensor
* Added input scale to loader and named _in_s
* Update convert_hf_to_gguf.py
Re-removed input_scale from aux cleanup
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2026-03-26 16:52:06 +01:00
Pavel Zloi
3d5acab3e7
convert : add RuGPT3XL (RuGPT3XLForCausalLM) support ( #21011 )
...
* Support of ruGPT3XL model added
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* chkhsh for ruGPT3XL model added
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Fixing chkhsh for ruGPT3XL, rerun updated and _qkv_parts in RuGPT3XLModel
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2026-03-26 16:49:09 +01:00