Commit Graph

8719 Commits

Author SHA1 Message Date
Ruben Ortlam 4cabbe36e0 state 2026-04-09 13:00:31 +02:00
Ruben Ortlam 9f001cae27 state 2026-04-09 12:51:43 +02:00
Ruben Ortlam 88335c0490 state 2026-04-09 12:39:51 +02:00
Ruben Ortlam 204023c897 state 2026-04-09 12:36:15 +02:00
Ruben Ortlam d88d722fc1 state 2026-04-09 12:32:08 +02:00
Ruben Ortlam 96d9516329 state 2026-04-09 12:25:27 +02:00
Ruben Ortlam 8a108eddb4 state 2026-04-09 12:05:15 +02:00
Ruben Ortlam 47dde34e00 state 2026-04-09 11:58:46 +02:00
Ruben Ortlam 8d0e158076 state 2026-04-09 11:51:39 +02:00
Ruben Ortlam aade0f81dd state 2026-04-09 11:42:50 +02:00
Ruben Ortlam 700270239d state 2026-04-09 11:24:21 +02:00
Ruben Ortlam ddaafa3dc1 state 2026-04-09 11:11:17 +02:00
Ruben Ortlam e5e0be0add state 2026-04-09 11:00:36 +02:00
Ruben Ortlam 3c4eae7dc9 state 2026-04-09 07:50:05 +02:00
Ruben Ortlam 7e2799c8c9 state 2026-04-09 07:40:02 +02:00
Ruben Ortlam cd0722594a state 2026-04-09 07:25:33 +02:00
Martin Klacer 5c4aae66e1 devops: kleidiai: provide KleidiAI-Enabled ARM Release Artifact (#21259)
* Unified macOS release setup with strategy-matrix block
 * Added KleidiAI arm64 macOS release definition


Change-Id: I05520889ffc646488a178d06817a17f29274465a

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
b8703
2026-04-08 13:06:12 +08:00
Aman Gupta c5ce4bc227 CUDA: make cuda graphs props check faster (#21472)
* CUDA: compute fast hash instead of expensive props check

* use seen node

* use memcp
b8702
2026-04-08 09:05:51 +08:00
iacopPBK 66c4f9ded0 ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (#21168)
* ds_read_b128 for q4_0 and q4_1 mmq kernels

     Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.

* Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation

* Explicit for loop in mmq, renamed vec into tmp

* Fixed max_cpy usage in the loading loop

* Fixed typo in q4_1 kernel

* Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Renoved trailing white line 500

* Update mmq.cuh removed other whitelines

* Remove trailing whitespaces

---------

Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: iacopPBK <iacop@deneb.com>
b8701
2026-04-07 21:47:42 +02:00
Daniel Bevenius 93bdc61563 gguf-py : fix missing comma after bad merge in tensor-mapping (#21558)
This commit adds a missing comma in the vision encoder attention qkv
block.

The motivation for this change is that without the comma there will be
a string concatenation of the Kimi-K2.5 and the Nemotron Nano v2 VL
tensor mappings which will be broken.
2026-04-07 21:24:25 +02:00
Georgi Gerganov 4eb19514dd kv-cache : support attention rotation for heterogeneous iSWA (#21513)
* kv-cache : support attention rotation for heterogeneous iSWA

* cont : remove assert
b8699
2026-04-07 20:31:28 +03:00
Reese Levine 957d717ce5 ggml-webgpu: parameterize submission size and add iOS specific limits (#21533)
* Work towards removing bitcast

* Move rest of existing types over

* Add timeout back to wait and remove synchronous set_tensor/memset_tensor

* move to unpackf16 for wider compatibility

* cleanup

* Remove deadlock condition in free_bufs

* Start work on removing parameter buffer pools

* Simplify and optimize further

* simplify profile futures

* Fix stride

* Try using a single command buffer per batch

* formatting

* Add parameters for different browsers in-flight submissions

* Update handling of batch size too

* Throttle ios as much as possible

* Increase timeout for llvm-pipe testing
b8698
2026-04-07 20:30:01 +03:00
Aman Gupta de1aa6fa73 CUDA: check for buffer overlap before fusing (#21566)
* CUDA: check for buffer overlap before fusing

* use ggml_cuda_check_fusion_memory_ranges
b8697
2026-04-08 00:57:04 +08:00
Aaron Teo 69c28f1547 llama-server: fix model params not propagated (#21509)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
b8696
2026-04-07 21:39:41 +08:00
Son H. Nguyen 0d049d6a92 unicode : add custom Qwen2 regex handler to fix segfault on long input (#21257)
* unicode : add custom Qwen2 regex handler to fix segfault on long input

std::regex uses recursive backtracking internally, which causes a stack
overflow (segfault) when tokenizing long sequences of repeated characters
(e.g. 43K 'A's). The Qwen2 tokenizer regex differs from Llama3 only in
the digit pattern (\p{N} vs \p{N}{1,3}), so it was falling through to
the std::regex fallback path instead of using a custom handler.

Add unicode_regex_split_custom_qwen2() following the established pattern
used by gpt2, llama3, kimi_k2, and afmoe custom handlers.

Closes: https://github.com/ggml-org/llama.cpp/issues/21113

* cont : remove TODO comment

* cont : update comment to reflect original regex

* use the correct regex in the comment this time... [no ci]

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-04-07 16:13:38 +03:00
Johannes Gäßler a8ec0df461 llama: remove per-arch tensor name lists (#21531) b8694 2026-04-07 15:02:03 +02:00
Georgi Gerganov e8f5082697 server : fix restore for checkpoints with pos_min == 0 (#21510) b8693 2026-04-07 15:29:17 +03:00
Georgi Gerganov 22fc79134e ggml : deprecate GGML_OP_ADD1 (#21363)
* ggml : deprecate GGML_OP_ADD1

* cont : remove tests

* cont : re-enable vulkan check
b8692
2026-04-07 15:28:27 +03:00
Tom Overlund 2a619f6fbc ggml: Vulkan build, Linux -- output error string for errno on fork failure (#20868) (#20904) b8691 2026-04-07 13:54:55 +02:00
mkoker edd4d9bca5 vulkan: add FA dequant for q4_1, q5_0, q5_1, iq4_nl (#21029)
Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL
in the flash attention base shader. Register them in the shader
generator, pipeline creation, and enable in the scalar/coopmat1 FA
support check.
b8690
2026-04-07 13:41:29 +02:00
Aldehir Rojas 482192f12d webui : store reasoning_content so it is sent back in subsequent requests (#21249) 2026-04-07 13:32:44 +02:00
Antoine Viallon 71a81f6fcc ggml-cuda : fix CDNA2 compute capability constant for gfx90a (MI210) (#21519)
GGML_CUDA_CC_CDNA2 was set to 0x910
Fix by setting the constant to 0x90a to match the actual gfx90a ISA.
b8688
2026-04-07 12:18:55 +02:00
Aleksander Grygier ecce0087da fix: Detect streaming state in reasoning content blocks (#21549) 2026-04-07 12:04:41 +02:00
Kabir08 d1f82e382d Fix rtl text rendering (#21382)
* Fix Arabic RTL text rendering in web UI

- Add dir='auto' attributes to markdown containers and blocks
- Implement post-processing to add dir='auto' to all text elements
- Replace directional CSS properties with logical properties for proper RTL list alignment
- Ensure bidirectional text support for mixed Arabic/English content

* Clean up commented duplicate function

Remove the commented-out duplicate transformMdastNode function
that was left over from refactoring.

* Fix Arabic RTL text rendering in web UI

- Add dir='auto' attributes to markdown containers and blocks
- Implement post-processing to add dir='auto' to all text elements
- Replace directional CSS properties with logical properties for proper RTL list alignment
- Minor code formatting improvements

This ensures bidirectional text support for mixed Arabic/English content in the llama.cpp web UI.

* Implement rehype plugin for comprehensive RTL text support

- Add rehypeRtlSupport plugin that applies dir='auto' to all elements with children
- Replace DOMParser-based approach with efficient HAST tree processing
- Remove hardcoded element lists for better maintainability
- Ensure proper bidirectional text rendering for mixed RTL/LTR content

* Fix RTL text rendering with rehype plugin and cleanup

* fix: prettier formatting
2026-04-07 11:37:20 +02:00
PMZFX 0988accf82 [SYCL] Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (#21527)
Extend the existing reorder optimization to Q8_0. The reorder
separates scale factors from weight data for coalesced memory
access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing.

On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x)
on Qwen3.5-27B. BW utilization: 21% -> 66%.

The key fix beyond the kernels: Q8_0 was missing from the type
check in ggml_backend_sycl_buffer_init_tensor() that allocates
the extra struct carrying the reorder flag -- so the optimization
was silently skipped.

AI (Claude) was used to assist with root cause investigation and
writing the kernel code. All code was human-reviewed and tested
on real hardware.

Fixes: #21517
b8685
2026-04-07 16:12:49 +08:00
Dmytro Romanov 0033f53a07 docs: fix typo in build.md (emdawbwebgpu -> emdawnwebgpu) (#21518) b8684 2026-04-07 12:37:26 +08:00
Masashi Yoshimura d0a6dfeb28 ggml-webgpu: Add the support of MUL_MAT_ID (#21147)
* Add mul_mat_id support to WebGPU

* Apply suggestion from @reeselevine

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
b8683
2026-04-06 13:08:46 -07:00
Pasha Khosravi 2e1f0a889e ggml: add Q1_0 1-bit quantization support (CPU) (#21273)
* ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU)

* add generic fallback for x86

* remove Q1_0 (group size 32)

* rename Q1_0_g128 => Q1_0

* fix Q1_0 LlamaFileType Enum

* Fix trailing spaces; add generic fallback for othre backends

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix /r/n spacing + arch-fallback

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8682
2026-04-06 20:55:21 +02:00
Bipin Yadav 506200cf8b cli: fix stripping of \n in multiline input (#21485)
* llama-cli: fix stripping of \n in multiline input

* Change & string to string_view

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix EditorConfig linter error

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8681
2026-04-06 20:54:06 +02:00
Gaurav Garg 15f786e658 [CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159)
* Write an optimized flash_attn_stream_k_fixup kernel

Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst.
Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst

* Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs

* Address review comments

* Address review comments

* Revert variable names to original
b8680
2026-04-06 20:34:29 +02:00
Aman Gupta 94ca829b60 llama-bench: add -fitc and -fitt to arguments (#21304)
* llama-bench: add `-fitc` and `-fitt` to arguments

* update README.md

* address review comments

* update compare-llama-bench.py
b8679
2026-04-06 22:26:02 +08:00
Aldehir Rojas 4aa962e2b0 vocab : add byte token handling to BPE detokenizer for Gemma4 (#21488) b8678 2026-04-06 09:08:37 -05:00
Sigbjørn Skjæret 941146b3f1 convert : fix block_ff_dim retrieval for lfm2 (#21508) 2026-04-06 14:05:18 +02:00
lainon1 482d862bcb server : handle unsuccessful sink.write in chunked stream provider (#21478)
Check the return value of sink.write() in the chunked content provider
and return false when the write fails, matching cpp-httplib's own
streaming contract. This prevents logging chunks as sent when the sink
rejected them and properly aborts the stream on connection failure.
b8676
2026-04-06 14:03:02 +02:00
Xuan-Son Nguyen 3979f2bb08 docs: add hunyuan-ocr gguf, also add test [no ci] (#21490) 2026-04-06 14:02:37 +02:00
Georgi Gerganov 400ac8e194 convert : set "add bos" == True for Gemma 4 (#21500)
* convert : set "add bos" == True for Gemma 4

* cont : handle old GGUFs
2026-04-06 13:52:07 +03:00
Neo Zhang f51fd36d79 sycl : handle other FA case (#21377) 2026-04-06 13:28:00 +03:00
Yarden Tal 25eec6f327 hexagon: slight optimization for argosrt output init (#21463) b8672 2026-04-05 18:30:25 -07:00
anchortense 58190cc84d llama : correct platform-independent loading of BOOL metadata (#21428)
* model-loader : fix GGUF bool array conversion

* model-loader : fix remaining GGUF bool pointer uses
b8671
2026-04-06 01:40:38 +02:00
Richard Davison af76639f72 model : add HunyuanOCR support (#21395)
* HunyuanOCR: add support for text and vision models

- Add HunyuanOCR vision projector (perceiver-based) with Conv2d merge
- Add separate HUNYUAN_OCR chat template (content-before-role format)
- Handle HunyuanOCR's invalid pad_token_id=-1 in converter
- Fix EOS/EOT token IDs from generation_config.json
- Support xdrope RoPE scaling type
- Add tensor mappings for perceiver projector (mm.before_rms, mm.after_rms, etc.)
- Register HunYuanVLForConditionalGeneration for both text and mmproj conversion

* fix proper mapping

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* address comments

* update

* Fix typecheck

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8670
2026-04-05 23:32:14 +02:00