llama.cpp-rocm

Author	SHA1	Message	Date
Ruben Ortlam	4cabbe36e0	state	2026-04-09 13:00:31 +02:00
Ruben Ortlam	9f001cae27	state	2026-04-09 12:51:43 +02:00
Ruben Ortlam	88335c0490	state	2026-04-09 12:39:51 +02:00
Ruben Ortlam	204023c897	state	2026-04-09 12:36:15 +02:00
Ruben Ortlam	d88d722fc1	state	2026-04-09 12:32:08 +02:00
Ruben Ortlam	96d9516329	state	2026-04-09 12:25:27 +02:00
Ruben Ortlam	8a108eddb4	state	2026-04-09 12:05:15 +02:00
Ruben Ortlam	47dde34e00	state	2026-04-09 11:58:46 +02:00
Ruben Ortlam	8d0e158076	state	2026-04-09 11:51:39 +02:00
Ruben Ortlam	aade0f81dd	state	2026-04-09 11:42:50 +02:00
Ruben Ortlam	700270239d	state	2026-04-09 11:24:21 +02:00
Ruben Ortlam	ddaafa3dc1	state	2026-04-09 11:11:17 +02:00
Ruben Ortlam	e5e0be0add	state	2026-04-09 11:00:36 +02:00
Ruben Ortlam	3c4eae7dc9	state	2026-04-09 07:50:05 +02:00
Ruben Ortlam	7e2799c8c9	state	2026-04-09 07:40:02 +02:00
Ruben Ortlam	cd0722594a	state	2026-04-09 07:25:33 +02:00
Martin Klacer	5c4aae66e1	devops: kleidiai: provide KleidiAI-Enabled ARM Release Artifact (#21259 ) * Unified macOS release setup with strategy-matrix block * Added KleidiAI arm64 macOS release definition Change-Id: I05520889ffc646488a178d06817a17f29274465a Signed-off-by: Martin Klacer <martin.klacer@arm.com> b8703	2026-04-08 13:06:12 +08:00
Aman Gupta	c5ce4bc227	CUDA: make cuda graphs props check faster (#21472 ) * CUDA: compute fast hash instead of expensive props check * use seen node * use memcp b8702	2026-04-08 09:05:51 +08:00
iacopPBK	66c4f9ded0	ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (#21168 ) * ds_read_b128 for q4_0 and q4_1 mmq kernels Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both. * Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation * Explicit for loop in mmq, renamed vec into tmp * Fixed max_cpy usage in the loading loop * Fixed typo in q4_1 kernel * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Renoved trailing white line 500 * Update mmq.cuh removed other whitelines * Remove trailing whitespaces --------- Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: iacopPBK <iacop@deneb.com> b8701	2026-04-07 21:47:42 +02:00
Daniel Bevenius	93bdc61563	gguf-py : fix missing comma after bad merge in tensor-mapping (#21558 ) This commit adds a missing comma in the vision encoder attention qkv block. The motivation for this change is that without the comma there will be a string concatenation of the Kimi-K2.5 and the Nemotron Nano v2 VL tensor mappings which will be broken.	2026-04-07 21:24:25 +02:00
Georgi Gerganov	4eb19514dd	kv-cache : support attention rotation for heterogeneous iSWA (#21513 ) * kv-cache : support attention rotation for heterogeneous iSWA * cont : remove assert b8699	2026-04-07 20:31:28 +03:00
Reese Levine	957d717ce5	ggml-webgpu: parameterize submission size and add iOS specific limits (#21533 ) * Work towards removing bitcast * Move rest of existing types over * Add timeout back to wait and remove synchronous set_tensor/memset_tensor * move to unpackf16 for wider compatibility * cleanup * Remove deadlock condition in free_bufs * Start work on removing parameter buffer pools * Simplify and optimize further * simplify profile futures * Fix stride * Try using a single command buffer per batch * formatting * Add parameters for different browsers in-flight submissions * Update handling of batch size too * Throttle ios as much as possible * Increase timeout for llvm-pipe testing b8698	2026-04-07 20:30:01 +03:00
Aman Gupta	de1aa6fa73	CUDA: check for buffer overlap before fusing (#21566 ) * CUDA: check for buffer overlap before fusing * use ggml_cuda_check_fusion_memory_ranges b8697	2026-04-08 00:57:04 +08:00
Aaron Teo	69c28f1547	llama-server: fix model params not propagated (#21509 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b8696	2026-04-07 21:39:41 +08:00
Son H. Nguyen	0d049d6a92	unicode : add custom Qwen2 regex handler to fix segfault on long input (#21257 ) * unicode : add custom Qwen2 regex handler to fix segfault on long input std::regex uses recursive backtracking internally, which causes a stack overflow (segfault) when tokenizing long sequences of repeated characters (e.g. 43K 'A's). The Qwen2 tokenizer regex differs from Llama3 only in the digit pattern (\p{N} vs \p{N}{1,3}), so it was falling through to the std::regex fallback path instead of using a custom handler. Add unicode_regex_split_custom_qwen2() following the established pattern used by gpt2, llama3, kimi_k2, and afmoe custom handlers. Closes: https://github.com/ggml-org/llama.cpp/issues/21113 * cont : remove TODO comment * cont : update comment to reflect original regex * use the correct regex in the comment this time... [no ci] --------- Co-authored-by: Aldehir Rojas <hello@alde.dev>	2026-04-07 16:13:38 +03:00
Johannes Gäßler	a8ec0df461	llama: remove per-arch tensor name lists (#21531 ) b8694	2026-04-07 15:02:03 +02:00
Georgi Gerganov	e8f5082697	server : fix restore for checkpoints with pos_min == 0 (#21510 ) b8693	2026-04-07 15:29:17 +03:00
Georgi Gerganov	22fc79134e	ggml : deprecate GGML_OP_ADD1 (#21363 ) * ggml : deprecate GGML_OP_ADD1 * cont : remove tests * cont : re-enable vulkan check b8692	2026-04-07 15:28:27 +03:00
Tom Overlund	2a619f6fbc	ggml: Vulkan build, Linux -- output error string for errno on fork failure (#20868 ) (#20904 ) b8691	2026-04-07 13:54:55 +02:00
mkoker	edd4d9bca5	vulkan: add FA dequant for q4_1, q5_0, q5_1, iq4_nl (#21029 ) Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL in the flash attention base shader. Register them in the shader generator, pipeline creation, and enable in the scalar/coopmat1 FA support check. b8690	2026-04-07 13:41:29 +02:00
Aldehir Rojas	482192f12d	webui : store reasoning_content so it is sent back in subsequent requests (#21249 )	2026-04-07 13:32:44 +02:00
Antoine Viallon	71a81f6fcc	ggml-cuda : fix CDNA2 compute capability constant for gfx90a (MI210) (#21519 ) GGML_CUDA_CC_CDNA2 was set to 0x910 Fix by setting the constant to 0x90a to match the actual gfx90a ISA. b8688	2026-04-07 12:18:55 +02:00
Aleksander Grygier	ecce0087da	fix: Detect streaming state in reasoning content blocks (#21549 )	2026-04-07 12:04:41 +02:00
Kabir08	d1f82e382d	Fix rtl text rendering (#21382 ) * Fix Arabic RTL text rendering in web UI - Add dir='auto' attributes to markdown containers and blocks - Implement post-processing to add dir='auto' to all text elements - Replace directional CSS properties with logical properties for proper RTL list alignment - Ensure bidirectional text support for mixed Arabic/English content * Clean up commented duplicate function Remove the commented-out duplicate transformMdastNode function that was left over from refactoring. * Fix Arabic RTL text rendering in web UI - Add dir='auto' attributes to markdown containers and blocks - Implement post-processing to add dir='auto' to all text elements - Replace directional CSS properties with logical properties for proper RTL list alignment - Minor code formatting improvements This ensures bidirectional text support for mixed Arabic/English content in the llama.cpp web UI. * Implement rehype plugin for comprehensive RTL text support - Add rehypeRtlSupport plugin that applies dir='auto' to all elements with children - Replace DOMParser-based approach with efficient HAST tree processing - Remove hardcoded element lists for better maintainability - Ensure proper bidirectional text rendering for mixed RTL/LTR content * Fix RTL text rendering with rehype plugin and cleanup * fix: prettier formatting	2026-04-07 11:37:20 +02:00
PMZFX	0988accf82	[SYCL] Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (#21527 ) Extend the existing reorder optimization to Q8_0. The reorder separates scale factors from weight data for coalesced memory access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing. On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x) on Qwen3.5-27B. BW utilization: 21% -> 66%. The key fix beyond the kernels: Q8_0 was missing from the type check in ggml_backend_sycl_buffer_init_tensor() that allocates the extra struct carrying the reorder flag -- so the optimization was silently skipped. AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware. Fixes: #21517 b8685	2026-04-07 16:12:49 +08:00
Dmytro Romanov	0033f53a07	docs: fix typo in build.md (emdawbwebgpu -> emdawnwebgpu) (#21518 ) b8684	2026-04-07 12:37:26 +08:00
Masashi Yoshimura	d0a6dfeb28	ggml-webgpu: Add the support of `MUL_MAT_ID` (#21147 ) * Add mul_mat_id support to WebGPU * Apply suggestion from @reeselevine --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com> b8683	2026-04-06 13:08:46 -07:00
Pasha Khosravi	2e1f0a889e	ggml: add Q1_0 1-bit quantization support (CPU) (#21273 ) * ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU) * add generic fallback for x86 * remove Q1_0 (group size 32) * rename Q1_0_g128 => Q1_0 * fix Q1_0 LlamaFileType Enum * Fix trailing spaces; add generic fallback for othre backends * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix /r/n spacing + arch-fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8682	2026-04-06 20:55:21 +02:00
Bipin Yadav	506200cf8b	cli: fix stripping of \n in multiline input (#21485 ) * llama-cli: fix stripping of \n in multiline input * Change & string to string_view * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix EditorConfig linter error --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8681	2026-04-06 20:54:06 +02:00
Gaurav Garg	15f786e658	[CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159 ) * Write an optimized flash_attn_stream_k_fixup kernel Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst. Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst * Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs * Address review comments * Address review comments * Revert variable names to original b8680	2026-04-06 20:34:29 +02:00
Aman Gupta	94ca829b60	llama-bench: add `-fitc` and `-fitt` to arguments (#21304 ) * llama-bench: add `-fitc` and `-fitt` to arguments * update README.md * address review comments * update compare-llama-bench.py b8679	2026-04-06 22:26:02 +08:00
Aldehir Rojas	4aa962e2b0	vocab : add byte token handling to BPE detokenizer for Gemma4 (#21488 ) b8678	2026-04-06 09:08:37 -05:00
Sigbjørn Skjæret	941146b3f1	convert : fix block_ff_dim retrieval for lfm2 (#21508 )	2026-04-06 14:05:18 +02:00
lainon1	482d862bcb	server : handle unsuccessful sink.write in chunked stream provider (#21478 ) Check the return value of sink.write() in the chunked content provider and return false when the write fails, matching cpp-httplib's own streaming contract. This prevents logging chunks as sent when the sink rejected them and properly aborts the stream on connection failure. b8676	2026-04-06 14:03:02 +02:00
Xuan-Son Nguyen	3979f2bb08	docs: add hunyuan-ocr gguf, also add test [no ci] (#21490 )	2026-04-06 14:02:37 +02:00
Georgi Gerganov	400ac8e194	convert : set "add bos" == True for Gemma 4 (#21500 ) * convert : set "add bos" == True for Gemma 4 * cont : handle old GGUFs	2026-04-06 13:52:07 +03:00
Neo Zhang	f51fd36d79	sycl : handle other FA case (#21377 )	2026-04-06 13:28:00 +03:00
Yarden Tal	25eec6f327	hexagon: slight optimization for argosrt output init (#21463 ) b8672	2026-04-05 18:30:25 -07:00
anchortense	58190cc84d	llama : correct platform-independent loading of BOOL metadata (#21428 ) * model-loader : fix GGUF bool array conversion * model-loader : fix remaining GGUF bool pointer uses b8671	2026-04-06 01:40:38 +02:00
Richard Davison	af76639f72	model : add HunyuanOCR support (#21395 ) * HunyuanOCR: add support for text and vision models - Add HunyuanOCR vision projector (perceiver-based) with Conv2d merge - Add separate HUNYUAN_OCR chat template (content-before-role format) - Handle HunyuanOCR's invalid pad_token_id=-1 in converter - Fix EOS/EOT token IDs from generation_config.json - Support xdrope RoPE scaling type - Add tensor mappings for perceiver projector (mm.before_rms, mm.after_rms, etc.) - Register HunYuanVLForConditionalGeneration for both text and mmproj conversion * fix proper mapping * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * address comments * update * Fix typecheck * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8670	2026-04-05 23:32:14 +02:00

1 2 3 4 5 ...

8719 Commits