llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-29 17:17:40 +02:00

Author	SHA1	Message	Date
Aman Gupta	8c146a8366	DeepSeek V4 (#24162 ) * convert: add dsv4 conversion * add basic setup * add llm_graph_input_dsv4 * add save-load state * add sinkhorn eps - correction by @fairydreaming * add rope fix * cleanup dead code * fix bugs * support pro model: added by @fairydreaming * remove redundant V cache * Chat template * remove debugging leftovers * Add mechanism for inlining templates based on architecture * s/deepseek-v4-flash/deepseek4/g * s/deepseek-v4-flash/deepseek4/g continued * enable graph reuse * enable FA * fix test llama archs * rename * compatibility with antirez ds4 GGUFs * simplified set_gguf_parameters() by calling super class method, replaced moe.score_func with expert_gating_func. * reserve worst-case kv-cache * revert max split inputs * address review comments * add padding to enable FA * pad only the final value of plan.n_kv to 256 * remove built-in cpp chat template * cont: remove cpp built-in template * rm outdated test * replace ggml_view_3d() with ggml_reshape_3d() Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * only support n_seq=1 for now * remove unused var * cont: remove unused var * use scale bias * use correct ptr for can_reuse * remove gen-chat-inline-templates.py * simplify graph reuse * cont: cleanup * remove unused inputs * enable partial checkpointing * add correct shape for kq_mask + set llama_model_n_swa to 0 for dsv4 * precompute source_idx + add comment about dummy write * support multi-seq * remove restored_trim_pos * use split_equal when possible * fix indent * address review comments * use LLM_KV * fix ci --------- Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com> Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b9840	2026-06-29 16:58:51 +08:00
seryogakovalyov	6cb18b2f2e	tools/ui: restore Tailwind scanning in ignored worktrees (#24879 ) b9839	2026-06-29 10:55:52 +02:00
o7si	277a105dc8	common : remove unused regex-partial (#25118 ) b9838	2026-06-29 08:48:39 +02:00
Xuan-Son Nguyen	b3fed31b99	jinja, chat: add --reasoning-preserve flag (#25105 ) * jinja, chat: add --reasoning-preserve flag * correct help message b9837	2026-06-28 23:33:51 +02:00
Aleksander Grygier	dbdaece23d	Revert "ui: fix accessibility for hover-gated interactive elements assisted by claude(in debugging and tests) (#24727 )" (#25098 )	2026-06-28 21:30:03 +02:00
Pascal	7cb8576e7c	ui: fix stop and reasoning skip in single-model mode (#25084 ) b9835	2026-06-28 21:06:43 +02:00
Ruixiang Wang	fa72bc6826	dflash: refactor draft model conversion (#25110 ) * dflash: refactor draft model conversion * apply fix for eagle3 convert	2026-06-28 20:31:48 +02:00
Aldehir Rojas	c818263f2a	chat : implement minicpm5 parser (#24889 ) * Add minicpm5 tool call parser * Refactor MiniCPM5 PEG parser per review feedback * Fix jinja min/max API to match Jinja2 * modify by review * MiniCPM5: use autoparser for XML tool calls and fix grammar preserved-token triggers * MiniCPM5: fix streaming tool-arg placeholder and remove alt XML markers * skip min/max attribute tests in -py mode * test-jinja: use real expected output for min/max attribute tests * MiniCPM5: revert shared mapper and history fallbacks per review Drop streaming tool-arg placeholder workarounds from the generic PEG mapper and restore strict tool-call argument JSON parsing so MiniCPM5 support stays limited to autoparser/diff-analyzer changes. * chat : refactor minicpm5 back to dedicated parser * cont : simplify grammar * cont : refactor * cont : fixes * cont : rename template to openbmb-MiniCPM5-1B.jinja * cont : add message delimiters * cont : fix tests --------- Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: 张涛 <> b9833	2026-06-28 16:53:32 +02:00
Xuan-Son Nguyen	f68a788b0b	jinja: add --dump-prog for debugging (#25086 ) * jinja: add --dump-prog for debugging * Update common/jinja/runtime.cpp Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com> --------- Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com> b9832	2026-06-28 15:50:31 +02:00
Ruixiang Wang	d1b34251bc	spec : add DFlash support (#22105 ) * spec: add DFlash v2 support * dflash: support sliding window attention per layer_types * docs: add dflash section --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> b9831	2026-06-28 16:01:34 +03:00
Adrien Gallouët	c1a1c8ee94	common : allow --offline in llama download (#25091 ) Expose the existing --offline flag to `llama download` so a script can run it to check whether a model is already cached and ready to be served without touching the network. Also fix a latent use-after-free in the URL-task on_done callback: first_path is block-scoped and was captured by reference, but invoked after the block ends. Signed-off-by: Adrien Gallouët <angt@huggingface.co> b9830	2026-06-28 12:34:11 +02:00
Georgi Gerganov	27c8bb4f63	logs : reduce v2 (#25078 ) * server : reduce logs * cont : common * cont : spec * cont : CMN_ -> COM_ b9829	2026-06-28 08:52:15 +03:00
Hongqiang Wang	ebd048fc5e	opencl: flash attention improvement (#25069 ) * opencl: rework FA kernel for f16 and f32 * opencl: flash-attention prefill prepass kernels - flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple - flash_attn_mask_pad_f16 pads the matching mask tile - flash_attn_blk_f16 classifies each KV tile per query block as fully masked / mixed / fully unmasked, so the main kernel can skip fully-masked tiles and the mask lookup for fully-unmasked ones * opencl: FA kernels for q4_0 and q8_0 * opencl: `set_rows` for f32 to q8_0/q4_0 * opencl: dequant kernels for q4_0 and q8_0 * opencl: add FA tile tuning table with override * opencl: wire host side for FA * opencl: q4_0 MoE tensors are also SOA'ed * opencl: cosmetic fix * opencl: refactor, also clarify some code paths in comments * opencl: fix inifity for `-cl-finite-math-only` --------- Co-authored-by: Li He <lih@qti.qualcomm.com> b9828	2026-06-27 15:36:06 -07:00
Gaurav Garg	0ed235ea2c	[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057 ) * [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies. When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel. This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps. * Add new tests that execute the new optimized strided copy path * Return unsupported for strided copy in OpenVINO, as new tests are failing b9827	2026-06-27 17:46:21 +05:30
Neo Zhang	9bebfcb4bc	sycl : fix failed ut cases of norm (#25044 ) b9826	2026-06-27 12:13:43 +03:00
Ruben Ortlam	0b6529d818	vulkan: fix step operator for 0 input (#25036 ) b9825	2026-06-27 10:57:31 +02:00
Christian Kastner	c299a92c38	binaries : Improve rpc-server and export-graph-ops names. (#25045 ) Tests are generally prefixed with -test, so rename export-graph-ops accordingly. rpc-server is probably too generic a name for /usr/bin. Because it should work with any ggml application, it is renamed to ggml-rpc-server. b9824	2026-06-27 10:31:29 +03:00
Sigbjørn Skjæret	0275c0f800	ci : add windows-openvino to check-release (#25022 ) b9823	2026-06-27 10:30:56 +03:00
Sigbjørn Skjæret	83d385b429	tests : fix test-chat-template --no-common option (#25075 ) b9822	2026-06-27 10:30:19 +03:00
Adrien Gallouët	050ee92d04	app : allow --version, --licenses & --help (#25054 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b9821	2026-06-26 23:18:11 +02:00
Andreas Kieslinger	3fc4e10527	sched : reintroduce less synchronizations during split compute (#20793 ) * CUDA: Improve performance via less synchronizations between token (#17795) * Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async() * Adds function to relax sync requirements between input copies on supported backends (CUDA for now) * Exchanges synchronous copy with async copy function. * Adds macro guards to allow compilation in non-CUDA builds * Reworked backend detection in ggml-backend.cpp to avoid linking conflicts * Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues * Minor cleanup * Makes opt-in to relax use of explicit syncs more general. Backends like vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now. * Reintroduces stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU. * Corrects initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization * Simplifies synchronizations to adhere to `saaasg` pattern. * Apply suggestion from @ggerganov (src->buffer to buf_src) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestion from @ggerganov (src->buffer to buf_src) v2 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestions from @johannesgaessler code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Adds single-GPU synchronizations to multi-GPU settings to fix hip backend pipeline parallel bugs. * Scheduler Hardening: Exclude hip/MUSA from copy_from_host CPU split -> GPU split optimization * Scheduler Hardening: Re-adding original additional synchronizations for non-async backends * Adds disclaimer to hip/musa exclusion of copy_from_host. Highlights that it is out of precaution, but that no perf-impact is visible, and that it can be revisited separately anytime. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b9820	2026-06-26 17:18:30 +03:00
Adrien Gallouët	5d8ccdf9d1	devops : add llama in all docker images (#25035 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-06-26 15:15:48 +02:00
Xuan-Son Nguyen	024930c6ad	arg: fix handling --spec-draft-hf and --hf-repo-v (#25043 ) * arg: fix handling --spec-draft-hf and --hf-repo-v * fix missing mparams.hf_file	2026-06-26 14:36:03 +02:00
Ravi Panchumarthy	5397c36194	openvino: Update to OV 2026.2.1, self-contained release packages, operator improvements (#24974 ) * Update to OV 2026.2.1, Make OV release packages self-contained * Update to OV 2026.2.1, Make OV release packages self-contained * OpenVINO Backend: Remove compute_op_type hardcoded sets (#222) * OpenVINO Backend: Remove compute_op_type hardcoded sets * revert get_op_type removal * OpenVINO backend: enable softmax with sink input * OpenVINO backend: opt mul_mat_id convert process for large size * OpenVINO backend: Modify add_id to support 2D/4D * OpenVINO Backend: Add glu_swiglu_oai * PR review: fix paths * PR review: fix path consistency --------- Co-authored-by: Mostafa <mostafas.main.email@gmail.com> Co-authored-by: Xuejun <Xuejun.Zhai@intel.com> b9817	2026-06-26 15:07:19 +03:00
Georgi Gerganov	e7ea94afcb	sync : ggml b9816	2026-06-26 15:04:42 +03:00
Georgi Gerganov	96183e9820	ggml : bump version to 0.15.3 (ggml/1550)	2026-06-26 15:04:42 +03:00
nullname	487a6cc164	vulkan: opt mul_mat_vecq for mi50 (#22933 ) b9814	2026-06-26 13:49:24 +02:00
Jiang, Fish	5a6a0dd7e1	vulkan: add INTEL_XE1 arch enum and enable coopmat1 on Intel Xe-LPG Plus (#24404 ) * vulkan: add INTEL_PRE_XE2 arch enum and enable coopmat1 on Intel Xe-LPG Plus (1/3, Xe1-ARLH) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com> * Address comments of bf16 and trailing whitespace * Rename INTEL_PRE_XE2 to INTEL_XE1 and remove driver workaround * Add Windows driver check --------- Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com> b9813	2026-06-26 13:26:22 +02:00
Sanjay Ahari	ded1561b42	ui: fix accessibility for hover-gated interactive elements assisted by claude(in debugging and tests) (#24727 )	2026-06-26 12:55:38 +02:00
Jeff Bolz	9df06805ee	vulkan: Workaround compiler bug in conv2d coopmat2 path (#24924 ) * vulkan: Workaround compiler bug in conv2d coopmat2 path * apply same workaround to CONV_3D * Apply suggestion from @jeffbolznv b9811	2026-06-26 11:53:32 +02:00
leonardHONG	2f18fe13c5	CUDA: add cublasSgemmBatched mapping for HIP/MUSA vendor headers (#25033 ) b9810	2026-06-26 11:42:56 +02:00
Tarek Dakhran	c16c35b814	ggml-cpu: fix SVE leftover path in ggml_vec_dot_f32 (#24699 ) * ggml-cpu: fix SVE leftover path in ggml_vec_dot_f32 2D convolutions with kernel size 9 produced different results on SVE enabled ARM devices. After debugging it turned out that ggml_vec_dot_f32 was using data from inactive lanes. Use svmla_f32_m(pg, sum1, ax1, ay1) so inactive lanes retain sum1. * cont : clean-up --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-06-26 10:41:56 +03:00
Pascal	1a87dcdc45	server + ui: SSE Replay Buffer (#23226 ) * server: SSE replay buffer, survives client disconnect Opt in on POST /v1/chat/completions when the client sends X-Stream-Resume: 1 and a non empty X-Conversation-Id. The conv id is the session identity end to end, no extra opaque token. The drain runs detached server side and buffers SSE bytes, the generation survives HTTP disconnect, F5, or lets users switch from iOS Safari to another app without losing the actively generated response. Routes: GET /v1/stream/<conv_id>?from=N replay GET /v1/streams[?conversation_id=X] list, drives sidebar spinners DELETE /v1/stream/<conv_id> Stop, idempotent Router parent fans out to children for list and delete, probes on GET to route to the owner, fans out DELETE on POST so "one session per conv" holds across model swaps. WebUI: the layout snapshots /v1/streams at mount and on visibilitychange, the sidebar reflects live inferences across all convs. The chat page reattaches on mount, append vs fresh is detected from existing content so continue mid stream keeps its prefix. update_slots: on llama_memory_seq_rm refusal at a deep position, full clear of the seq and reprefill from zero instead of GGML_ABORT. OAI strict path unchanged when the opt in headers are absent. * server: create stream session only after post_tasks succeeds * server, ui: drop X-Stream-Resume, X-Conversation-Id alone enables the replay buffer * server: drop magic 17, derive the X-Conversation-Id header length from sizeof at build time * refactor: address review feedback from ngxson * server-context: cleaning * server-stream: fix use-after-free on rd Guard stop_producer with a shared alive flag, flipped by on_stream_end before rd dies. Prevents a late cancel (session eviction by a later POST on the same conv_id, or a DELETE arriving after the producer ended) from touching a destroyed rd. * ui: fix cross-conversation contamination Scope streaming flags per conv so one finishing does not unflag the others, guard discoverActiveStream against concurrent runs to avoid duplicate attaches, and stop racing syncRemoteRunningStreams for the sidebar set. * server-http: keep request alive in detached SSE drain The response next() lambda may reach into request via &req long after on_complete reset the request shared_ptr. Capture request in the detached thread so it outlives the drain. ui: address review feedback from coder543 Forward Authorization to /v1/stream and /v1/streams fetches, the resumable routes must obey --api-key like the rest of the API. Wrap reader.read() in a try/catch, the underlying connection drop rejects with TypeError instead of resolving done=true, treat it as a premature end of stream so the existing resume loop kicks in. Freeze the model at session start in chatStreamingStates.model and thread it through cancel and resume, the dropdown selection may have changed since the POST and the server side identity is fixed at that time. * format * ui: remove unused selectedModelName * server-stream: poll session->is_cancelled() in stream_aware_should_stop Address review feedback from coder543. The cancel propagation through rd.stop() relies on the slot eventually processing the cancel task and posting a result that notifies the recv condvar, remove_waiting_task_ids does not notify directly. Add a defensive poll on session->is_cancelled() so the producer-side next() loop exits on its next iteration after cancel() without waiting for the cancel task to round trip through a slot. * server-stream, ui: replace GET /v1/streams with POST /v1/streams/lookup Address review feedback from coder543. Listing live sessions leaks the conversation_id of every concurrent user, which defeats the random UUID unguessability. The new route takes {conversation_ids: [...]} in the body and returns matches only for the ids the caller already owns, so foreign UUIDs stay private. The router fans out the same POST to every child and aggregates, the WebUI passes the convs visible in its sidebar. * ui: read conv ids from IndexedDB in syncRemoteRunningStreams The conversations store is not hydrated yet at +layout onMount, so the sidebar spinners stayed off for background convs until the user clicked on them. Read straight from the DB to dodge the init race. * server-models: deduplicate stream lookup timeouts behind one constant * ui: extract visibility kick grace into a stream constant, bump to 1000 ms * make it safer & more simple * server-stream: survive client disconnect via stream_pipe::finish_producer After the RAII rewrite the generation stopped the moment the client disconnected. httplib bails its content provider on the is_peer_alive check at the top of write_content_chunked, so returning true from the provider never keeps it producing: the response resets, rd is destroyed and its task gets cancelled. Reinstate the disconnect survival inside the pipe. stream_pipe gains finish_producer, which pumps the response next() into the ring buffer until the generation ends, and mark_producer_done for the clean wire end. server-http only triggers them: mark before sink.done on a clean close, finish in on_complete when the peer left early. No detach, no stream logic in server-http beyond the trigger, and the strict OAI path is untouched when no pipe is attached. Known limitation: finish_producer pumps synchronously on the http worker, so a disconnected stream keeps its worker busy until the generation ends. A follow-up will move the drain off the http worker so no worker is held. * server-stream: drain disconnected streams on a manager owned thread The previous commit pumped the post disconnect drain synchronously in on_complete, on the http worker, so a disconnected stream kept its worker busy until the generation ended. Under a wave of reloads or tab closes that pins workers from the pool. Move the drain off the http worker. on_complete now hands the response to stream_session_manager::adopt_orphan, which pumps it to completion on a manager owned thread and releases the worker at once. One thread per disconnected stream still generating, stored in a list, joined and reaped on the next adopt, by the GC, and at shutdown. No detach, the thread lifecycle is fully owned by the manager. needs_drain gates the handoff so a cleanly finished stream never spawns a thread, and the strict OAI path stays untouched when no pipe is attached. stop_gc now cancels sessions before finalizing them, so an in flight drain sees is_cancelled and exits instead of blocking the shutdown join until the generation ends naturally. * ui: add missing JSDoc * server-stream: drain on the http worker, drop the manager thread Address @ngxson review: httplib runs a large dynamic pool and a worker blocked in next() sits on a condvar instead of burning cpu, so draining the rest of the generation on that worker is fine and much simpler than a dedicated thread. on_complete calls finish_producer directly again. Removes adopt_orphan, the orphan thread list and its reaping, the stop_gc session cancel that only existed to unblock those threads, and the now dead drain_shutdown flag. * server-stream: split stream_pipe into producer and consumer classes Address @ngxson review: one class covering both ends was messy. stream_pipe is now a base holding the session and is_cancelled, with stream_pipe_producer (write, mark_producer_done, finish_producer, cleanup, finalizes on destruct) and stream_pipe_consumer (read only, no finalize) deriving from it. Drops the is_producer_ discriminator and its runtime guards, the type now encodes the role. res.spipe is retyped to shared_ptr<stream_pipe_producer> since it is only ever a producer. No behavior change. * server-stream: rename producer methods to unix pipe semantics Address @ngxson review: mark_producer_done becomes done(), finish_producer becomes close(), matching a unix pipe write end. The producer_done_ member follows as done_. write() is unchanged. No behavior change. * server, ui: route resumable streams via a conv map, persist resume identity Address ngxson review: drop the polling probe, proxy_post records a conv_id -> model map and the stream routes resolve the owning child with one lookup. The map is the single source of truth, the ::model suffix stays for child session uniqueness but the router never parses it. UI: the server keys a session by the POST time identity (conv::model), but reload probed with the bare conv id and missed model tagged sessions, so F5 stopped the stream and sidebar spinners stayed off. Persist the model and rebuild the exact identity on resume, single conv and bulk sidebar both send it. Add unit coverage for the identity round trip. * ui: resolve continue target by id to stop cross-conversation flash on switch * ui: skip stream resume when the abort is intentional * server: move the conv id to model map into a self contained tracker Address review from ngxson: server_models held two mutexes side by side, the global one and a bare conv_model_mu guarding a loose map, which made the locking hard to follow. Wrap the map and its lock in a small conv_model_tracker struct that owns its mutex, one mutex per struct. The remember, lookup and forget methods move inline into the tracker, server_models exposes a single conv_models member and the routes call models.conv_models.lookup and friends. No behavior change, the map stays the single source of truth for routing resumable streams to a child. * ui: replace stream magic values with enums and shared constants Address review from allozaur: lift the inline literals around the resumable stream code into named symbols so the intent is explicit and reusable. * ui: fold the stream resume and discovery helpers into ChatService Address review from allozaur: drop the two standalone stream-.service files. They were used only by the chat service and store, carried no shared state, and did not follow the static class pattern the other services use, so a separate abstraction was not warranted. Move the helpers onto ChatService as static methods. No behavior change, tests now exercise them through ChatService. docs: document the SSE replay buffer in server README-dev Add the resumable streaming section, list stream_session_manager in the backend component inventory, and link PR 23226 in the related PRs. * ui: align attachServerStream call with onCompletionId param in handleStreamResponse * server-http: rename del_ to del to match get and post * ui: address review feedback from allozaur * ui: drop duplicate SSE constants, keep sse.ts canonical * ui: use svelte:document for the visibilitychange listener address review from allozaur: replace the manual document.addEventListener in onMount with a declarative <svelte:document onvisibilitychange>. svelte handles attach, detach and SSR, so the typeof document guard and the onMount cleanup go away. onMount keeps only the first load snapshot. * server: trim redundant stream drain comments Address review from ngxson * server: balance and clean up stream comments remove redundant comments and tighten the verbose ones across the resumable stream code, keeping the concurrency and lifetime rationale that is not obvious from the code. also fix two stale comments in server.cpp and server-models.h that still described the old ::model suffix probe and fan out routing, now replaced by the conv_id -> model map Address review from ngxson * ui: balance and clean up stream comments dedup repeated rationale (frozen conv::model identity, the lookup privacy note, the abort patterns) down to one canonical spot, tighten the verbose blocks, and keep the concurrency and resume-offset reasoning. fix stale comments in stream-identity.ts and chat.service.ts that still described the old loopback probe and fan out routing, now the conv_id -> model map. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-06-26 09:31:29 +02:00
Jassieluo	e7e3f35090	sycl : clamp softmax input to avoid underflow (#24941 )	2026-06-26 15:02:42 +08:00
Xuan-Son Nguyen	b11f7c16bc	mtmd: add more validations (#25013 ) * mtmd: add more validations * fix * refactor a bit * type check for get_arr_int	2026-06-26 08:43:29 +02:00
leonardHONG	f818065d75	CUDA: batch out_prod broadcast (dps2>1) path with cublasSgemmBatched (#24426 )	2026-06-26 08:51:25 +03:00
Arsen Arutunan	960d628f46	mamba2: remove hardcoded 2x expansion factor and invalid d_inner % d_state check (#23082 ) * mamba2: remove hardcoded 2x expansion factor, support any expand value * mamba2: remove invalid d_inner %% d_state check (unrelated parameters) * Update convert_hf_to_gguf.py: make expand optional with default 2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * mamba2: apply expand fix to refactored conversion/mamba.py * also check for mamba_expand --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com> b9804	2026-06-26 08:50:54 +03:00
shaofeiqi	5c7c22c3e1	opencl: flush profiling batch at shutdown for incomplete batches (#25016 ) b9803	2026-06-25 18:48:24 -07:00
Sigbjørn Skjæret	beac5309f1	xcframework : disable mtmd video on i/tv/visionos (#25018 ) b9802	2026-06-26 00:13:59 +02:00
Tarek Dakhran	9d5d882d8c	model : Add label for LFM2.5-230M (#25008 )	2026-06-25 18:58:52 +02:00
Oliver Simons	1ec44d178d	CUDA: Various fixes to `cpy.cu` (#25000 ) * Add failing test-case to test-backend-ops Extracted from https://github.com/ggml-org/llama.cpp/issues/24072 * Minimize repro with help of AI N = 8 * (65535 - 1) + 1 = 524273 * Port and adjust workaround from https://github.com/LostRuins/koboldcpp/commit/0ba798341e0c70517cb226cb63c966b086a3b5b3 Fall-back should share code, also relax y-z constraint to be inclusive * Add test-case + fallback also for y dim * Fix x-guards which is 2^{31}-1, so inlusive of INT_MAX * Fix overflow problems for transposed copy kernel	2026-06-25 17:29:23 +02:00
Xuan-Son Nguyen	c7cddefcbd	misc: fix labeler (#25012 )	2026-06-25 17:23:37 +02:00
Xuan-Son Nguyen	e9d1b76d0a	server: use status code 403 for disabled features (#24970 ) * server: use status code 403 for disabled features * cont * fix test case	2026-06-25 16:36:40 +02:00
Xuan-Son Nguyen	099bf06952	misc: update lables (#24920 ) * misc: update lables * bring back examples, add mtmd	2026-06-25 16:26:56 +02:00
Xuan-Son Nguyen	60bc8866b1	common: refactor model handling (#24980 ) * common: refactor models handling * remote preset * cont * rm skip_download option * missing header * fix plan.model_files * fix --offline case * move hf_plan to download * refactor * rm redundant curr_ex, add comments * adapt	2026-06-25 15:17:51 +02:00
Kashif Rasul	e8ecce53b8	docs : Eagle3 qwen3 draft model support (#24977 ) * eagle3: accept Eagle3LlamaForCausalLM draft checkpoints * docs: add eagle3 speculative decoding section * docs: address eagle3 review comments * docs: add more angelslim eagle3 models * docs: add gpt-oss eagle3 models and link to pr 18039	2026-06-25 15:58:00 +03:00
Adrien Gallouët	683b04cc4a	app : add the llama download subcommand (#24982 ) * app : add the download command (with llama-download) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove llama-download tool for now Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-06-25 13:36:36 +02:00
fairydreaming	f728adab68	ggml : address integer overflows in binary ops CUDA implementation (#24706 ) * ggml : address integer overflows in binary ops CUDA implementation * ggml : add size_t casts to avoid integer overflows * ggml : add more asserts checking integer overflows in binary ops CUDA implementation --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-06-25 10:06:44 +02:00
Pascal	3e61ea0e2f	ui: fix always-show-sidebar-on-desktop setting after navigation refactor (#24979 )	2026-06-25 09:45:55 +02:00
Christopher Albert	fdbd6abee2	tests : synchronize contexts at end of test-thread-safety (#24935 ) Assisted-by: Claude	2026-06-25 09:22:51 +03:00

1 2 3 4 5 ...

9840 Commits