llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-09 07:16:44 +02:00

Author	SHA1	Message	Date
Aldehir Rojas	42a0afd594	server : do not parse when flushing http headers (#24281 )	2026-06-08 13:32:41 -05:00
Xuan-Son Nguyen	8f83d6c271	mtmd : add video input support (#24269 ) * wip * ok: lazy bitmap API * remember to free lazy text * wip * add mtmd_helper_video * support video input on server (base64 input) * add MTMD_VIDEO config * add timestamp * update CLI * cli: allow auto-completion for video * add --video arg * fix build * update docs * rename as suggested	2026-06-08 14:40:12 +03:00
Xuan-Son Nguyen	715b86a366	cli: fix spinner not show during prompt processing (#24283 )	2026-06-08 11:11:45 +02:00
ddh0	9e3b928fd8	common : relax sampler name matching (#23744 ) * common : relax sampler name matching Currently, in some cases, the alternative names for samplers (like `top-k` and `min-p` instead of the canonical `top_k` and `min_p`) are not always recognized by the `common_sampler_types_from_names` function in `common/sampling.cpp`. This PR changes the signature of this function to remove the `bool allow_alt_names` flag, and removes all occurences of the flag from call sites. Therefore, the function will now always match all known names. I also changed the logic of the function to unconditionally check the provided sampler names against both the canonical and alternative names, and to be case-insensitive. This fixes an issue I was seeing wherein samplers specified in the `llama-server` UI were not recognized as valid when the alternative names were used. * add more alt names * cont. fix * cast to unsigned char for correctness * common : unify sampler name mapping * annotate canonical vs. alt sampler name mappings per @CISC * Update common/sampling.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * common : auto-generate sampler name aliases per @ngxson * use merged map for matching * use `.merge` instead of iterating * nit: simplify comment * nit: use insert everywhere, not index assignment --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-06-07 22:48:11 +02:00
Aman Gupta	04eb4c446d	llama : add Gemma4 MTP (#23398 )	2026-06-07 20:50:54 +08:00
Xuan-Son Nguyen	31e82494c0	mtmd: support "frame merge" for qwen-vl-based models (#21858 ) * feat: add video support for Qwen3.5 * various clean up * revise the design * fix llava-uhd case * nits * nits 2 --------- Co-authored-by: andrewmd5 <1297077+andrewmd5@users.noreply.github.com>	2026-06-06 21:17:25 +02:00
Adrien Gallouët	6b80c74f28	completion : remove useless statics (#24226 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-06-06 12:16:16 +02:00
Adrien Gallouët	588f0dc2ce	completion : fix format specifier in LOG_INF (#24213 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-06-06 11:24:27 +02:00
Xuan-Son Nguyen	f5c6ae1827	mtmd, server: add "placeholder bitmap" for counting tokens , add /input_tokens API (#23913 ) mtmd: add "placeholder bitmap" for counting tokens w/o preprocessing * fast path skip preproc for placeholder * fix build * correct the api * add server endpoint + tests * add object name * update docs * add proxy handling * fix build * fix audio input path * use is_placeholder in process_mtmd_prompt() * nits * nits (2) * docs: clarify chat/completions/input_tokens is not official * fix merge problem	2026-06-06 11:06:51 +02:00
Gabe Goodhart	64086f2b2f	model, mtmd: Granite4 Vision (#23545 ) * feat(convert): Get language model conversion working for 4.1 vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Disable vocab padding for non-hybrid models that use GraniteMoeHybrid Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb python-side vision projector names and mappings There are several awkward things here: 1. Most of these are essentially identical to the audio qformer tensors. On the c++ side, that's mapped using the prefix, so the rest of the GGUF name needs to align, but on the python side there's no prefix notion, so they all get duplicated. 2. There are a couple of net-new tensors for vision, in particular PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as belonging to the qformer portion, but the GGUF name is simply proj_norm which conflicts with the ideal name for this new PROJ_NORM that is not qualified as part of the qformer. To get around this, I used "proj_layernorm" as the GGUF name. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python side architecture name Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python-side plumbing for setting FEATURE_LAYERS hparam Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side tensor naming defines NOTE: Usage of these hasn't been updated to include prefix yet Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Convert vision_feature_layer to an ordered vector We need to preserve the ordering of these feature index values so that they can be mapped to the sub-tensors within the stacked projectors. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Add architecture label plumbing Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(wip): Add partial conversion for mmproj This handles stacking the projector tensors and setting the new harams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add gguf_writer and constant support for new hparams and deepstack layer arr Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Full conversion for mmproj w/ tensor mappings Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add lm_head skip for mmproj for 4.0 Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias text_config architecture in convert_lora_to_gguf.py Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add --trust-remote-code arg to convert_lora_to_gguf.py This defaults to False, but allows a user to enable it programmaticly instead of using the interactive prompt. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias model.language_model. -> model. for lora adapters Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Extend language model tensor dealiasing in adapters Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary registration for GraniteSpeech in language model Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb through mm prefix formatting for qformer tensors Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Refactor vision projector tensors to use predictor ID as the block This is cleaner than stacking them. The modeling file hard-codes single-layer qformers, so we can punt on the multiipule multi-layer projectors problem. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add spatial offests array hparam conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add stub plumbing for granite vision in mtmd Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new hparam and tensor naming in clip-impl.h New hparams: - KEY_PROJ_SAMPLE_QUERY_SIDE - KEY_PROJ_SAMPLE_WINDOW_SIDE - KEY_PROJ_SPATIAL_OFFSETS New tensors: - TN_MULTI_PROJ_IMG_POS - TN_MULTI_PROJ_QUERY - TN_MULTI_PROJ_LAYERNORM - TN_MULTI_PROJ_LINEAR - TN_MULTI_PROJ_NORM Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Move deepstack_layer_arr to llm hparam instead of mmproj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove IS_DEEPSTACK_LAYERS This appears to have been added during Qwen3 VL (https://github.com/ggml-org/llama.cpp/pull/16780), but it was never actually used. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: n_deepstack_layers -> deepstack_layer_arr The old logic hard coded a correspondence between the first N layers of the LLM and the 1->N entries in the input embeddings. Now, that relationship is maintained at loading time if the GGUF value is single-valued. If it is multi-valued, it loads directly allowing for deepstack layers to be spaced out throughout the model. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use try/catch for single/multi valued deepstack info The alternative would be to use get_key_or_arr, but then the single value would be populated through the entire array and we'd need to detect that and update it with the right correspondence. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add deepstack injection point for granite LLM The use of ggml_add here assumes that the elements of inp_embd will be pre- arranged to be the full embedding length with only the vision-mask'ed portions non-zero from the projector. This matches how Qwen3VL does it. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: add missing vision attn layernorm eps Branch: Granite4Vision AI-usage: full (OpenCode + Qwen 3.6-35B) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Hoist qformer tensors into qf_block and hold a vector for multi-proj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix missing prefix template for TN_QF_PROJ_LINEAR It's not strictly necessary since vision uses the blockwise version, but it makes the loading consistent. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add embedding scale and image grid pinpoints hparams in conversion Also remove dead parsing for self._deepstack_layer_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add mtmd KEY_ section for hparams shared with the LLM In this case, we need the EMBEDDING_SCALE so we can unscale the image embeddings to compensate for applying embedding scale to the input embeddings Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Implement c++ hparam parsing Branch: Granite4Vision AI-usage: draft (Claude Code) Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Flatten pinpoints in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing break Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: No reason to have modality prefix for img_pos Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tensor loading Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the right portion of speech for tensor loading! Also plumb through the layernorm -> post_norm naming change Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add logging of deepstack_layers_arr if set I also changed the print_f output type to int32_t to avoid printing overflow values for -1. This could cause overflows on the other side, but I can't imagine a value for any of the current array hparams that would trigger that. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Make sure input embeddings are cont before f_embedding_scale Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add init and mmproj_embd cases for g4v The n_mmproj_embd is 1+ to make space for the text embedding and all 8 projectors Branch: Granite4Vision AI-usage: draft (Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Invert (h, w) -> (w, h) pinpoints Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Reorder projectors based on llm index and skip the first injection The multi-projector stack has a strange asymmetry based on how it's currently implemented for qwen3vl: on the mmproj side, it's all N projectors, but the output of the "first" (by inp_embd index) projector is automatically consumed as if it were a standard single-projector mmproj, so the deepstack portion needs to only contain the 1-N entries. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix mmproj hparams in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix ordering/logic for deepstack injection in granite Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix preprocessing config to match what the model needs Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * wip: Partial port of Eli's implementation This is still pretty broken, but it's getting closer. It now happily generates tokens, but the values are quite incorrect still. I suspect it's caused by the mapping of projectors from safetensors to their respective orders here. Also, this implementation breaks encapsulation pretty badly in mtmd_encode. This will need a big refactor to put the G4V-specific encoding logic somewhere more appropriate. Branch: Granite4Vision AI-usage: draft (Claude Code, Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix the pre-scaling on the input embeddings to correctly invert the scale We've got tokens! They still don't line up quite right, so something's a little off, but we're getting much closer now. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: invert embedding multiplier -> base_scale at load Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix setting image_resize_pad after new enum introduced Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add G4V to mmproj mapping in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Re-add padding disable for non-hybrid hybrid models Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Simplify G4V n_tokens computation This is slightly more efficient and flexible for when we implement the unpad cropping. IMO, it's also clearer that it is adding the number of image_newline tokens (embeddings) to the grid, rather than recomputing the entire count. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new clip APIs for post-tile-encoding assembly Granite 4 Vision uses llava-next style pack-and-unpad which requires injecting the learned newline after each row of the tile grid. A row here is a single row of the grid which is composed of (grid_x * cols_per_tile) * (grid_y * rows_per_tile), so the result is newlines injected in between individual tile rows, thus not something that can be handled with the standard llava-uhd block-wise endcoding. Branch: Granite4Vision AI-usage: draft (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add model interfaces for granite 4 vision assembler I'm on the fence about the best organization of this. These free functions allow the per-architecture logic in clip.cpp to access the model-specific graph building, but they still require a fair bit of model-specific logic in clip.cpp which is not ideal. I think a better approach may be to replicate what is done with the graph builders themselves (and possibly even make the assembler part of the model's existing graph builder). Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove all g4v-specific branching from mtmd.cpp in favor of clip assembler Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(mtmd): Consolidate assembler logic into clip_assembler class family Just like `clip_graph` is the base class for building the model-specific encoder graphs, `clip_assembler` will be the base class for building the model-specific assembler graphs. This allows the assembly pattern to follow how the encoder pattern is implemented where the model-specific logic lives in a subclass co-located with the encoder graph builder that gets constructed by a simple factory method. Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment improvement Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: granite_vision -> granite4_vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack These pieces were never used on the c++ side (removed there in an earlier commit), so this is just cleanup that I missed before. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Oops! I did not mean to commit one of my prompt files But now it's too far back in history to effectively rebase out, even with interactive and --rebase-merges :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing <algorithm> include for std::find It seems that this was already pulled in on some platforms, but not on others Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix Flake8 warnings in granite conversion module Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove clip_assembler in favor of clip_image_f32.append_token Per conversation in the PR, the clip_assembler pattern was too invasive. This is a compromise that limits model-specific blocks to add_media where each preprocessed tile is annotated with an injection type, after which all the token counting logic is generic and the newline injection itself is handled in the graph based on the value for the given tile image. Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen 3.6 35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(convert): Split n_deepstack_layers and deepstack_layers (array) Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix GGUF key for deepstack_layers_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove pre-scaling embeddings and skip scaling for raw embd inputs This follows how gemma3 and gemma4 handle embedding scaling by skipping the multiplier for raw input embeddings. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Fully revert changes to n_deepstack_layers and qwen3vl* Since we're going to keep the GGUF KVs separate, it makes sense to just keep the hparams separate too to limit the scope of this branch. The down side is that n_deepstack_layers and deepstack_mapping_arr are potentially conflicting. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Revert removal of "is_deepstack_layers" GGUF KV This KV is not used at all on the c++ side, so it's fully dead, but there's also no need to conflate this cleanup with the addition of G4V. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary ggml_cont and build_forward_expand in cbx Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Clean up comments Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Tighter and more flexible code for g4v_build_block This could be refactored to look a lot more like granite-speech, but the overall block constructs before/after the qformer are pretty different, so for now I'm going to leave it as is and just tighten a bit. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary `unordered_set` include Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add architecture guard on deepstack_mapping_arr printout Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary AI-gen comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Always initialize deepstack_mapping_arr with -1 values This was causing `test-llama-archs` to fail, likely due to trying to save the uninitialized values, then re-loading them. It's safer to always initialize so that other models don't forget and end up with undefined behavior. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove TODO about block/vs non-block tensor mapping Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Move is_vision_feature_layer logic into clip_hparams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a bool for append_token Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove unnecessary comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unused get_model api yikes! Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rearrange helpers for g4v to be private members and use build_attn Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix off-by-one in vision layer index This was inherited from the Claude Code implementation that pushed the negative index inversion down into the model file. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix norm/post_norm mixup in conversion face. palm. :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: More descriptive tensor names Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Apply PR cleanup for new conversion changes AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix(convert): Remove duplicate V_ENC_EMBD_IMGNL Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: append_token -> add_newline Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment cleanup Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Cleaner error handling/checking NOTE: format_string is not available in granite.cpp (and including clip-impl.h to get it doesn't compile, so I think it violates the intended encapsulation), so std::stringstream is the simplest answer. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2026-06-05 17:44:59 +02:00
therealkenc	86591c7536	cli: fix model params not propagated (#23893 ) Fixes #23847	2026-06-05 17:29:41 +02:00
Mario	9c955c48b0	Fix link to available UI settings (#24169 ) The current link is to a non-existent file. I had a look at the repo, spotted the file containing the UI configuration key and updated the link	2026-06-05 14:39:32 +02:00
Xuan-Son Nguyen	cc7bef34e2	ui: add ignore-scripts=true to npmrc (#24149 )	2026-06-05 14:31:03 +02:00
Pedro Cuenca	ad1b88ca0d	docs: Update quantization readme (#24133 ) * Update quantization readme * install requirements * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * dos2unix suggestions --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-06-05 12:21:26 +02:00
Georgi Gerganov	7c158fbb4a	server : disable on-device spec checkpoints (#24108 )	2026-06-04 19:30:59 +03:00
viggy	42b2d60e57	webui: [a11y] fix keyboard navigation issues in chat interface and sidebar (#23132 ) * use child snippets for landing and chat message elements * make ... icon visible in conversation history menu * conversation history forward tab fix * add snippet fix for fork icon in conversation history * focus/keyboard fix for attachment x icon and scroll left/right * formatting * fix scroll down issue * simply Statistics and pointer events in scrolldown * create storybook tests and move to folder * improve tests to actually assert on element	2026-06-04 17:59:00 +02:00
Bartowski	e7bcf1c3a8	Move duplicated imatrix code into single common imatrix-loader.cpp (#22445 ) * Deduplicate imatrix loading code * Add back LLAMA_TRACE, early exit on quantize missing metadata	2026-06-04 17:45:40 +02:00
Aleksander Grygier	21444c822e	ui: Fixed packages (#24119 ) * chore(ui): pin package versions to currently installed - Update all dependencies and devDependencies to match exactly what's in package-lock.json - This ensures reproducible builds by locking to specific versions rather than semver ranges * chore: Update packages * chore: Move remaining dependencies to devDependencies * fix: Add missing `mermaid` package * chore: Update `cookie` package to `v1.1.1` * chore: Formatting * test: Update test configs	2026-06-04 16:23:08 +02:00
MagicExists	526977068f	ui: added single line reasoning preview (#23601 ) * webui: added single line reasoning preview. * patch: reduce width slightly for the previewing section * refactor: move formatter constants to the right file * feat: reimplement reasoning preview with throttled dynamic per-line rendering * chore: fix spacing Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: refactor to requested changes * refactor: grouped by capture pattern instead of block-level + inline * ui: fax interrupt state only trigger for 1st reasoning message * chore: make reasoning preview respects showThoughtInProgress setting * chore; newline at EOF Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * fix: thread rawContent so collapsible content can handle compute preview * patch: showThoughtInProgress accidentally blocks rawContent being passed * chore: fix lint * chore: change smoke test --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-06-04 16:09:43 +02:00
Yongyue Sun	6f3a9f3dee	server: avoid unnecessary checkpoint restore when new tokens are present (#24110 ) * server: avoid unnecessary checkpoint restore when new tokens are present The pos_min_thold calculation unconditionally subtracts 1 to ensure at least one token is evaluated for logits when no new tokens exist. However, when the request contains new tokens beyond the cached prefix, this -1 is overly conservative and may trigger an unnecessary checkpoint restore. Conditionally apply the -1 only when n_past >= task.n_tokens() (no new tokens), avoiding redundant KV state restoration when there is actual work to do. * cont : add ref --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-06-04 16:09:01 +03:00
Pascal	4586479852	webui: fix tool selector toggle/counter, key tools by stable identity (#24065 ) * webui: fix tool selector toggle/counter, key tools by stable identity Key the disabled set, counts and toggles by a stable per-tool key instead of bare function name, deduped from one canonical list. Per-tool checkboxes become presentational (single row handler, no nested button), category checkboxes drop the tristate (n/total carries partial). One getEnabledToolsForLLM keeps normalized MCP schemas and dedupes by name. * ui: use SvelteSet and SvelteMap for local tool collections to satisfy svelte/prefer-svelte-reactivity	2026-06-04 13:09:49 +02:00
A B	0066404085	server : add header to tools/server/server-http.h (#24089 )	2026-06-04 13:14:46 +03:00
Andrea Richiardi	7ac5a4225e	cmake: skip cvector-generator and export-lora when CPU backend is disabled (#24053 )	2026-06-04 13:13:19 +03:00
Andrei	e3ba22d6cc	fix(mtmd): handle Gemma 4 audio projector embedding size (#24091 ) * mtmd: handle Gemma 4 audio projector embedding size * rm projection_dim from clip_n_mmproj_embd --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-06-04 11:51:23 +02:00
Andrei	94a220cd67	mtmd: fix Gemma 4 unified FPE (#24088 )	2026-06-03 21:51:18 +02:00
Aman Gupta	166fe29492	qwen35: use post-norm hidden state for MTP (#24025 ) * qwen35: use post-norm hidden state for MTP * rename pre_norm to nextn * fix step35	2026-06-04 01:29:09 +08:00
Xuan-Son Nguyen	c8d6a00636	mtmd: enable non-causal vision for gemma 4 unified (#24082 )	2026-06-03 19:05:17 +02:00
Xuan-Son Nguyen	a731805ced	mtmd, model: allow skip build_vit() (#24077 ) * add model * nits	2026-06-03 17:10:35 +02:00
Aleksander Grygier	ee4cf705bb	ui: Mermaid Diagrams in chat + interactive preview (#24032 )	2026-06-03 16:55:36 +02:00
Mikhail Podvitskii	4fb16eccce	model: add Mellum architecture (#23966 ) * model: support for Mellum architecture * model: improve mellum.py formatting * model: improve mellum.py formatting once again * deps: downgrade transformers to 4.57.6 (to fix CI) * deps: remove huggingface_hub dependency * deps: remove huggingface_hub from test requirements --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-06-02 22:11:12 +03:00
Daniel Bevenius	0b7154066e	common : fix state save in common_prompt_batch_decode (#23468 ) * common : fix state save in common_prompt_batch_decode This commit addresses a bug in common_prompt_batch_decode that affects the session state store/restore in completion.cpp and save-load-state.cpp. The motivation for this is that currently the code is saving n-1 tokens in both the session_tokens and in the KV cache. Then when loading the session tokens, and if the prompt matches, it would replay the last saved token (n-1) into the next position, effectively replaying the same token in the wrong position. The fix is to store all n tokens in session_tokens, while the memory state only reflects n-1 processed tokens as the saving happens before the last token is decoded in common_prompt_batch_decode. I ran both completion.cpp and save-load-state.cpp with a transformer, a recurrent, and a hybrid model. Resolves: https://github.com/ggml-org/llama.cpp/issues/23400 Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com>	2026-06-02 15:44:15 +02:00
Xuan-Son Nguyen	60130d18f9	server: add SSE ping interval (#24013 )	2026-06-02 14:14:55 +02:00
Marcos Del Sol Vives	69cea5b669	ui: simplify network error handling (#23431 ) Previously error to string conversion was split in two different files, with one converting errors into strings, and another function analyzing those strings to generate yet another string. Now the the error handling for network fetches has been centralised and uses directly HTTP error codes whereas possible to generate the human-readable error strings. It also fixes an issue where all JSON errors reported from the backend, such as "Invalid API key", would get turned incorrectly in to "Failed to connect to server" due to poor matching logic in the now-gone getErrorMessage function.	2026-06-02 10:45:25 +02:00
Aleksander Grygier	f8e67fc583	ui: Add Thinking mode toggle with reasoning effort levels + improvements for Chat Form Add Action UI (#23434 ) * feat: Add "Thinking" toggle and status icon + redesign Chat Form Actions Add panel * test: Update test reference * fix: Icon * fix: E2E test command * fix: wait for greeting h1 to be visible in e2e test * fix: remove duplicate PDF option in attachment dropdown * fix: use label-based group toggle to avoid stale references * refactor: inline MCP server and tool toggles in mobile sheet * fix: serve correct build directory in e2e playwright config * feat: add reasoning effort levels selector in model dropdown * feat: Reasoning effort * refactor: Make server origin configurable via environment variable * feat: Add chat template thinking detector utility * feat: Add thinking support detection to models store * refactor: Update model selector components with thinking detection and message-specific indicators * feat: Update chat form components for model selection and thinking support * feat: Improve Reasoning controls UI * refactor: Apply suggestions from code review Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * fix: Model tags * refactor: Cleanup * refactor: Remove unneeded components * refactor: Cleanup	2026-06-02 10:23:19 +02:00
Pascal	354ebac8cb	server: real-time reasoning interruption via control endpoint (#23971 ) * server: real-time reasoning interruption via control endpoint Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model. Minimal WebUI button as a skeleton for further UI work. * ui: track reasoning phase via explicit streaming state Add isReasoning to the chat store, mirroring the isLoading pattern: per conversation map, private setter, public accessor and reactive export. Set from the stream callbacks, true on reasoning chunks, false on the first content chunk, reset on stream end and resynced on conversation switch. The skip button now keys off isReasoning so it shows only during the thinking phase, not the whole generation. * ui: extract control endpoint and action into constants Move the chat completion routes, the slots route and the reasoning control action out of chat.service into api-endpoints and a dedicated control-actions module. No behavior change, drops the magic strings so the control protocol has a single source of truth. * server: target reasoning control by completion id Address @ngxson review on the control endpoint. Switch from id_slot to the chat completion id to avoid a TOCTOU: the slot can be reassigned between the lookup and the control request, so matching the live completion (oaicompat_cmpl_id) is safe and a finished one simply matches nothing. Rename the action to reasoning_end, guard it on the reasoning_control flag of the target slot, and reduce the response to {success} with an optional message. * ui: target reasoning control by completion id Keep the streamed completion id on the message and post it back to the control endpoint instead of probing /slots. Drops the slot discovery and the TOCTOU that came with it. Action renamed to reasoning_end, response read as {success}. * server: address review from @ngxson Move the control fields into task_params and drop the redundant comments on the control path. * server: document the reasoning control endpoint * Update tools/ui/src/lib/types/database.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: rename cmplId to completionId Per @allozaur review, clearer name for the streamed completion id. * ui: wire completion id capture through the agentic flow The webui streams through the agentic flow, which relayed onModel but not onCompletionId, so the completion id never reached the message and the control request was never sent. Relay it through the flow and its callbacks type, declare id on the chunk type, and log an explicit error when the button fires without a usable id. * ui: target reasoning control model from the message The model is a property of the completion, so read it from the streaming message like the id, not from the model dropdown which is unrelated UI state. Makes the request self-consistent by construction instead of just unlikely to drift. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-06-02 07:26:20 +02:00
Georgi Gerganov	5dcb711666	speculative : fix n_outputs_max and remove draft-simple auto-enable (#23988 ) * speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in common/speculative. Assisted-by: llama.cpp:local pi * cont : draft context always has n_parallel outputs * llama : log n_outputs_max * speculative : remove draft-simple auto-enable * ci : enable server tests on PRs	2026-06-01 22:26:58 +03:00
Aman Gupta	de6f727aae	llama: limit max outputs of `llama_context` (#23861 ) * llama: save more VRAM by reserving n_outputs == n_seqs when possible * add n_outputs_per_seq * move n_outputs_max to server-context * change ubatch to batch everywhere	2026-06-01 18:01:38 +03:00
Junwon Hwang	48b88c3b00	model: Add EXAONE 4.5 implementations (#21733 ) * Add EXAONE 4.5 and Add GQA for MMproj * mtmd: EXAONE 4.5 vision markers and projector path EXAONE 4.5 uses <vision> and </vision> for image boundaries; Qwen keeps <\|vision_start\|> and <\|vision_end\|>. Route EXAONE 4.5 through the Qwen2.5-VL-style encode path (window attention pattern, optional mmproj input norm). Update exaone4_5 projector weights and convert_hf_to_gguf for mmproj export. * mtmd: load EXAONE4 nextn tensors correctly Align EXAONE4 tensor registration with EXAONE_MOE for NextN/MTP slots and avoid skip-flag propagation on duplicated rope_freqs so model loading succeeds for EXAONE 4.5 GGUF. * Minor fixes * Address PR feedback * Address PR feedback * Fix EXAONE after merge * Fix EXAONE 4.5 conversion * Address PR feedback * Refactor EXAONE 4.5 conversion * Address PR feedback * Fix unintended deletion * Minor fix --------- Co-authored-by: LG-AI-EXAONE <exaonemodels@lgresearch.ai>	2026-06-01 11:48:53 +02:00
Eric Zhang	6f165c1c64	server : handle If-None-Match weak ETags (#23916 )	2026-05-31 16:21:08 -05:00
Eric Zhang	3292da09f6	ui: fix ETag truncation with MSVC compiler (#23917 )	2026-05-31 11:21:23 +02:00
Pascal	d749821db3	webui: add custom CSS injection via config (#23904 ) * webui: add custom CSS injection via config register a customCSS setting in the Developer section under Custom JSON, syncable so it rides the existing ui-config pass through. inject the value into a single style element in the head, reactive on the setting. lets an operator theme a prebuilt binary through --ui-config without rebuilding, and lets a user set it from the settings panel. * ui: address review from @niutech and @allozaur, rename custom JSON key and CSS field * ui: address review from @allozaur, move custom CSS injection to a style tag in svelte:head * ui: inject custom CSS through a svelte action instead of a bound element move the textContent write into a use: action on the head style node. the action is the idiomatic way to touch a node, so the no-dom-manipulating lint rule is satisfied without a disable. value stays text through textContent, never parsed as HTML. * Update tools/ui/src/lib/constants/settings-keys.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: address review from @allozaur, rename custom config key to customJson with migration rename the custom config key to customJson across the type, the chat request builder, the settings save check and the custom tools reader, keeping the custom API param name unchanged. add a non destructive migration that copies the legacy custom key to customJson at startup. only render the head style tag when custom CSS is set. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-05-30 23:49:31 +02:00
Gaurav Garg	aa46bda89b	Support `-fa auto` in llama-bench (#23714 ) * Support `-fa auto` in llama-bench Make the default value of `-ngl` -1, similar to other tools. Update README with latest usage and examples * Address review comments	2026-05-31 02:03:57 +05:30
Pascal	d38d50e7ff	ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (#23910 )	2026-05-30 16:50:54 +02:00
Xuan-Son Nguyen	0821c5fcfd	server: in SSE mode, send HTTP headers when slot starts (#23884 ) * server: in SSE mode, send HTTP headers when slot starts * ref to pr * stream should be false by default	2026-05-30 00:06:29 +02:00
Ruixiang Wang	689a9a470e	server-bench : add speed-bench for speculative decoding benchmarking (#23869 ) * spec: add speed-bench support for benchmarking * speed-bench : add trailing newline to requirements.txt * speed-bench : bump datasets to 4.8.0 to fix ty check * server-bench : remove now-unused type: ignore after datasets bump	2026-05-29 23:09:47 +02:00
Pascal	5a46b46acd	app: add llama update self updater (#23865 ) * wip: llama update POC * cleaning: llama update * llama-gen-docs * app: delegate llama update to the install script * app: spawn the installer detached so llama update can replace a running binary * cleaning: inline llama update into llama.cpp, drop app-update.{cpp,h} * app: make llama_update static Address review from @angt	2026-05-29 23:02:40 +02:00
ValdikSS	22d66b567e	ui: handle audio/vnd.wave as audio WAV file (#23754 ) Firefox on Linux uses this MIME type	2026-05-29 21:41:35 +02:00
Xuan-Son Nguyen	b5f52280fb	server: remove obsolete scripts (#23870 )	2026-05-29 19:47:30 +02:00
Xuan-Son Nguyen	06d26dfdff	download: add option to skip_download (#23059 ) * download: add option to skip_download * fix * fix 2 * if file doesn't exist, respect skip_download flag	2026-05-29 16:30:55 +02:00
Saba Fallah	da3f990a47	mtmd: Add DeepSeekOCR 2 Support (#20975 ) * mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution * introduced clip_image_f32::add_viewsep * address PR review - drop redundant ggml_cpy ops in both deepseekocr versions build - drop no-op ggml_cont in build_sam - assert num_image_tokens deepseekocr2 - view_seperator as (1, n_embd) at conversion (for both versions) - drop redundant ggml_reshape_2d * Update tools/mtmd/models/deepseekocr2.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-05-29 16:13:51 +02:00

1 2 3 4 5 ...

918 Commits