* common : relax sampler name matching
Currently, in some cases, the alternative names for samplers (like
`top-k` and `min-p` instead of the canonical `top_k` and `min_p`) are
not always recognized by the `common_sampler_types_from_names` function
in `common/sampling.cpp`.
This PR changes the signature of this function to remove the `bool
allow_alt_names` flag, and removes all occurences of the flag from call
sites. Therefore, the function will now always match all known names.
I also changed the logic of the function to unconditionally check the
provided sampler names against both the canonical and alternative names,
and to be case-insensitive.
This fixes an issue I was seeing wherein samplers specified in the
`llama-server` UI were not recognized as valid when the alternative
names were used.
* add more alt names
* cont. fix
* cast to unsigned char for correctness
* common : unify sampler name mapping
* annotate canonical vs. alt sampler name mappings per @CISC
* Update common/sampling.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common : auto-generate sampler name aliases per @ngxson
* use merged map for matching
* use `.merge` instead of iterating
* nit: simplify comment
* nit: use insert everywhere, not index assignment
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common : fix state save in common_prompt_batch_decode
This commit addresses a bug in common_prompt_batch_decode that affects
the session state store/restore in completion.cpp and
save-load-state.cpp.
The motivation for this is that currently the code is saving n-1 tokens
in both the session_tokens and in the KV cache. Then when loading the
session tokens, and if the prompt matches, it would replay the last
saved token (n-1) into the next position, effectively replaying the
same token in the wrong position.
The fix is to store all n tokens in session_tokens, while the memory
state only reflects n-1 processed tokens as the saving happens before
the last token is decoded in common_prompt_batch_decode.
I ran both completion.cpp and save-load-state.cpp with a transformer, a
recurrent, and a hybrid model.
Resolves: https://github.com/ggml-org/llama.cpp/issues/23400
Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com>
* feat: Add "Thinking" toggle and status icon + redesign Chat Form Actions Add panel
* test: Update test reference
* fix: Icon
* fix: E2E test command
* fix: wait for greeting h1 to be visible in e2e test
* fix: remove duplicate PDF option in attachment dropdown
* fix: use label-based group toggle to avoid stale references
* refactor: inline MCP server and tool toggles in mobile sheet
* fix: serve correct build directory in e2e playwright config
* feat: add reasoning effort levels selector in model dropdown
* feat: Reasoning effort
* refactor: Make server origin configurable via environment variable
* feat: Add chat template thinking detector utility
* feat: Add thinking support detection to models store
* refactor: Update model selector components with thinking detection and message-specific indicators
* feat: Update chat form components for model selection and thinking support
* feat: Improve Reasoning controls UI
* refactor: Apply suggestions from code review
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* fix: Model tags
* refactor: Cleanup
* refactor: Remove unneeded components
* refactor: Cleanup
* server: real-time reasoning interruption via control endpoint
Builds on the manual reasoning budget trigger from #23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
* ui: track reasoning phase via explicit streaming state
Add isReasoning to the chat store, mirroring the isLoading pattern:
per conversation map, private setter, public accessor and reactive
export. Set from the stream callbacks, true on reasoning chunks, false
on the first content chunk, reset on stream end and resynced on
conversation switch. The skip button now keys off isReasoning so it
shows only during the thinking phase, not the whole generation.
* ui: extract control endpoint and action into constants
Move the chat completion routes, the slots route and the reasoning
control action out of chat.service into api-endpoints and a dedicated
control-actions module. No behavior change, drops the magic strings so
the control protocol has a single source of truth.
* server: target reasoning control by completion id
Address @ngxson review on the control endpoint.
Switch from id_slot to the chat completion id to avoid a TOCTOU: the
slot can be reassigned between the lookup and the control request, so
matching the live completion (oaicompat_cmpl_id) is safe and a finished
one simply matches nothing. Rename the action to reasoning_end, guard
it on the reasoning_control flag of the target slot, and reduce the
response to {success} with an optional message.
* ui: target reasoning control by completion id
Keep the streamed completion id on the message and post it back to the
control endpoint instead of probing /slots. Drops the slot discovery
and the TOCTOU that came with it. Action renamed to reasoning_end,
response read as {success}.
* server: address review from @ngxson
Move the control fields into task_params and drop the redundant
comments on the control path.
* server: document the reasoning control endpoint
* Update tools/ui/src/lib/types/database.d.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* ui: rename cmplId to completionId
Per @allozaur review, clearer name for the streamed completion id.
* ui: wire completion id capture through the agentic flow
The webui streams through the agentic flow, which relayed onModel but
not onCompletionId, so the completion id never reached the message and
the control request was never sent. Relay it through the flow and its
callbacks type, declare id on the chunk type, and log an explicit error
when the button fires without a usable id.
* ui: target reasoning control model from the message
The model is a property of the completion, so read it from the streaming
message like the id, not from the model dropdown which is unrelated UI
state. Makes the request self-consistent by construction instead of just
unlikely to drift.
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* speculative : add common_speculative_n_max helper function
Extract the speculative max-draft-size logic from server_n_outputs_max
into a reusable common_speculative_n_max() function in common/speculative.
Assisted-by: llama.cpp:local pi
* cont : draft context always has n_parallel outputs
* llama : log n_outputs_max
* speculative : remove draft-simple auto-enable
* ci : enable server tests on PRs
* llama: save more VRAM by reserving n_outputs == n_seqs when possible
* add n_outputs_per_seq
* move n_outputs_max to server-context
* change ubatch to batch everywhere
* common : add common_chat_split_by_role
* cont : fix spans to reach end of message
* server: fix checkpoints creation
- extract message_spans from chat templates
- find the prompt token position before the latest user message
- split prompt batching at that position
- create a context checkpoint before the latest user input
- avoid periodic mid-prompt checkpoints when that position is known
- handle multimodal prompts when mapping text/template positions to server prompt tokens
- add --checkpoint-min-step to control minimum spacing between checkpoints
* cont : clean-up
* Support autoparser detection for message barriers
* server: fix message span delimiter and update docs
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
* Move to backend sampling for MTP draft path
Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits
Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.
* Allow sampler chains to be partially offloaded to backend
* Add --spec-draft-backend-sampling argument. Enabled by default.
ggml_backend_dev_by_name always appends a nullptr sentinel to the devices
vector. Skipping nullptr entries prevents assertion failure in
ggml_backend_dev_name.
Assisted-by: llama.cpp:local pi
* save-load-state : refactor into separate phase functions
- Split monolithic main() into 4 self-contained phase functions, each
managing its own context/sampler/batch lifecycle
- Each function tokenizes internally using its local ctx instance
- main() is now a clean orchestrator: init -> run phases -> assert results
- Proper resource cleanup on every exit path (return {} on error)
Assisted-by: llama.cpp:local pi
* save-load-state : use params.out_file instead of separate state_file
- Remove state_file parameter from all phase functions
- Each function accesses params.out_file directly
- Initialize params.out_file in main alongside params.prompt
Assisted-by: llama.cpp:local pi
* save-load-state : use smart pointers for ctx and smpl
- Replace raw llama_context* with llama_context_ptr
- Replace raw llama_sampler* with llama_sampler_ptr
- Remove all manual llama_free() and llama_sampler_free() calls
- Keep llama_batch as raw (managed manually with llama_batch_free)
Assisted-by: llama.cpp:local pi
* save-load-state : add local llama_batch_ptr RAII wrapper
- Add llama_batch_ptr struct holding llama_batch by value
- Calls llama_batch_free() in destructor
- Eliminates all manual llama_batch_free() calls
Assisted-by: llama.cpp:local pi
* save-load-state : replace printf/fprintf with logging macros
- Add log.h include
- Replace fprintf(stderr, ...) errors with LOG_ERR
- Replace fprintf(stderr, ...) info with LOG_TRC
- Replace printf output with LOG
Assisted-by: llama.cpp:local pi
* save-load-state : refactor tests to check results inline
Each follow-up phase now accepts an expected result and performs
the comparison internally instead of collecting results in main().
Assisted-by: llama.cpp:local pi
* save-load-state : improve test output readability
Add phase labels, remove redundant run prefixes, and show
PASS after each test.
Assisted-by: llama.cpp:local pi
* pi : add rule about git signing
* save-load-state : simplify llama_batch_ptr
Change get() to return a reference and remove operator*().
Use batch.get() throughout for consistency.
Assisted-by: llama.cpp:local pi
* save-load-state : extract generate_tokens helper
Factor out the repeated token generation loop into a shared
helper function used by all phases.
Assisted-by: llama.cpp:local pi
* save-load-state : update comments to use test terminology
Replace "Phase" with "Test" and list each test's steps
as bullet points.
Assisted-by: llama.cpp:local pi
* save-load-state : rename test functions
Rename to test_baseline, test_state_load, test_seq_cp_host,
test_seq_cp_device. Update comments and logs accordingly.
Assisted-by: llama.cpp:local pi
* pi : add rule to never git push without confirmation
Assisted-by: llama.cpp:local pi
* common : add model_only option to common_init_from_params
Add bool model_only parameter to skip context creation,
sampler init, and context-dependent setup.
Use in save-load-state to initialize only the model,
with each test creating its own context.
Assisted-by: llama.cpp:local pi
---------
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
* common : delegate assistant continuation to template handler
* server : implement echo parameter to exclude assistant prefill in the response
* server : fix tests for prefill
* server : use existing llama template
* cont : clean up
The --embd-normalize flag was registered only for the embedding and debug
examples, so llama-server rejected it and the /embedding handler used a
hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's
example set and read params.embd_normalize as the handler's default. The
per-request "embd_normalize" body field continues to override.
* spec: support MTP
* fix batch size
* rename files
* cont : simplify (#7)
* MTP: clean-up (#9)
* MTP: clean-up
* review: use llama_context_type instead of llama_graph_type
* review: remove llama_model_has_mtp
* review: fix convert issues
* convert: fix pycheck
* review: formatting
* use `mtp-` for identifying mtp models
* convert: fix mtp conversion
* mtp -> draft-mtp
* remove unused llama_arch
* add need_embd in speculative
* llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
* fix pending state
* vulkan: add GDN partial rollback
* meta: extend check to axis 1
* metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: https://github.com/ggml-org/llama.cpp/commit/8c05923630110223669f069af2000e9cf10c02bc
Assisted-by: llama.cpp:local pi
* delta_net_base: use ggml_pad instead of new_tensor
* review: add need_rs_seq
* review: rename part_bounded to n_rs
* review: deslop comments
* review: rename, add asserts
* server : adjust checkpoint logic (#11)
* server : adjust checkpoint logic
* cont : rm asserts
* server-context: fix early exit
* spec : fix compatibility with n-gram and add TODOs (#13)
* metal : cleanup
* llama : fix faulty bitwise check in recurrent memory
* server : disable RS-based MTP in combination with other spec types
* spec : add TODOs
* cont : fix comment
* cont : update comment
* common : fix logic for ngram + mtp compat
* llama-memory: enable checkpointing with partial rollback
* cont: add test-case for loading into a dirty ctx
* llama-memory-recurrent: clear rs_idx in clear
* download: fix mtp path
* llama-arch: fix enorm op
* docs: update docs
* conversion: fix type annotations
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>